对于MySQL性能问题,可用如下方案解决:在分布式系统中我们可以多部署几台机器,每台机器设置不同的初始值,且步长和机器数相等。比如有两台机器。设置步长step为2,TicketServer1的初始值为1(1,3,5,7,9,11…)、TicketServer2的初始值为2(2,4,6,8,10…)。这是Flickr团队在2010年撰文介绍的一种主键生成策略(Ticket Servers: Distributed Unique Primary Keys on the Cheap )。如下所示,为了实现上述方案分别设置两台机器对应的参数,TicketServer1从1开始发号,TicketServer2从2开始发号,两台机器每次发号之后都递增2。
对于第三点“DB可用性”问题,我们目前采用一主两从的方式,同时分机房部署,Master和Slave之间采用半同步方式[5]同步数据。同时使用公司Atlas数据库中间件(已开源,改名为DBProxy)做主从切换。当然这种方案在一些情况会退化成异步模式,甚至在非常极端情况下仍然会造成数据不一致的情况,但是出现的概率非常小。如果你的系统要保证100%的数据强一致,可以选择使用“类Paxos算法”实现的强一致MySQL方案,如MySQL 5.7前段时间刚刚GA的MySQL Group Replication。但是运维成本和精力都会相应的增加,根据实际情况选型即可。
Assuming you're looking to generate sequential ids, you can use Redis and the INCR command without worrying about race conditions. Since Redis is (mostly) single threaded, you are assured that every request will get it's own unique id from it.
Furthermore, you don't need to check the id key's existence/initialize it because Redis will do that for you (i.e. if you INCR a non-existent key, it will be first created and set to 0 automatically).
MongoDB’s ObjectIDs are 12-byte (96-bit) hexadecimal numbers that are made up of -
a 4-byte epoch timestamp in seconds,
a 3-byte machine identifier,
a 2-byte process id, and
a 3-byte counter, starting with a random value.
This is smaller than the earlier 128-bit UUID. But again the size is relatively longer than what we normally have in a single MySQL auto-increment field (a 64-bit bigint value).
Database Ticket Servers
This approach uses a centralized database server to generate unique incrementing IDs. It’s like a centralized auto-increment. This approach is used by Flickr.
The problem with this approach is that the ticket server can become a write bottleneck. Moreover, you introduce one more component in your infrastructure that you need to manage and scale.
Twitter Snowflake
Twitter snowflake is a dedicated network service for generating 64-bit unique IDs at high scale. The IDs generated by this service are roughly time sortable.
The IDs are made up of the following components:
Epoch timestamp in millisecond precision - 41 bits (gives us 69 years with a custom epoch)
Configured machine id - 10 bits (gives us up to 1024 machines)
Sequence number - 12 bits (A local counter per machine that rolls over every 4096)
They have reserved 1-bit for future purpose. Since the IDs use timestamp as the first component, they are sortable.
The IDs generated by twitter snowflake fits in 64-bits and are time sortable, which is great. That’s what we want.
The above generator uses the system’s MAC address to create a unique identifier for the Node. You can also supply a NodeID to the sequence generator. That will guarantee uniqueness.
Ordered: They may be ordered to allow comparison of items stamped with IDs.
Meaningful: They may embed certain information about the point at which the ID was generated, for example the time (however inexact).
Distributed: They may be generated by a machine without consulting any other machine (ie. they must be truly distributed).
Compact: They may be represented in a specific format for space or performance reasons. Normally, you’d want to represent this in as compact a format as possible but you often trade-off longevity to do this (UNIX timestamps are a great example of compact albeit finite time representation in action 1).
So called “time oracles” are dedicated servers normally equipped with an atomic clock or GPS device whose sole purpose is for keeping time in as precise a manner as possible. Time oracle servers are used frequently in environments where super accurate time is important (such as trading platforms and financial applications), but they are prohibitively expensive and not a viable solution for those working in an environment where deploying custom hardware is impossible (think cloud environments).
To combat this, software such as NTP is available, specifically written to ameliorate this problem for those who cannot access a dedicated time oracle. The idea is basically that large organisations donate access to their time oracle servers by adding them to a global pool of NTP servers which the masses can use to adjust their clock against. This is, unfortunately, not without its problems. Its an error prone process, as it involves network hops and jitter, and is thus susceptible to all kinds of inaccuracies that the underlying algorithms do their best to abate but cannot avoid. It’s also really important to note that NTP is, by default, allowed to move time backwards to compensate for clock drift, something you must disable if you want to ensure any ID generation based on time can never generate duplicates.
Redis fitted our needs well as it allowed us to run arbitrary scripts using Lua; however, you could use any service that allows this7. The ability to run Lua scripts is important in the case of Redis, as it otherwise lacks the ability to conditionally run commands in a transaction.
At a high level, Wormhole propagates changes issued in one system to all systems that need to reflect those changes – within and across data centers.
One application of Wormhole is to connect our user databases (UDBs) with other services so they can operate based on the most current data. Here are three examples of systems that receive updates via Wormhole:
Caches – Refills and invalidation messages need to be sent to each cache so they stay in sync with their local database and consistent with each other.
Services – Services such as Graph Search that build specialized social indexes need to remain current and receive updates to the underlying data.
Data warehouse – The Data warehouses and bulk processing systems (Hadoop, Hbase, Hive, Oracle) receive streaming, real-time updates instead of relying on periodic database dumps.
The Wormhole system has three primary components:
Producer – Services and web tier use the Wormhole producer library to embed messages that get written to the binary log of the UDBs.
Publisher – The Wormhole publisher tails the binary log, extracts the embedded messages, and makes them available as real-time streams that can be subscribed to.
Consumer – Each interested consumer subscribes to relevant updates.
In order to satisfy a wide variety of use-cases, Wormhole has the following properties:
Data partitioning: On the databases, the user data is partitioned, or sharded, across a large number of machines. Updates are ordered within a shard but not necessarily across shards. This partitioning also isolates failures, allowing the rest of the system to keep working even when one or more storage machines have failed. Wormhole maintains a separate publish-subscribe stream per shard–think parallel wormholes in space.
Rewind in time: To deal with different failures (network, software, and hardware), services need to be able to go back to an earlier data checkpoint and start applying updates from that point onward. Wormhole supports check-pointing, state management, and a rewind feature.
Thanks to all of these properties, we are able to run Wormhole at a huge scale. For example, on the UDB deployment alone, Wormhole processes over 1 trillion messages every day (significantly more than 10 million messages every second). Like any system at Facebook’s scale, Wormhole is engineered to deal with failure of individual components, integrate with monitoring systems, perform automatic remediation, enable capacity planning, automate provisioning and adapt to sudden changes in usage pattern
how do FB apps use mc?
read:
v = get(k) (computes hash(k) to choose mc server)
if v is nil {
v = fetch from DB
set(k, v)
}
write:
v = new value
send k,v to DB
delete(k)
application determines relationship of mc to DB
mc doesn't know anything about DB
FB uses mc as a "look-aside" cache
real data is in the DB
cached value (if any) should be same as DB
look-aside is much trickier than it looks -- consistency
paper is trying to integrate mutually-oblivious storage layers
cache is critical:
not really about reducing user-visible delay
mostly about surviving huge load!
cache misses and failures can create intolerable DB load
they can tolerate modest staleness: no freshness guarantee
stale data nevertheless a big headache
want to avoid unbounded staleness (e.g. missing a delete() entirely)
want read-your-own-writes
each performance fix brings a new source of staleness
huge "fan-out" => parallel fetch, in-cast congestion
will partition or replication yield most mc throughput?
partition: server i, key k -> mc server hash(k)
replicate: server i, key k -> mc server hash(i)
partition is more memory efficient (one copy of each k/v)
partition works well if no key is very popular
partition forces each web server to talk to many mc servers (overhead)
replication works better if a few keys are very popular
performance and regions (Section 5)
Q: what is the point of regions -- multiple complete replicas?
lower RTT to users (east coast, west coast)
parallel reads of popular data due to replication
(note DB replicas help only read performance, no write performance)
maybe hot replica for main site failure?
Q: why not partition users over regions?
i.e. why not east-coast users' data in east-coast region, &c
social net -> not much locality
very different from e.g. e-mail
Q: why OK performance despite all writes forced to go to the master region?
writes would need to be sent to all regions anyway -- replicas
users probably wait for round-trip to update DB in master region
only 100ms, not so bad
users do not wait for all effects of writes to finish
i.e. for all stale cached values to be deleted
performance within a region (Section 4)
multiple mc clusters *within* each region
cluster == complete set of mc cache servers
i.e. a replica, at least of cached data
why multiple clusters per region?
why not add more and more mc servers to a single cluster?
1. adding mc servers to cluster doesn't help single popular keys
replicating (one copy per cluster) does help
2. more mcs in cluster -> each client req talks to more servers
and more in-cast congestion at requesting web servers
client requests fetch 20 to 500 keys! over many mc servers
MUST request them in parallel (otherwise total latency too large)
so all replies come back at the same time
network switches, NIC run out of buffers
3. hard to build network for single big cluster
uniform client/server access
so cross-section b/w must be large -- expensive
two clusters -> 1/2 the cross-section b/w
but -- replicating is a waste of RAM for less-popular items
"regional pool" shared by all clusters
unpopular objects (no need for many copies)
decided by *type* of object
frees RAM to replicate more popular objects
bringing up new mc cluster was a serious performance problem
new cluster has 0% hit rate
if clients use it, will generate big spike in DB load
if ordinarily 1% miss rate, and (let's say) 2 clusters,
adding "cold" third cluster will causes misses for 33% of ops.
i.e. 30x spike in DB load!
thus the clients of new cluster first get() from existing cluster (4.3)
and set() into new cluster
basically lazy copy of existing cluster to new cluster
better 2x load on existing cluster than 30x load on DB
important practical networking problems:
n^2 TCP connections is too much state
thus UDP for client get()s
UDP is not reliable or ordered
thus TCP for client set()s
and mcrouter to reduce n in n^2
small request per packet is not efficient (for TCP or UDP)
per-packet overhead (interrupt &c) is too high
thus mcrouter batches many requests into each packet
mc server failure?
can't have DB servers handle the misses -- too much load
can't shift load to one other mc server -- too much
can't re-partition all data -- time consuming
Gutter -- pool of idle servers, clients only use after mc server fails
The Question:
why don't clients send invalidates to Gutter servers?
my guess: would double delete() traffic
and send too many delete()s to small gutter pool
since any key might be in the gutter pool
thundering herd
one client updates DB and delete()s a key
lots of clients get() but miss
they all fetch from DB
they all set()
not good: needless DB load
mc gives just the first missing client a "lease"
lease = permission to refresh from DB
mc tells others "try get() again in a few milliseconds"
effect: only one client reads the DB and does set()
others re-try get() later and hopefully hit
let's talk about consistency now
the big truth
hard to get both consistency (== freshness) and performance
performance for reads = many copies
many copies = hard to keep them equal
what is their consistency goal?
*not* read sees latest write
since not guaranteed across clusters
more like "not more than a few seconds stale"
i.e. eventual
*and* writers see their own writes
read-your-own-writes is a big driving force
first, how are DB replicas kept consistent across regions?
one region is master
master DBs distribute log of updates to DBs in slave regions
slave DBs apply
slave DBs are complete replicas (not caches)
DB replication delay can be considerable (many seconds)
how do we feel about the consistency of the DB replication scheme?
good: eventual consistency, b/c single ordered write stream
bad: longish replication delay -> stale reads
how do they keep mc content consistent w/ DB content?
1. DBs send invalidates (delete()s) to all mc servers that might cache
2. writing client also invalidates mc in local cluster
for read-your-writes
why did they have consistency problems in mc?
client code to copy DB to mc wasn't atomic:
1. writes: DB update ... mc delete()
2. read miss: DB read ... mc set()
so *concurrent* clients had races
what were the races and fixes?
Race 1:
k not in cache
C1 get(k), misses
C1 v = read k from DB
C2 updates k in DB
C2 and DB delete(k) -- does nothing
C1 set(k, v)
now mc has stale data, delete(k) has already happened
will stay stale indefinitely, until key is next written
solved with leases -- C1 gets a lease, but C2's delete() invalidates lease,
so mc ignores C1's set
key still missing, so next reader will refresh it from DB
Race 2:
during cold cluster warm-up
remember clients try get() in warm cluster, copy to cold cluster
k starts with value v1
C1 updates k to v2 in DB
C1 delete(k) -- in cold cluster
C2 get(k), miss -- in cold cluster
C2 v1 = get(k) from warm cluster, hits
C2 set(k, v1) into cold cluster
now mc has stale v1, but delete() has already happened
will stay stale indefinitely, until key is next written
solved with two-second hold-off, just used on cold clusters
after C1 delete(), cold ignores set()s for two seconds
by then, delete() will propagate via DB to warm cluster
Race 3:
k starts with value v1
C1 is in a slave region
C1 updates k=v2 in master DB
C1 delete(k) -- local region
C1 get(k), miss
C1 read local DB -- sees v1, not v2!
later, v2 arrives from master DB
solved by "remote mark"
C1 delete() marks key "remote"
get()/miss yields "remote"
tells C1 to read from *master* region
"remote" cleared when new data arrives from master region
Q: aren't all these problems caused by clients copying DB data to mc?
why not instead have DB send new values to mc, so clients only read mc?
then there would be no racing client updates &c, just ordered writes
A:
1. DB doesn't generally know how to compute values for mc
generally client app code computes them from DB results,
i.e. mc content is often not simply a literal DB record
2. would increase read-your-own writes delay
3. DB doesn't know what's cached, would end up sending lots
of values for keys that aren't cached
PNUTS does take this alternate approach of master-updates-all-copies
FB/mc lessons for storage system designers?
cache is vital to throughput survival, not just a latency tweak
need flexible tools for controlling partition vs replication
need better ideas for integrating storage layers with consistency
However, this will not happen because if the memcached server has recently given a lease token out, it will not give out another. In other words,B does not get a lease token in this scenario. Instead what B receives is ahot missresult which tells client B that another client (namely client A) is already headed to the DB for the value, and that client B will have to try again later.
So, leases work in practice because only one client is headed to the DB at any given time for a particular value in the cache.
Furthermore, when a new delete comes in, memcached will know that the nextlease-set(the one from client A) is already stale, so it will accept the value (because it's newer) but it will mark it as stale and the next client to ask for that value will get a new lease token, and clients after that will once again get hot miss results.
Note also thathot missresults include the latest stale value, so a client can make the decision to go forth with slightly stale data, or try again (via exponential backoff) to get the most up-to-date data possible.
我们每更新一次软件都要重启Server,然后要等一段时间cache才能满上数据。解决办法是更新前把cached的数据放到某个地方(System V Shared Memory Region,我也不知道是啥。。。),然后这部分数据之后就可以接着用了。
https://medium.com/@shagun/scaling-memcache-at-facebook-1ba77d71c082 In Facebook’s context, users consume much more content than they create. So the workload is read intensive and caching help to reduce the workload.MORE ON IT. Facebook uses Memcache Clusters where each Memcache instance is a demand-filled, look-aside cache. This means if a client requests data from Memcache and data is not available, the client would fetch the data from the database and would populate the cache for further requests.
In case of get requests, UDP performs better than TCP and geterrors are treated as cache miss though insertion is not performed. This design choice seems practical as only .25% of requests fail due to late/ dropped or out of order packets. Though the response size is very small, the variation is quite large with mean = 954 bytes and median = 135 bytes. Set and delete operations are still performed over TCP (for reliability) though the connections are coalesced to improve efficiency.
Within a cluster, data is distributed across hundreds of servers through consistent hashing. A very high request rate combined with large fanout leads to an all to all communication between Memcache servers and clients and even a single server can become a bottleneck for many requests. Clients construct a DAG representing the dependency between data so that more independent requests are fired concurrently. Facebook also provides a standalone proxy called mcrouter that acts as an interface to Memcache server interface and routes the requests/replies to/from other servers. Along with these, flow control mechanisms in the form of sliding window mechanism are provided to limit incast congestion.
Lease
Leases are used to address stale sets (when web server writes a stale value in the cache) and thundering herds (when a key undergoes heavy read and write operations). When a client experiences a cache-miss, Memcache gives it a lease (a 64-bit token bound to the requested key). This lease is verified by Memcache when client tries to set the value. If Memcache receives a delete request for the key, the lease is invalidated and the value can not be set. To mitigate thundering herds, Memcache returns a token only once every 10 seconds per key. If a read request comes within 10 seconds of a token issue, the client is notified to retry after a short time, by which the updated value is expected to be set in the cache. In situations where returning stale data is not much problem, the client does not have to wait and stale data (at most 10 second old) is returned
Memcache Pools
Since different workloads can have different access patterns and requirements, Memcache clusters are partitioned into pools. The default pool is called wildcard and then there are separate pools for keys that can not reside in the wildcard pool. A small pool can be provisioned for keys for which cache miss is not expensive. Data is replicated within a pool when request rate is quite high and data can easily fit into a few machines. In Facebook’s context, replication seems to work better than sharding though it has the additional overhead of maintaining consistency.
In case a few Memcache servers fail, a small set of machines, called gutters, take over. In case of more widespread failures, requests are diverted to alternate clusters.
Cold cluster Warmup
When a new cluster is brought online, it takes time to get populated and initially the cache hit rate is very low. So a technique called Cold Cluster Warmup is employed where a new cold cluster is populated with data from a warm cluster instead of the database cluster. This way the cold cluster comes to full capacity within few hours. But additional care is needed to account for race conditions. One example could be: a client in cold cluster makes an update and before this update reaches the warm cluster, another request for the same key is made by the cold cluster then the item in the cold cache would be indefinitely inconsistent. To avoid this, Memcache rejects add operations for 2 seconds (called holdoff time)once a delete operation is taken place. So if a value is updated in a cold cluster and a subsequent request is made within 2 seconds, the add operation would be rejected indicating that the data has been updated. 2 seconds is chosen as most updates seems to propagate within this time.
Across Region consistency
Clusters comprising a storage cluster and several front end clusters are deployed in different regions. Only one of the regions contains the master database and rest act as replicas. This adds the challenge of synchronisation.
Writes from a master region send invalidations only within the master region. Sending invalidations outside may lead to a race situation where deletes reach before data is replicated. Facebook uses mcsequal daemon helps to avoid that at the cost of serving stale data for some time.
Writes from a non-master region are handled differently. Suppose a user updates his data from a non-master region with a large replication lag. A cache refill from a replica database is allowed only after the replication stream has caught up. A remote marker mechanism is used to minimise the probability if reading stale data. The presence of a marker indicates that data in the replica is stale and the query is redirected to the master region. When a webserver updates a key k, it sets a remote marker rk in the region, performs the write to the master database having key k and deletes k in the local cluster. When it tries to read k next time, it will experience a cache miss, will check if rk exists and will redirect its query to the master cluster. Had rk not been set, the query would have gone to the local cluster itself. Here latency is introduced to make sure most updated data is read.
Single Server Improvements
Facebook introduced many improvements for Memcache servers running as single instances as well. This includes extending Memcache to support automatic expansion of the hash table without the look-up time drifting to O(n), making the server multi-threaded using a global lock and giving each thread its own UDP port to reduce contention.
Memcache uses an Adaptive Slab Allocator which organizes memory into slab classes — pre-allocated, uniformly sized chunks of memory. Items are stored in smallest possible slab class which can fit the data. Slab sizes start at 64 bytes and reach up to 1 Mb. Each slab class maintains a free-list of available chunks and requests more memory in 1MB in case the free-list is empty. If no more free memory can be allocated, eviction takes place in Least Recently Used (LR) fashion. The allocator periodically rebalances slab assignments to keep up with the workload. Memory is freed in less used slabs and given to more heavily used slabs.
Most of the items are evicted lazily from the memory though some are evicted as soon as they are expired. Short lived items are placed into a circular buffer of linked lists (indexed by seconds until expiration) — called Transient Item Cache — based on the expiration time of the item. Every second, all of the items in the bucket at the head of the buffer are evicted and the head advances by one. By adding a short expiration time to heavily used set of keys whose items with short useful lifespans, the proportion of Memcache pool used by this key family was reduced from 6% to 0.3% without affecting the hit rate.
Lessons Learnt
Memcache’s design decisions are driven by data and not just intuition. Facebooks seems to have experimented with a lot of configurations before arriving on decisions like using UDP for read operations and choosing the value for parameters like holdoff time. This is how it should be — data-driven decisions. In the entire process of developing Memcache, Facebook focused on optimizations which affect a good number of users and usecases instead of optimizing for each possbile use case.
Facebook separated caching layer from persistence layer which makes it easier to mould the 2 layers individually. By handling the cache insertion logic in the application itself, Facebook made it easier to plug Memcache with different data stores. By modeling the probability of reading the stale data as a parameter, they allowed the latency and hit rate on the persistence store to be configurable to some extent thus broadening the scope for Memcache. Facebook breaks down a single page load request into multiple requests, thereby allowing for different kind of stale data tolerance for each component. For example, the servers serving the profile picture can cache content longer than the servers serving messages. This helps to lower the response latency and reduce the load on the data layer.
Facebook’s Memcache is one of the many solutions aimed to solve a rather common problem — a scalable, distributed key-value store. Amazon’s Dynamosolves is well for a write-intensive workload for small data sizes and Facebook solves it for a read intensive workload. Other contenders in the list are Cassandra, Voldemort, Redis, Twitter’s Memcache and more. The sheer number of possible, well known solutions suggests that there are many dimensions to the problem, some of which are yet unknown and that we still do not have a single solution that can be used for all use cases.
Every time a new cache box is added to Memcache infrastructure, it has empty cache. This box is called 'cold'. Every request to this box will end up in 'miss', so clients will spend lots of resources to fill the cache. In the worst case this will affect performance of the entire site, because on every 'miss' on the cold box client would spend much more time to actually get the data.
Mcrouter offers a way to 'warm up' cold cache boxes without impact on performance. The idea is simple: when we have a cold box and receive a 'miss', try to get a value from 'warm 'box (one with filled cache) and set it back to cold box.
Explanation: All sets and deletes go to the "cold" route handle. Gets are attempted on the "cold" route handle and in case of a miss, data is fetched from the "warm" route handle (where the request is likely to result in a cache hit). If "warm" returns an hit, the response is then forwarded to the client and an asynchronous request updates the value in the "cold" route handle.
Since any client that wants to talk to memcached can already speak the standard ASCII memcached protocol, we use that as the common API and enter the picture silently. To a client, mcrouter looks like a memcached server. To a server, mcrouter looks like a normal memcached client. But mcrouter’s feature-rich configurability makes it more than a simple proxy.
Some features of mcrouter are listed below. In the following, a “destination” is a memcached host (or some other cache service that understands the memcached protocol) and “pool” is a set of destinations configured for some workload — e.g., a sharded pool with a specified hashing function, or a replicated pool with multiple copies of the data on separate hosts. Finally, pools can be organized into multiple clusters.
Connection pooling: Multiple clients can connect to a single mcrouter instance and share the outgoing connections, reducing the number of open connections to memcached instances.
Multiple hashing schemes: Mcrouter provides a proven consistent hashing algorithm (furc_hash) that allows distribution of keys across many memcached instances. Hostname hashing is useful for selecting a unique replica per client. There are a number of other hashes useful in specialized applications.
Prefix routing: Mcrouter can route keys according to common key prefixes. For example, you can send all keys starting with “foo” to one pool, “bar” prefix to another pool, and everything else to a “wildcard” pool. This is a simple way to separate different workloads.
Replicated pools: A replicated pool has the same data on multiple hosts. Writes are replicated to all hosts in the pool, while reads are routed to a single replica chosen separately for each client. This could be done either due to per-host packet limitations where a sharded pool would not be able to handle the read rate; or for increased availability of the data (one replica going down doesn’t affect availability due to automatic failover).
Cold cache warm up: Mcrouter can smooth the performance impact of starting a brand new empty cache host or set of hosts (as large as an entire cluster) by automatically refilling it from a designated “warm” cache.
Broadcast operations: By adding a special prefix to a key in a request, it’s easy to replicate the same request into multiple pools and/or clusters.
Reliable delete stream: In a demand-filled look-aside cache, it’s important to ensure all deletes are eventually delivered to guarantee consistency. Mcrouter supports logging delete commands to disk in cases when the destination is not accessible (due to a network outage or other failure). A separate process then replays those deletes asynchronously. This is done transparently to the client — the original delete command is always reported as successful.
Quality of service: Mcrouter allows throttling the rate of any type of request (e.g., get/set/delete) at any level (per-host, per-pool, per-cluster), rejecting requests over a specified limit. We also support rate limit requests to slow delivery.
Large values: Mcrouter can automatically split/re-stitch large values that would not normally fit in a memcached slab.
Multi-level caches: Mcrouter supports local/remote cache setup, where values would be looked up locally first and automatically set in a local cache from remote after fetching.
Production traffic shadowing: When testing new cache hardware, we found it extremely useful to be able to route a complete copy of production traffic from clients. Mcrouter supports flexible shadowing configuration. It’s possible to shadow test a different pool size (re-hashing the key space), shadow only a fraction of the key space, or vary shadowing settings dynamically at runtime.
Online reconfiguration: Mcrouter monitors its configuration files and automatically reloads them on any file change; this loading and parsing is done on a background thread and new requests are routed according to the new configuration as soon as it’s ready. There’s no extra latency from client’s point of view.
Flexible routing: Configuration is specified as a graph of small routing modules called “route handles,” which share a common interface (route a request and return a reply) and which can be composed freely. Route handles are easy to understand, create, and test individually, allowing for arbitrarily complex logic when used together. For example: An “all-sync” route handle will be set up with multiple child route handles (which themselves could be arbitrary route handles). It will pass a request to all of its children and wait for all of the replies to come back before returning one of these replies. Other examples include, among many others, “all-async” (send to all but don’t wait for replies), “all-majority” (for consensus polling), and “failover” (send to every child in order until an non-error reply is returned). Expanding a pool can be done quickly by using a “cold cache warmup” route handle on the pool (with the old set of servers as the warm pool). Moving this handle handle up the stack will allow for an entire cluster to be warmed up from a warm cluster.
Destination health monitoring and automatic failover: Mcrouter keeps track of the health status of each destination. If mcrouter marks a destination as unresponsive, it will fail over incoming requests to an alternate destination automatically (fast failover) without attempting to send them to the original destination. At the same time health check requests will be sent in the background, and as soon as a health check is successful, mcrouter will revert to using the original destination. We distinguish between “soft errors” (e.g., data timeouts) that are allowed to happen a few times in a row and “hard errors” (e.g., connection refused) that cause a host to be marked unresponsive immediately. Needless to say, all of this is completely transparent to the client.
拿最后一个例子来说(low-churn vs. high-churn),服务器A使用的key万年不变,就在那么一小堆里来来去去;服务器B使用的key基本什么都有可能。如果它们使用同一个Pool,那么服务器B因为经常换新key,就会把A的key给频繁踢掉,而理想情况下,A使用的pool应该基本上就不用更新key的。这样就会造成很大的性能浪费。
Memcached is an in-memory key-value store for small chunks of arbitrary data (strings, objects) from results of database calls, API calls, or page rendering.
Memcached servers are indeed independent of each other. Memcached server is just an efficient key-value store implemented as in-memory hash table.
• It cannot cache large object
What makes memcached distributed is the client, which in most implementations can connect to a pool of servers. Typical implementations use consistent hashing, which means that when you add or remove server to/from a pool of N servers, you only have to remap 1/N keys.
Typically keys are not duplicated on various hosts, as memcached is not meant to be persistent store and gives no guarantees that your value will persist (for example when running out of assigned memory, memcached server drops least recently used (LRU) elements). Thus it’s assumed that your application should handle missing keys.
Here is where consistent hashing is advantageous: Suppose server 2 will be removed from the memcache server pool. What would happen when we want to fetch key k1? Nothing strange would happen. We plot k1 still on the same position in the continuum, and the first server-dot it will hit is still s1.
However, when fetching k3, which is stored on s2, it would miss the s2-dot (since it has been removed), and will be moved to server s3. Same goes for k2, which moves from s2 to s1.
In fact, the more server-dots we place onto the continuum, the less key-misses we get in case a server gets removed (or added). A good number would be around 100 to 200 dots, since more dots would result in a slower lookup on the continuum (this has to be a very fast process of course). The more servers you add, the better the consistent hashing will perform.
Instead of almost 100% of the key-movements you have when using a standard modulus algorithm, the consistent hashing algorithm would maybe invalidate 10-25% of your keys (these numbers drop down quickly the more servers you use) which means the pressure on your backend system (like the database) will be much less than it would when using modulus.
Pros:
reduces database load.
perfect for websites with high database load
significantly reduces the number of retrieval requests to database
cuts down the I/O access
Cons:
not a persistent data store
not a database
not application specific
can not cache large object(key size = 255 chars, max value = 1MB) https://www.adayinthelifeof.nl/2011/02/06/memcache-internals/
Internally, all objects have a “counter”. This counter holds a timestamp. Every time a new object is created, that counter will be set to the current time. When an object gets FETCHED, it will reset that counter to the current time as well. As soon as memcache needs to “evict” an object to make room for newer objects, it will find the lowest counter. That is the object that isn’t fetched or is fetched the longest time ago (and probably isn’t needed that much, otherwise the counter would be closed to the current timestamp).
In effect this creates a simple system that uses the cache very efficient. If it isn’t used, it’s kicked out of the system.
But high performance systems like memcache can get into trouble working that way. The reason is that malloc() and free() functions are not really optimized for such kind of programs. Memory gets fragmented easily which means a lot of memory will get spilled
in order to combat this “malloc()” problem, memcache does its own memory management by default. Memcache’s memory manager will allocate the maximum amount of memory from the operating system that you have set (for instance, 64Mb, but probably more) through one malloc() call. From that point on, it will use its own memory manager system called the slab allocator.
Slab allocation When memcache starts, it partitions its allocated memory into smaller parts called pages. Each page is 1Mb large (coincidentally, the maximum size that an object can have you can store in memcache).
Each of those pages can be assigned to a slab-class, or can be unassigned (being a free page).
A slab-class decides how large the objects can be that are stored inside that particular page. Each page that is designated to a particular slab-class will be divided into smaller parts called chunks.
The chunks in each slab have the same size so there cannot be 2 different sized chunks inside the same page.
For instance, there could be a page with 64byte chunks (slab class 1), a page with 128byte chunks (slab class 2) and so on, until we get the largest slab with only 1 chunk (the 1MB chunk).
There can be multiple pages for each slab-class, but as soon as a page is assigned a slab-class (and thus, split up in chunks), it cannot be changed to another slab-class.
The smallest chunk-size starts at 80 bytes and increases with a factor of 1.25 (rounded up until the next power of 2). So the second smallest chunksize would be 100 etc. You can actually find it out by issuing the “-vv” flag when starting memcache.
You can also set the factor (-f) and the initial chunk-size (-s), but unless you really know what you are doing, don’t change the initial values.
Memcache will initially create 1 page per slab-class and the rest of the pages will be free (which even slab class needs a page, gets a page)
Now that memcache has partitioned the memory, it can add data to the slabs. as soon as a complete page if full (all chunks in the page are filled) and we need to add another piece of data, it will fetch a new free page, assign it to the specified slab-class, partition it into chunks and gets the first available chunk to store the data.
But as soon as there are no more pages left that we can designate to our slab-class, it will use the LRU-algorithm to evict one of the existing chunks to make room. This means that when we need a 128byte chunk, it will evict a 128byte chunk, even though there might be a 256byte chunk that is even older. Each slab-class has its own LRU.