System Design Interview: Memcached

https://pdos.csail.mit.edu/6.824/notes/l-memcached.txt

how do FB apps use mc?
  read:
    v = get(k) (computes hash(k) to choose mc server)
    if v is nil {
      v = fetch from DB
      set(k, v)
    }
  write:
    v = new value
    send k,v to DB
    delete(k)
  application determines relationship of mc to DB
    mc doesn't know anything about DB
  FB uses mc as a "look-aside" cache
    real data is in the DB
    cached value (if any) should be same as DB

  look-aside is much trickier than it looks -- consistency
    paper is trying to integrate mutually-oblivious storage layers
  cache is critical:
    not really about reducing user-visible delay
    mostly about surviving huge load!
    cache misses and failures can create intolerable DB load
  they can tolerate modest staleness: no freshness guarantee
  stale data nevertheless a big headache
    want to avoid unbounded staleness (e.g. missing a delete() entirely)
    want read-your-own-writes
    each performance fix brings a new source of staleness
  huge "fan-out" => parallel fetch, in-cast congestion

will partition or replication yield most mc throughput?
  partition: server i, key k -> mc server hash(k)
  replicate: server i, key k -> mc server hash(i)
  partition is more memory efficient (one copy of each k/v)
  partition works well if no key is very popular
  partition forces each web server to talk to many mc servers (overhead)
  replication works better if a few keys are very popular

performance and regions (Section 5)

Q: what is the point of regions -- multiple complete replicas?
   lower RTT to users (east coast, west coast)
   parallel reads of popular data due to replication
   (note DB replicas help only read performance, no write performance)
   maybe hot replica for main site failure?

Q: why not partition users over regions?
   i.e. why not east-coast users' data in east-coast region, &c
   social net -> not much locality
   very different from e.g. e-mail

Q: why OK performance despite all writes forced to go to the master region?
   writes would need to be sent to all regions anyway -- replicas
   users probably wait for round-trip to update DB in master region
     only 100ms, not so bad
   users do not wait for all effects of writes to finish
     i.e. for all stale cached values to be deleted
   
performance within a region (Section 4)

multiple mc clusters *within* each region
  cluster == complete set of mc cache servers
    i.e. a replica, at least of cached data

why multiple clusters per region?
  why not add more and more mc servers to a single cluster?
  1. adding mc servers to cluster doesn't help single popular keys
     replicating (one copy per cluster) does help
  2. more mcs in cluster -> each client req talks to more servers
     and more in-cast congestion at requesting web servers
     client requests fetch 20 to 500 keys! over many mc servers
     MUST request them in parallel (otherwise total latency too large)
     so all replies come back at the same time
     network switches, NIC run out of buffers
  3. hard to build network for single big cluster
     uniform client/server access
     so cross-section b/w must be large -- expensive
     two clusters -> 1/2 the cross-section b/w

but -- replicating is a waste of RAM for less-popular items
  "regional pool" shared by all clusters
  unpopular objects (no need for many copies)
  decided by *type* of object
  frees RAM to replicate more popular objects

bringing up new mc cluster was a serious performance problem
  new cluster has 0% hit rate
  if clients use it, will generate big spike in DB load
    if ordinarily 1% miss rate, and (let's say) 2 clusters,
      adding "cold" third cluster will causes misses for 33% of ops.
    i.e. 30x spike in DB load!
  thus the clients of new cluster first get() from existing cluster (4.3)
    and set() into new cluster
    basically lazy copy of existing cluster to new cluster
  better 2x load on existing cluster than 30x load on DB

important practical networking problems:
  n^2 TCP connections is too much state
    thus UDP for client get()s
  UDP is not reliable or ordered
    thus TCP for client set()s
    and mcrouter to reduce n in n^2
  small request per packet is not efficient (for TCP or UDP)
    per-packet overhead (interrupt &c) is too high
    thus mcrouter batches many requests into each packet
    
mc server failure?
  can't have DB servers handle the misses -- too much load
  can't shift load to one other mc server -- too much
  can't re-partition all data -- time consuming
  Gutter -- pool of idle servers, clients only use after mc server fails

The Question:
  why don't clients send invalidates to Gutter servers?
  my guess: would double delete() traffic
    and send too many delete()s to small gutter pool
    since any key might be in the gutter pool

thundering herd
  one client updates DB and delete()s a key
  lots of clients get() but miss
    they all fetch from DB
    they all set()
  not good: needless DB load
  mc gives just the first missing client a "lease"
    lease = permission to refresh from DB
    mc tells others "try get() again in a few milliseconds"
  effect: only one client reads the DB and does set()
    others re-try get() later and hopefully hit

let's talk about consistency now

the big truth
  hard to get both consistency (== freshness) and performance
  performance for reads = many copies
  many copies = hard to keep them equal

what is their consistency goal?
  *not* read sees latest write
    since not guaranteed across clusters
  more like "not more than a few seconds stale"
    i.e. eventual
  *and* writers see their own writes
    read-your-own-writes is a big driving force

first, how are DB replicas kept consistent across regions?
  one region is master
  master DBs distribute log of updates to DBs in slave regions
  slave DBs apply
  slave DBs are complete replicas (not caches)
  DB replication delay can be considerable (many seconds)

how do we feel about the consistency of the DB replication scheme?
  good: eventual consistency, b/c single ordered write stream
  bad: longish replication delay -> stale reads

how do they keep mc content consistent w/ DB content?
  1. DBs send invalidates (delete()s) to all mc servers that might cache
  2. writing client also invalidates mc in local cluster
     for read-your-writes

why did they have consistency problems in mc?
  client code to copy DB to mc wasn't atomic:
    1. writes: DB update ... mc delete()
    2. read miss: DB read ... mc set()
  so *concurrent* clients had races

what were the races and fixes?

Race 1:
  k not in cache
  C1 get(k), misses
  C1 v = read k from DB
    C2 updates k in DB
    C2 and DB delete(k) -- does nothing
  C1 set(k, v)
  now mc has stale data, delete(k) has already happened
  will stay stale indefinitely, until key is next written
  solved with leases -- C1 gets a lease, but C2's delete() invalidates lease,
    so mc ignores C1's set
    key still missing, so next reader will refresh it from DB

Race 2:
  during cold cluster warm-up
  remember clients try get() in warm cluster, copy to cold cluster
  k starts with value v1
  C1 updates k to v2 in DB
  C1 delete(k) -- in cold cluster
  C2 get(k), miss -- in cold cluster
  C2 v1 = get(k) from warm cluster, hits
  C2 set(k, v1) into cold cluster
  now mc has stale v1, but delete() has already happened
    will stay stale indefinitely, until key is next written
  solved with two-second hold-off, just used on cold clusters
    after C1 delete(), cold ignores set()s for two seconds
    by then, delete() will propagate via DB to warm cluster

Race 3:
  k starts with value v1
  C1 is in a slave region
  C1 updates k=v2 in master DB
  C1 delete(k) -- local region
  C1 get(k), miss
  C1 read local DB  -- sees v1, not v2!
  later, v2 arrives from master DB
  solved by "remote mark"
    C1 delete() marks key "remote"
    get()/miss yields "remote"
      tells C1 to read from *master* region
    "remote" cleared when new data arrives from master region

Q: aren't all these problems caused by clients copying DB data to mc?
   why not instead have DB send new values to mc, so clients only read mc?
     then there would be no racing client updates &c, just ordered writes
A:
  1. DB doesn't generally know how to compute values for mc
     generally client app code computes them from DB results,
       i.e. mc content is often not simply a literal DB record
  2. would increase read-your-own writes delay
  3. DB doesn't know what's cached, would end up sending lots
     of values for keys that aren't cached

PNUTS does take this alternate approach of master-updates-all-copies

FB/mc lessons for storage system designers?
  cache is vital to throughput survival, not just a latency tweak
  need flexible tools for controlling partition vs replication
  need better ideas for integrating storage layers with consistency

https://www.quora.com/How-does-the-lease-token-solve-the-stale-sets-problem-in-Facebooks-memcached-servers

However, this will not happen because if the memcached server has recently given a lease token out, it will not give out another. In other words, B does not get a lease token in this scenario. Instead what B receives is a hot miss result which tells client B that another client (namely client A) is already headed to the DB for the value, and that client B will have to try again later.

So, leases work in practice because only one client is headed to the DB at any given time for a particular value in the cache.

Furthermore, when a new delete comes in, memcached will know that the next lease-set (the one from client A) is already stale, so it will accept the value (because it's newer) but it will mark it as stale and the next client to ask for that value will get a new lease token, and clients after that will once again get hot miss results.

Note also that hot miss results include the latest stale value, so a client can make the decision to go forth with slightly stale data, or try again (via exponential backoff) to get the most up-to-date data possible.

Scaling Memcache at Facebook
https://www.usenix.org/system/files/conference/nsdi13/nsdi13-final170_update.pdf
https://www.youtube.com/watch?v=6phA3IAcEJ8&t=67s
https://medium.com/@shagun/scaling-memcache-at-facebook-1ba77d71c082
https://zhuanlan.zhihu.com/p/20734038
https://zhuanlan.zhihu.com/p/20761071
https://zhuanlan.zhihu.com/p/20827183

FB的数据中心分布在不同地方，互相之间数据需要保持一致性，这部分就是讲在数据库和Memcache层面保证数据正确性的。

首先，底层数据存储用的是MySQL（跟开源版本不完全一样，有一些Facebook自己的强化），采用Master-Slave Replication（主从复制）架构，其中一个Regional作为Master Region，接受所有的Write，然后通过MySQL本身的Replication把数据更新广播到其他Slave Region。这是数据库层面，简单直接。

那么Memcache层面呢？如果是Master Region的Write，那好办，完全照搬之前的流程：

1. Web Server往数据库发更新，然后invalidate local（同一个Cluster的） cache

2. 同一个Cluster的Web Server遇上cache miss会从数据库拿新的数据然后往cache写

但是如果是非Master Region，就不能照搬了，因为更新是往Master Region发，而这个更新可能要很久才会被广播到当前的Region。一旦local cache被清掉，别人来读发现cache miss了，但是新的数据还没传到本Region的数据库，这时候别人就会读本地数据里那个实际上已经过期的值往cache里写了。这么一写，这个陈旧数据就可能不知道会存活到何年何月了。

要避免这种情况发生，就得保证本地cache被清掉了，但是本地数据库还没有最新数据的时候，要到Master Region去拿正确的数据。所以步骤如下。假如一个Web Server要更新一个key k：

1. 先在某个地方放一个标志Rk。这个标志的含义就是我更新数据了，但是这个更新数据还没到我这个Region的数据库。

2. 把k和Rk一起放在SQL里发给Master Region

3. 清掉本Cluster的k

4. Master Region收到更新后会把同样的更新广播到本地Region

5. 本地Region收到后负责把其他Cluster的k，和之前的标志Rk删掉

然后别人的读如果发生在3之后，5之前，它就会看到Rk存在，于是它就不会去读本地数据库去拿数据，而是直接访问Master Region的数据库，这样就能很大程度上保证读到的数据不是过期的。

这就是大概的思路。至于实现细节，Rk是放在Regional Pool中的，被整个Regional共享（读到这里真是感觉人生如此巧妙，需要某个东西的时候恰好可以拿现成的东西来用，这种感觉真是太爽了）。

论文里还提到一个历史细节：在需要扩张到多个Region之前，其实清空cache过期数据的工作其实完全是靠Web Server来干的。前面说过Web Server跨Cluster干这种事情已经很蛋疼了，还要跨Region就太不实际了（而且还容易发生Race condition），于是才把这部分逻辑放在后端。这个故事告诉我们系统是逐渐演变的，没有一步到位的完美系统。

D. Single Server Improvement

这部分讲的是如何提高单个服务器的性能。

D.1 基本优化

基本的优化有三个

1) 允许Hashtable自动扩张（否则冲突太多的时候，look-up time会趋向O(n))

2) 从单线程变成多线程，通过全局锁来保护数据结构

3) 每个线程使用单独的端口通信

然后在此基础上

1) 把全局锁改成粒度更小的锁，使得throughput提高

2) 使用UDP而不是TCP来通信，使得throughput大致提升了10%（论文做了两个实验，分别是8%和13%)

D.2 Slab Allocator

Slab Allocator是Memcached的内存管理模块。它的算法大致是：

1. 定义不同的内存块大小(slab class)，从64 bytes等比数列一直到1M （1.07 factor)

2. 每个slab class会对应一堆预先分配的内存块

3. 每当需要放什么东西的时候，它会找对应大小的slab class，然后从这个class对应的内存块里找空闲的，然后把数据塞进去。找不到就清掉没人用的内存块，再把数据塞进去腾出来的空间。

论文提出的修改是：动态的重新分配给每个slab class的内存。比如原来是64bytes一直都有100块内存，1M一直都有10块内存，都是固定的。但是，随着workload的改变，一台Server可能早上要存的东西都是64 bytes以下的，晚上要存的东西都是0.99M的。理想状况是，早上我们把1M的那些内存块对应的内存分给64bytes，晚上把64bytes的内存块对应的内存分给1M。这就是大致的算法思路。

具体实现则是，结合之前的例子，我发现64 bytes这个class老是在腾空间给新数据，而且被清掉的数据块的历史，比其他slab class最没人用的数据块（least recently used）还要短20%，那么我就知道64 bytes这个class内存不够了，我就从别的slab class里找最最最least recently used的内存块，把它resize一下分给64 bytes这个急切需要用内存的class。

D.3 Transient Item Cache

Memcached在清cache的时候是很拖延症的——不到满了都不清。在大部分情况下，这是没问题的。但是，有一部分key，只在很短的一段时间之内有用，之后根本不会被用到，但它们还是会在cache呆着，占着内存（请想象一个人在厕所赖着不出来直到下一个人来的情况），直到它们到了eviction队列尾端。

其实问题不大，但就是看它们不爽，那么怎么清掉呢？方法是对于expiration time比较短的key，把他们放到另外一个数据结构了，然后过一段时间check一下，有过期的就主动把它删掉，即使内存还没满。

D.4 Software Upgrade

我们每更新一次软件都要重启Server，然后要等一段时间cache才能满上数据。解决办法是更新前把cached的数据放到某个地方（System V Shared Memory Region，我也不知道是啥。。。)，然后这部分数据之后就可以接着用了。

https://medium.com/@shagun/scaling-memcache-at-facebook-1ba77d71c082
In Facebook’s context, users consume much more content than they create. So the workload is read intensive and caching help to reduce the workload.MORE ON IT. Facebook uses Memcache Clusters where each Memcache instance is a demand-filled, look-aside cache. This means if a client requests data from Memcache and data is not available, the client would fetch the data from the database and would populate the cache for further requests.

In case of get requests, UDP performs better than TCP and geterrors are treated as cache miss though insertion is not performed. This design choice seems practical as only .25% of requests fail due to late/ dropped or out of order packets. Though the response size is very small, the variation is quite large with mean = 954 bytes and median = 135 bytes. Set and delete operations are still performed over TCP (for reliability) though the connections are coalesced to improve efficiency.

Within a cluster, data is distributed across hundreds of servers through consistent hashing. A very high request rate combined with large fanout leads to an all to all communication between Memcache servers and clients and even a single server can become a bottleneck for many requests. Clients construct a DAG representing the dependency between data so that more independent requests are fired concurrently. Facebook also provides a standalone proxy called mcrouter that acts as an interface to Memcache server interface and routes the requests/replies to/from other servers. Along with these, flow control mechanisms in the form of sliding window mechanism are provided to limit incast congestion.

Lease

Leases are used to address stale sets (when web server writes a stale value in the cache) and thundering herds (when a key undergoes heavy read and write operations). When a client experiences a cache-miss, Memcache gives it a lease (a 64-bit token bound to the requested key). This lease is verified by Memcache when client tries to set the value. If Memcache receives a delete request for the key, the lease is invalidated and the value can not be set. To mitigate thundering herds, Memcache returns a token only once every 10 seconds per key. If a read request comes within 10 seconds of a token issue, the client is notified to retry after a short time, by which the updated value is expected to be set in the cache. In situations where returning stale data is not much problem, the client does not have to wait and stale data (at most 10 second old) is returned

Memcache Pools

Since different workloads can have different access patterns and requirements, Memcache clusters are partitioned into pools. The default pool is called wildcard and then there are separate pools for keys that can not reside in the wildcard pool. A small pool can be provisioned for keys for which cache miss is not expensive. Data is replicated within a pool when request rate is quite high and data can easily fit into a few machines. In Facebook’s context, replication seems to work better than sharding though it has the additional overhead of maintaining consistency.

In case a few Memcache servers fail, a small set of machines, called gutters, take over. In case of more widespread failures, requests are diverted to alternate clusters.

Cold cluster Warmup

When a new cluster is brought online, it takes time to get populated and initially the cache hit rate is very low. So a technique called Cold Cluster Warmup is employed where a new cold cluster is populated with data from a warm cluster instead of the database cluster. This way the cold cluster comes to full capacity within few hours. But additional care is needed to account for race conditions. One example could be: a client in cold cluster makes an update and before this update reaches the warm cluster, another request for the same key is made by the cold cluster then the item in the cold cache would be indefinitely inconsistent. To avoid this, Memcache rejects add operations for 2 seconds (called holdoff time)once a delete operation is taken place. So if a value is updated in a cold cluster and a subsequent request is made within 2 seconds, the add operation would be rejected indicating that the data has been updated. 2 seconds is chosen as most updates seems to propagate within this time.

Across Region consistency

Clusters comprising a storage cluster and several front end clusters are deployed in different regions. Only one of the regions contains the master database and rest act as replicas. This adds the challenge of synchronisation.
Writes from a master region send invalidations only within the master region. Sending invalidations outside may lead to a race situation where deletes reach before data is replicated. Facebook uses mcsequal daemon helps to avoid that at the cost of serving stale data for some time.

Writes from a non-master region are handled differently. Suppose a user updates his data from a non-master region with a large replication lag. A cache refill from a replica database is allowed only after the replication stream has caught up. A remote marker mechanism is used to minimise the probability if reading stale data. The presence of a marker indicates that data in the replica is stale and the query is redirected to the master region. When a webserver updates a key k, it sets a remote marker rk in the region, performs the write to the master database having key k and deletes k in the local cluster. When it tries to read k next time, it will experience a cache miss, will check if rk exists and will redirect its query to the master cluster. Had rk not been set, the query would have gone to the local cluster itself. Here latency is introduced to make sure most updated data is read.

Single Server Improvements

Facebook introduced many improvements for Memcache servers running as single instances as well. This includes extending Memcache to support automatic expansion of the hash table without the look-up time drifting to O(n), making the server multi-threaded using a global lock and giving each thread its own UDP port to reduce contention.

Memcache uses an Adaptive Slab Allocator which organizes memory into slab classes — pre-allocated, uniformly sized chunks of memory. Items are stored in smallest possible slab class which can fit the data. Slab sizes start at 64 bytes and reach up to 1 Mb. Each slab class maintains a free-list of available chunks and requests more memory in 1MB in case the free-list is empty. If no more free memory can be allocated, eviction takes place in Least Recently Used (LR) fashion. The allocator periodically rebalances slab assignments to keep up with the workload. Memory is freed in less used slabs and given to more heavily used slabs.

Most of the items are evicted lazily from the memory though some are evicted as soon as they are expired. Short lived items are placed into a circular buffer of linked lists (indexed by seconds until expiration) — called Transient Item Cache — based on the expiration time of the item. Every second, all of the items in the bucket at the head of the buffer are evicted and the head advances by one. By adding a short expiration time to heavily used set of keys whose items with short useful lifespans, the proportion of Memcache pool used by this key family was reduced from 6% to 0.3% without affecting the hit rate.

Lessons Learnt

Memcache’s design decisions are driven by data and not just intuition. Facebooks seems to have experimented with a lot of configurations before arriving on decisions like using UDP for read operations and choosing the value for parameters like holdoff time. This is how it should be — data-driven decisions. In the entire process of developing Memcache, Facebook focused on optimizations which affect a good number of users and usecases instead of optimizing for each possbile use case.

Facebook separated caching layer from persistence layer which makes it easier to mould the 2 layers individually. By handling the cache insertion logic in the application itself, Facebook made it easier to plug Memcache with different data stores. By modeling the probability of reading the stale data as a parameter, they allowed the latency and hit rate on the persistence store to be configurable to some extent thus broadening the scope for Memcache. Facebook breaks down a single page load request into multiple requests, thereby allowing for different kind of stale data tolerance for each component. For example, the servers serving the profile picture can cache content longer than the servers serving messages. This helps to lower the response latency and reduce the load on the data layer.

Facebook’s Memcache is one of the many solutions aimed to solve a rather common problem — a scalable, distributed key-value store. Amazon’s Dynamosolves is well for a write-intensive workload for small data sizes and Facebook solves it for a read intensive workload. Other contenders in the list are Cassandra, Voldemort, Redis, Twitter’s Memcache and more. The sheer number of possible, well known solutions suggests that there are many dimensions to the problem, some of which are yet unknown and that we still do not have a single solution that can be used for all use cases.

https://github.com/facebook/mcrouter/wiki/Cold-cache-warm-up-setup

Every time a new cache box is added to Memcache infrastructure, it has empty cache. This box is called 'cold'. Every request to this box will end up in 'miss', so clients will spend lots of resources to fill the cache. In the worst case this will affect performance of the entire site, because on every 'miss' on the cold box client would spend much more time to actually get the data.

Mcrouter offers a way to 'warm up' cold cache boxes without impact on performance. The idea is simple: when we have a cold box and receive a 'miss', try to get a value from 'warm 'box (one with filled cache) and set it back to cold box.

Here is the setup for this logic:

 {
   "pools": {
     "cold": { "servers": [ /* cold hosts */ ] },
     "warm": { "servers": [ /* warm hosts */ ] }
   },
   "route": {
     "type": "WarmUpRoute",
     "cold": "PoolRoute|cold",
     "warm": "PoolRoute|warm"
   }
 }

Explanation: All sets and deletes go to the "cold" route handle. Gets are attempted on the "cold" route handle and in case of a miss, data is fetched from the "warm" route handle (where the request is likely to result in a cache hit). If "warm" returns an hit, the response is then forwarded to the client and an asynchronous request updates the value in the "cold" route handle.

https://code.fb.com/core-data/introducing-mcrouter-a-memcached-protocol-router-for-scaling-memcached-deployments/

Since any client that wants to talk to memcached can already speak the standard ASCII memcached protocol, we use that as the common API and enter the picture silently. To a client, mcrouter looks like a memcached server. To a server, mcrouter looks like a normal memcached client. But mcrouter’s feature-rich configurability makes it more than a simple proxy.

Some features of mcrouter are listed below. In the following, a “destination” is a memcached host (or some other cache service that understands the memcached protocol) and “pool” is a set of destinations configured for some workload — e.g., a sharded pool with a specified hashing function, or a replicated pool with multiple copies of the data on separate hosts. Finally, pools can be organized into multiple clusters.

Connection pooling: Multiple clients can connect to a single mcrouter instance and share the outgoing connections, reducing the number of open connections to memcached instances.

Multiple hashing schemes: Mcrouter provides a proven consistent hashing algorithm (furc_hash) that allows distribution of keys across many memcached instances. Hostname hashing is useful for selecting a unique replica per client. There are a number of other hashes useful in specialized applications.

Prefix routing: Mcrouter can route keys according to common key prefixes. For example, you can send all keys starting with “foo” to one pool, “bar” prefix to another pool, and everything else to a “wildcard” pool. This is a simple way to separate different workloads.

Replicated pools: A replicated pool has the same data on multiple hosts. Writes are replicated to all hosts in the pool, while reads are routed to a single replica chosen separately for each client. This could be done either due to per-host packet limitations where a sharded pool would not be able to handle the read rate; or for increased availability of the data (one replica going down doesn’t affect availability due to automatic failover).

Cold cache warm up: Mcrouter can smooth the performance impact of starting a brand new empty cache host or set of hosts (as large as an entire cluster) by automatically refilling it from a designated “warm” cache.

Broadcast operations: By adding a special prefix to a key in a request, it’s easy to replicate the same request into multiple pools and/or clusters.

Reliable delete stream: In a demand-filled look-aside cache, it’s important to ensure all deletes are eventually delivered to guarantee consistency. Mcrouter supports logging delete commands to disk in cases when the destination is not accessible (due to a network outage or other failure). A separate process then replays those deletes asynchronously. This is done transparently to the client — the original delete command is always reported as successful.

Quality of service: Mcrouter allows throttling the rate of any type of request (e.g., get/set/delete) at any level (per-host, per-pool, per-cluster), rejecting requests over a specified limit. We also support rate limit requests to slow delivery.

Large values: Mcrouter can automatically split/re-stitch large values that would not normally fit in a memcached slab.

Multi-level caches: Mcrouter supports local/remote cache setup, where values would be looked up locally first and automatically set in a local cache from remote after fetching.

Production traffic shadowing: When testing new cache hardware, we found it extremely useful to be able to route a complete copy of production traffic from clients. Mcrouter supports flexible shadowing configuration. It’s possible to shadow test a different pool size (re-hashing the key space), shadow only a fraction of the key space, or vary shadowing settings dynamically at runtime.

Online reconfiguration: Mcrouter monitors its configuration files and automatically reloads them on any file change; this loading and parsing is done on a background thread and new requests are routed according to the new configuration as soon as it’s ready. There’s no extra latency from client’s point of view.

Flexible routing: Configuration is specified as a graph of small routing modules called “route handles,” which share a common interface (route a request and return a reply) and which can be composed freely. Route handles are easy to understand, create, and test individually, allowing for arbitrarily complex logic when used together. For example: An “all-sync” route handle will be set up with multiple child route handles (which themselves could be arbitrary route handles). It will pass a request to all of its children and wait for all of the replies to come back before returning one of these replies. Other examples include, among many others, “all-async” (send to all but don’t wait for replies), “all-majority” (for consensus polling), and “failover” (send to every child in order until an non-error reply is returned). Expanding a pool can be done quickly by using a “cold cache warmup” route handle on the pool (with the old set of servers as the warm pool). Moving this handle handle up the stack will allow for an entire cluster to be warmed up from a warm cluster.

Destination health monitoring and automatic failover: Mcrouter keeps track of the health status of each destination. If mcrouter marks a destination as unresponsive, it will fail over incoming requests to an alternate destination automatically (fast failover) without attempting to send them to the original destination. At the same time health check requests will be sent in the background, and as soon as a health check is successful, mcrouter will revert to using the original destination. We distinguish between “soft errors” (e.g., data timeouts) that are allowed to happen a few times in a row and “hard errors” (e.g., connection refused) that cause a host to be marked unresponsive immediately. Needless to say, all of this is completely transparent to the client.

https://www.quora.com/In-scaling-Memcache-at-Facebook-4-1-Regional-Invalidations-is-there-an-example-of-how-McSqueal-actually-derives-the-keys-to-be-deleted-based-on-an-SQL-statement
I think you are asking about how memcached gets invalidated when SQL changes rows in the database. Facebook has a technology called wormhole that captures row changes from the MySQL binary logs and publishes them on a message bus. These messages can be used for invalidation.

https://zhuanlan.zhihu.com/p/20734038

A.1.1 第一个优化是把数据请求一批一批，每一批并发来做。Facebook内部的异步编程使得Web Server能够建立起数据的依赖关系，得到一个DAG（有向无环图），然后没有依赖关系的数据就可以作为一批（batch）并发请求。每个batch的大小平均而言是24个。

A.1.2 客户端－服务器端通信

不同memcached服务器之间并不通信，一切工作都在客户端进行。客户端主要的工作是，给定一个key，它要知道应该到哪个memcached server去取数据。这个过程称为routing。客户端routing的逻辑，可以作为一个库直接嵌入到Web Server里面，也可以作为单独的程序运行。

客户端和服务器端通信的时候，UDP和TCP都用。在get请求的时候，用的是UDP，这样就不用像TCP那样还要握手建立连接。如果客户端发现丢包了，直接作为错误报出，而并不会尝试找回丢失的包。这时Web Server会把这当作cache miss处理。使用UDP把latency的平均值降低了20%左右。另一方面为了可靠起见，update和delete都采用TCP。

如果在网路中的包太多，就会发生Incast Congestion的问题（可以理解为，network有很多switch，router啥的，一旦一次性发一堆包，这些包同时到达switch，这些switch就会忙不过来）。应对这个问题就是不要让大量包在同一时间发送出去，在客户端限制每次发出去的包的数量（具体实现就是客户端弄个队列）。每次发送的包的数量称为“Window size”。这个值太小的话，发送太慢，自然延迟会变高；这个值太大，发送的包太多把network switch搞崩溃了，就可能发生比如丢包之类的情况，可能被当作cache miss，这样延迟也会变高。所以这个值需要调。

A.2.1 租约(Lease)。租约解决的问题有两个：

1. Stale set，就是一个client拿着过期的数据往memcache里塞

2. Thundering Herd Problem.

租约的实现细节如下：每次cache miss，memcache会给客户端返回一个token，告诉它我这没有数据，你去取数据，取完拿这个token来更新我这的数据。如果有多个客户端同时读，同时cache miss，那么它们就会同时收到各自的token。但是，在接受数据更新的时候，比如A，B先后拿到一个token，A拿Token更新完之后，B再拿Token过来更新就不认了。如果系统收到了一个delete请求，那么啥token都不管用了。

再进一步，我们可以限制发token的频率，比如每10秒最多发一个。然后对于那些拿不到token的，既然有别人去拿数据去了，何必让它们再去拿？就让它们等待一小段时间再重试。大部分时候，更新后的数据在几毫秒后就来了，它们再重试就是cache hit了。

这个设计还是很精妙的。

A.2.2 过期数据(Stale values)

如果应用程序层可以忍受稍微过期一点的数据，针对这点可以进一步降低系统负载。当一个key－value被删除的时候（delete请求或者cache爆棚清空间了），它被放倒一个临时的数据结构里，会再续上比较短的一段时间。当有请求进来的时候会返回这个数据并标记为“Stale”。对于Facebook大部分应用场景而言，Stale Value是可以忍受的。

A.2.3 Memcache池（memcache pool）

在不同应用场景之下，对memcache的读取规律（access pattern）是可能大为不同的。比如有些访问率高，有些访问率低；cache miss有时无所谓，有时候很昂贵（比如计算一次时间特别久）；有时上一周用到的key跟这一周用到的key基本一样(low-churn)，有时差别很大(high-churn)。

拿最后一个例子来说（low-churn vs. high-churn)，服务器A使用的key万年不变，就在那么一小堆里来来去去；服务器B使用的key基本什么都有可能。如果它们使用同一个Pool，那么服务器B因为经常换新key，就会把A的key给频繁踢掉，而理想情况下，A使用的pool应该基本上就不用更新key的。这样就会造成很大的性能浪费。

所以，针对不同的需求，设立不同的pool是很有必要的。

A.2.4 数据复制(Replication)

最后一个降低负载的手段是在每个Pool内进行Replication。想象如下情景：我们有一台memcache机器，可以支持500K/s的请求率(request rate)。现在我们想要支持1M/s的请求，怎么办？加一台机器，然后把数据平分，两台机器一人一半？但是，客户端在请求的时候，一般是多个key作为一个batch发出来。假如原来是每个request要读100个key，那么现在就是分成两个request，每个request读50个key，然后同时扔给两个机器。由于每台机器接受请求的数量还是受到500K/s的限制，那么总的request rate还是500K/s。

那如何破？就是加一台机器，然后把数据完整复制到那上面。现在客户端还是一次发一个有100个key的读，但是有50%发到老机器，50%发到新机器。这样就能支持500K/s * 2的请求率了。

A.3 应对系统崩溃

系统崩溃有大有小。对于大规模的服务器失联，一般就直接把对应的cluster直接取下来，把web server的请求重定向到别的cluster的memcache。这种情况不常见，应对方法也比较简单粗暴。

对于小规模的outage，就需要注意。一方面小规模事故不足以让我们把整个cluster取缔掉，但是放任不管，也容易造成连锁反应：cache miss太多，导致Web Server全部直接访问后端服务，后端服务顶不住，挂了，或者后端数据库顶不住，也挂了，然后全站就挂了，句号。论文里提到的应对的办法就是设立一堆专门的应急memcache服务器（Gutter）。一旦一个get请求没有任何响应，客户端默认那个server挂了，转而请求Gutter。这个优化带来的影响是：

In practice, this sys- tem reduces the rate of client-visible failures by 99% and converts 10%–25% of failures into hits each day

https://zhuanlan.zhihu.com/p/20761071

B. Region Replication

（Region感觉不好翻译，类似于data center的概念吧）

前面我们谈到把一堆Web Server和一堆Memcached作为一个Cluster。但是一个Cluster是不能无限Scale up的，因为每个Cluster里面都是多对多的Connection，也就是说是N^2的connection上限，这样N越大，就会：

Hotkey 访问量越来越大，最终搞得有hotkey的那个server hold不住
网路堵塞各种丢包
...

所以，论文提到建立多个Cluster （为了区分，后面把这种每个里面包含一定数量的Web Server和memcached的cluster称为Frontend Cluster），把它们和一个Storage Cluster组合起来，组成一个Region。所有Frontend Cluster共享同一个后端服务。

这样相当于是用空间换时间：同一个key在每一个Frontend Cluster都可能有一个Copy，这样会带来consistency的问题（后面会讲怎么解决），但是这样能够降低latency和提高availability。

B.1 Regional Invalidation

这里要解决的问题是：Cluster A的某个Server修改了后端数据库里面的值，如何把这个值被修改了的消息传播到所有的Frontend Cluster，好让它们把本地memcached对应的旧的数据清掉？

一种简单的方式是让那个做出修改的Web Server负责通知同一个Region里所有Cluster的所有memcached，但基于我们上面说到的种种理由，这样performance会很差，而且还容易由于routing的配置出错，所以这种操作只能现在在同一个Cluster里。

那么对于其他Cluster，解决办法是让Storage Cluster来负责把消息广播出去。Storage layer采用的是Mysql，而Mysql对于数据更新是有日志的。第一步首先在日志内容里加上对应的memcached key，然后设立一个进程监听日志，发现有数据更新，就解析出对应的key，然后把消息广播给所有的memcached。

这里有一点要注意的是我们不能让数据库直接跟memcached通信，原因包括但不限于：

这样通信连接太多
这样还得把routing逻辑放到后端逻辑里

所以每个cluster里面有专门的server，运行mcrouter进程（还记得前面说过routing逻辑可以作为库函数嵌入，也可以作为单独的进程运行吧）。广播消息会发送给这些进程，再由它们负责传给本cluster的memcached。

B.2 Regional Pool

每个Frontend Cluster都有自己的memcached pools，我们姑且把它们称作cluster pool吧。与之相对的Regional Pool顾名思义就是被所有cluster共享的memcached pool。为什么要设立这样的pool呢？

主要原因是，有一些数据，访问频率低，本身占内存还多，这样的数据放到每个cluster里复制一份，要占用很多额外的内存。所以把它们放到Regional Pool里面就可以减少内存占用。虽然Regional Pool的latency会相对更高（因为要穿越cluster的边界），但是由于它们访问频率不高，所以也就显得不那么有所谓了。

论文提到目前是靠人的经验来觉得什么东西放Regional Pool的，不知道现在是不是做到自动化了。

B.3 Cold Cluster Warmup

当我们起一个新的cluster，或者把cluster拿去维护，等等等等之类的，这个cluster的cache基本是没东西的，所以基本很大概率是各种cache miss，然后要等很久才能填得比较满，而且这样也会给后端服务带来很大压力。

解决办法嘛，很容易就能想到，允许这个cold cluster在cache miss的时候，把别的“warm cluster”（就是cache有比较多数据的cluster）当作storage cluster去那边读数据，这样原来需要几天时间才能完成的warm up在几个小时之内就能完成。

但是这样又带来了新的问题。想象下面这个情景：Cluster A某个Server往Storage里更新了一份数据，后端在完成数据更新后会把数据过期的消息发送到其他的Cluster。同时Cluster A里某个Server读这份数据的时候发现cache miss，然后从Cluster B里读；如果恰好Cluster B此时还没有收到数据过期的消息（因为消息传递也是要时间的），这个读会成功返回，然后这个Server拿到事实上已经过期的数据后会往Cluster A的memcached里写。这样子就相当于Cluster A的memcached就存放了一份过期的数据，而且这份过期的数据可能被保留很长甚至无限长的时间。

解决办法是：memcached支持一个功能，在对一个key进行delete操作之后锁住这个key一段时间不允许进行add操作。通过在cold cluster里设置这段hold-off时间为大于0的值（2秒），在上面的场景中，由于前面那个更新的Server是对本地memcached进行了delete操作的，第二个server拿着过期数据往里写的时候就会报错，然后它就知道这个数据在Storage里面有新值，就会去读最新的值再往里写。理论上过期数据还是可能出现的，但是可能性大大减低了。

http://www.shuatiblog.com/blog/2015/01/24/distributed-caching-memcached/

Memcached is an in-memory key-value store for small chunks of arbitrary data (strings, objects) from results of database calls, API calls, or page rendering.

Memcached servers are indeed independent of each other. Memcached server is just an efficient key-value store implemented as in-memory hash table.

• It cannot cache large object

What makes memcached distributed is the client, which in most implementations can connect to a pool of servers. Typical implementations use consistent hashing, which means that when you add or remove server to/from a pool of N servers, you only have to remap 1/N keys.

Typically keys are not duplicated on various hosts, as memcached is not meant to be persistent store and gives no guarantees that your value will persist (for example when running out of assigned memory, memcached server drops least recently used (LRU) elements). Thus it’s assumed that your application should handle missing keys.

https://miafish.wordpress.com/2015/03/06/memcached/

Here is where consistent hashing is advantageous: Suppose server 2 will be removed from the memcache server pool. What would happen when we want to fetch key k1? Nothing strange would happen. We plot k1 still on the same position in the continuum, and the first server-dot it will hit is still s1.

However, when fetching k3, which is stored on s2, it would miss the s2-dot (since it has been removed), and will be moved to server s3. Same goes for k2, which moves from s2 to s1.

In fact, the more server-dots we place onto the continuum, the less key-misses we get in case a server gets removed (or added). A good number would be around 100 to 200 dots, since more dots would result in a slower lookup on the continuum (this has to be a very fast process of course). The more servers you add, the better the consistent hashing will perform.

Instead of almost 100% of the key-movements you have when using a standard modulus algorithm, the consistent hashing algorithm would maybe invalidate 10-25% of your keys (these numbers drop down quickly the more servers you use) which means the pressure on your backend system (like the database) will be much less than it would when using modulus.

Pros:
reduces database load.
perfect for websites with high database load
significantly reduces the number of retrieval requests to database
cuts down the I/O access
Cons:
not a persistent data store
not a database
not application specific
can not cache large object(key size = 255 chars, max value = 1MB)
https://www.adayinthelifeof.nl/2011/02/06/memcache-internals/

Internally, all objects have a “counter”. This counter holds a timestamp. Every time a new object is created, that counter will be set to the current time. When an object gets FETCHED, it will reset that counter to the current time as well. As soon as memcache needs to “evict” an object to make room for newer objects, it will find the lowest counter. That is the object that isn’t fetched or is fetched the longest time ago (and probably isn’t needed that much, otherwise the counter would be closed to the current timestamp).

In effect this creates a simple system that uses the cache very efficient. If it isn’t used, it’s kicked out of the system.

But high performance systems like memcache can get into trouble working that way. The reason is that malloc() and free() functions are not really optimized for such kind of programs. Memory gets fragmented easily which means a lot of memory will get spilled

in order to combat this “malloc()” problem, memcache does its own memory management by default. Memcache’s memory manager will allocate the maximum amount of memory from the operating system that you have set (for instance, 64Mb, but probably more) through one malloc() call. From that point on, it will use its own memory manager system called the slab allocator.

Slab allocation
When memcache starts, it partitions its allocated memory into smaller parts called pages.
Each page is 1Mb large (coincidentally, the maximum size that an object can have you can store in memcache).
Each of those pages can be assigned to a slab-class, or can be unassigned (being a free page).

A slab-class decides how large the objects can be that are stored inside that particular page. Each page that is designated to a particular slab-class will be divided into smaller parts called chunks.

The chunks in each slab have the same size so there cannot be 2 different sized chunks inside the same page.

For instance, there could be a page with 64byte chunks (slab class 1), a page with 128byte chunks (slab class 2) and so on, until we get the largest slab with only 1 chunk (the 1MB chunk).

There can be multiple pages for each slab-class, but as soon as a page is assigned a slab-class (and thus, split up in chunks), it cannot be changed to another slab-class.

The smallest chunk-size starts at 80 bytes and increases with a factor of 1.25 (rounded up until the next power of 2). So the second smallest chunksize would be 100 etc. You can actually find it out by issuing the “-vv” flag when starting memcache.

You can also set the factor (-f) and the initial chunk-size (-s), but unless you really know what you are doing, don’t change the initial values.

Memcache will initially create 1 page per slab-class and the rest of the pages will be free (which even slab class needs a page, gets a page)

Now that memcache has partitioned the memory, it can add data to the slabs. as soon as a complete page if full (all chunks in the page are filled) and we need to add another piece of data, it will fetch a new free page, assign it to the specified slab-class, partition it into chunks and gets the first available chunk to store the data.

But as soon as there are no more pages left that we can designate to our slab-class, it will use the LRU-algorithm to evict one of the existing chunks to make room. This means that when we need a 128byte chunk, it will evict a 128byte chunk, even though there might be a 256byte chunk that is even older. Each slab-class has its own LRU.

Consistent hashing
$server_id = hashfunc($key) % $servercount;

The trouble with this system: as soon as $servercount (the number of servers) change, almost 100% of all keys will change server as well.

System Design Interview

Sunday, March 24, 2019

Memcached - Product Design