最近在公司的wiki里写了几篇查问题的日志,感觉有分享的必要,就贴出来了。
业务方反馈在向memcache集群写入数据时,出现不稳定。表现为向mc写入一个creative和ad对象的list,有的时候能写进去并读出来,有的时候写成功但是读不出来。
使用同一个proxy的再次复现问题,出现了之前的多种情况。所以排除proxy问题。
另外在排查中发现出现问题的key长度小于20B、value长度在9K~10K左右。
集群各个节点状态,以10.0.0.1:11211为例:
============10.0.0.1:11211
# Item_Size Max_age Pages Count Full? Evicted Evict_Time OOM
1 96B 604774s 158 1021044 no 0 0 0
2 120B 3600s 2 10013 no 0 0 0
3 152B 186s 1 1 no 0 0 0
4 192B 6157491s 1 69 no 0 0 0
5 240B 1721785s 1 835 no 0 0 0
6 304B 1721786s 1 2673 no 0 0 0
7 384B 1721783s 1 91 no 0 0 0
8 480B 1721786s 1 6 no 0 0 0
9 600B 140686s 2 1 no 0 0 0
10 752B 125s 7 11 no 0 0 0
11 944B 121s 4 940 no 0 0 0
12 1.2K 120s 9 4666 yes 3 562 0
13 1.4K 121s 5 1447 yes 2047 495 0
14 1.8K 437s 754 1209 no 0 0 0
15 2.3K 83618s 58 261 yes 575 138922 0
16 2.8K 172787s 558 80573 no 0 0 0
17 3.5K 172780s 576 131417 yes 96835 172745 0
18 4.4K 172788s 2869 169486 no 0 0 0
19 5.5K 3576s 90 16560 yes 3357047 3577 0
20 6.9K 118334s 7 988 yes 1 72968 0
21 8.7K 82s 6 708 yes 12016644 85 88
22 10.8K 1s 2 188 yes 393841058 1 8640
23 13.6K 1s 1 75 yes 118541885 1 1153
24 16.9K 59s 1 60 yes 1262831 60 14
25 21.2K 338s 1 16 no 0 0 0
26 26.5K 144s 1 5 no 0 0 0
27 33.1K 21s 1 1 no 0 0 0
28 41.4K 5s 1 2 no 0 0 0
30 64.7K 23s 1 2 no 0 0 0
31 80.9K 0s 1 0 no 0 0 0
通过看集群各个节点的状态,发现节点的slab存在不同程度的Full.
情节较为严重的有三个节点,并且Evicted很多。
用client直接连接不同的节点,有的节点上读写都ok,有的节点上出现了之前的问题。确定是集群节点出现问题。
基本确定问题和10.0.0.1:11211节点上5.5K~16.9K之间的slab的Full状态以及Evicted有关。
内存使用情况
最大内存5GB,每个实例用了1.5GB左右。
内存没满啊,为什么存不进去?
再次返回来看各个节点的状态,以10.0.0.1:11211为例:
把分配的Page都加起来:158+2+1+1+1+1+1+1+2+7+4+9+5+754+58+558+576+2869+90+7+6+2+1+1+1+1+1+1+1+1=5121 ~ 5GB
5GB是分配给每个节点的maxmemory.
说明所有的memory page都被分配给相应的slab了,目前即使有一部分page回收后空闲,但是这部分空闲的page没有被重新分配到全局空闲空间,供其他slab使用。
看一下chunk size为1.8K的一行,分配page为754,item数量1209,也就是说这个slab里面,实际只有1MB左右的数据,却分配了754M的空间,严重浪费。
为什么mc就不能把已经分配的空闲空间回收呢?
问题定位:mc没有把已经分配的空闲空间回收。
1.自己搭了一个64M的mc节点。
2.用4k的value数据写满:
# Item_Size Max_age Pages Count Full? Evicted Evict_Time OOM
18 4.4K 6s 64 14720 yes 2056 2 0
3.删除所有数据:
# Item_Size Max_age Pages Count Full? Evicted Evict_Time OOM
18 4.4K 0s 64 0 yes 0 0 0
4.再用1k的value写满:
# Item_Size Max_age Pages Count Full? Evicted Evict_Time OOM
12 1.2K 9s 2 885 no 66222 0 0
18 4.4K 0s 63 0 yes 0 0 0
发现这次value大小为1k的很多都Evited了。并且上次value大小为4k的数据虽然已经删除了,但是page大多数还处于被分配状态。
STAT slab_reassign_running 0
STAT slabs_moved 2
在stats里面看到,slabs也出现了reassign(就是在启动参数里面指定了slabs_reassign和slabs_automove),但是和我们要的差距有点大。
在1.4.11的ReleaseNote里面看到:
Slab Reassign
Long running instances of memcached may run into an issue where all available memory has been assigned to a specific slab class (say items of roughly size 100 bytes). Later the application starts storing more of its data into a different slab class (items around 200 bytes). Memcached could not use the 100 byte chunks to satisfy the 200 byte requests, and thus you would be able to store very few 200 byte items.
1.4.11 introduces the ability to reassign slab pages. This is a beta feature and the commands may change for the next few releases, so please keep this in mind. When the commands are finalized they will be noted in the release notes.
Slab reassignment can only be enabled at start time:
```$ memcached -o slab_reassign}}}
Once all memory has been assigned and used by items, you may use a command to reassign memory.
<div class="se-preview-section-delimiter">div>
```$ echo "slabs reassign 1 4" | nc localhost 11211}}}
That will return an error code indicating success, or a need to retry later. Success does not mean that the slab was moved, but that a background thread will attempt to move the memory as quickly as it can.
Slab Automove
While slab reassign is a manual feature, there is also the start of an automatic memory reassignment algorithm.
```$ memcached -o slab_reassign,slab_automove}}}
The above enables it on startup. slab_automove requires slab_reassign first be enabled.
automove itself may also be enabled or disabled at runtime:
<div class="se-preview-section-delimiter">div>
```$ echo "slabs automove 0" | nc localhost 11211}}}
The algorithm is slow and conservative. If a slab class is seen as having the highest eviction count 3 times 10 seconds apart, it will take a page from a slab class which has had zero evictions in the last 30 seconds and move the memory.
There are lots of cases where this will not be sufficient, and we invite the community to help improve upon the algorithm. Included in the source directory is scripts/mc_slab_mover. See perldoc for more information:
```$ perldoc ./scripts/mc_slab_mover}}}
It implements the same algorithm as built into memcached, and you may modify it to better suit your needs and improve on the script or port it to other languages. Please provide patches!
Slab Reassign Implementation
Slab page reassignment requires some tradeoffs:
All items larger than 500k (even if they're under 730k) take 1MB of space
When memory is reassigned, all items that were in the 1MB page are evicted
When slab reassign is enabled, an extra background thread is used
<div class="se-preview-section-delimiter">div>
试一下$ echo "slabs reassign 1 4" | nc localhost 11211
写满1k数据:
<div class="se-preview-section-delimiter">div>
# Item_Size Max_age Pages Count Full? Evicted Evict_Time OOM
12 1.2K 6s 7 6195 yes 193356 1 0
18 4.4K 0s 58 0 yes 0 0 0
<div class="se-preview-section-delimiter">div>
有4个page迁移了
$ echo "slabs reassign 1 10" | nc localhost 11211
再写一遍1k数据:
# Item_Size Max_age Pages Count Full? Evicted Evict_Time OOM
12 1.2K 50s 14 12390 yes 254268 31 0
18 4.4K 0s 51 0 yes 0 0 0
STAT slab_reassign_running 0
STAT slabs_moved 13
STAT bytes 13232520
写一遍2k的数据:
# Item_Size Max_age Pages Count Full? Evicted Evict_Time OOM
12 1.2K 223s 13 11505 yes 254268 31 0
15 2.3K 15s 2 902 yes 32651 4 0
18 4.4K 0s 51 0 yes 0 0 0
STAT slab_reassign_running 0
STAT slabs_moved 14
STAT bytes 14154480
能看到reassign的速度变快了。但还是和我们要的差距有点大。我们不能经常手动执行slabs reassign.
我们用的mc是1.4.13,新版本的mc是不是解决了这个问题?
于是下载最新的1.4.33,重复上面的测试。
下载最新版本1.4.33,重试上面的测试:
用4k的value数据写满:
# Item_Size Max_age Pages Count Full? Evicted Evict_Time OOM
18 4.4K 11s 64 14720 yes 2056 1 0
删除所有数据:
# Item_Size Max_age Pages Count Full? Evicted Evict_Time OOM
18 4.4K 0s 64 0 yes 0 0 0
用1k的value写满:
# Item_Size Max_age Pages Count Full? Evicted Evict_Time OOM
12 1.2K 8s 1 885 yes 66222 1 0
18 4.4K 0s 64 0 yes 0 0 0
貌似没什么变化啊,
赶紧下载最新的代码看看在申请空间的时候怎么做的。
item *do_item_alloc(char *key, const size_t nkey, const unsigned int flags,
const rel_time_t exptime, const int nbytes) {
//计算一共占用的空间
size_t ntotal = item_make_header(nkey + 1, flags, nbytes, suffix, &nsuffix);
if (settings.use_cas) {
ntotal += sizeof(uint64_t);
}
//根据占用的空间取得slabs_clsid
unsigned int id = slabs_clsid(ntotal);
if (id == 0)
return 0;
/* If no memory is available, attempt a direct LRU juggle/eviction */
/* This is a race in order to simplify lru_pull_tail; in cases where
* locked items are on the tail, you want them to fall out and cause
* occasional OOM's, rather than internally work around them.
* This also gives one fewer code path for slab alloc/free
*/
/* TODO: if power_largest, try a lot more times? or a number of times
* based on how many chunks the new object should take up?
* or based on the size of an object lru_pull_tail() says it evicted?
* This is a classical GC problem if "large items" are of too varying of
* sizes. This is actually okay here since the larger the data, the more
* bandwidth it takes, the more time we can loop in comparison to serving
* and replacing small items.
*/
for (i = 0; i < 10; i++) {
uint64_t total_bytes;
/* Try to reclaim memory first */
if (!settings.lru_maintainer_thread) {//如果没有开maintainer线程,先从coldlru里面看看有没有空间
lru_pull_tail(id, COLD_LRU, 0, 0);
}
it = slabs_alloc(ntotal, id, &total_bytes, 0);//尝试申请slab
if (settings.expirezero_does_not_evict)
total_bytes -= noexp_lru_size(id);
if (it == NULL) {
if (settings.lru_maintainer_thread) {
lru_pull_tail(id, HOT_LRU, total_bytes, 0);
lru_pull_tail(id, WARM_LRU, total_bytes, 0);
if (lru_pull_tail(id, COLD_LRU, total_bytes, LRU_PULL_EVICT) <= 0)
break;
} else {
if (lru_pull_tail(id, COLD_LRU, 0, LRU_PULL_EVICT) <= 0)
break;
}
} else {
break;
}
}
if (i > 0) {//多次尝试才获取到空间
pthread_mutex_lock(&lru_locks[id]);
itemstats[id].direct_reclaims += i;
pthread_mutex_unlock(&lru_locks[id]);
}
if (it == NULL) {
pthread_mutex_lock(&lru_locks[id]);
itemstats[id].outofmemory++;
pthread_mutex_unlock(&lru_locks[id]);
return NULL;
}
assert(it->slabs_clsid == 0);
//assert(it != heads[id]);
/* Refcount is seeded to 1 by slabs_alloc() */
it->next = it->prev = 0;
/* Items are initially loaded into the HOT_LRU. This is '0' but I want at
* least a note here. Compiler (hopefully?) optimizes this out.
*/
if (settings.lru_maintainer_thread) {
if (exptime == 0 && settings.expirezero_does_not_evict) {
id |= NOEXP_LRU;
} else {
id |= HOT_LRU;
}
} else {
/* There is only COLD in compat-mode */
id |= COLD_LRU;
}
it->slabs_clsid = id;
...
return it;
}
这段代码就是当新增kv却没有足够的空间时的操作,做了10次尝试使用回收空间。
static void *lru_maintainer_thread(void *arg) {
rel_time_t last_crawler_check = 0;
struct crawler_expired_data cdata;
memset(&cdata, 0, sizeof(struct crawler_expired_data));
pthread_mutex_init(&cdata.lock, NULL);
cdata.crawl_complete = true; // kick off the crawler.
/*线程里面一个死循环,回收指定的slab_clsid或者所有的slab_clsid空间*/
while (do_run_lru_maintainer_thread) {
...
/* We were asked to immediately wake up and poke a particular slab
* class due to a low watermark being hit
* did_moves:移动slab的数量,根据此数量判断maintainer_thread每次sleep时间的长短
*/
if (lru_maintainer_check_clsid != 0) {
did_moves = lru_maintainer_juggle(lru_maintainer_check_clsid);
lru_maintainer_check_clsid = 0;
} else {
for (i = POWER_SMALLEST; i < MAX_NUMBER_OF_SLAB_CLASSES; i++) {
did_moves += lru_maintainer_juggle(i);
}
}
...
/* Once per second at most */
if (settings.lru_crawler && last_crawler_check != current_time) {
lru_maintainer_crawler_check(&cdata);
last_crawler_check = current_time;
}
}
return NULL;
}
//================
/* Loop up to N times:
* 如果HOT_LRU里面item太多,移到COLD_LRU
* 如果WARM_LRU里面item太多,移到COLD_LRU
* 如果COLD_LRU里面item太多,放到COLD_LRU尾部
* 1000 loops with 1ms min sleep gives us under 1m items shifted/sec. The
* locks can't handle much more than that. Leaving a TODO for how to
* autoadjust in the future.
*/
static int lru_maintainer_juggle(const int slabs_clsid) {
int i;
int did_moves = 0;
bool mem_limit_reached = false;
uint64_t total_bytes = 0;
unsigned int chunks_perslab = 0;
unsigned int chunks_free = 0;
/* TODO: if free_chunks below high watermark, increase aggressiveness */
chunks_free = slabs_available_chunks(slabs_clsid, &mem_limit_reached,
&total_bytes, &chunks_perslab);//slabclass和全局变量里面保存了这些数值,取出来用来判断是否需要回收
if (settings.expirezero_does_not_evict)
total_bytes -= noexp_lru_size(slabs_clsid);
/* 如果在任何级别上启用slab automove,并且我们在此类中有超过2.5页的空闲块,将此类的页面重新分配回全局池
*/
if (settings.slab_automove > 0 && chunks_free > (chunks_perslab * 2.5)) {
slabs_reassign(slabs_clsid, SLAB_GLOBAL_PAGE_POOL);//把空闲快从slabs_clsid移到SLAB_GLOBAL_PAGE_POOL全局池,并且保证src slabs_clsid里面至少两个slab.
}
/* Juggle HOT/WARM up to N times */
for (i = 0; i < 1000; i++) {
int do_more = 0;
if (lru_pull_tail(slabs_clsid, HOT_LRU, total_bytes, LRU_PULL_CRAWL_BLOCKS) ||
lru_pull_tail(slabs_clsid, WARM_LRU, total_bytes, LRU_PULL_CRAWL_BLOCKS)) {
do_more++;
}
do_more += lru_pull_tail(slabs_clsid, COLD_LRU, total_bytes, LRU_PULL_CRAWL_BLOCKS);
if (do_more == 0)
break;
did_moves++;
}
return did_moves;
}
//============
/*** LRU MAINTENANCE THREAD ***/
/* Returns number of items remove, expired, or evicted.
* Callable from worker threads or the LRU maintainer thread */
//两处调用:maintainer线程。另一个调用点是do_item_alloc给item分配内存的时候。
static int lru_pull_tail(const int orig_id, const int cur_lru,
const uint64_t total_bytes, uint8_t flags) {
item *it = NULL;
int id = orig_id;
int removed = 0;
if (id == 0)
return 0;
int tries = 5;
item *search;
item *next_it;
void *hold_lock = NULL;
unsigned int move_to_lru = 0;
uint64_t limit = 0;
id |= cur_lru;//根据hot,warm,cold来判断slab_clsid
pthread_mutex_lock(&lru_locks[id]);
search = tails[id];
/* We walk up *only* for locked items, and if bottom is expired. */
for (; tries > 0 && search != NULL; tries--, search=next_it) {
/* we might relink search mid-loop, so search->prev isn't reliable */
next_it = search->prev;
if (search->nbytes == 0 && search->nkey == 0 && search->it_flags == 1) {
/* We are a crawler, ignore it. */
if (flags & LRU_PULL_CRAWL_BLOCKS) {
pthread_mutex_unlock(&lru_locks[id]);
return 0;
}
tries++;
continue;
}
uint32_t hv = hash(ITEM_key(search), search->nkey);
/* Attempt to hash item lock the "search" item. If locked, no
* other callers can incr the refcount. Also skip ourselves. */
if ((hold_lock = item_trylock(hv)) == NULL)
continue;
/* Now see if the item is refcount locked */
if (refcount_incr(&search->refcount) != 2) {
/* Note pathological case with ref'ed items in tail.
* Can still unlink the item, but it won't be reusable yet */
itemstats[id].lrutail_reflocked++;
/* In case of refcount leaks, enable for quick workaround. */
/* WARNING: This can cause terrible corruption */
if (settings.tail_repair_time &&
search->time + settings.tail_repair_time < current_time) {
itemstats[id].tailrepairs++;
search->refcount = 1;
/* This will call item_remove -> item_free since refcnt is 1 */
do_item_unlink_nolock(search, hv);
item_trylock_unlock(hold_lock);
continue;
}
}
/* Expired or flushed */
if ((search->exptime != 0 && search->exptime < current_time)
|| item_is_flushed(search)) {
itemstats[id].reclaimed++;
if ((search->it_flags & ITEM_FETCHED) == 0) {
itemstats[id].expired_unfetched++;
}
/* refcnt 2 -> 1 */
do_item_unlink_nolock(search, hv);
/* refcnt 1 -> 0 -> item_free */
do_item_remove(search);
item_trylock_unlock(hold_lock);
removed++;
/* If all we're finding are expired, can keep going */
continue;
}
/* If we're HOT_LRU or WARM_LRU and over size limit, send to COLD_LRU.
* If we're COLD_LRU, send to WARM_LRU unless we need to evict
*/
switch (cur_lru) {
case HOT_LRU:
limit = total_bytes * settings.hot_lru_pct / 100;
case WARM_LRU:
if (limit == 0)
limit = total_bytes * settings.warm_lru_pct / 100;
if (sizes_bytes[id] > limit) {
itemstats[id].moves_to_cold++;
move_to_lru = COLD_LRU;
do_item_unlink_q(search);
it = search;
removed++;
break;
} else if ((search->it_flags & ITEM_ACTIVE) != 0) {
/* Only allow ACTIVE relinking if we're not too large. */
itemstats[id].moves_within_lru++;
search->it_flags &= ~ITEM_ACTIVE;
do_item_update_nolock(search);
do_item_remove(search);
item_trylock_unlock(hold_lock);
} else {
/* Don't want to move to COLD, not active, bail out */
it = search;
}
break;
case COLD_LRU:
it = search; /* No matter what, we're stopping */
if (flags & LRU_PULL_EVICT) {
if (settings.evict_to_free == 0) {
/* Don't think we need a counter for this. It'll OOM. */
break;
}
itemstats[id].evicted++;
itemstats[id].evicted_time = current_time - search->time;
if (search->exptime != 0)
itemstats[id].evicted_nonzero++;
if ((search->it_flags & ITEM_FETCHED) == 0) {
itemstats[id].evicted_unfetched++;
}
LOGGER_LOG(NULL, LOG_EVICTIONS, LOGGER_EVICTION, search);
do_item_unlink_nolock(search, hv);
removed++;
if (settings.slab_automove == 2) {
slabs_reassign(-1, orig_id);
}
} else if ((search->it_flags & ITEM_ACTIVE) != 0
&& settings.lru_maintainer_thread) {
itemstats[id].moves_to_warm++;
search->it_flags &= ~ITEM_ACTIVE;
move_to_lru = WARM_LRU;
do_item_unlink_q(search);
removed++;
}
break;
}
if (it != NULL)
break;
}
pthread_mutex_unlock(&lru_locks[id]);
if (it != NULL) {
if (move_to_lru) {
it->slabs_clsid = ITEM_clsid(it);
it->slabs_clsid |= move_to_lru;
item_link_q(it);
}
do_item_remove(it);
item_trylock_unlock(hold_lock);
}
return removed;
}
这段基本就是lru_maintainer线程回收空间的核心代码了。
概括一下就是:
maintainer线程处于一个while循环中,不断对所有的slabs_cls进行循环,看看哪些slabs_cls里面空闲空间的>2.5个page,就标记一下到slab_reb里面,等待回收。
并且不断对lru表维护,如果hot,warm lru占有内存超过限定额度,将hot lru的item移至warm lru, warm lru的item移至cold lru,以及对cold lru里面对象的回收等等.
slabclass_t对应三条lru队列,即hot,warm,cold lru,最终内存不足的时候会有优先删除cold lru的数据。
另外,最新的mc里面也支持一个crawler线程和maintainer线程配合。crawler线程用来检查当前memcache里面的所有item是否过期等。
1.启动参数:
增加lru_maintainer参数
2.写4k数据:
# Item_Size Max_age Pages Count Full? Evicted Evict_Time OOM
18 4.4K 27s 64 14720 yes 2056 4 0
3.全部删除:
# Item_Size Max_age Pages Count Full? Evicted Evict_Time OOM
18 4.4K 0s 2 0 yes 0 0 0
4.写入8k数据:
# Item_Size Max_age Pages Count Full? Evicted Evict_Time OOM
18 4.4K 0s 2 0 yes 0 0 0
21 8.7K 27s 62 7316 yes 1071 1 0
5.删除8k数据:
# Item_Size Max_age Pages Count Full? Evicted Evict_Time OOM
18 4.4K 0s 2 0 yes 0 0 0
21 8.7K 0s 2 0 yes 0 0 0
6.写满6k数据:
# Item_Size Max_age Pages Count Full? Evicted Evict_Time OOM
18 4.4K 0s 2 0 yes 0 0 0
20 6.9K 7s 60 8820 yes 2363 3 0
21 8.7K 0s 2 0 yes 0 0 0
7.写入4k无过期数据,8k有过期的数据600s
# Item_Size Max_age Pages Count Full? Evicted Evict_Time OOM
18 4.4K 484s 11 2000 yes 0 0 0
21 8.7K 9s 26 2999 yes 0 0 0
8.写入5k的不过期数据:
# Item_Size Max_age Pages Count Full? Evicted Evict_Time OOM
18 4.4K 0s 2 0 yes 0 0 0
19 5.5K 5s 33 5999 yes 0 0 0
21 8.7K 0s 2 0 yes 0 0 0
9.写入5k无过期数据,8k有过期的数据600s
# Item_Size Max_age Pages Count Full? Evicted Evict_Time OOM
18 4.4K 0s 2 0 yes 0 0 0
19 5.5K 57s 18 3000 yes 0 0 0
21 8.7K 8s 26 2999 yes 0 0 0
10.写入4k带过期数据:
# Item_Size Max_age Pages Count Full? Evicted Evict_Time OOM
18 4.4K 5s 9 1999 yes 0 0 0
19 5.5K 176s 18 3000 yes 0 0 0
21 8.7K 127s 10 1000 yes 0 0 0
删除了8k的过期数据。
加入lru_maintainer线程之后效果大好,另外,如果增加crawler线程的话会占用锁,可能会影响mc的性能(需要性能测试)
通过上面的实验看出,1.4.33的mc在page分配完成后的回收上效果很好。
如果集群已经出现了page分配完的情况,如果使用新版的mc,一方面会缓存之前1.4.13版本写不进去的数据,提高在slab钙化情况下的空间利用率,提高mc命中率。另一方面因为将数据分别放在hot,warm,cold lru里面,能快速的找到替换的空间,大大降低查找已经过期的空间回收时间,进一步提高性能。
所有mc都已经升级到1.4.33版本。
https://github.com/memcached/memcached
https://github.com/memcached/memcached/wiki/ReleaseNotes1411