作为一个内存数据库,内存肯定是最重要的系统资源,但是机器的内存又不可能无限大,加上内存资源本身比较昂贵,因此尽可能减少内存的占用肯定是Redis的一个核心目标。本文将详细介绍Redis是如何通过数据过期和主动驱逐来最大化的利用内存空间。
Redis的数据管理包括 TTL过期淘汰 和 主动驱逐 。TTL过期淘汰又可分为 主动淘汰和被动淘汰。
- 主动淘汰: 通过后台定时线程检查,主动删除过期的key
- 被动淘汰: client访问的时候进行检查,如果已经过期,则删除数据
主动驱逐则是当内存空间不足时的一个数据淘汰策略。用户可以通过配置设置不同的淘汰策略,通过文档介绍可以知道redis包含以下几种驱逐策略:
# volatile-lru -> Evict using approximated LRU, only keys with an expire set.
# allkeys-lru -> Evict any key using approximated LRU.
# volatile-lfu -> Evict using approximated LFU, only keys with an expire set.
# allkeys-lfu -> Evict any key using approximated LFU.
# volatile-random -> Remove a random key having an expire set.
# allkeys-random -> Remove a random key, any key.
# volatile-ttl -> Remove the key with the nearest expire time (minor TTL)
# noeviction -> Don't evict anything, just return an error on write operations.
EXPIRE 淘汰
被动淘汰
被动淘汰的实现非常简单,就是当client访问一个key的时候,redis会检查该key的过期时间,如果数据已经过期,则进行主动淘汰
robj *lookupKey(redisDb *db, robj *key, int flags) {
dictEntry *de = dictFind(db->dict,key->ptr);
robj *val = NULL;
if (de) {
val = dictGetVal(de);
/* Forcing deletion of expired keys on a replica makes the replica
* inconsistent with the master. We forbid it on readonly replicas, but
* we have to allow it on writable replicas to make write commands
* behave consistently.
*
* It's possible that the WRITE flag is set even during a readonly
* command, since the command may trigger events that cause modules to
* perform additional writes. */
int is_ro_replica = server.masterhost && server.repl_slave_ro;
int force_delete_expired = flags & LOOKUP_WRITE && !is_ro_replica;
if (expireIfNeeded(db, key, force_delete_expired)) {
/* The key is no longer valid. */
val = NULL;
}
}
...
}
核心实现为在lookupKey的时候,通过 expireIfNeeded
来判断数据是否过期,如果过期则删除并设置value位空。 数据删除的详细实现为 deleteExpiredKeyAndPropagate
。该函数会判断是否设置了 lazyfree_lazy_expire
,如果设置了则通过 dbAsyncDelete
进行一步删除,否则直接同步删除。之所以需要进行一步删除,是因为由于redis是单线程处理io的,如果删除的key比较大,那么可能导致析构key的内存空间的时候出现阻塞影响client请求。异步删除的详细实现本文先不做详细介绍。
void deleteExpiredKeyAndPropagate(redisDb *db, robj *keyobj) {
mstime_t expire_latency;
latencyStartMonitor(expire_latency);
if (server.lazyfree_lazy_expire)
dbAsyncDelete(db,keyobj); // 异步删除
else
dbSyncDelete(db,keyobj);
latencyEndMonitor(expire_latency);
latencyAddSampleIfNeeded("expire-del",expire_latency);
notifyKeyspaceEvent(NOTIFY_EXPIRED,"expired",keyobj,db->id);
signalModifiedKey(NULL, db, keyobj);
propagateDeletion(db,keyobj,server.lazyfree_lazy_expire);
server.stat_expiredkeys++;
}
被动删除的缺点很明显,如果一个数据没有被访问到,那么即便已经过期了也不会被删除,为了解决这一问题,redis还实现了主动删除。
主动淘汰
主动淘汰的核心逻辑为 activeExpireCycle
,该函数有两个入口,对应着两个不同处理逻辑,一个是在 beforeSleep
的时候通过 ACTIVE_EXPIRE_CYCLE_FAST
模式进行,另一个通过 databasesCron
后台线程进行以 ACTIVE_EXPIRE_CYCLE_SLOW
驱动.
ACTIVE_EXPIRE_CYCLE_FAST
/* This function gets called every time Redis is entering the
* main loop of the event driven library, that is, before to sleep
* for ready file descriptors.
/
void beforeSleep(struct aeEventLoop *eventLoop){
...
/* Run a fast expire cycle (the called function will return
* ASAP if a fast cycle is not needed). */
if (server.active_expire_enabled && server.masterhost == NULL)
activeExpireCycle(ACTIVE_EXPIRE_CYCLE_FAST);
}
redis会在函数进入阻塞之前调用beforeSleep,同时通过判断 active_expire_enabled
决定是否进行FAST 模式。执行FAST 模式还有两个其他条件限制:
if (type == ACTIVE_EXPIRE_CYCLE_FAST) {
/* Don't start a fast cycle if the previous cycle did not exit
* for time limit, unless the percentage of estimated stale keys is
* too high. Also never repeat a fast cycle for the same period
* as the fast cycle total duration itself. */
if (!timelimit_exit &&
server.stat_expired_stale_perc < config_cycle_acceptable_stale)
return;
if (start < last_fast_cycle + (long long)config_cycle_fast_duration*2)
return;
last_fast_cycle = start;
}
- 判断上一轮执行是否因为超时而推出,且过期数据的比例低于配置的阈值,如果不是则不需要执行fast模式,slow模式即可应对数据清理
- 距离上一次FAST是否超过时间间隔,默认间隔为 2000us
ACTIVE_EXPIRE_CYCLE_SLOW
SLOW 模式的核心逻辑包含以下几个配置
void activeExpireCycle(int type) {
/* Adjust the running parameters according to the configured expire
* effort. The default effort is 1, and the maximum configurable effort
* is 10. */
unsigned long
effort = server.active_expire_effort-1, /* Rescale from 0 to 9. */
config_keys_per_loop = ACTIVE_EXPIRE_CYCLE_KEYS_PER_LOOP +
ACTIVE_EXPIRE_CYCLE_KEYS_PER_LOOP/4*effort,
config_cycle_fast_duration = ACTIVE_EXPIRE_CYCLE_FAST_DURATION +
ACTIVE_EXPIRE_CYCLE_FAST_DURATION/4*effort,
config_cycle_slow_time_perc = ACTIVE_EXPIRE_CYCLE_SLOW_TIME_PERC +
2*effort,
config_cycle_acceptable_stale = ACTIVE_EXPIRE_CYCLE_ACCEPTABLE_STALE-
effort;
...
if (dbs_per_call > server.dbnum || timelimit_exit)
dbs_per_call = server.dbnum;
...
timelimit = config_cycle_slow_time_perc*1000000/server.hz/100;
timelimit_exit = 0;
if (timelimit <= 0) timelimit = 1;
...
if (num > config_keys_per_loop)
num = config_keys_per_loop;
for (j = 0; j < dbs_per_call && timelimit_exit == 0; j++) {
...
do {
if ((iteration & 0xf) == 0) { /* check once every 16 iterations. */
elapsed = ustime()-start;
if (elapsed > timelimit) {
timelimit_exit = 1;
server.stat_expired_time_cap_reached_count++;
break;
}
}
} while (sampled == 0 ||
(expired*100/sampled) > config_cycle_acceptable_stale);
}
config_keys_per_loop
每次执行扫描的key数量,默认20 可以通过 active_expire_effort
进行调整,越大每次扫描的key越多,淘汰越快,同时CPU 开销也会更大。
config_cycle_slow_time_perc
每次扫描执行的时间,如果执行超时 则会设置 timelimit_exit
为true,表明过期的key比较多,需要尽快进行清理
默认的 dbs_per_call
为16,表示每次执行需要扫描的db数,如果上一轮退出时设置了 timelimit_exit
则需要扫描所有db,尽可能快淘汰所有数据。
执行扫描时,会一次遍历所有的db,每轮循环最大采样的数据量为 config_keys_per_loop,采样时,记录过期的key数量,如果过期的key数量比例低于 config_cycle_acceptable_stale
,则该db无需进行继续清理,进入下一个db继续扫描。同时每执行 16次 (iteration & 0xf)
会判断是否执行超时,如果超时则设置 timelimit_exit
并退出当前循环。
主动驱逐
主动驱逐是redis在内存空间即将不足的时候设置的驱逐策略。根据驱逐算法可以分为LRU和LFU ,本文将详细介绍下redis LRU 的实现。标准的LRU 实现为 hash + 双链表 的模式,但是这种模式会引入pre和next指针,带来额外的内存开销,同时每次访问都需要更新指针排序,带来额外的cpu开销。因此redis并没有采用这种方式,而是实现了近似的 采样LRU ,每次随机采样N个key,默认为5个,可以通过 maxmemory-samples
进行配置。然后对采样的key进行排序,淘汰最老的数据。虽然采用了采样可能导致不精确,但是在符合长尾的请求模式里,采样模式和标准模式的淘汰效果几乎一致。
Redis LRU 实现
竟然是LRU 那么肯定需要一个字段来表示最最近的访问时间,为了减少内存的开销,redis使用了 lru::LRU_BITS
来存储最近访问时间,该字段为一个24bit的位域 , 每次key被访问时,该值会被更新为系统的lru_lock,通过直接读取server的lru_lock,减少了gettimeofday()的调用,从这里可以看到,redis某些地方真的是把性能优化到了极致。
typedef struct redisObject {
unsigned type:4;
unsigned encoding:4;
unsigned lru:LRU_BITS; /* LRU time (relative to global lru_clock) or
* LFU data (least significant 8 bits frequency
* and most significant 16 bits access time). */
int refcount;
void *ptr;
} robj;
robj *lookupKey(redisDb *db, robj *key, int flags) {
...
if (!hasActiveChildProcess() && !(flags & LOOKUP_NOTOUCH)){
if (server.maxmemory_policy & MAXMEMORY_FLAG_LFU) {
updateLFU(val);
} else {
val->lru = LRU_CLOCK();
}
}
}
LRU_CLOCK
为server的全局变量,更新频率取决于 server.hz
,由于lru只有24bit因此lru_lock的最大取值范围为 #define LRU_CLOCK_MAX ((1<
采样实现
void evictionPoolPopulate(int dbid, dict *sampledict, dict *keydict, struct evictionPoolEntry *pool) {
int j, k, count;
dictEntry *samples[server.maxmemory_samples];
count = dictGetSomeKeys(sampledict,samples,server.maxmemory_samples);
for (j = 0; j < count; j++) {
unsigned long long idle;
sds key;
robj *o;
dictEntry *de;
de = samples[j];
key = dictGetKey(de);
/* If the dictionary we are sampling from is not the main
* dictionary (but the expires one) we need to lookup the key
* again in the key dictionary to obtain the value object. */
if (server.maxmemory_policy != MAXMEMORY_VOLATILE_TTL) {
if (sampledict != keydict) de = dictFind(keydict, key);
o = dictGetVal(de);
}
/* Calculate the idle time according to the policy. This is called
* idle just because the code initially handled LRU, but is in fact
* just a score where an higher score means better candidate. */
if (server.maxmemory_policy & MAXMEMORY_FLAG_LRU) {
idle = estimateObjectIdleTime(o);
} else if (server.maxmemory_policy & MAXMEMORY_FLAG_LFU) {
/* When we use an LRU policy, we sort the keys by idle time
* so that we expire keys starting from greater idle time.
* However when the policy is an LFU one, we have a frequency
* estimation, and we want to evict keys with lower frequency
* first. So inside the pool we put objects using the inverted
* frequency subtracting the actual frequency to the maximum
* frequency of 255. */
idle = 255-LFUDecrAndReturn(o);
} else if (server.maxmemory_policy == MAXMEMORY_VOLATILE_TTL) {
/* In this case the sooner the expire the better. */
idle = ULLONG_MAX - (long)dictGetVal(de);
} else {
serverPanic("Unknown eviction policy in evictionPoolPopulate()");
}
/* Insert the element inside the pool.
* First, find the first empty bucket or the first populated
* bucket that has an idle time smaller than our idle time. */
k = 0;
while (k < EVPOOL_SIZE &&
pool[k].key &&
pool[k].idle < idle) k++;
if (k == 0 && pool[EVPOOL_SIZE-1].key != NULL) {
/* Can't insert if the element is < the worst element we have
* and there are no empty buckets. */
continue;
} else if (k < EVPOOL_SIZE && pool[k].key == NULL) {
/* Inserting into empty position. No setup needed before insert. */
} else {
/* Inserting in the middle. Now k points to the first element
* greater than the element to insert. */
if (pool[EVPOOL_SIZE-1].key == NULL) {
/* Free space on the right? Insert at k shifting
* all the elements from k to end to the right. */
/* Save SDS before overwriting. */
sds cached = pool[EVPOOL_SIZE-1].cached;
memmove(pool+k+1,pool+k,
sizeof(pool[0])*(EVPOOL_SIZE-k-1));
pool[k].cached = cached;
} else {
/* No free space on right? Insert at k-1 */
k--;
/* Shift all elements on the left of k (included) to the
* left, so we discard the element with smaller idle time. */
sds cached = pool[0].cached; /* Save SDS before overwriting. */
if (pool[0].key != pool[0].cached) sdsfree(pool[0].key);
memmove(pool,pool+1,sizeof(pool[0])*k);
pool[k].cached = cached;
}
}
/* Try to reuse the cached SDS string allocated in the pool entry,
* because allocating and deallocating this object is costly
* (according to the profiler, not my fantasy. Remember:
* premature optimization bla bla bla. */
int klen = sdslen(key);
if (klen > EVPOOL_CACHED_SDS_SIZE) {
pool[k].key = sdsdup(key);
} else {
memcpy(pool[k].cached,key,klen+1);
sdssetlen(pool[k].cached,klen);
pool[k].key = pool[k].cached;
}
pool[k].idle = idle;
pool[k].dbid = dbid;
}
}
- 执行evict的时候,每次会采样随机读取N个key,然后排序比较该N个key的ttl,将ttl最小的key放入到
evictionPoolEntry
, evictionPoolEntry 的最大容量为EVPOOL_SIZE(16)。 - 如果pool已经放满, 那么判断即将被插入的key的lru,如果比pool的都大,则拒绝放入,否则查找合适的位置进行插入
-
performEvictions
遍历完所有db,然后判断是否有数据可以进行淘汰,如果有,则遍历evictionPoolEntry
,然后读取需要淘汰的key,此时在判断是allkeys-lru 还是volatile-lru,然后获取到bestkey进行驱逐。
redis采样以后选择最小lru的key以后并不是直接进行删除,而是先放入排序pool,最终在pool里进行淘汰,通过pool可以进一步提高采样lru的准确率。
总结
无论是EXPIRCE还是EVICT,都可以从设计上看到对性能的权衡取舍。 主动expire的时候需要权衡每次扫描的key数量,在尽可能多淘汰过期数据的同时减低cpu的使用。EVICT的时候,则通过牺牲时间精度减少内存占用,同时通过牺牲LRU 的准确率,来降低LRU 标准实现的开销。