Redis
作为高效率的cache
系统, 在内存中实现数据缓存.在一些使用场景中,需要控制缓存的数据的内存消耗, 因此会自动淘汰(evict
)一些缓存的数据.其实现机制一般可以基于数据的访问时间(LRU
),也可以基于访问频率(LFU
),或者二者的某种形式的结合.Redis的缓存淘汰机制既支持LRU也支持LFU, 本文档主要讨论Redis的LRU实现机制.
相比较其他具有LRU功能的系统, Redis的LRU实现机制的一个显著特点是:
LRU行为配置参数
Redis系统中与LRU功能相关的配置参数有三个:
maxmemory
. 该参数即为缓存数据占用的内存限制. 当缓存的数据消耗的内存超过这个数值限制时, 将触发数据淘汰. 该数据配置为0
时,表示缓存的数据量没有限制, 即LRU功能不生效.maxmemory_policy
. 淘汰策略. 定义参与淘汰的数据的类型和属性.maxmemory_samples
. 随机采样的精度. 该数值配置越大, 越接近于真实的LRU算法,但是数值越大, 消耗的CPU计算时间越多,执行效率越低.要在Redis系统中使用LRU功能, 只需配置maxmemory
参数即可.maxmemory_policy参数和maxmemory_samples参数有默认的配置. 这两个参数将在下面的内容中具体讨论.
这三个配置参数既可以在conf文件中设置,Redis启动时,从该配置文件中读取设置, 也可以在运行过程中动态的设置.在动态设置时,使用CONFIG SET
命令进行设置.
淘汰策略
Redis缓存的数据可以有超时属性,也可以没有超时属性,所以,Redis在每个数据库结构中使用了两个不同的哈希表来管理缓存数据. Redis数据库结构的定义为(redis.h):
typedef struct redisDb {
dict *dict; /* The keyspace for this DB */
dict *expires; /* Timeout of keys with a timeout set */
dict *blocking_keys; /* Keys with clients waiting for data (BLPOP) */
dict *ready_keys; /* Blocked keys that received a PUSH */
dict *watched_keys; /* WATCHED keys for MULTI/EXEC CAS */
int id;
long long avg_ttl; /* Average TTL, just for stats */
} redisDb;
哈希表字段expires
存储具有超时属性的数据,哈希表字段dict
既存储没有超时属性的数据, 也存储具有超时属性的数据,即哈希表dict
中存储的是全量的缓存数据.
Redis系统提供五种淘汰策略,即参数maxmemory_policy
有五种取值:
noeviction
: 如果缓存数据超过了maxmemory
限定值,并且客户端正在执行的命令会导致内存分配,则向客户端返回错误响应.allkeys-lru
: 所有的缓存数据(包括没有超时属性的和具有超时属性的)都参与LRU算法淘汰.volatile-lru
: 只有超时属性的缓存数据才参与LRU算法淘汰.allkeys-random
: 所有的缓存数据(包括没有超时属性的和具有超时属性的)都参与淘汰, 但是采用随机淘汰,而不是用LRU算法进行淘汰.volatile-random
: 只有超时属性的缓存数据才参与淘汰,但是采用随机淘汰,而不是用LRU算法进行淘汰.volatile-ttl
: 只有超时属性的缓存数据才参与淘汰. 根据缓存数据的超时TTL
进行淘汰,而不是用LRU算法进行淘汰.因为volatile-lru
, volatile-random
和volatile-ttl
这三个淘汰策略使用的不是全量的缓存数据,有可能无法淘汰出足够的内存空间.
因为将缓存数据设置超时属性占用更多的内存, 因此,当内存压力比较大的时候,需要慎重考虑设置超时属性.
处理流程概述
如果Redis系统支持LRU功能,其处理流程如下:
maxmemory
,如果超过限制值,根据淘汰算法进行释放.Redis LRU源码实现细节
Redis处理命令的入口函数processCommand
, 其实现为(redis.c):
int processCommand(redisClient *c) {
/* The QUIT command is handled separately. Normal command procs will
* go through checking for replication and QUIT will cause trouble
* when FORCE_REPLICATION is enabled and would be implemented in
* a regular command proc. */
if (!strcasecmp(c->argv[0]->ptr,"quit")) {
addReply(c,shared.ok);
c->flags |= REDIS_CLOSE_AFTER_REPLY;
return REDIS_ERR;
}
/* Now lookup the command and check ASAP about trivial error conditions
* such as wrong arity, bad command name and so forth. */
c->cmd = c->lastcmd = lookupCommand(c->argv[0]->ptr);
if (!c->cmd) {
flagTransaction(c);
addReplyErrorFormat(c,"unknown command '%s'",
(char*)c->argv[0]->ptr);
return REDIS_OK;
} else if ((c->cmd->arity > 0 && c->cmd->arity != c->argc) ||
(c->argc < -c->cmd->arity)) {
flagTransaction(c);
addReplyErrorFormat(c,"wrong number of arguments for '%s' command",
c->cmd->name);
return REDIS_OK;
}
/* Check if the user is authenticated */
if (server.requirepass && !c->authenticated && c->cmd->proc != authCommand)
{
flagTransaction(c);
addReply(c,shared.noautherr);
return REDIS_OK;
}
/* Handle the maxmemory directive.
*
* First we try to free some memory if possible (if there are volatile
* keys in the dataset). If there are not the only thing we can do
* is returning an error. */
if (server.maxmemory) {
int retval = freeMemoryIfNeeded();
/* freeMemoryIfNeeded may flush slave output buffers. This may result
* into a slave, that may be the active client, to be freed. */
if (server.current_client == NULL) return REDIS_ERR;
/* It was impossible to free enough memory, and the command the client
* is trying to execute is denied during OOM conditions? Error. */
if ((c->cmd->flags & REDIS_CMD_DENYOOM) && retval == REDIS_ERR) {
flagTransaction(c);
addReply(c, shared.oomerr);
return REDIS_OK;
}
}
/* Don't accept write commands if there are problems persisting on disk
* and if this is a master instance. */
if (((server.stop_writes_on_bgsave_err &&
server.saveparamslen > 0 &&
server.lastbgsave_status == REDIS_ERR) ||
server.aof_last_write_status == REDIS_ERR) &&
server.masterhost == NULL &&
(c->cmd->flags & REDIS_CMD_WRITE ||
c->cmd->proc == pingCommand))
{
flagTransaction(c);
if (server.aof_last_write_status == REDIS_OK)
addReply(c, shared.bgsaveerr);
else
addReplySds(c,
sdscatprintf(sdsempty(),
"-MISCONF Errors writing to the AOF file: %s\r\n",
strerror(server.aof_last_write_errno)));
return REDIS_OK;
}
/* Don't accept write commands if there are not enough good slaves and
* user configured the min-slaves-to-write option. */
if (server.masterhost == NULL &&
server.repl_min_slaves_to_write &&
server.repl_min_slaves_max_lag &&
c->cmd->flags & REDIS_CMD_WRITE &&
server.repl_good_slaves_count < server.repl_min_slaves_to_write)
{
flagTransaction(c);
addReply(c, shared.noreplicaserr);
return REDIS_OK;
}
/* Don't accept write commands if this is a read only slave. But
* accept write commands if this is our master. */
if (server.masterhost && server.repl_slave_ro &&
!(c->flags & REDIS_MASTER) &&
c->cmd->flags & REDIS_CMD_WRITE)
{
addReply(c, shared.roslaveerr);
return REDIS_OK;
}
/* Only allow SUBSCRIBE and UNSUBSCRIBE in the context of Pub/Sub */
if (c->flags & REDIS_PUBSUB &&
c->cmd->proc != pingCommand &&
c->cmd->proc != subscribeCommand &&
c->cmd->proc != unsubscribeCommand &&
c->cmd->proc != psubscribeCommand &&
c->cmd->proc != punsubscribeCommand) {
addReplyError(c,"only (P)SUBSCRIBE / (P)UNSUBSCRIBE / QUIT allowed in this context");
return REDIS_OK;
}
/* Only allow INFO and SLAVEOF when slave-serve-stale-data is no and
* we are a slave with a broken link with master. */
if (server.masterhost && server.repl_state != REDIS_REPL_CONNECTED &&
server.repl_serve_stale_data == 0 &&
!(c->cmd->flags & REDIS_CMD_STALE))
{
flagTransaction(c);
addReply(c, shared.masterdownerr);
return REDIS_OK;
}
/* Loading DB? Return an error if the command has not the
* REDIS_CMD_LOADING flag. */
if (server.loading && !(c->cmd->flags & REDIS_CMD_LOADING)) {
addReply(c, shared.loadingerr);
return REDIS_OK;
}
/* Lua script too slow? Only allow a limited number of commands. */
if (server.lua_timedout &&
c->cmd->proc != authCommand &&
c->cmd->proc != replconfCommand &&
!(c->cmd->proc == shutdownCommand &&
c->argc == 2 &&
tolower(((char*)c->argv[1]->ptr)[0]) == 'n') &&
!(c->cmd->proc == scriptCommand &&
c->argc == 2 &&
tolower(((char*)c->argv[1]->ptr)[0]) == 'k'))
{
flagTransaction(c);
addReply(c, shared.slowscripterr);
return REDIS_OK;
}
/* Exec the command */
if (c->flags & REDIS_MULTI &&
c->cmd->proc != execCommand && c->cmd->proc != discardCommand &&
c->cmd->proc != multiCommand && c->cmd->proc != watchCommand)
{
queueMultiCommand(c);
addReply(c,shared.queued);
} else {
call(c,REDIS_CALL_FULL);
if (listLength(server.ready_keys))
handleClientsBlockedOnLists();
}
return REDIS_OK;
}
需要注意的是调用到该函数的时候,Redis已经解析完命令以及参数,并分配了内存空间,客户端对象的argv
字段指向这些分配的内存空间.
LINE40:53调用函数freeMemoryIfNeeded
释放缓存的内存空间,如果freeMemoryIfNeeded
返回失败,即无法释放足够的内存,并且客户端命令是导致内存增加的命令,则向客户端返回OOM
错误消息响应.
在分析缓存数据淘汰函数freeMemoryIfNeeded
之前,我们先看一下Redis记录内存分配的细节.其实比较简单,Redis使用自己定义的内存分配和释放接口中, 使用了一个全局变量used_memory
记录当前缓存数据占用的内存大小.Redis的动态内存分配函数为(zmalloc.c):
void *zmalloc(size_t size) {
void *ptr = malloc(size+PREFIX_SIZE);
if (!ptr) zmalloc_oom_handler(size);
#ifdef HAVE_MALLOC_SIZE
update_zmalloc_stat_alloc(zmalloc_size(ptr));
return ptr;
#else
*((size_t*)ptr) = size;
update_zmalloc_stat_alloc(size+PREFIX_SIZE);
return (char*)ptr+PREFIX_SIZE;
#endif
}
调用libc
的库函数malloc
分配内存成功后, 使用宏update_zmalloc_stat_alloc
更新使用的内存大小.其宏定义为(zmalloc.c):
#define update_zmalloc_stat_alloc(__n) do { \
size_t _n = (__n); \
if (_n&(sizeof(long)-1)) _n += sizeof(long)-(_n&(sizeof(long)-1)); \
if (zmalloc_thread_safe) { \
update_zmalloc_stat_add(_n); \
} else { \
used_memory += _n; \
} \
} while(0)
该宏定义实现的逻辑即是把malloc
分配的内存大小更新到全局变量used_memory
中.其中执行 if (_n&(sizeof(long)-1)) _n += sizeof(long)-(_n&(sizeof(long)-1));
将分配的内存大小按照sizeof(long)
向上对齐.库函数malloc
在分配内存时,一般是按照某种对齐进行处理的,所以Redis做这个假设处理也是合理的.其实,Redis不做这个假设处理也是可以的,这里的关键就是,在zmalloc
和zfree
中保持假设处理行为的一致即可.
当调用zfree
释放内存时,再次更新全局变量used_memory
.
函数freeMemoryIfNeeded
淘汰缓存的数据,其实现为(redis.c):
int freeMemoryIfNeeded(void) {
size_t mem_used, mem_tofree, mem_freed;
int slaves = listLength(server.slaves);
mstime_t latency;
/* Remove the size of slaves output buffers and AOF buffer from the
* count of used memory. */
mem_used = zmalloc_used_memory();
if (slaves) {
listIter li;
listNode *ln;
listRewind(server.slaves,&li);
while((ln = listNext(&li))) {
redisClient *slave = listNodeValue(ln);
unsigned long obuf_bytes = getClientOutputBufferMemoryUsage(slave);
if (obuf_bytes > mem_used)
mem_used = 0;
else
mem_used -= obuf_bytes;
}
}
if (server.aof_state != REDIS_AOF_OFF) {
mem_used -= sdslen(server.aof_buf);
mem_used -= aofRewriteBufferSize();
}
/* Check if we are over the memory limit. */
if (mem_used <= server.maxmemory) return REDIS_OK;
if (server.maxmemory_policy == REDIS_MAXMEMORY_NO_EVICTION)
return REDIS_ERR; /* We need to free memory, but policy forbids. */
/* Compute how much memory we need to free. */
mem_tofree = mem_used - server.maxmemory;
mem_freed = 0;
latencyStartMonitor(latency);
while (mem_freed < mem_tofree) {
int j, k, keys_freed = 0;
for (j = 0; j < server.dbnum; j++) {
long bestval = 0; /* just to prevent warning */
sds bestkey = NULL;
struct dictEntry *de;
redisDb *db = server.db+j;
dict *dict;
if (server.maxmemory_policy == REDIS_MAXMEMORY_ALLKEYS_LRU ||
server.maxmemory_policy == REDIS_MAXMEMORY_ALLKEYS_RANDOM)
{
dict = server.db[j].dict;
} else {
dict = server.db[j].expires;
}
if (dictSize(dict) == 0) continue;
/* volatile-random and allkeys-random policy */
if (server.maxmemory_policy == REDIS_MAXMEMORY_ALLKEYS_RANDOM ||
server.maxmemory_policy == REDIS_MAXMEMORY_VOLATILE_RANDOM)
{
de = dictGetRandomKey(dict);
bestkey = dictGetKey(de);
}
/* volatile-lru and allkeys-lru policy */
else if (server.maxmemory_policy == REDIS_MAXMEMORY_ALLKEYS_LRU ||
server.maxmemory_policy == REDIS_MAXMEMORY_VOLATILE_LRU)
{
for (k = 0; k < server.maxmemory_samples; k++) {
sds thiskey;
long thisval;
robj *o;
de = dictGetRandomKey(dict);
thiskey = dictGetKey(de);
/* When policy is volatile-lru we need an additional lookup
* to locate the real key, as dict is set to db->expires. */
if (server.maxmemory_policy == REDIS_MAXMEMORY_VOLATILE_LRU)
de = dictFind(db->dict, thiskey);
o = dictGetVal(de);
thisval = estimateObjectIdleTime(o);
/* Higher idle time is better candidate for deletion */
if (bestkey == NULL || thisval > bestval) {
bestkey = thiskey;
bestval = thisval;
}
}
}
/* volatile-ttl */
else if (server.maxmemory_policy == REDIS_MAXMEMORY_VOLATILE_TTL) {
for (k = 0; k < server.maxmemory_samples; k++) {
sds thiskey;
long thisval;
de = dictGetRandomKey(dict);
thiskey = dictGetKey(de);
thisval = (long) dictGetVal(de);
/* Expire sooner (minor expire unix timestamp) is better
* candidate for deletion */
if (bestkey == NULL || thisval < bestval) {
bestkey = thiskey;
bestval = thisval;
}
}
}
/* Finally remove the selected key. */
if (bestkey) {
long long delta;
robj *keyobj = createStringObject(bestkey,sdslen(bestkey));
propagateExpire(db,keyobj);
/* We compute the amount of memory freed by dbDelete() alone.
* It is possible that actually the memory needed to propagate
* the DEL in AOF and replication link is greater than the one
* we are freeing removing the key, but we can't account for
* that otherwise we would never exit the loop.
*
* AOF and Output buffer memory will be freed eventually so
* we only care about memory used by the key space. */
delta = (long long) zmalloc_used_memory();
dbDelete(db,keyobj);
delta -= (long long) zmalloc_used_memory();
mem_freed += delta;
server.stat_evictedkeys++;
notifyKeyspaceEvent(REDIS_NOTIFY_EVICTED, "evicted",
keyobj, db->id);
decrRefCount(keyobj);
keys_freed++;
/* When the memory to free starts to be big enough, we may
* start spending so much time here that is impossible to
* deliver data to the slaves fast enough, so we force the
* transmission here inside the loop. */
if (slaves) flushSlavesOutputBuffers();
}
}
if (!keys_freed) {
latencyEndMonitor(latency);
latencyAddSampleIfNeeded("eviction-cycle",latency);
return REDIS_ERR; /* nothing to free... */
}
}
latencyEndMonitor(latency);
latencyAddSampleIfNeeded("eviction-cycle",latency);
return REDIS_OK;
}
该函数中不会释放replica
节点的发送缓存和AOF
的缓存,这两部的缓存由相应的逻辑负责释放,. 所以LINE9:22从全局变量mem_used
中减去replica
节点的发送缓存.LINE23:26从全局变量mem_used
中减去AOF
缓存.
执行if (mem_used <= server.maxmemory) return REDIS_OK;
如果当前缓存数据占用的总的内存小于配置的maxmemory
,则不用淘汰,直接返回.
如果当前缓存的数据使用的内存大于配置的maxmemory
,并且淘汰策略不允许释放内存(noeviction
),则该函数返回失败.
接下来,局部变量mem_tofree
表示需要淘汰的内存,局部变量mem_freed
表示已经淘汰的内存.循环执行while (mem_freed < mem_tofree)
淘汰缓存数据,该循环中的逻辑可以概括为:
0
号数据库开始(Redis默认有16
个全局的数据库),根据淘汰策略,选择该数据库中的哈希表.如果该哈希表为空, 选择下一个全局数据库.key
, 从该数据库对象中删除该key
所对应的缓存数据.key
,即无法淘汰所需的缓存数据大小 函数直接返回错误.0
号数据库重新淘汰.如果淘汰策略是allkeys-random
或者volatile-random
,则从相应数据库中随机选择一个key
进行淘汰.
如果淘汰策略是allkeys-lru
或者volatile-lru
, 则根据配置的采样值maxmemory_samples
,随机从数据库中选择maxmemory_samples
个key
, 淘汰其中热度最低的key
对应的缓存数据.
如果淘汰策略是volatile-ttl
,则根据配置的采样值maxmemory_samples
,随机从数据库中选择maxmemory_samples
个key
,淘汰其中最先要超时的key
对应的缓存数据.
所以采样参数maxmemory_samples
配置的数值越大, 就越能精确的查找到待淘汰的缓存数据,但是也消耗更多的CPU计算,执行效率降低.
从数据库的哈希表结构中随机返回一个key
的执行函数为dictGetRandomKey
, 其实现为(dict.c):
/* Return a random entry from the hash table. Useful to
* implement randomized algorithms */
dictEntry *dictGetRandomKey(dict *d)
{
dictEntry *he, *orighe;
unsigned int h;
int listlen, listele;
if (dictSize(d) == 0) return NULL;
if (dictIsRehashing(d)) _dictRehashStep(d);
if (dictIsRehashing(d)) {
do {
h = random() % (d->ht[0].size+d->ht[1].size);
he = (h >= d->ht[0].size) ? d->ht[1].table[h - d->ht[0].size] :
d->ht[0].table[h];
} while(he == NULL);
} else {
do {
h = random() & d->ht[0].sizemask;
he = d->ht[0].table[h];
} while(he == NULL);
}
/* Now we found a non empty bucket, but it is a linked
* list and we need to get a random element from the list.
* The only sane way to do so is counting the elements and
* select a random index. */
listlen = 0;
orighe = he;
while(he) {
he = he->next;
listlen++;
}
listele = random() % listlen;
he = orighe;
while(listele--) he = he->next;
return he;
}
该函数意义直白,执行过程分为两步:
bucket
).根据LRU淘汰算法的属性,如果缓存的数据被频繁访问, 其热度就高,反之,热度低. 下面说明缓存数据的热度相关的细节.
Redis中的对象结构定义为(redis.h):
typedef struct redisObject {
unsigned type:4;
unsigned encoding:4;
unsigned lru:REDIS_LRU_BITS; /* lru time (relative to server.lruclock) */
int refcount;
void *ptr;
} robj;
即对象结构中存在一个lru
字段, 且使用了unsigned
的低24
位(REDIS_LRU_BITS
定义的值).
Redis命令访问缓存的数据时,均会调用函数lookupKey
, 其实现为(db.c):
robj *lookupKey(redisDb *db, robj *key) {
dictEntry *de = dictFind(db->dict,key->ptr);
if (de) {
robj *val = dictGetVal(de);
/* Update the access time for the ageing algorithm.
* Don't do it if we have a saving child, as this will trigger
* a copy on write madness. */
if (server.rdb_child_pid == -1 && server.aof_child_pid == -1)
val->lru = server.lruclock;
return val;
} else {
return NULL;
}
}
该函数会更新对象的lru
值, 设置为全局的server.lruclock
值.当然,在对象创建的时候也会将该lru
字段设置为全局的server.lruclock
.
全局的server.lruclock
是在函数serverCron
中调用函数updateLRUClock
更新的(redis.c):
void updateLRUClock(void) {
server.lruclock = (server.unixtime/REDIS_LRU_CLOCK_RESOLUTION) &
REDIS_LRU_CLOCK_MAX;
}
而全局的server.unixtime
是在函数serverCron
中调用函数updateCachedTime
更新的(redis.c):
/* We take a cached value of the unix time in the global state because with
* virtual memory and aging there is to store the current time in objects at
* every object access, and accuracy is not needed. To access a global var is
* a lot faster than calling time(NULL) */
void updateCachedTime(void) {
server.unixtime = time(NULL);
server.mstime = mstime();
}
函数serverCron
是定时器执行函数, 会周期性执行.Redis系统中全局变量server.hz
设置为10
, 则serverCron
的调度周期为100
毫秒.也就是说,全局变量server.lruclock
会每隔100
毫秒得到更新,该字段也和对象结构的lru
字段一样,也是使用了unsigned
的低24
位.
所以函数lookupKey
中更新缓存数据的lru
热度值时,不是调用的系统函数获得的当前时间戳,而是该值的一个近似值server.lruclock
, 这样不用每次调用系统函数,可以提高执行效率.
函数estimateObjectIdleTime
评估指定对象的lru
热度,其实现为(object.c):
/* Given an object returns the min number of seconds the object was never
* requested, using an approximated LRU algorithm. */
unsigned long estimateObjectIdleTime(robj *o) {
if (server.lruclock >= o->lru) {
return (server.lruclock - o->lru) * REDIS_LRU_CLOCK_RESOLUTION;
} else {
return ((REDIS_LRU_CLOCK_MAX - o->lru) + server.lruclock) *
REDIS_LRU_CLOCK_RESOLUTION;
}
}
其思想就是对象的lru
热度值和全局的server.lruclock
的差值越大, 该对象热度越低.但是,因为全局的server.lruclock
数值有可能发生溢出(超过REDIS_LRU_CLOCK_MAX
则溢出), 所以对象的lru
数值可能大于server.lruclock
数值. 所以计算二者的差值时,需考虑二者间的大小关系.
总结
Redis
系统没有使用一个全局的链表将所有的缓存数据管理起来,而是使用一种近似的算法来模拟LRU
淘汰的效果, 个人认为其原因有:
16
字节(64位系统上).