记一次redis读取超时的排查过程（SADD惹的祸）

问题背景

在业务使用redis过程中，出现了read timeout 的异常。

问题排查

直接原因

运维查询redis慢查询日志，发现在异常时间节点，有redis慢查询日志，执行sadd 命令花费了1秒钟。但由于redis是单线程应用，执行单条命令的阻塞，会造成其他命令的排队等候，导致read timeout。

深入排查-为什么sadd这么慢呢

为什么sadd这么慢呢？查阅redis文档看到，sadd操作的复杂度是O（1）的，实际使用本机docker搭建redis进行测试，使用脚本进行sadd，直到800W以上的量级才偶尔出现100毫秒以上的情况。（测试过程详见后面）

搭建redis环境

偷懒在本机就行测试，使用docker跑起了redis应用，过程如下：

docker pull redis # 使用redis3.x版本
docker run  -itv ~/redis.conf:/redis.conf -p 32768:6379 --name myredis5 -d redis redis-server /redis.conf

测试脚本

#coding=utf-8

import time
import redis
import random


r = redis.Redis(host='x.x.x.x', port=xxxx, decode_responses=True)   

k = 'key4'

tarr = []

st = time.clock()

st2 = time.clock()
r.sadd(k, 1) # 创建连接也会有耗时

for i in range(1, 1600000):
    t1 = time.clock() * 1000
    rn = random.randint(100000000000, 20000000000000)
    r.sadd(k, rn)
    t2 = time.clock() * 1000
    c = t2 - t1
    tarr.append(str(c))
    if c > 100:
        print i, c

print time.clock()

s = "\n".join(tarr)
with open('./result.txt', 'w') as f:
    f.write(s)

测试结果

到达800W的时候开始偶尔出现sadd需要100ms的现象。

问题分析

查询了很多资料，无意中看到redis del操作复杂度为O（n），这里补充一下超时的更多背景，举例如下：

慢查询日志时间：16号00点00分01秒，命令为sadd prefix_20180215，且key有过期时间。

看到这里答案已经呼之欲出，是不是sadd触发了redis是过期删除操作，同时由于del命令的复杂度为O（n），时间花在了删除过期数据上。

测试重现

for i in range(1, 1000000):
    t1 = time.clock() * 1000
    rn = random.randint(100000000000, 20000000000000)
    r.sadd(k, rn)
    t2 = time.clock() * 1000
    c = t2 - t1
    tarr.append(str(c))
    if c > 100:
        print i, c

x = int(time.time())
x += 10 #延时10每秒过期
r.expire(k, 10)


while True: 
    y = time.time()
    t1 = time.clock() * 1000
    rn = random.randint(1, 1000000000)
    r.sadd(k, rn)
    t2 = time.clock() * 1000
    tarr.append(str(c))
    if c > 100:#复现sadd慢查询的情况
        print i, c

    if y > x + 5: # 超时时间就break
        break

print time.clock()

重现的步骤很简单，

给某个key sadd上足够的数据（百万级）
给key设置一个相对过期时间。
持续调用sadd命令，记录调用时间。
最后观察redis的慢查询日志。

如猜想一样，慢查询日志中出现了SADD命令，耗时1秒。

解决方案与总结

由于redis 对于集合键的del操作复杂度均为O（n），所以对于集合键，最好设置通过分片，避免单个key的值过大。

另外，redis4.0已经通过配置支持延时删除，可以通过lazyfree_lazy_expire/azyfree_lazy_eviction/lazyfree_lazy_server_del 来实现异步删除的操作，避免异步阻塞

延伸阅读

最后，让我们来看看redis3.x和4.x处理删除key的源码吧。

redis 有三种淘汰key的机制，分别为

del命令

被动淘汰（当请求命令对应的键过期时进行删除）
主动删除（redis主动对键进行淘汰，回收内存）

我们先看看redis3.x版本中上面三种淘汰机制的入口代码。

del命令 - delCommand

void delCommand(client *c) {
    int deleted = 0, j;

    for (j = 1; j < c->argc; j++) {
        expireIfNeeded(c->db,c->argv[j]);
        if (dbDelete(c->db,c->argv[j])) {
            signalModifiedKey(c->db,c->argv[j]);
            notifyKeyspaceEvent(NOTIFY_GENERIC,
                "del",c->argv[j],c->db->id);
            server.dirty++;
            deleted++;
        }
    }
    addReplyLongLong(c,deleted);
}

处理流程相当的简单，先检查键是否过期，然后调用dbDelete进行删除

被动淘汰 - expireIfNeeded

int expireIfNeeded(redisDb *db, robj *key) {
    mstime_t when = getExpire(db,key);  //获取过期时间
    mstime_t now;

    if (when < 0) return 0; /* No expire for this key */

    /* Don't expire anything while loading. It will be done later. */
    if (server.loading) return 0;

    /* If we are in the context of a Lua script, we claim that time is
     * blocked to when the Lua script started. This way a key can expire
     * only the first time it is accessed and not in the middle of the
     * script execution, making propagation to slaves / AOF consistent.
     * See issue #1525 on Github for more information. */
    now = server.lua_caller ? server.lua_time_start : mstime(); // 过去当前时间

    /* If we are running in the context of a slave, return ASAP:
     * the slave key expiration is controlled by the master that will
     * send us synthesized DEL operations for expired keys.
     *
     * Still we try to return the right information to the caller,
     * that is, 0 if we think the key should be still valid, 1 if
     * we think the key is expired at this time. */
    if (server.masterhost != NULL) return now > when;

    /* Return when this key has not expired */
    if (now <= when) return 0; 

    /* Delete the key */
    server.stat_expiredkeys++;
    propagateExpire(db,key); // 把过期时间传递出去（从库、AOF备份等）
    notifyKeyspaceEvent(NOTIFY_EXPIRED, 
        "expired",key,db->id); // 对db内的键发生的变动进行通知，适用于pubsub 通过pubsub来传递消息，可以用来作为redis的执行监控
    return dbDelete(db,key);
}

主动淘汰 - serverCron

server.c文件

int serverCron(struct aeEventLoop *eventLoop, long long id, void *clientData) {
   
   /**
    * sth not important
    */
    ...
    
    /* We need to do a few operations on clients asynchronously. */
    clientsCron();

    /* Handle background operations on Redis databases. */
    databasesCron();

    /**
      * sth not important
      */ 
      ...
    
    server.cronloops++;
    return 1000/server.hz;
}

/* This function handles 'background' operations we are required to do
 * incrementally in Redis databases, such as active key expiring, resizing,
 * rehashing. */
void databasesCron(void) {
    /* Expire keys by random sampling. Not required for slaves
     * as master will synthesize DELs for us. */
    if (server.active_expire_enabled && server.masterhost == NULL)
        activeExpireCycle(ACTIVE_EXPIRE_CYCLE_SLOW);
    /**
     * sth not important 
     */
     
}

/* Try to expire a few timed out keys. The algorithm used is adaptive and
 * will use few CPU cycles if there are few expiring keys, otherwise
 * it will get more aggressive to avoid that too much memory is used by
 * keys that can be removed from the keyspace.
 *
 * No more than CRON_DBS_PER_CALL databases are tested at every
 * iteration.
 *
 * This kind of call is used when Redis detects that timelimit_exit is
 * true, so there is more work to do, and we do it more incrementally from
 * the beforeSleep() function of the event loop.
 *
 * Expire cycle type:
 *
 * If type is ACTIVE_EXPIRE_CYCLE_FAST the function will try to run a
 * "fast" expire cycle that takes no longer than EXPIRE_FAST_CYCLE_DURATION
 * microseconds, and is not repeated again before the same amount of time.
 *
 * If type is ACTIVE_EXPIRE_CYCLE_SLOW, that normal expire cycle is
 * executed, where the time limit is a percentage of the REDIS_HZ period
 * as specified by the REDIS_EXPIRELOOKUPS_TIME_PERC define. */

void activeExpireCycle(int type) {

    int dbs_per_call = CRON_DBS_PER_CALL;

    /* We usually should test CRON_DBS_PER_CALL per iteration, with
     * two exceptions:
     *
     * 1) Don't test more DBs than we have.
     * 2) If last time we hit the time limit, we want to scan all DBs
     * in this iteration, as there is work to do in some DB and we don't want
     * expired keys to use memory for too much time. */
    if (dbs_per_call > server.dbnum || timelimit_exit)
        dbs_per_call = server.dbnum; //每次清理扫描的数据库数

    /* We can use at max ACTIVE_EXPIRE_CYCLE_SLOW_TIME_PERC percentage of CPU time
     * per iteration. Since this function gets called with a frequency of
     * server.hz times per second, the following is the max amount of
     * microseconds we can spend in this function. */
    timelimit = 1000000*ACTIVE_EXPIRE_CYCLE_SLOW_TIME_PERC/server.hz/100;
    timelimit_exit = 0;
    if (timelimit <= 0) timelimit = 1;

    if (type == ACTIVE_EXPIRE_CYCLE_FAST)
        timelimit = ACTIVE_EXPIRE_CYCLE_FAST_DURATION; /* in microseconds. */

    for (j = 0; j < dbs_per_call; j++) {
        int expired;
        redisDb *db = server.db+(current_db % server.dbnum);

        /* Increment the DB now so we are sure if we run out of time
         * in the current DB we'll restart from the next. This allows to
         * distribute the time evenly across DBs. */
        current_db++;

        /* Continue to expire if at the end of the cycle more than 25%
         * of the keys were expired. */
         // 如果有超过25%的键过期了则继续扫描
        do {
            unsigned long num, slots;
            long long now, ttl_sum;
            int ttl_samples;

            /* If there is nothing to expire try next DB ASAP. */
            if ((num = dictSize(db->expires)) == 0) { //当前没有需要过期的键
                db->avg_ttl = 0;
                break;
            }
            slots = dictSlots(db->expires);
            now = mstime();

            /* When there are less than 1% filled slots getting random
             * keys is expensive, so stop here waiting for better times...
             * The dictionary will be resized asap. */
            if (num && slots > DICT_HT_INITIAL_SIZE &&
                (num*100/slots < 1)) break;

            /* The main collection cycle. Sample random keys among keys
             * with an expire set, checking for expired ones. */
            expired = 0;
            ttl_sum = 0;
            ttl_samples = 0;

            if (num > ACTIVE_EXPIRE_CYCLE_LOOKUPS_PER_LOOP) 
                num = ACTIVE_EXPIRE_CYCLE_LOOKUPS_PER_LOOP; // 3.2.11为20次

            while (num--) {
                dictEntry *de;
                long long ttl;

                if ((de = dictGetRandomKey(db->expires)) == NULL) break; //随机获取一个键
                ttl = dictGetSignedIntegerVal(de)-now;
                if (activeExpireCycleTryExpire(db,de,now)) expired++;
                if (ttl > 0) {
                    /* We want the average TTL of keys yet not expired. */
                    ttl_sum += ttl;
                    ttl_samples++;
                }
            }

            /**
             * 这里有一些控制删除时间的逻辑和其他逻辑。
             */

            if (timelimit_exit) return;
            /* We don't repeat the cycle if there are less than 25% of keys
             * found expired in the current DB. */
        } while (expired > ACTIVE_EXPIRE_CYCLE_LOOKUPS_PER_LOOP/4); // 20次 / 4 
    }
}

/* ======================= Cron: called every 100 ms ======================== */

/* Helper function for the activeExpireCycle() function.
 * This function will try to expire the key that is stored in the hash table
 * entry 'de' of the 'expires' hash table of a Redis database.
 *
 * If the key is found to be expired, it is removed from the database and
 * 1 is returned. Otherwise no operation is performed and 0 is returned.
 *
 * When a key is expired, server.stat_expiredkeys is incremented.
 *
 * The parameter 'now' is the current time in milliseconds as is passed
 * to the function to avoid too many gettimeofday() syscalls. */
int activeExpireCycleTryExpire(redisDb *db, dictEntry *de, long long now) {
    long long t = dictGetSignedIntegerVal(de);
    if (now > t) {
        sds key = dictGetKey(de);
        robj *keyobj = createStringObject(key,sdslen(key));

        propagateExpire(db,keyobj);
        dbDelete(db,keyobj);
        notifyKeyspaceEvent(NOTIFY_EXPIRED,
            "expired",keyobj,db->id);
        decrRefCount(keyobj);
        server.stat_expiredkeys++;
        return 1;
    } else {
        return 0;
    }
}

主动删除的调用路径为serverCron -> databasesCron -> activeExpireCycle -> activeExpireCycleTryExpire，我们主要看看activeExpireCycleTryExpire。

主动淘汰是通过随机采样来进行删除的，随机的算法很简单，就是通过random来进行的，先random出slot，然后random出slot上的链表中的某个节点。另外会根据删除时间长短和过期键的数量来决定一次主动淘汰的扫描db数量和次数。

顺带说说，serverCron是redis的周期任务，通过定时器注册，databasesCron除了主动淘汰键，还会做rehash、resize等事情。

底层调用

三种机制虽然不同，但他们调用的底层都是相同的——dbDelete方法。

db.c 文件


/* Delete a key, value, and associated expiration entry if any, from the DB */
int dbDelete(redisDb *db, robj *key) {
    /* Deleting an entry from the expires dict will not free the sds of
     * the key, because it is shared with the main dictionary. */
    if (dictSize(db->expires) > 0) dictDelete(db->expires,key->ptr);
    if (dictDelete(db->dict,key->ptr) == DICT_OK) {
        if (server.cluster_enabled) slotToKeyDel(key);
        return 1;
    } else {
        return 0;
    }
}

dict.c文件

int dictDelete(dict *ht, const void *key) {
    return dictGenericDelete(ht,key,0);
}

/* Search and remove an element */
static int dictGenericDelete(dict *d, const void *key, int nofree)
{
    unsigned int h, idx;
    dictEntry *he, *prevHe;
    int table;

    if (d->ht[0].size == 0) return DICT_ERR; /* d->ht[0].table is NULL */
    if (dictIsRehashing(d)) _dictRehashStep(d);
    h = dictHashKey(d, key);

    for (table = 0; table <= 1; table++) {
        idx = h & d->ht[table].sizemask;
        he = d->ht[table].table[idx];
        prevHe = NULL;
        while(he) {
            if (key==he->key || dictCompareKeys(d, key, he->key)) {
                /* Unlink the element from the list */
                if (prevHe)
                    prevHe->next = he->next;
                else
                    d->ht[table].table[idx] = he->next;
                if (!nofree) {
                    dictFreeKey(d, he);
                    dictFreeVal(d, he);
                }
                zfree(he);
                d->ht[table].used--;
                return DICT_OK;
            }
            prevHe = he;
            he = he->next;
        }
        if (!dictIsRehashing(d)) break;
    }
    return DICT_ERR; /* not found */
}

/* ------------------------------- Macros ------------------------------------*/
#define dictFreeVal(d, entry) \
    if ((d)->type->valDestructor) \
        (d)->type->valDestructor((d)->privdata, (entry)->v.val)

server.c

/* Db->dict, keys are sds strings, vals are Redis objects. */
dictType dbDictType = {
    dictSdsHash,                /* hash function */
    NULL,                       /* key dup */
    NULL,                       /* val dup */
    dictSdsKeyCompare,          /* key compare */
    dictSdsDestructor,          /* key destructor */
    dictObjectDestructor   /* val destructor */
};

void dictObjectDestructor(void *privdata, void *val)
{
    DICT_NOTUSED(privdata);

    if (val == NULL) return; /* Values of swapped out keys as set to NULL */
    decrRefCount(val);
}

object.c

void decrRefCount(robj *o) {
    if (o->refcount <= 0) serverPanic("decrRefCount against refcount <= 0");
    if (o->refcount == 1) {
        switch(o->type) {
        case OBJ_STRING: freeStringObject(o); break;
        case OBJ_LIST: freeListObject(o); break;
        case OBJ_SET: freeSetObject(o); break;
        case OBJ_ZSET: freeZsetObject(o); break;
        case OBJ_HASH: freeHashObject(o); break;
        default: serverPanic("Unknown object type"); break;
        }
        zfree(o);
    } else {
        o->refcount--;
    }
}

void freeSetObject(robj *o) {
    switch (o->encoding) {
    case OBJ_ENCODING_HT:
        dictRelease((dict*) o->ptr);
        break;
    case OBJ_ENCODING_INTSET:
        zfree(o->ptr);
        break;
    default:
        serverPanic("Unknown set encoding type");
    }
}

可以看到核心的删除是在dictFreeVal里，对应了一个宏，这个宏调用的是对应dictType的 valDestructor，也就是dbDictType里指定的dictObjectDestructor函数，对应的删除操作在decrRefCount（严格来说是通过引用计数来管理声明周期）

decrRefCount内对每种数据类型有对应的释放方法，我们来看set的释放方法freeSetObject方法。根据Set的两种数据类型有两种处理方式，intset只需要释放指针就好了，如果是哈希表则调用dictRelease方法。

dict.c

/* Clear & Release the hash table */
void dictRelease(dict *d)
{
    _dictClear(d,&d->ht[0],NULL);
    _dictClear(d,&d->ht[1],NULL);
    zfree(d);
}

/* Destroy an entire dictionary */
int _dictClear(dict *d, dictht *ht, void(callback)(void *)) {
    unsigned long i;

    /* Free all the elements */
    for (i = 0; i < ht->size && ht->used > 0; i++) {
        dictEntry *he, *nextHe;

        if (callback && (i & 65535) == 0) callback(d->privdata);

        if ((he = ht->table[i]) == NULL) continue;
        while(he) {
            nextHe = he->next;
            dictFreeKey(d, he);
            dictFreeVal(d, he);
            zfree(he);
            ht->used--;
            he = nextHe;
        }
    }
    /* Free the table and the allocated cache structure */
    zfree(ht->table);
    /* Re-initialize the table */
    _dictReset(ht);
    return DICT_OK; /* never fails */
}

至此（dictClear方法）我们可以看到这是一个O（N）的过程，需要遍历ht每一个元素并进行删除，所以都存在阻塞redis的风险。（即使是主动淘汰的机制）

这一点在redis4.x系列已经通过延迟删除解决。

记一次redis读取超时的排查过程（SADD惹的祸）