redis是一个内存型nosql数据库,之前阅读其源码,就感觉到作者的牛逼之处,源码中的数据结构透露出对内存的极高的利用率(ps:我看的是redis 5.05版本的源码)。
#redis db描述结构
typedef struct redisDb {
dict *dict; /* The keyspace for this DB ,存储我们熟知的KEY-VAL*/
dict *expires; /* Timeout of keys with a timeout set, 存储每个KEY的过期时间 */
int id; /* Database ID */
dict *blocking_keys; /* Keys with clients waiting for data (BLPOP),记录着LIST阻塞操作数据*/
dict *ready_keys; /* Blocked keys that received a PUSH */
dict *watched_keys; /* WATCHED keys for MULTI/EXEC CAS */
long long avg_ttl; /* Average TTL, just for stats */
list *defrag_later; /* List of key names to attempt to defrag one by one, gradually. */
} redisDb;
db在redis中的定义是redisdb,我们可以先看下前三行的定义,就是我们熟悉的存储数据用的字典dict, 过期时间expires、以及db的id;
typedef struct redisObject {
unsigned type:4; //数据类型,string、list、hash等
unsigned encoding:4; //ptr数据的存储编码long 类型的整数,双端链表,压缩链表等
unsigned lru:LRU_BITS; /* LRU time (relative to global lru_clock,unix秒级时间戳的LRU_BITS取余数) or
* LFU data (least significant 8 bits frequency(8位的访问频率计数,类似于对数函数增长)
* and most significant 16 bits access time(16位的分钟数,最大45年)). */
int refcount; //引用计数
void *ptr; //数据指针
} robj;
特别的,介绍下 sds,下图就是存储string时的结构图,这里引入了sds对象(simple dynamic string)[就是一个n个字节的header+那个字节的value组成], 当string是数字时,其将转化为int类型来存储,从这些可以看到作者对内存尽可能紧凑使用的意图;
从下图的源码关系可以看出,redis的字典的实现数据结构是二维的数组,而冲突解决方法是“链地址法"。 值得注意的是这里使用了两个hash表来存储数据,原因是需要进行”再hash操作“,为了性能考虑,redis采用的是”渐进式再hash“ 的策略,即每次请求只对几个数据进行再hash,直至完成,这就需要用到两个hash表来存储数据了。
#t_hash.c hset指令实现
int hashTypeSet(robj *o, sds field, sds value, int flags) {
int update = 0;
if (o->encoding == OBJ_ENCODING_ZIPLIST) {
//---do sth
/* ziplist 转化为dict存储 */
if (hashTypeLength(o) > server.hash_max_ziplist_entries)
hashTypeConvert(o, OBJ_ENCODING_HT);
} else if (o->encoding == OBJ_ENCODING_HT) {
//---do sth
} else {
serverPanic("Unknown hash encoding");
/* Free SDS strings we did not referenced elsewhere if the flags
* want this function to be responsible. */
if (flags & HASH_SET_TAKE_FIELD && field) sdsfree(field);
if (flags & HASH_SET_TAKE_VALUE && value) sdsfree(value);
return update;
ziplist是redis的一个内存非常紧凑的链表,用于存储元素为字符串和数字的链表,能进行push(append 从尾部insert)、pop、delete、insert操作;所有数据以“小端”格式存储;如果字符串为数字,将自动转为数字进行存储;
zlbytes: 4bytes 整个ziplist的字节数
zltail: 4bytes 最后提个entry的偏移地址,用于pop操作
zllen: 2bytes entrys的个数
zlen: 1byte 结束符 固定为0xff
1. 删除kv:直接删除kv,并把后面memory往前挪
2. 更新kv:删除v(把后面memory往前挪),再在原v位置插入(后面的memory往后挪)
3. 添加kv:直接在尾部追加
list对象的底层实现是quicklist, quicklist的数据结构定义如下:
typedef struct quicklistNode {
struct quicklistNode *prev;
struct quicklistNode *next;
unsigned char *zl;
unsigned int sz; /* ziplist size in bytes */
unsigned int count : 16; /* count of items in ziplist */
unsigned int encoding : 2; /* RAW==1 or LZF==2 */
unsigned int container : 2; /* NONE==1 or ZIPLIST==2 */
unsigned int recompress : 1; /* was this node previous compressed? */
unsigned int attempted_compress : 1; /* node can't compress; too small */
unsigned int extra : 10; /* more bits to steal for future usage */
} quicklistNode;
typedef struct quicklist {
quicklistNode *head;
quicklistNode *tail;
unsigned long count; /* total count of all entries in all ziplists */
unsigned long len; /* number of quicklistNodes */
int fill : 16; /* fill factor for individual nodes */
unsigned int compress : 16; /* depth of end nodes not to compress;0=off */
} quicklist;
#t_set.c setadd操作
int setTypeAdd(robj *subject, sds value) {
long long llval;
if (subject->encoding == OBJ_ENCODING_HT) {
do something...
} else if (subject->encoding == OBJ_ENCODING_INTSET) {
if (isSdsRepresentableAsLongLong(value,&llval) == C_OK) {
} else {
serverAssert(dictAdd(subject->ptr,sdsdup(value),NULL) == DICT_OK);
return 1;
} else {
serverPanic("Unknown set encoding");
return 0;
从源码上看,inset就是用int数组来存储set元素的,其中contents是一个元素长度动态变化的数组,初始化时是16bit数组,如果存入数据大于元素长度,则元素长度扩展到32bit或64bit,这再次体现了作者高度利用内存的思想, 当然数组的伸缩也是消耗时间,这是用时间换空间的一种做法;
typedef struct intset {
uint32_t encoding; //编码:16bit、32bit、64bit
uint32_t length; //数据长度
int8_t contents[]; //存储数据的数组
} intset;
#t_zset.c zset add 操作
int zsetAdd(robj *zobj, double score, sds ele, int *flags, double *newscore) {
int incr = (*flags & ZADD_INCR) != 0;
if (zobj->encoding == OBJ_ENCODING_ZIPLIST) {
/* 如果是ziplist编码*/
if (zzlLength(zobj->ptr) > server.zset_max_ziplist_entries)
if (sdslen(ele) > server.zset_max_ziplist_value)
if (newscore) *newscore = score;
*flags |= ZADD_ADDED;
return 1;
} else if (zobj->encoding == OBJ_ENCODING_SKIPLIST) {
/* 如果是skiplist编码*/
} else {
serverPanic("Unknown sorted set encoding");
return 0; /* Never reached. */
#zset的定义 server.h
typedef struct zset {
dict *dict;
zskiplist *zsl;
} zset;
zlbytes: 4bytes 整个ziplist的字节数
zltail: 4bytes 最后提个entry的偏移地址,用于pop操作
zllen: 2bytes entrys的个数
zlen: 1byte 结束符 固定为0xff
再延伸一下,skiplist的特点是快速查找,但是耗内存,为什么不用btree来实现呢?这里找到作者的回答,简单的说就是btree虽然节省存储空间,但是维护起来也挺麻烦的, 他并不需要为了节省那么一丁点内存而付出那么大的代价。
There are a few reasons:
1) They are not very memory intensive. It's up to you basically. Changing parameters about the probability of a node to have a given number of levels will make thenless memory intensive than btrees.
2) A sorted set is often target of many ZRANGE or ZREVRANGE operations, that is, traversing the skip list as a linked list. With this operation the cache locality of skip lists is at least as good as with other kind of balanced trees.
3) They are simpler to implement, debug, and so forth. For instance thanks to the skip list simplicity I received a patch (already in Redis master) with augmented skip lists implementing ZRANK in O(log(N)). It required little changes to the code.
About the Append Only durability & speed, I don't think it is a good idea to optimize Redis at cost of more code and more complexity for a use case that IMHO should be rare for the Redis target (fsync() at every command). Almost no one is using this feature even with ACID SQL databases, as the performance hint is big anyway.
About threads: our experience shows that Redis is mostly I/O bound. I'm using threads to serve things from Virtual Memory. The long term solution to exploit all the cores, assuming your link is so fast that you can saturate a single core, is running multiple instances of Redis (no locks, almost fully scalable linearly with number of cores), and using the "Redis Cluster" solution that I plan to develop in the future.
/* Stream item ID: a 128 bit number composed of a milliseconds time and
* a sequence counter. IDs generated in the same millisecond (or in a past
* millisecond if the clock jumped backward) will use the millisecond time
* of the latest generated ID and an incremented sequence. */
typedef struct streamID {
uint64_t ms; /* Unix time in milliseconds. */
uint64_t seq; /* Sequence number. */
} streamID;
typedef struct stream {
rax *rax; /* The radix tree holding the stream. */
uint64_t length; /* Number of elements inside this stream. */
streamID last_id; /* Zero if there are yet no items. */
rax *cgroups; /* Consumer groups dictionary: name -> streamCG */
} stream;
/* We define an iterator to iterate stream items in an abstract way, without
* caring about the radix tree + listpack representation. Technically speaking
* the iterator is only used inside streamReplyWithRange(), so could just
* be implemented inside the function, but practically there is the AOF
* rewriting code that also needs to iterate the stream to emit the XADD
* commands. */
typedef struct streamIterator {
stream *stream; /* The stream we are iterating. */
streamID master_id; /* ID of the master entry at listpack head. */
uint64_t master_fields_count; /* Master entries # of fields. */
unsigned char *master_fields_start; /* Master entries start in listpack. */
unsigned char *master_fields_ptr; /* Master field to emit next. */
int entry_flags; /* Flags of entry we are emitting. */
int rev; /* True if iterating end to start (reverse). */
uint64_t start_key[2]; /* Start key as 128 bit big endian. */
uint64_t end_key[2]; /* End key as 128 bit big endian. */
raxIterator ri; /* Rax iterator. */
unsigned char *lp; /* Current listpack. */
unsigned char *lp_ele; /* Current listpack cursor. */
unsigned char *lp_flags; /* Current entry flags pointer. */
/* Buffers used to hold the string of lpGet() when the element is
* integer encoded, so that there is no string representation of the
* element inside the listpack itself. */
unsigned char field_buf[LP_INTBUF_SIZE];
unsigned char value_buf[LP_INTBUF_SIZE];
} streamIterator;
/* Consumer group. */
typedef struct streamCG {
streamID last_id; /* Last delivered (not acknowledged) ID for this
group. Consumers that will just ask for more
messages will served with IDs > than this. */
rax *pel; /* Pending entries list. This is a radix tree that
has every message delivered to consumers (without
the NOACK option) that was yet not acknowledged
as processed. The key of the radix tree is the
ID as a 64 bit big endian number, while the
associated value is a streamNACK structure.*/
rax *consumers; /* A radix tree representing the consumers by name
and their associated representation in the form
of streamConsumer structures. */
} streamCG;
/* A specific consumer in a consumer group. */
typedef struct streamConsumer {
mstime_t seen_time; /* Last time this consumer was active. */
sds name; /* Consumer name. This is how the consumer
will be identified in the consumer group
protocol. Case sensitive. */
rax *pel; /* Consumer specific pending entries list: all
the pending messages delivered to this
consumer not yet acknowledged. Keys are
big endian message IDs, while values are
the same streamNACK structure referenced
in the "pel" of the conumser group structure
itself, so the value is shared. */
} streamConsumer;
/* Pending (yet not acknowledged) message in a consumer group. */
typedef struct streamNACK {
mstime_t delivery_time; /* Last time this message was delivered. */
uint64_t delivery_count; /* Number of times this message was delivered.*/
streamConsumer *consumer; /* The consumer this message was delivered to
in the last delivery. */
} streamNACK;
/* Stream propagation informations, passed to functions in order to propagate
* XCLAIM commands to AOF and slaves. */
typedef struct sreamPropInfo { //这里的定义是搞笑吗?
robj *keyname;
robj *groupname;
} streamPropInfo;
如果要存储 "foo", "foobar" 和 "footer"这三个key,那么我们先来构造一个最简单的radix树,可以看到树的每个节点都是以最小符号来存储;
* (f) ""
* \
* (o) "f"
* \
* (o) "fo"
* \
* [t b] "foo"
* / \
* "foot" (e) (a) "foob"
* / \
* "foote" (r) (r) "fooba"
* / \
* "footer" [] [] "foobar"
/* ["foo"] ""
* |
* [t b] "foo"
* / \
* "foot" ("er") ("ar") "foob"
* / \
* "footer" [] [] "foobar"
/* (f) ""
* /
* (i o) "f"
* / \
* "firs" ("rst") (o) "fo"
* / \
* "first" [] [t b] "foo"
* / \
* "foot" ("er") ("ar") "foob"
* / \
* "footer" [] [] "foobar"