glusterfs文件系统是一个分布式的文件系统,但是与很多分布式文件系统不一样,它没有元数服务器,听说swift上也是应用了这个技术的。glusterfs中每个xlator的配置信息都是用dict进行管理的。dict这玩意儿,说白了就是一个hash表,是一个key/value的内存数据库。今天花了点时间慢慢研究了glusterfs中的设计,觉得还是挺有意思的。
上篇博客介绍了glusterfs文件系统的内存池的设计,而glusterfs的内存池正应用在这项技术上。首先,glusterfsd在程序初始化时,就建立了三个池dict_pool、dict_pair_pool、dict_data_pool。接下来看看它是怎么玩这三个内存池的呢!
1、在使用dict之前,首先是建立dict对象,这点是面向对象的思想吧。
1 dict_t * 2 get_new_dict (void) 3 { 4 return get_new_dict_full (1); 5 }
glusterfs调用get_new_dict来建立一个dict对象,接下来看看get_new_dict又做了什么呢?
1 dict_t * 2 get_new_dict_full (int size_hint) 3 { 4 dict_t *dict = mem_get0 (THIS->ctx->dict_pool); 5 6 if (!dict) { 7 return NULL; 8 } 9 10 dict->hash_size = size_hint; 11 if (size_hint == 1) { 12 /* 13 * This is the only case we ever see currently. If we ever 14 * need to support resizing the hash table, the resize function 15 * will have to take into account the possibility that 16 * "members" is not separately allocated (i.e. don't just call 17 * realloc() blindly. 18 */ 19 dict->members = &dict->members_internal; 20 } 21 else { 22 /* 23 * We actually need to allocate space for size_hint *pointers* 24 * but we actually allocate space for one *structure*. Since 25 * a data_pair_t consists of five pointers, we're wasting four 26 * pointers' worth for N=1, and will overrun what we allocated 27 * for N>5. If anybody ever starts using size_hint, we'll need 28 * to fix this. 29 */ 30 GF_ASSERT (size_hint <= 31 (sizeof(data_pair_t) / sizeof(data_pair_t *))); 32 dict->members = mem_get0 (THIS->ctx->dict_pair_pool); 33 if (!dict->members) { 34 mem_put (dict); 35 return NULL; 36 } 37 } 38 39 LOCK_INIT (&dict->lock); 40 41 return dict; 42 }
size_hint是要分配的字典的大小。当 size_hint为1时,字典内的数据将是一个链表(用链表解决HASH冲突问题)。
接下来看看程序又将是如何向字典中添加一项数据的呢?首先还是来看看dict_t 的数据结构吧:
1 struct _dict { 2 unsigned char is_static:1; 3 int32_t hash_size; 4 int32_t count; 5 int32_t refcount; 6 data_pair_t **members; 7 data_pair_t *members_list; 8 char *extra_free; 9 char *extra_stdfree; 10 gf_lock_t lock; 11 data_pair_t *members_internal; 12 data_pair_t free_pair; 13 gf_boolean_t free_pair_in_use; 14 };
在dict_t中有一个lock子成员,每次操作dict_t对象时,首先要对它进行加锁:
int32_t dict_add (dict_t *this, char *key, data_t *value) { int32_t ret; if (!this || !value) { gf_log_callingfn ("dict", GF_LOG_WARNING, "!this || !value for key=%s", key); return -1; } LOCK (&this->lock); ret = _dict_set (this, key, value, 0); UNLOCK (&this->lock); return ret; }
不得不说glusterfs的编码风格还是挺漂亮的,它把一些细节与核心点分的很清楚,代码看上去那个爽啊!!看上面的代码:打日志与加锁放一在一块,核心的处理将在_dict_set中处理。
1 static int32_t 2 _dict_set (dict_t *this, char *key, data_t *value, gf_boolean_t replace) 3 { 4 int hashval; 5 data_pair_t *pair; 6 char key_free = 0; 7 int tmp = 0; 8 int ret = 0; 9 10 if (!key) { 11 ret = gf_asprintf (&key, "ref:%p", value); 12 if (-1 == ret) { 13 gf_log ("dict", GF_LOG_WARNING, "asprintf failed %s", key); 14 return -1; 15 } 16 key_free = 1; 17 } 18 19 tmp = SuperFastHash (key, strlen (key)); 20 hashval = (tmp % this->hash_size); 21 22 /* Search for a existing key if 'replace' is asked for */ 23 if (replace) { 24 pair = _dict_lookup (this, key); 25 26 if (pair) { 27 data_t *unref_data = pair->value; 28 pair->value = data_ref (value); 29 data_unref (unref_data); 30 if (key_free) 31 GF_FREE (key); 32 /* Indicates duplicate key */ 33 return 0; 34 } 35 } 36 37 if (this->free_pair_in_use) { 38 pair = mem_get0 (THIS->ctx->dict_pair_pool); 39 if (!pair) { 40 if (key_free) 41 GF_FREE (key); 42 return -1; 43 } 44 } 45 else { 46 pair = &this->free_pair; 47 this->free_pair_in_use = _gf_true; 48 } 49 50 if (key_free) { 51 /* It's ours. Use it. */ 52 pair->key = key; 53 key_free = 0; 54 } 55 else { 56 pair->key = (char *) GF_CALLOC (1, strlen (key) + 1, 57 gf_common_mt_char); 58 if (!pair->key) { 59 if (pair == &this->free_pair) { 60 this->free_pair_in_use = _gf_false; 61 } 62 else { 63 mem_put (pair); 64 } 65 return -1; 66 } 67 strcpy (pair->key, key); 68 } 69 pair->value = data_ref (value); 70 71 pair->hash_next = this->members[hashval]; 72 this->members[hashval] = pair; 73 74 pair->next = this->members_list; 75 pair->prev = NULL; 76 if (this->members_list) 77 this->members_list->prev = pair; 78 this->members_list = pair; 79 this->count++; 80 81 if (key_free) 82 GF_FREE (key); 83 return 0; 84 }
19行利用HASH算法计算HASH值,20行缩小HASH值的范围,23行到了35行为替换处理。37-48行是让我最难受的代码,这个地方不知道是不是设计的问题。55行之后是插入新的HASH键值的操作。
再看看查询的操作吧。
1 data_t * 2 dict_get (dict_t *this, char *key) 3 { 4 data_pair_t *pair; 5 6 if (!this || !key) { 7 gf_log_callingfn ("dict", GF_LOG_INFO, 8 "!this || key=%s", (key) ? key : "()"); 9 return NULL; 10 } 11 12 LOCK (&this->lock); 13 14 pair = _dict_lookup (this, key); 15 16 UNLOCK (&this->lock); 17 18 if (pair) 19 return pair->value; 20 21 return NULL; 22 }
同样是先处理锁之类的杂项操作,_dict_lookup才是真正的始作俑者。
1 static data_pair_t * 2 _dict_lookup (dict_t *this, char *key) 3 { 4 if (!this || !key) { 5 gf_log_callingfn ("dict", GF_LOG_WARNING, 6 "!this || !key (%s)", key); 7 return NULL; 8 } 9 10 int hashval = SuperFastHash (key, strlen (key)) % this->hash_size; 11 data_pair_t *pair; 12 13 for (pair = this->members[hashval]; pair != NULL; pair = pair->hash_next) { 14 if (pair->key && !strcmp (pair->key, key)) 15 return pair; 16 } 17 18 return NULL; 19 }
查询的代码相当的简单吧,计算一个哈希值,再查询一个链表就OK了。
查看了glusterfs中的所有代码,glusterfs_new_dict_full调用时几乎都是传入参数1,只有dict_copy接口比较特别:
1 dict_t * 2 dict_copy (dict_t *dict, 3 dict_t *new) 4 { 5 if (!dict) { 6 gf_log_callingfn ("dict", GF_LOG_WARNING, "dict is NULL"); 7 return NULL; 8 } 9 10 if (!new) 11 new = get_new_dict_full (dict->hash_size); 12 13 dict_foreach (dict, _copy, new); 14 15 return new; 16 }
从代码上看,只有此处才发挥了HASH表的作用,其它的都只是把dict_t当成链表来使用。而且这个地方也并不是用HASH表的思想,只是把一个链表转换成了HASH表。这个是我在glusterfs中见到的一处最不明智的地方。