python dict 实现原理

这篇文章描述了如何用Python语言实现字典。

字典由键索引，可以将它们视为关联数组。让我们将3个键/值对添加到字典中：

>>> d = {'a': 1, 'b': 2}
>>> d['c'] = 3
>>> d
{'a': 1, 'b': 2, 'c': 3}

通过以下的方式访问值

>>> d['a']
1
>>> d['b']
2
>>> d['c']
3
>>> d['d']
Traceback (most recent call last):
  File "", line 1, in 
KeyError: 'd'

key “ d”不存在，因此引发了KeyError异常

Hash tables 哈希表

Python字典是使用哈希表实现的。它是一个数组，其索引是使用键上的哈希函数获得的。哈希函数的目标是将键均匀分布在数组中。良好的哈希函数可最大程度地减少冲突次数，例如具有相同散列的不同键。 Python没有这种哈希函数。在最常见的情况下，它最重要的散列函数（用于字符串和整数）非常规则：

>>> map(hash, (0, 1, 2, 3))
[0, 1, 2, 3]
>>> map(hash, ("namea", "nameb", "namec", "named"))
[-1658398457, -1658398460, -1658398459, -1658398462]

我们将假定我们在本文的其余部分中使用字符串作为键。 Python中字符串的哈希函数定义为：

arguments: string object
returns: hash
function string_hash:
    if hash cached:
        return it
    set len to string's length
    initialize var p pointing to 1st char of string object
    set x to value pointed by p left shifted by 7 bits
    while len >= 0:
        set var x to (1000003 * x) xor value pointed by p
        increment pointer p
    set x to x xor length of string object
    cache x as the hash so we don't need to calculate it again
    return x as the hash

如果您在Python中运行hash（'a'），它将执行string_hash（）并返回12416037344 。在这里，我们假设我们使用的是64位计算机。

如果使用size为x的数组存储键/值对，则我们使用等于x-1的掩码来计算该对在数组中的插槽索引。这使得时隙索引的计算快速。由于下面描述的调整大小机制，找到空插槽的可能性很高。这意味着在大多数情况下进行简单的计算是有意义的。如果数组的大小为8，则'a'的索引为：hash（'a'）＆7 =0。'b'的索引为3，'c'的索引为2， “ z”为3，与“ b”相同，这里有一个碰撞。

image

我们可以看到，当键是连续的时，Python哈希函数可以很好地完成工作，这很好，因为使用这种类型的数据非常普遍。但是，一旦添加键“ z”，就会发生哈希碰撞，因为碰撞不够连续。

我们可以使用链表来存储具有相同散列的对，但这会增加查找时间，例如不再是O（1）平均值。下一节将介绍在Python词典中使用的冲突解决方法。

开放式寻址

开放式寻址是使用探测的冲突解决方法。在“ z”的情况下，数组中已经使用了插槽索引3，因此我们需要探查其他索引以查找尚未使用的索引。添加一个键/值对也将平均O（1）和查找操作。

二次探测序列用于找到空闲solt。代码如下：

j = (5*j) + 1 + perturb;
perturb >>= PERTURB_SHIFT;
use j % 2**i as the next table index;

重复执行5 * j + 1会迅速放大不影响初始索引的位的微小差异。变量perturb使哈希码的其他位起作用。

出于好奇，让我们看一下表大小为32且j = 3时的探测顺序。
3-> 11-> 19-> 29-> 5-> 6-> 16-> 31-> 28-> 13-> 2…

您可以通过查看[dictobject.c]（http://svn.python.org/projects/python/trunk/Objects/dictobject.c）的源代码来了解有关此探测序列的更多信息。探测机制的详细说明可以在文件顶部找到。

image

现在，让我们看一下Python内部代码以及示例。

字典的数据结构

以下结构表示字典实体：键/值对。哈希，键和值被存储。 PyObject是Python对象的基类。

typedef struct {
    Py_ssize_t me_hash;
    PyObject *me_key;
    PyObject *me_value;
} PyDictEntry;

以下结构表示字典。

ma_fill是已使用的插槽数+虚拟插槽数。当一个简直被移除，插槽被标记为虚拟

ma_used是已使用的插槽数（活动）

ma_mask等于数组大小减去1，用于计算插槽索引

ma_table是数组

ma_smalltable是大小为8的初始数组。

typedef struct _dictobject PyDictObject;
struct _dictobject {
    PyObject_HEAD
    Py_ssize_t ma_fill;
    Py_ssize_t ma_used;
    Py_ssize_t ma_mask;
    PyDictEntry *ma_table;
    PyDictEntry *(*ma_lookup)(PyDictObject *mp, PyObject *key, long hash);
    PyDictEntry ma_smalltable[PyDict_MINSIZE];
};

字典初始化

首次创建字典时，将调用函数PyDict_New（）。我删除了一些行，并将C代码转换为伪代码，以专注于关键概念。

returns new dictionary object
function PyDict_New:
    allocate new dictionary object
    clear dictionary's table
    set dictionary's number of used slots + dummy slots (ma_fill) to 0
    set dictionary's number of active slots (ma_used) to 0
    set dictionary's mask (ma_value) to dictionary size - 1 = 7
    set dictionary's lookup function to lookdict_string
    return allocated dictionary object

字典添加

添加新的键/值对时，将调用PyDict_SetItem（）。该函数获取指向字典对象和键/值对的指针。它检查密钥是否为字符串，并计算哈希值或重用缓存的哈希值（如果存在）。如果使用的插槽数+虚拟插槽数大于数组大小的2/3，则会调用insertdict（）以添加新的键/值对，并调整字典的大小。
为什么是2/3？这是为了确保探测序列可以足够快地找到空闲插槽。稍后我们将介绍调整大小功能。

arguments: dictionary, key, value
returns: 0 if OK or -1
function PyDict_SetItem:
    if key's hash cached:
        use hash
    else:
        calculate hash
    call insertdict with dictionary object, key, hash and value
    if key/value pair added successfully and capacity over 2/3:
        call dictresize to resize dictionary's table

inserdict（）使用查找函数lookdict_string（）查找空闲插槽。此功能与查找key的方法相同。 lookdict_string（）使用哈希值和掩码值计算插槽索引。如果无法在 slot index =hash & mask 找到key，则使用上述循环开始探测，直到找到空闲插槽为止。在第一次探测尝试中，如果键为null，则如果在第一次查找期间找到，它将返回虚拟插槽。这样可以优先使用先前删除的插槽。

我们要添加以下键/值对：{'a'：1，'b'：2'，'z'：26，'y'：25，'c'：5，'x'：24}。这是发生了什么：

分配的字典结构的内部表大小为8。

PyDict_SetItem: key = ‘a’, value = 1
- hash = hash(‘a’) = 12416037344
- insertdict
- - lookdict_string
  - - slot index = hash & mask = 12416037344 & 7 = 0
    - slot 0 is not used so return it
  - init entry at index 0 with key, value and hash
  - ma_used = 1, ma_fill = 1
PyDict_SetItem: key = ‘b’, value = 2
- hash = hash(‘b’) = 12544037731
- insertdict
- - lookdict_string
  - - slot index = hash & mask = 12544037731 & 7 = 3
    - slot 3 is not used so return it
  - init entry at index 3 with key, value and hash
  - ma_used = 2, ma_fill = 2
PyDict_SetItem: key = ‘z’, value = 26
- hash = hash(‘z’) = 15616046971
- insertdict
- - lookdict_string
  - - slot index = hash & mask = 15616046971 & 7 = 3
    - slot 3 is used so probe for a different slot: 5 is free
  - init entry at index 5 with key, value and hash
  - ma_used = 3, ma_fill = 3
PyDict_SetItem: key = ‘y’, value = 25
- hash = hash(‘y’) = 15488046584
- insertdict
- - lookdict_string
  - - slot index = hash & mask = 15488046584 & 7 = 0
    - slot 0 is used so probe for a different slot: 1 is free
  - init entry at index 1 with key, value and hash
  - ma_used = 4, ma_fill = 4
PyDict_SetItem: key = ‘c’, value = 3
- hash = hash(‘c’) = 12672038114
- insertdict
- - lookdict_string
  - - slot index = hash & mask = 12672038114 & 7 = 2
    - slot 2 is free so return it
  - init entry at index 2 with key, value and hash
  - ma_used = 5, ma_fill = 5
PyDict_SetItem: key = ‘x’, value = 24
- hash = hash(‘x’) = 15360046201
- insertdict
- - lookdict_string
  - - slot index = hash & mask = 15360046201 & 7 = 1
    - slot 1 is used so probe for a different slot: 7 is free
  - init entry at index 7 with key, value and hash
  - ma_used = 6, ma_fill = 6

image

现在使用了8个上的6个插槽，因此我们的存储容量超过了阵列的2/3。调用dictresize()分配更大的数组。此功能还负责将旧表条目复制到新表中。

在我们的例子中，dictresize()在 minused= 24 时被调用，它是4 * ma_used。 2 * ma_used用于已使用的插槽数非常大（大于50000）的情况。为什么使用的插槽数是4倍？它减少了调整大小步骤的数量，并且增加了稀疏性。

新表的大小需要大于24，并通过将当前大小左移1位直到大于24来计算。最终以32为例。 8-> 16-> 32。

这是我们的表在调整大小时发生的情况：分配了一个大小为32的新表。使用新的掩码值31将旧表条目插入到新表中。我们得到以下结果：

image

移除字典

调用PyDict_DelItem（）删除条目。计算该键的哈希值，并调用查找函数以返回条目。现在，该插槽是虚拟插槽。

我们想从字典中删除键“ c”。我们最终得到以下数组：

image

请注意，如果使用的插槽数比插槽总数少得多，则删除项操作不会触发数组调整大小。但是，当添加键/值对时，需要根据使用的插槽数+虚拟插槽数来调整大小，因此也可以缩小数组。

参考：

python dict 实现原理