作者: William Pugh
论文: skiplists.pdf
维基百科: Skip list
SkipList,缘起leveldb源码,一见钟情。它是如此的简单,高效。又名跳跃表, 动态结构图如下(来自维基百科)。
SkipList由多层级单向有序链表组成。搜索,插入,删除的平均复杂度是O(logn)。so amazing! 接下来我们通过Pugh的论文一起来分析学习一下(如有错误之处,敬请指出,谢谢!)。
We search for an element by traversing forward pointers that do not overshoot the node containing the element being searched for. When no more progress can be made at the current level of forward pointers, the search moves down to the next level. When we can make no more progress at level 1, we must be immediately in front of the node that contains the desired element (if it is in the list).
我们通过遍历不包含正在搜索的元素的节点的前向指针来搜索元素。当前进指针的层级不能再进一步提高时,搜索向下移动到下一个层级。当我们在第1层级不再有进展时,我们必须立即在包含所需元素的节点的前面(如果它在列表中)【即第1层级不能有进展时,在当前节点的前面就是要找的位置】。
Initially, we discussed a probability distribution where half of the nodes that have level i pointers also have level i+1 pointers. To get away from magic constants, we say that a fraction p of the nodes with level i pointers also have level i+1 pointers. (for our original discussion, p = 1/2).
如果一个节点有第i层级指针,那么有1/2的概率也有第i+1层级指针, 我们用一个分数(为了摆脱魔法常数)p来表示,这里p=1/2。
我们可以得到,一个节点在有第1层级指针的概率为p0(=1),有第2层指针的概率为p1, 有第L层指针的概率为pL-1 。
randomLevel()
lvl := 1
-- random() that returns a random value in [0...1)
while random() < p and lvl < MaxLevel do
lvl := lvl + 1
return lvl
通过随机使每个节点出现在不同的层级,从而加快搜索速度。MaxLevel是最大层级。
In a skip list of 16 elements generated with p = 1/2, we might happen to have 9 elements of level 1, 3 elements of level 2, 3 elements of level 3 and 1 element of level 14 (this would be very unlikely, but it could happen). How should we handle this? If we use the standard algorithm and start our search at level 14, we will do a lot of useless work.
在以p=1/2生成的16个元素的跳过列表中,我们可能碰巧具有9个元素,1级3个元素,3个元素3级元素和1个元素14级(这不太可能,但可能会发生)。我们该怎么处理这个?如果我们使用标准算法并在第14级开始我们的搜索,我们将会做很多无用的工作。[MaxLevel ?]
Where should we start the search? Our analysis suggests that ideally we would start a search at the level L where we expect 1/p nodes. This happens when L = log1/p n. Since we will be referring frequently to this formula, we will use L(n) to denote log1/p n.
我们应该从哪里开始搜索?我们的分析表明,理想情况下,我们将在L层级开始搜索,我们期望有1/p个节点。当L = log1/p n 时, 会发生这种情况。由于我们会经常用到这个公式,所以用L(n)表示 log1/p n 。
WHY ? 我们来分析一下:
现在SkipList中有n个元素,p=1/2,第L层级元素个数的期望是1/p(2)个;
每个元素出现在L层的概率是pL-1, 那么第L层级元素个数的期望是 n * pL-1;
因此可以得到1 / p = n * pL-1 。
推导过程如下:
1 / p = n * pL-1
n = (1/p)L
L = log1/p n
我们定义:L(n) = log1/p n
Since we can safely cap levels at L(n), we should choose MaxLevel = L(N) (where N is an upper bound on the number of elements in a skip list). If p = 1/2, using MaxLevel = 16 is appropriate for data structures containing up to 216 elements.
由于我们可以安全地在L(n)的上限级别,所以我们应该选择MaxLevel = L(N)(其中N是跳过列表中元素数量的上限)。 如果p = 1/2,则使用MaxLevel = 16适用于最多包含216个元素的数据结构。
We analyze the search path backwards, travelling up and to the left. Although the levels of nodes in the list are known and fixed when the search is performed, we act as if the level of a node is being determined only when it is observed while backtracking the search path.
我们反向分析搜索路径,向上和向左移动。 尽管在执行搜索时,列表中的节点的级别是已知的和固定的,但是我们的行为似乎只有在追溯搜索路径时才观察到节点的级别。
At any particular point in the climb, we are at a situation similar to situation a in Figure 6 – we are at the ith forward pointer of a node x and we have no knowledge about the levels of nodes to the left of x or about the level of x, other than that the level of x must be at least i. Assume the x is not the header (the is equivalent to assuming the list extends infinitely to the left). If the level of x is equal to i, then we are in situation b. If the level of x is greater than i, then we are in situation c. The probability that we are in situation c is p. Each time we are in situation c, we climb up a level. Let C(k) = the expected cost (i.e, length) of a search path that climbs up k levels in an infinite list:
C(0) = 0
C(k) = (1–p) (cost in situation b) + p (cost in situation c)
By substituting and simplifying, we get:
C(k) = (1–p) (1 + C(k)) + p (1 + C(k–1))
C(k) = 1/p + C(k–1)
C(k) = k/p
在爬升的任何特定点,我们处于类似于图6中情况a的情况 - 我们在节点x的第i个前向指针,我们不知道x的左边的节点的水平或关于x的层级,除了x的层级必须至少为i。 假设x不是头节点(相当于假设列表无限延伸到左边)。 如果x的层级等于i,那么我们处于情况b。 如果x的层级大于i,那么我们处于情况c。 我们处于情况c的概率是p。 每次我们在情况c,我们爬上一个层级。 令C(k)=在无限列表中爬升k级的搜索路径的预期成本(即长度):
(cost in situation b) = (1 + C(k)) 如图从x左边的节点到x需要1步再加上需要向上爬升k级,推导过程比较简单,此处略。
Our assumption that the list is infinite is a pessimistic assumption. When we bump into the header in our backwards climb, we simply climb up it, without performing any leftward movements. This gives us an upper bound of (L(n)–1)/p on the expected length of the path that climbs from level 1 to level L(n) in a list of n elements.
我们的这个列表是无限的假设是一个悲观的假设。 当我们在向后移动时碰到头节点,我们只是向上爬升,不进行任何向左的移动。 这给出了在n个元素的列表中从1级爬升到L(n)级的路径的期望长度(L(n)-1)/ p的上限。
由于MaxLevel = L(n), C(k) = k / p,因此期望值为:(L(n) – 1) / p
将L(n) = log1/p n 代入可得:(log1/p n - 1) / p
将p = 1 / 2 代入可得:2 * log2 n - 2,即O(logn)的时间复杂度
因此搜索期望的时间复杂度是O(logn)
至此我们已经得到搜索的时间复杂度,那对于插入和删除来说,其实也是一样的,我们就不再赘述。通过论文加上简单的分析,我们应该大致了解到skip list为什么如此的amazing,有兴趣的话可以继续学习一下Pugh的论文。
总结:
1)如果一个节点有第i层级指针,那么有1/2的概率也有第i+1层级指针, 用一个分数p来表示,p = 1 / 2
2)理想情况下,我们将在L层级开始搜索,我们期望有1/p个节点
3)每个元素出现在L层的概率是pL-1, 那第L层元素个数的期望值为: n * pL-1