Here is what the visualization looks like when it is first initalized.
In the beginning of the algorithm, we start out with our two storage areas: the directory and the buckets.
The directory has a property associated with it called the directory depth (abbreviated as dd) and each bucket has a property associated with it called the bucket depth (abbreviated as bd).
The directory is an array of pointers to buckets. The color code scheme is used to indicate which bucket each directory element points at.
Three primary operations done on a hash table are: Add an element, Remove an element, and Search for an element. Regardless of which operations are being done, the first steps are always the same. First, the element to be inserted is sent to a hash function that returns a bit string. From this bit string ,we look at the most significant dd bits to obtain the directory element. Once we have the directory element we then load into memory the bucket that it references.
In this example we are performing an operation on 125. When we send the value 125 to the hash function it returns 01111101. Since the directory depth is 3, we look at the most significant 3 bits to obtain the directory element 011. Directory element 011 points to bucket B3. This is the bucket that needs to be loaded into memory.
- With the bucket loaded into memory, it is searched for the element we are looking for. If element is found in the bucket, then the element exists in the database; otherwise it does not exist.
Adding an Element
Removing an Element- If the bucket is not full, then the element we are adding is placed in an available location.
- If the bucket is full, then a new bucket is created and a Bucket Split is performed
- A new bucket depth is given to both of these buckets. It is equal to the number of most significant bits that all of these elements, including the element being added, share in common plus 1.
- The elements in the full bucket are redistributed between the current bucket the elements are located in and the new bucket created. If their most significant bd bit is 0 then the element stays where it is; otherwise if the most signficant bd bit is 1 then this element is moved to the new bucket created.
- If the new bd becomes larger than the directory depth then the directory depth must be changed to the bucket depth of this bucket and then the directory size must be expanded to 2 raised to the "new directory depth"th power
- The pointers in the directory need to be adjusted. In my coding I adjusted the pointers in a similar mannerly as demonstrated in this video at google videos: Click Here To Watch This Video
- The pointers that used to point to this bucket are now pointed to the bucket it has merged to.
- The bucket depth of the merged bucket is decreased until every element of the directory on the (bd)th most significant bits of the directory addresses don't point to this merged bucket
- If all of the bucket depths of each bucket are now smaller than the directory depth, then the directory depth must be changed to the largest bucket depth available and then the directory size must be shrunk to 2 raised to the power of the new directory depth.
- The pointers in the directory need to be adjusted. Again I used the same method for readjustement as demonstrated in the google video.
动态hash方法之二
线性散列:动态hash常用的另一种方法为线性散列,它能随数据的插入和删除,适当的对hash桶数进行调整,与可扩展散列相比,线性散列不需要存放数据桶指针的专门目录项,且能更自然的处理数据桶已满的情况,允许更灵活的选择桶分裂的时机,因此实现起来相比前两种方法要复杂。
理解线性散列,需要引入“轮转分裂进化”的概念,各个桶轮流进行分裂,当一轮分裂完成之后,进入下一轮分裂,于是分裂将从头开始。用Level表示当前的“轮数”,其值从0开始。假定hash表初始桶数为N(要求N是2的幂次方),则值logN(以2为底)是指用于表示N个数需要的最少二进制位数,用d0表示,即d0=logN。
以上提到,用Level表示当前轮数,则每轮的初始桶数为N*2^Level个(2^Level表示2的Level次方)。例如当进行第0轮时,level值为0,则初始桶数为N*2^0=N。桶将按桶编号从小到大的顺序,依次发生分裂,一次分裂一个桶,这里我们使用Next指向下次将被分裂的桶。
每次桶分裂的条件可灵活选择,例如,可设置一个桶的填充因子r(r<1),当桶中记录数达到该值时进行分裂;也可选择当桶满时才进行分裂。
需要注意的时,每次发生分裂的桶总是由Next决定,与当前值被插入的桶已满或溢出无关。为处理溢出情况,可引入溢出页解决。话不多说,先来看一个图示:
假定初始时,数据分布如上,hash函数为h(x)。桶数N=4,轮数Level为0,Next处于0位置;采用“发生溢出分裂”作为触发分裂的条件。此时d=logN=2,即使用两个二进位可表示桶的全部编号。
简单解释一下,为什么32*、25*、18*分别位于第一、二、三个桶中。因为h(x)=32=100000,取最后两个二进制位00,对应桶编号00;h(y)=25=11001,取最后两个二进制位01,对应桶编号01;h(z)=18=10010,最后两位对应桶编号10。
接下来,向以上hash表中插入两个新项h(x1)=43和h(x2)=37,插入结果如下图所示:
我们来分析一下。当插入h(x1)=43=101011时,d值为2,因此取末尾两个二进制位,应插入11桶。由于该桶已满,故应增加溢出页,并将43*插入该溢出页内。由于触发了桶分裂,因此在Next=0位置上(注意不是在11桶上),进行桶分裂,产生00桶的映像桶,映像桶的编号计算方式为N+Next=4+0=100,且将原来桶内的所有元素进行重新分配,Next值移向下一个桶。
当插入h(x2)=37=100101时,d值仍为2,取末尾两个二进制位,应插入01桶,该桶中有空余空间,直接插入。
分析到这里,读者应该基本了解了线性散列的分裂方式。我们发现,桶分裂是依次进行的,且后续产生的映像桶一定位于上一次产生的映像桶之后。
读者不妨继续尝试插入h(x)=29,22,66,34,50,情况如下图所示,这里不再详细分析。
线性散列的查找操作,例如要查询h(x)=18,32,44。假定查询时,hash表状态为N=4,Level=0,Next=1,因此d值为2。
(1) 查找h(x)=18=10010,取末两位10,由于10位于Next=1和N=4之间,对用桶还未进行分裂,直接取10作为桶编号,在该桶中进行查找。
(2) 查找h(x)=32=10000,取末两位00,由于00不在Next=1和N=4之间,表示该桶已经分裂,再向前取一位,因此桶编号为000,在该桶中进行查找。
(3) 查找h(x)=44=101100,取末两位00,由于00不在Next=1和N=4之间,表示该桶已经分裂,再向前取一位,因此桶编号为100,在该桶中进行查找。
线性散列的删除操作是插入操作的逆操作,若溢出块为空,则可释放。若删除导致某个桶元素变空,则Next指向上一个桶。当Next减少到0,且最后一个桶也是空时,则Next指向N/2 -1的位置,同时Level值减1。
线性散列比可扩展动态散列更灵活,且不需要存放数据桶指针的专门目录项,节省了空间;但如果数据散列后分布不均匀,导致的问题可能会比可扩展散列还严重。
至此,三种动态散列方式介绍完毕。
附:对于多hash表和可扩展的动态散列,桶内部的组织,可采用(1)链式方法,一个元素一个元素的链接起来,则上例中的4表示最多只能链接4个这样的元素;也可采用(2)块方式,每个块中可放若干个元素,块与块之间链接起来,则上例中的4表示最多只能链接4个这样的块。
转载请注明原处:http://hi.baidu.com/calrincalrin/blog/item/b51b1910c7629265cb80c413.html
参考书籍:高级数据库系统及其应用,谢兴生主编,清华大学出版社(北京),2010.1
ficiencylinear hash
一,在介绍linear hash 之前,需要对动态hash和静态hash这两个概念做一下解释:
静态hash:是指在hashtable初始化得时候bucket的数目就已经确定了,当需要插入一个元素的时候,通过hash函数找到对应的bucket number,之后插入即可。不论用什么冲突解决方法,当插入的元素越来越多时,在这个hash表中查找元素的效率会变的越来越低。
动态hash:是指在hashtable的bucket的数目不是确定的,而是会根据插入元素的多少而实现动态的增减,当元素变多得时候,bucket会动态增加,这样就可以解决静态hash的查找效率低得问题。当元素变少得时候,bucket会动态减少,从而减少空间的浪费。linear hash就是一种动态hash。
二。linear hash实现.
对于hash操作,主要有insert, find, erase三个操作,下面对linear_hash的3个操作做一些解析:
1. Find.
对于查找操作,目的是输入key值,找出对应的卫星数据的值,这一操作和其他hash操作一样。
2. Insert操作。
对于插入操作,客户端程序输入key值和卫星数据,进行这一操作,会增加Linearhash中的元素个数numElement,当Linearhash的bucket负载(numElements/numBuckets)超过一定值,需要动态的增加Bucket数,增加一个bucket, 接着需要把一个特定的bucket上的element分一部分到这个new 的bucket上。
3. Erase操作
对于erase操作,客户端程序给定key值,要求contianer删除其中和key值相同的元素。同时需要减少numElement数,当Linearhash的bucket负载(numElements/numBuckets)减少到一定值,需要动态的减少Bucket数,减少一个bucket, 同时需要把这个减少的bucket中的所有元素还原到原来的oldBucket中。
三.于STL::map和STLEXT::hash_map做了一下性能比较。
测试结果如下:
Linearhash insert time consume: 160
Linearhash find time consume: 50
Linearhash erase time consume: 70
map insert time consume: 80
map find time consume: 40
map erase time consume: 110
hash_map insert time consume: 150
hash_map find time consume: 40
hash_map erase time consume: 2130