1 Overview
Open addressing和Chaining是两种不同的解决hash冲突的策略。当多个不同的key被映射到相同的slot时,chaining方式采用链表保存所有的value。而Open addressing则尝试在该slot的邻近位置查找,直到找到对应的value或者空闲的slot, 这个过程被称作probing。常见的probing策略有Linear probing,Quadratic probing和Double hashing。
2 Chaining
2.1 Chaining in java.util.HashMap
在分析open addressing策略之前,首先简单介绍一下大多数的Java 核心集合类采用的chaining策略,以便比较。 java.util.HashMap有一个Entry[]类型的成员变量table,其每个非null元素都是一个单向链表的表头。
Chaining策略的主要缺点是需要通过Entry保存key,value以及指向链表下个节点的引用(Map.Entry就有四个成员变量),这意味着更多的内存使用(尤其是当key,value本身使用的内存很小时,额外使用的内存所占的比例就显得比较大)。此外链表对CPU的高速缓存不太友好。
3 Open Addressing
3.1 Probing
3.1.1 Linear probing
两次查找位置的间隔为一固定值,即每次查找时在原slot位置的基础上增加一个固定值(通常为1),例如:P = (P + 1) mod SLOT_LENGTH。其最大的优点在于计算速度快,另外对CPU高速缓存更友好。其缺点也非常明显:
假设key1,key2,key3的hash code都相同并且key1被映射到slot(p),那么在计算key2的映射位置时需要查找slot(p), slot(p+1),计算key3的映射位置时需要查找slot(p), slot(p+1),slot(p+2)。也就是说对于导致hash冲突的所有key,在probing过程中会重复查找以前已经查找过的位置,这种现象被称为clustering。
3.1.2 Quadratic probing
两次查找位置的间隔线性增长,例如P(i) = (P + c1*i + c2*i*i) mod SLOT_LENGTH,其中c1和c2为常量且c2不为0(如果为0,那么降级为Linear probing)。 Quadratic probing的各方面性能介于Linear probing和Double hashing之间。
3.1.3 Double hashing
两次查找位置的间隔为一固定值,但是该值通过另外一个hash算法生成,例如P = (P + INCREMENT(key)) mod SLOT_LENGTH,其中INCREMENT即另外一个hash算法。以下是个简单的例子:
H(key) = key mod 10
INCREMENT(key) = 1 + (key mod 7)
P(15): H(15) = 5;
P(35): H(35) = 5, 与P(15)冲突,因此需要进行probe,位置是 (5 + INCREMENT(35)) mod 10 = 6
P(25): H(25) = 5, 与P(15)冲突,因此需要进行probe,位置是 (5 + INCREMENT(25)) mod 10 = 0
P(75): H(75) = 5, 与P(15)冲突,因此需要进行probe,位置是 (5 + INCREMENT(75)) mod 10 = 1
从以上例子可以看出,跟Linear probing相比,减少了重复查找的次数。
3.2 Load Factor
基于open addressing的哈希表的性能对其load factor属性值非常敏感。如果该值超过0.7 (Trove maps/sets的默认load factor是0.5),那么性能会下降的非常明显。由于hash冲突导致的probing次数跟(loadFactor) / (1 - loadFactor)成正比。当loadFactor为1时,如果哈希表中的空闲slot非常少,那么可能会导致probing的次数非常大。
3.3 Open addressing in gnu.trove.THashMap
GNU Trove (http://trove4j.sourceforge.net/) 是一个Java 集合类库。在某些场景下,Trove集合类库提供了更好的性能,而且内存使用更少。以下是Trove中跟open addressing相关的几个特性:
TObjectHashingStrategy
接口, Trove支持定制hash算法(例如不希望使用String或者数组的默认hash算法)。跟java.util.HashMap相比,gnu.trove.THashMap没有Entry[] table之类的成员变量,而是分别通过Object[] _set,V[] _values直接保存key和value。在逻辑上,Object[] _set中的每个元素都有三种状态:
这三种状态的迁移过程如下:
以下是关于状态迁移的简单例子(:= 的含义是赋值, H(key) = key mod 11):
以下是与get()方法相关的代码片段:
public V get(Object key) {
int index = index((K) key);
return index < 0 ? null : _values[index];
}
protected int index(T obj) {
final TObjectHashingStrategy hashing_strategy = _hashingStrategy;
final Object[] set = _set;
final int length = set.length;
final int hash = hashing_strategy.computeHashCode(obj) & 0x7fffffff;
int index = hash % length;
Object cur = set[index];
if ( cur == FREE ) return -1;
// NOTE: here it has to be REMOVED or FULL (some user-given value)
if ( cur == REMOVED || ! hashing_strategy.equals((T) cur, obj)) {
// see Knuth, p. 529
final int probe = 1 + (hash % (length - 2));
do {
index -= probe;
if (index < 0) {
index += length;
}
cur = set[index];
} while (cur != FREE
&& (cur == REMOVED || ! _hashingStrategy.equals((T) cur, obj)));
}
return cur == FREE ? -1 : index;
}
从以上代码可以看出get()方法的流程如下, 根据key的hash值找到对应的set元素,判断是否存在hash冲突。
如果不存在hash冲突,那么该set元素的可能状态如下:
以下是与put()方法相关的代码片段:
public V put(K key, V value) {
int index = insertionIndex(key);
return doPut(key, value, index);
}
private V doPut(K key, V value, int index) {
V previous = null;
Object oldKey;
boolean isNewMapping = true;
if (index < 0) {
index = -index -1;
previous = _values[index];
isNewMapping = false;
}
oldKey = _set[index];
_set[index] = key;
_values[index] = value;
if (isNewMapping) {
postInsertHook(oldKey == FREE);
}
return previous;
}
protected int insertionIndex(T obj) {
final TObjectHashingStrategy hashing_strategy = _hashingStrategy;
final Object[] set = _set;
final int length = set.length;
final int hash = hashing_strategy.computeHashCode(obj) & 0x7fffffff;
int index = hash % length;
Object cur = set[index];
if (cur == FREE) {
return index; // empty, all done
} else if (cur != REMOVED && hashing_strategy.equals((T) cur, obj)) {
return -index -1; // already stored
} else { // already FULL or REMOVED, must probe
// compute the double hash
final int probe = 1 + (hash % (length - 2));
// if the slot we landed on is FULL (but not removed), probe
// until we find an empty slot, a REMOVED slot, or an element
// equal to the one we are trying to insert.
// finding an empty slot means that the value is not present
// and that we should use that slot as the insertion point;
// finding a REMOVED slot means that we need to keep searching,
// however we want to remember the offset of that REMOVED slot
// so we can reuse it in case a "new" insertion (i.e. not an update)
// is possible.
// finding a matching value means that we've found that our desired
// key is already in the table
if (cur != REMOVED) {
// starting at the natural offset, probe until we find an
// offset that isn't full.
do {
index -= probe;
if (index < 0) {
index += length;
}
cur = set[index];
} while (cur != FREE
&& cur != REMOVED
&& ! hashing_strategy.equals((T) cur, obj));
}
// if the index we found was removed: continue probing until we
// locate a free location or an element which equal()s the
// one we have.
if (cur == REMOVED) {
int firstRemoved = index;
while (cur != FREE
&& (cur == REMOVED || ! hashing_strategy.equals((T) cur, obj))) {
index -= probe;
if (index < 0) {
index += length;
}
cur = set[index];
}
// NOTE: cur cannot == REMOVED in this block
return (cur != FREE) ? -index -1 : firstRemoved;
}
// if it's full, the key is already stored
// NOTE: cur cannot equal REMOVE here (would have retuned already (see above)
return (cur != FREE) ? -index -1 : index;
}
}
从以上代码可以看出,THashMap使用Double hashing。用来计算增量的hash算法是final int probe = 1 + (hash % (length - 2)); 如果insertionIndex()方法的返回值为正值,那么该值就是可用的slot位置;如果为负值,那么说明该key之前已经保存过,(-index-1)就是之前的slot位置。
put()方法的流程如下, 根据key的hash值找到对应的set元素,判断是否存在hash冲突。
如果不存在hash冲突,那么该set元素的可能状态如下: