HashMap是Java开发者最常用的集合类之一,今天阿楠结合jdk7的源码来对HashMap作一翻总结,盘点一下HashMap的设计精髓。了解源代码之前,先了解一下两位赫赫有名的HashMap源代码的作者。
Josh Bloch
Java 集合框架创办人,Joshua Bloch 领导了很多 Java 平台特性的设计和实现,包括 JDK 5.0 语言增强以及屡获殊荣的 Java 集合框架。2004年6月他离开了SUN公司并成为 Google 的首席 Java 架构师。此外他还因为《Effective Java》一书获得著名的 Jolt 大奖。
Doug Lea
纽约州立大学Oswego分校的计算机教授,在那里他专攻并发编程和并发数据结构设计。他曾是JCP(Java Community Process)执行委员会的一员,并担任Java 规范请求166(JSR Specification Request 166)的主席。JSR 166为Java加入了并发功能(详见Java并发)。他设计了util.concurrent开发包。
怎么样,这两位老人家是不是很厉害? 更厉害的可能要数Java之父——詹姆斯·高斯林 (James Gosling),相信Java工程师都听过他的大名,在这里就不作介绍了。
下面是jdk7中HashMap源代码开头部分,可以看到有四位作者参与了该类的编写,前两位已经作了介绍。有几个重要成员变量大家也很熟悉,不熟悉的看一下注释就懂了,这里也不作详细介绍。
* This class is a member of the
*
* Java Collections Framework.
*
* @param the type of keys maintained by this map
* @param the type of mapped values
*
* @author Doug Lea
* @author Josh Bloch
* @author Arthur van Hoff
* @author Neal Gafter
* @see Object#hashCode()
* @see Collection
* @see Map
* @see TreeMap
* @see Hashtable
* @since 1.2
*/
public class HashMap
extends AbstractMap
implements Map, Cloneable, Serializable
{
/**
* The default initial capacity - MUST be a power of two.
*/
static final int DEFAULT_INITIAL_CAPACITY = 16;
/**
* The maximum capacity, used if a higher value is implicitly specified
* by either of the constructors with arguments.
* MUST be a power of two <= 1<<30.
*/
static final int MAXIMUM_CAPACITY = 1 << 30;
/**
* The load factor used when none specified in constructor.
*/
static final float DEFAULT_LOAD_FACTOR = 0.75f;
/**
* The table, resized as necessary. Length MUST Always be a power of two.
*/
transient Entry[] table;
/**
* The number of key-value mappings contained in this map.
*/
transient int size;
/**
* The next size value at which to resize (capacity * load factor).
* @serial
*/
int threshold;
/**
* The load factor for the hash table.
*
* @serial
*/
final float loadFactor;
下面就来盘点一下HashMap的高明之处。
大家都知道,在Java语言里,要根据一个key快速找到它对应的value,肯定需要依靠数组,并且要采用散列的形式存储(依靠哈希值定位和查找),但是散列的结构容易发生哈希冲突,如果哈希值相同怎么办?HashMap在数组的基础上引入了链表结构,巧妙地解决了这一问题。即将数组的元素设计成链表结构,而数组元素本身存储的是链表的表头,这就是链表散列结构。HashMap的整体结构如下图所示。它利用了数组存取快的优点并且加入链表解决了哈希冲突。
看到这里,阿楠要提醒一下了。其实这样设计还是有问题的,如果程序员对集合元素key的hash()函数重写不当。导致了大量哈希值相同的元素,那么在数组的同一位置就会产生很长的单链表,单链表一旦过长就会导致查找效率降低,HashMap极端情况下就会变成链表,而链表大家都知道它的缺点就是查找慢,因为它要对所有元素进行遍历。对此,jdk8对HashMap作了改进,将单链表设计成了红黑树结构。进一步优化了HashMap的结构。但是jdk8源代码比较复杂,在这里就不作参照了,不过大体原理差不多。
看一下HashMap的构造函数:
/**
* Constructs an empty HashMap with the specified initial
* capacity and load factor.
*
* @param initialCapacity the initial capacity
* @param loadFactor the load factor
* @throws IllegalArgumentException if the initial capacity is negative
* or the load factor is nonpositive
*/
public HashMap(int initialCapacity, float loadFactor) {
if (initialCapacity < 0)
throw new IllegalArgumentException("Illegal initial capacity: " +
initialCapacity);
if (initialCapacity > MAXIMUM_CAPACITY)
initialCapacity = MAXIMUM_CAPACITY;
if (loadFactor <= 0 || Float.isNaN(loadFactor))
throw new IllegalArgumentException("Illegal load factor: " +
loadFactor);
// Find a power of 2 >= initialCapacity
int capacity = 1;
while (capacity < initialCapacity)
capacity <<= 1;
this.loadFactor = loadFactor;
threshold = (int)Math.min(capacity * loadFactor, MAXIMUM_CAPACITY + 1);
table = new Entry[capacity];
useAltHashing = sun.misc.VM.isBooted() &&
(capacity >= Holder.ALTERNATIVE_HASHING_THRESHOLD);
init();
}
/**
* Constructs an empty HashMap with the specified initial
* capacity and the default load factor (0.75).
*
* @param initialCapacity the initial capacity.
* @throws IllegalArgumentException if the initial capacity is negative.
*/
public HashMap(int initialCapacity) {
this(initialCapacity, DEFAULT_LOAD_FACTOR);
}
/**
* Constructs an empty HashMap with the default initial capacity
* (16) and the default load factor (0.75).
*/
public HashMap() {
this(DEFAULT_INITIAL_CAPACITY, DEFAULT_LOAD_FACTOR);
}
/**
* Constructs a new HashMap with the same mappings as the
* specified Map. The HashMap is created with
* default load factor (0.75) and an initial capacity sufficient to
* hold the mappings in the specified Map.
*
* @param m the map whose mappings are to be placed in this map
* @throws NullPointerException if the specified map is null
*/
public HashMap(Map extends K, ? extends V> m) {
this(Math.max((int) (m.size() / DEFAULT_LOAD_FACTOR) + 1,
DEFAULT_INITIAL_CAPACITY), DEFAULT_LOAD_FACTOR);
putAllForCreate(m);
}
// internal utilities
/**
* Initialization hook for subclasses. This method is called
* in all constructors and pseudo-constructors (clone, readObject)
* after HashMap has been initialized but before any entries have
* been inserted. (In the absence of this method, readObject would
* require explicit knowledge of subclasses.)
*/
void init() {
}
可以看到,无论调用哪个构造函数,最后都是调用了下面这个构造函数:
public HashMap(int initialCapacity, float loadFactor) {
那么我们就将注意力放在这个终极构造函数:
/**
* Constructs an empty HashMap with the specified initial
* capacity and load factor.
*
* @param initialCapacity the initial capacity
* @param loadFactor the load factor
* @throws IllegalArgumentException if the initial capacity is negative
* or the load factor is nonpositive
*/
public HashMap(int initialCapacity, float loadFactor) {
//以下三个if判断都是对容量和加载因子进行过滤
if (initialCapacity < 0)
throw new IllegalArgumentException("Illegal initial capacity: " + initialCapacity);
if (initialCapacity > MAXIMUM_CAPACITY)
initialCapacity = MAXIMUM_CAPACITY;
if (loadFactor <= 0 || Float.isNaN(loadFactor))
throw new IllegalArgumentException("Illegal load factor: " + loadFactor);
//注意这里了,下面这个while循环就是阿楠今天要讲的精髓之一,大家暂时也不用关心,后面会讲。
// Find a power of 2 >= initialCapacity
int capacity = 1;
while (capacity < initialCapacity)
capacity <<= 1;
this.loadFactor = loadFactor;
//threshold就是hashmap的元素数量临界值,元素数量达到这个值,就会扩容。是否要扩容操在添加元素的时候进行判断。
threshold = (int)Math.min(capacity * loadFactor, MAXIMUM_CAPACITY + 1);
//重点:创建了一个容量为capacity的数组
table = new Entry[capacity];
//是否使用备选哈希函数,用来对key为String类型的hash函数进行特殊处理,减少hash值的碰撞。
useAltHashing = sun.misc.VM.isBooted() &&(capacity >= Holder.ALTERNATIVE_HASHING_THRESHOLD);
init();
}
}
注意到构造函数中,创建了一个Entry类型的数组。这个数组就是用来存放元素的,而元素在数组中的位置是由元素key的哈希值计算的(后面会介绍)。当元素key的哈希值冲突怎么办呢?上文讲到了它将数组元素设计成了单链表。那么我们来看看Entry的结构,到底是不是一个单链表。
static class Entry implements Map.Entry {
final K key;
V value;
Entry next;
int hash;
/**
* Creates new entry.
*/
Entry(int h, K k, V v, Entry n) {
value = v;
next = n;
key = k;
hash = h;
}
public final K getKey() {
return key;
}
public final V getValue() {
return value;
}
public final V setValue(V newValue) {
V oldValue = value;
value = newValue;
return oldValue;
}
public final boolean equals(Object o) {
if (!(o instanceof Map.Entry))
return false;
Map.Entry e = (Map.Entry)o;
Object k1 = getKey();
Object k2 = e.getKey();
if (k1 == k2 || (k1 != null && k1.equals(k2))) {
Object v1 = getValue();
Object v2 = e.getValue();
if (v1 == v2 || (v1 != null && v1.equals(v2)))
return true;
}
return false;
}
public final int hashCode() {
return (key==null ? 0 : key.hashCode()) ^
(value==null ? 0 : value.hashCode());
}
public final String toString() {
return getKey() + "=" + getValue();
}
/**
* This method is invoked whenever the value in an entry is
* overwritten by an invocation of put(k,v) for a key k that's already
* in the HashMap.
*/
void recordAccess(HashMap m) {
}
/**
* This method is invoked whenever the entry is
* removed from the table.
*/
void recordRemoval(HashMap m) {
}
}
可以看到有一个Entry类型的next变量就是存放下一个结点的。那么我们再看一看HashMap的put函数,进一步验证。
/**
* Associates the specified value with the specified key in this map.
* If the map previously contained a mapping for the key, the old
* value is replaced.
*
* @param key key with which the specified value is to be associated
* @param value value to be associated with the specified key
* @return the previous value associated with key, or
* null if there was no mapping for key.
* (A null return can also indicate that the map
* previously associated null with key.)
*/
public V put(K key, V value) {
//将key为null的元素,单独处理,放在了数组下标为0的位置,下面有putForNullKey函数的源码。
if (key == null)
return putForNullKey(value);
//重新计算一遍hash值,这里也是阿楠要讲一精髓之一了,先忽略。
int hash = hash(key);
//根据新计算出的hash值,找到对应的数组下标i,先忽略,后面会详细讲。
int i = indexFor(hash, table.length);
//下面就是遍历单链表了,查找是否有key相同的元素,key如果相同,就是将value进行替换。
for (Entry e = table[i]; e != null; e = e.next) {
Object k;
if (e.hash == hash && ((k = e.key) == key || key.equals(k))) {
V oldValue = e.value;
e.value = value;
e.recordAccess(this);
return oldValue;
}
}
modCount++;
//没有找到key相同的元素,那么就是在单链表表头添加一个新元素。下面会贴上addEnry函数的源码。
addEntry(hash, key, value, i);
return null;
}
/**
* Offloaded version of put for null keys
*/
private V putForNullKey(V value) {
//这部分代码也很容易理解,就是在数组第一个位置插入元素,先判断key是否为null,再将value进行替换。因为数组下标为0的位置也有可能key不为null。但是key为null的元素一定是放在了数组下标为0的位置。
for (Entry e = table[0]; e != null; e = e.next) {
if (e.key == null) {
V oldValue = e.value;
e.value = value;
e.recordAccess(this);
return oldValue;
}
}
modCount++;
addEntry(0, null, value, 0);
return null;
}
可以看到,put函数中允许了key为null值的元素,并且将key为null的元素放在了数组下标为0的位置。下面,我们接着看addEntry函数的源码。
/**
* Adds a new entry with the specified key, value and hash code to
* the specified bucket. It is the responsibility of this
* method to resize the table if appropriate.
*
* Subclass overrides this to alter the behavior of put method.
*/
void addEntry(int hash, K key, V value, int bucketIndex) {
//当元素数量达到临界值,就会进行扩容操作,新的容量是原来容量的两倍。
if ((size >= threshold) && (null != table[bucketIndex])) {
resize(2 * table.length);
hash = (null != key) ? hash(key) : 0;
bucketIndex = indexFor(hash, table.length);
}
//创建新的元素
createEntry(hash, key, value, bucketIndex);
}
/**
* Like addEntry except that this version is used when creating entries
* as part of Map construction or "pseudo-construction" (cloning,
* deserialization). This version needn't worry about resizing the table.
*
* Subclass overrides this to alter the behavior of HashMap(Map),
* clone, and readObject.
*/
void createEntry(int hash, K key, V value, int bucketIndex) {
//下面是经典的单链表插入表头的算法:先将表头元素记录下来,再将新表头重新赋值。
Entry e = table[bucketIndex];
table[bucketIndex] = new Entry<>(hash, key, value, e);
size++;
}
添加元素时,进行了容量判断,超过临界值就会扩容。扩容操作就是新建一个容量为原来两倍的数组,将原来的元素复制过来。讲到这里,HashMap的整体结构已经很清晰了。感兴趣的同学可以看一看查找元素的代码,阿楠就不作阐释了。
HashMap查找元素的源代码:
/**
* Returns the value to which the specified key is mapped,
* or {@code null} if this map contains no mapping for the key.
*
* More formally, if this map contains a mapping from a key
* {@code k} to a value {@code v} such that {@code (key==null ? k==null :
* key.equals(k))}, then this method returns {@code v}; otherwise
* it returns {@code null}. (There can be at most one such mapping.)
*
*
A return value of {@code null} does not necessarily
* indicate that the map contains no mapping for the key; it's also
* possible that the map explicitly maps the key to {@code null}.
* The {@link #containsKey containsKey} operation may be used to
* distinguish these two cases.
*
* @see #put(Object, Object)
*/
public V get(Object key) {
if (key == null)
return getForNullKey();
Entry entry = getEntry(key);
return null == entry ? null : entry.getValue();
}
/**
* Offloaded version of get() to look up null keys. Null keys map
* to index 0. This null case is split out into separate methods
* for the sake of performance in the two most commonly used
* operations (get and put), but incorporated with conditionals in
* others.
*/
private V getForNullKey() {
for (Entry e = table[0]; e != null; e = e.next) {
if (e.key == null)
return e.value;
}
return null;
}
/**
* Returns true if this map contains a mapping for the
* specified key.
*
* @param key The key whose presence in this map is to be tested
* @return true if this map contains a mapping for the specified
* key.
*/
public boolean containsKey(Object key) {
return getEntry(key) != null;
}
/**
* Returns the entry associated with the specified key in the
* HashMap. Returns null if the HashMap contains no mapping
* for the key.
*/
final Entry getEntry(Object key) {
int hash = (key == null) ? 0 : hash(key);
for (Entry e = table[indexFor(hash, table.length)];
e != null;
e = e.next) {
Object k;
if (e.hash == hash &&
((k = e.key) == key || (key != null && key.equals(k))))
return e;
}
return null;
}
现在大家对HashMap的整体结构已经有所了解,即链表散列结构。下面阿楠再详细讲一讲HashMap的细节之美。
在HashMap的构造函数中,对传入的容量initialCapacity重新进行了计算,而没有直接使用initialCapacity作为数组容量。回顾一下构造函数:
//注意这里了,下面这个while循环就是阿楠今天要讲的精髓之二。
// Find a power of 2 >= initialCapacity
int capacity = 1;
while (capacity < initialCapacity)
capacity <<= 1;
this.loadFactor = loadFactor;
//threshold就是hashmap的元素数量临界值,元素数量达到这个值,就会扩容。是否要扩容操在添加元素的时候进行判断。
threshold = (int)Math.min(capacity * loadFactor, MAXIMUM_CAPACITY + 1);
//重点:使用新计算的容量创建了数组
table = new Entry[capacity];
重点看一下这个while循环:
int capacity = 1;
while (capacity < initialCapacity)
capacity <<= 1;
初始capacity的值是1,如果capacity小于传入的容量,则将capacity左移一位,大家都知道,对一个数左移一位,即是将这个数乘以2。最后capacity的值就是最接近initialCapacity的2的n次幂。而2的n次幂的二进制很有规律,高位是1,其它都是0。举例:
2的二进制:0000 0000 0000 0000 0000 0000 0000 0010
4的二进制:0000 0000 0000 0000 0000 0000 0000 0100
8的二进制:0000 0000 0000 0000 0000 0000 0000 1000
16的二进制:0000 0000 0000 0000 0000 0000 0001 0000
计算后的capacity即数组的长度length,它的二进制低位全是0,高位是1。更神奇的是,length-1的结果,低位全是1,高位是0。举例:
1的二进制:0000 0000 0000 0000 0000 0000 0000 0001
3的二进制:0000 0000 0000 0000 0000 0000 0000 0011
7的二进制:0000 0000 0000 0000 0000 0000 0000 0111
15的二进制:0000 0000 0000 0000 0000 0000 0000 1111
这就是重新计算容量的目的,它保证的容量一定要是2的n次幂,将为后面的计算数组下标提供良好的支持。
HashMap源代码的put函数中,对key的hash值重新进行了计算(源码在上面),得出了新哈希值hash,并通过indexFor函数得出了元素在数组中的下标:
//重新计算一遍hash值,即是阿楠要讲的精髓之三。
int hash = hash(key);
//根据新计算出的hash值,找到对应的数组下标i。
int i = indexFor(hash, table.length);
那么这样计算有什么道理呢?大家都知道,根据key的hash值转换成数组下标,有一种更加直接的方法,即取模运算。key的hash()函数返回的是int类型,但是这个哈希值可能会超过了数组上限,那么我们进行取模(求余)运算不就行了吗?即数组下标index=key.hash() % (table.length-1),计算的结果一定是在数组下标范围之内。但是HashMap并没有这样计算。我们来看看源代码中hash(key)函数和indexFor函数的实现:
/**
* Retrieve object hash code and applies a supplemental hash function to the
* result hash, which defends against poor quality hash functions. This is
* critical because HashMap uses power-of-two length hash tables, that
* otherwise encounter collisions for hashCodes that do not differ
* in lower bits. Note: Null keys always map to hash 0, thus index 0.
*/
final int hash(Object k) {
int h = 0;
if (useAltHashing) {
if (k instanceof String) {
return sun.misc.Hashing.stringHash32((String) k);
}
h = hashSeed;
}
h ^= k.hashCode();
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
/**
* Returns index for hash code h.
*/
static int indexFor(int h, int length) {
return h & (length-1);
}
可以看到hash(key)函数重新计算出一个哈希值,其中进行了各种位操作,其实这样做的目的只有一个,就是减少key的hash值碰撞。
举个例子,假设h^=h.hashCode()后h的值是0x7FFFFFFF,它的二进制除了符号位之外全是1,经过上述各种位运算的过程如下:
最后返回的二进制数已经不是连续的全1,而是0与1都有,这就是hash(key)函数的精髓,它保证了一个数低位如果连续都是1,那么就打乱它的这种连续性。至于为什么要打乱,下面会进一步阐释。
indexFor函数很好理解,即将哈希值与length-1进行位与运算,而数组长度在构造函数中已经作了处理,将数组长度控制在了2的n次幂,上面一节精髓二已经讲到了。length-1的结果,二进制低位全是1,高位是0,那么length-1与哈希值进行位与运算的结果是什么?那就是取哈希值的低位,并且结果不会超过length-1,从而保证了下标不会越界。那么为什么要使用位运算,而不是取模,很简单,因为位运算的效率远远高于取模运算。
那么hash(Object k)函数的作用是什么呢?它为什么要打乱key的hashCode二进制排序顺序呢?因为如果多个key的哈希值的低位是一样的,高位不一样,那么取低位的时候,就有可能取到相同的结果,这样计算的数组下标就是一样的,最后就导致数组中某一个位置的单链表过长,从而降低了HashMap的检索效率。
HashMap的精髓已经讲完了,相信大家已经有所了解,有疑问的同学可以在下面留言。阿楠会在第一时间回复。