开发中常常会用到这样一种数据结构,根据一个关键字,找到所需的信息。这个过程有点像查字典,拿到一个key,去字典表中查找对应的value。Java1.0版本提供了这样的类java.util.Dictionary(抽象类),基本上支持字典表的操作。后来引入了Map接口,更好的描述的这种数据结构。
一个Map中会包含多个K-V对儿,Key不能重复。一个Key最多只能映射到一个Value,但多个Key可以映射到同一个Value。Map还提供了3中集合视图:Key集合、Value集合和Key-Value集合。
public interface Map<K,V> {
int size();
boolean isEmpty();
boolean containsKey(Object key);
boolean containsValue(Object value);
V get(Object key);
V put(K key, V value);
V remove(Object key);
void putAll(Map<? extends K, ? extends V> m);
void clear();
Set<K> keySet();
Collection<V> values();
Set<Map.Entry<K, V>> entrySet();
boolean equals(Object o);
int hashCode();
interface Entry<K,V> {
K getKey();
V getValue();
V setValue(V value);
boolean equals(Object o);
int hashCode();
}
}
大概看一下Map接口包含的操作。可以看到Map接口中还定义了表示一个K-V的接口Entry。
我们经常使用的一种实现就是java.util.HashMap。顾名思义,这种实现就是散列表。那什么是散列表呢?首先肯定是一种表,之前总结过表可以由数组或链表实现。假设我们先把数组看成散列表的“表”,而且现在数组中有一些数据。我要用某个关键字在这个数组中直接定位到一个值。可以想象,这个关键字可能是一个编号,一个姓名。。。总之对我们是有意义的标识。但数组怎么能一下定位到某个数据呢?只能靠下标!这就出现了一个问题,我们怎么把有意义的标识转化成下标呢?可以写一个函数来完成这个转化过程,这个函数就可以认为是散列函数了。
简单起见,先假设key都是int型的整数。假如有一些key:{3,6,7},一个数组int[10]。有了这些前提,先来一个最简单的散列函数:
public int hash(int k){
return k;
}
(我靠!这也太儿戏了吧。。。行不行啊)对于上面的条件,当然可以了。可以看到,3分配到了下标为3的地方;6分配到下标为6的地方。。。看来是可以的。如果修改一下上面的条件,把key里面3变成3333,肯定不行了,数组太小了。那怎么办?把数组长度改成3333 + 1。这倒是能存了,可是我们为了存3个数,用了这么大一个数组,太多空间浪费掉了。好吧,换一个散列函数:
public int hash(int k){
return k % table.length;
}
这样的话,又可以继续使用int[10]了。看看还有什么问题,如果再改一下,key变成{3,6,16,7}。又会有问题了,因为6和16会散列到同一个位置。其实,这个问题就是散列表实现过程会常遇到的问题,称为冲突。要消除冲突,一般的方法有分离链表法,开放地址法,再散列法等。(当然,如果上面int数组的长度取一个素数会更好一些。)先分析到这儿吧。
总之,散列函数的目的是将不同的关键字均匀的散列到表中,散列的越均匀,性能越好。性能可以用对散列表进行插入、删除、查找的时间复杂度太衡量。
那么java.util.HashMap内部的散列函数是怎样的呢?HashMap内部怎么消除散列冲突问题呢?HashMap内部的表会不会扩展呢?带着这些问题,来看看HashMap的源码。
public class HashMap<K,V>
extends AbstractMap<K,V>
implements Map<K,V>, Cloneable, Serializable
{
可以看到,HashMap是
可克隆的,可序列化的。实现了Map接口。AbstractMap中实现了一些骨架方法,很容易理解,就不看了,继续往下。
/**
* The default initial capacity - MUST be a power of two.
*/
static final int DEFAULT_INITIAL_CAPACITY = 16;
/**
* The maximum capacity, used if a higher value is implicitly specified
* by either of the constructors with arguments.
* MUST be a power of two <= 1<<30.
*/
static final int MAXIMUM_CAPACITY = 1 << 30;
/**
* The load factor used when none specified in constructor.
*/
static final float DEFAULT_LOAD_FACTOR = 0.75f;
/**
* The table, resized as necessary. Length MUST Always be a power of two.
*/
transient Entry[] table;
/**
* The number of key-value mappings contained in this map.
*/
transient int size;
/**
* The next size value at which to resize (capacity * load factor).
* @serial
*/
int threshold;
/**
* The load factor for the hash table.
*
* @serial
*/
final float loadFactor;
/**
* The number of times this HashMap has been structurally modified
* Structural modifications are those that change the number of mappings in
* the HashMap or otherwise modify its internal structure (e.g.,
* rehash). This field is used to make iterators on Collection-views of
* the HashMap fail-fast. (See ConcurrentModificationException).
*/
transient volatile int modCount;
首先看到,有默认初始化容量和最大容量,默认初始化容量为16,最大容量为2的30次方,而且注释上写了“MUST be a power of two”,不禁让我想起了“a & (b-1)”;默认的加载因子是0.75f(加载因子是干嘛滴?);内部表是数组实现的,类型为java.util.HashMap.Entry。
static class Entry<K,V> implements Map.Entry<K,V> {
final K key;
V value;
Entry<K,V> next;
final int hash;
看来是单向链表实现的;size表示表中实际数据的数量(并非表的长度);threshold和resize有关(看来内部表是会扩展的);一看到modCount就想起了fail-fast吧。
/**
* Constructs an empty <tt>HashMap</tt> with the specified initial
* capacity and load factor.
*
* @param initialCapacity the initial capacity
* @param loadFactor the load factor
* @throws IllegalArgumentException if the initial capacity is negative
* or the load factor is nonpositive
*/
public HashMap(int initialCapacity, float loadFactor) {
if (initialCapacity < 0)
throw new IllegalArgumentException("Illegal initial capacity: " +
initialCapacity);
if (initialCapacity > MAXIMUM_CAPACITY)
initialCapacity = MAXIMUM_CAPACITY;
if (loadFactor <= 0 || Float.isNaN(loadFactor))
throw new IllegalArgumentException("Illegal load factor: " +
loadFactor);
// Find a power of 2 >= initialCapacity
int capacity = 1;
while (capacity < initialCapacity)
capacity <<= 1;
this.loadFactor = loadFactor;
threshold = (int)(capacity * loadFactor);
table = new Entry[capacity];
init();
}
/**
* Constructs an empty <tt>HashMap</tt> with the specified initial
* capacity and the default load factor (0.75).
*
* @param initialCapacity the initial capacity.
* @throws IllegalArgumentException if the initial capacity is negative.
*/
public HashMap(int initialCapacity) {
this(initialCapacity, DEFAULT_LOAD_FACTOR);
}
/**
* Constructs an empty <tt>HashMap</tt> with the default initial capacity
* (16) and the default load factor (0.75).
*/
public HashMap() {
this.loadFactor = DEFAULT_LOAD_FACTOR;
threshold = (int)(DEFAULT_INITIAL_CAPACITY * DEFAULT_LOAD_FACTOR);
table = new Entry[DEFAULT_INITIAL_CAPACITY];
init();
}
/**
* Constructs a new <tt>HashMap</tt> with the same mappings as the
* specified <tt>Map</tt>. The <tt>HashMap</tt> is created with
* default load factor (0.75) and an initial capacity sufficient to
* hold the mappings in the specified <tt>Map</tt>.
*
* @param m the map whose mappings are to be placed in this map
* @throws NullPointerException if the specified map is null
*/
public HashMap(Map<? extends K, ? extends V> m) {
this(Math.max((int) (m.size() / DEFAULT_LOAD_FACTOR) + 1,
DEFAULT_INITIAL_CAPACITY), DEFAULT_LOAD_FACTOR);
putAllForCreate(m);
}
重点看一下构造方法HashMap(int initialCapacity, float loadFactor),这里的一个关键点就是把内部表的初始容量设置成一个大于等于initialCapacity的2的幂。而且在这个构造方法和无参构造方法的最后都调用了一个init()方法,这个方法是提供给子类的钩子方法。
// internal utilities
/**
* Initialization hook for subclasses. This method is called
* in all constructors and pseudo-constructors (clone, readObject)
* after HashMap has been initialized but before any entries have
* been inserted. (In the absence of this method, readObject would
* require explicit knowledge of subclasses.)
*/
void init() {
}
/**
* Applies a supplemental hash function to a given hashCode, which
* defends against poor quality hash functions. This is critical
* because HashMap uses power-of-two length hash tables, that
* otherwise encounter collisions for hashCodes that do not differ
* in lower bits. Note: Null keys always map to hash 0, thus index 0.
*/
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
/**
* Returns index for hash code h.
*/
static int indexFor(int h, int length) {
return h & (length-1);
}
先看下indexFor方法,h & (length-1)很熟悉吧,就是h mod length,前提是length是2的幂。hash方法内部很奇怪,又是移位又是异或的,在搞什么灰机?我们可以想一下h mod length这样散列有什么问题,以二进制的角度看,一个数 & (2的n次方-1),就相当于取这个二进制数的低n位。
int型 32=2的5次方
789 = 00000000000000000000001100010101
789 & (32-1)
= 00000000000000000000000000010101
-614 & (32-1)
-614 = 11111111111111111111110110011010
= 00000000000000000000000000011010
假设一些数的低位相同,那么发生散列冲突的概率会很大,为了降低这种冲突的概率,hash方法里做了一些调整,给整数h的各个位上加入了一些随机性。
这里有点资料可做参考。
散列表会提供添加、删除、查找操作常数平均时间的性能。我们先看下添加的代码。
/**
* Associates the specified value with the specified key in this map.
* If the map previously contained a mapping for the key, the old
* value is replaced.
*
* @param key key with which the specified value is to be associated
* @param value value to be associated with the specified key
* @return the previous value associated with <tt>key</tt>, or
* <tt>null</tt> if there was no mapping for <tt>key</tt>.
* (A <tt>null</tt> return can also indicate that the map
* previously associated <tt>null</tt> with <tt>key</tt>.)
*/
public V put(K key, V value) {
if (key == null)
return putForNullKey(value);
int hash = hash(key.hashCode());
int i = indexFor(hash, table.length);
for (Entry<K,V> e = table[i]; e != null; e = e.next) {
Object k;
if (e.hash == hash && ((k = e.key) == key || key.equals(k))) {
V oldValue = e.value;
e.value = value;
e.recordAccess(this);
return oldValue;
}
}
modCount++;
addEntry(hash, key, value, i);
return null;
}
逐行分析,首先当key为null时,调用putForNullKey方法。
/**
* Offloaded version of put for null keys
*/
private V putForNullKey(V value) {
for (Entry<K,V> e = table[0]; e != null; e = e.next) {
if (e.key == null) {
V oldValue = e.value;
e.value = value;
e.recordAccess(this);
return oldValue;
}
}
modCount++;
addEntry(0, null, value, 0);
return null;
}
Null Key映射到内部表下标为0的位置上,因为每个位置上是一个单向链表(如果有数据的话),所以这里有一个循环。如果找到了Null Key,将其值覆盖,返回旧值。覆盖过程中还顺便调了一下java.util.HashMap.Entry的recordAccess方法(这个方法也可以看成是钩子方法,当然在HashMap里什么都没做)。如果没找到Null Key,会新添加一个Entry(K-V对)。
/**
* Adds a new entry with the specified key, value and hash code to
* the specified bucket. It is the responsibility of this
* method to resize the table if appropriate.
*
* Subclass overrides this to alter the behavior of put method.
*/
void addEntry(int hash, K key, V value, int bucketIndex) {
Entry<K,V> e = table[bucketIndex];
table[bucketIndex] = new Entry<K,V>(hash, key, value, e);
if (size++ >= threshold)
resize(2 * table.length);
}
根据给定的参数,新建了一个Entry,放到了对应下标(可以称这样的位置是一个桶)的表头(之所以放在表头是因为最新被添加到HashMap中的数据可能被访问的几率会大一些),然后做判断,看看有没有必要给表扩容。
继续看put方法,下一行是int hash = hash(key.hashCode())。噢!!原来这里调用了key的hashCode方法。Java的Object类中提供了hashCode方法,所以不管key是什么类型,都会有一个int型的hashCode了。记得之前面试被人问过Object的hashCode的作用是干嘛滴,顿时脑子一片空白。原来是这个作用哦!这样看来,给类提供一个好的hashCode方法是很重要的,假设提供一个巨弱的hashCode方法,比如直接返回1。那么把大量这样的对象作为key,放到HashMap里结果就是,所有的数据都放到了一个桶里,形成了一个单向链表,严重降低了删除和查找操作的性能。继续往下看,int i = indexFor(hash, table.length),用上一步得到的hash值来确定表的下标。后面的操作和putForNullKey里面一样,遍历相应的单链表,找到的话就覆盖,找不到就创建一个Entry。
有了put方法的分析,get方法也很容易看懂了。
/**
* Returns the value to which the specified key is mapped,
* or {@code null} if this map contains no mapping for the key.
*
* <p>More formally, if this map contains a mapping from a key
* {@code k} to a value {@code v} such that {@code (key==null ? k==null :
* key.equals(k))}, then this method returns {@code v}; otherwise
* it returns {@code null}. (There can be at most one such mapping.)
*
* <p>A return value of {@code null} does not <i>necessarily</i>
* indicate that the map contains no mapping for the key; it's also
* possible that the map explicitly maps the key to {@code null}.
* The {@link #containsKey containsKey} operation may be used to
* distinguish these two cases.
*
* @see #put(Object, Object)
*/
public V get(Object key) {
if (key == null)
return getForNullKey();
int hash = hash(key.hashCode());
for (Entry<K,V> e = table[indexFor(hash, table.length)];
e != null;
e = e.next) {
Object k;
if (e.hash == hash && ((k = e.key) == key || key.equals(k)))
return e.value;
}
return null;
}
/**
* Offloaded version of get() to look up null keys. Null keys map
* to index 0. This null case is split out into separate methods
* for the sake of performance in the two most commonly used
* operations (get and put), but incorporated with conditionals in
* others.
*/
private V getForNullKey() {
for (Entry<K,V> e = table[0]; e != null; e = e.next) {
if (e.key == null)
return e.value;
}
return null;
}
看下remove方法。
/**
* Removes the mapping for the specified key from this map if present.
*
* @param key key whose mapping is to be removed from the map
* @return the previous value associated with <tt>key</tt>, or
* <tt>null</tt> if there was no mapping for <tt>key</tt>.
* (A <tt>null</tt> return can also indicate that the map
* previously associated <tt>null</tt> with <tt>key</tt>.)
*/
public V remove(Object key) {
Entry<K,V> e = removeEntryForKey(key);
return (e == null ? null : e.value);
}
/**
* Removes and returns the entry associated with the specified key
* in the HashMap. Returns null if the HashMap contains no mapping
* for this key.
*/
final Entry<K,V> removeEntryForKey(Object key) {
int hash = (key == null) ? 0 : hash(key.hashCode());
int i = indexFor(hash, table.length);
Entry<K,V> prev = table[i];
Entry<K,V> e = prev;
while (e != null) {
Entry<K,V> next = e.next;
Object k;
if (e.hash == hash &&
((k = e.key) == key || (key != null && key.equals(k)))) {
modCount++;
size--;
if (prev == e)
table[i] = next;
else
prev.next = next;
e.recordRemoval(this);
return e;
}
prev = e;
e = next;
}
return e;
}
remove方法里,删除时需要先在相应的单向链表里找到要删除的数据,然后通过调整链表的“环”的指针来达到删除效果。删除过程中还顺便调了一下e.recordRemoval方法,这也是个钩子。
要注意一个费时的操作containsValue。
/**
* Returns <tt>true</tt> if this map maps one or more keys to the
* specified value.
*
* @param value value whose presence in this map is to be tested
* @return <tt>true</tt> if this map maps one or more keys to the
* specified value
*/
public boolean containsValue(Object value) {
if (value == null)
return containsNullValue();
Entry[] tab = table;
for (int i = 0; i < tab.length ; i++)
for (Entry e = tab[i] ; e != null ; e = e.next)
if (value.equals(e.value))
return true;
return false;
}
/**
* Special-case code for containsValue with null argument
*/
private boolean containsNullValue() {
Entry[] tab = table;
for (int i = 0; i < tab.length ; i++)
for (Entry e = tab[i] ; e != null ; e = e.next)
if (e.value == null)
return true;
return false;
}
这个操作需要对表进行遍历,而且还要循环链表,所以平均执行时间应该是θ(n)。
再看一下HashMap内部表是怎么扩容的。在添加的时候会将当前表中数据和threshold进行比较,决定是否进行扩容。
void addEntry(int hash, K key, V value, int bucketIndex) {
Entry<K,V> e = table[bucketIndex];
table[bucketIndex] = new Entry<K,V>(hash, key, value, e);
if (size++ >= threshold)
resize(2 * table.length);
}
/**
* Rehashes the contents of this map into a new array with a
* larger capacity. This method is called automatically when the
* number of keys in this map reaches its threshold.
*
* If current capacity is MAXIMUM_CAPACITY, this method does not
* resize the map, but sets threshold to Integer.MAX_VALUE.
* This has the effect of preventing future calls.
*
* @param newCapacity the new capacity, MUST be a power of two;
* must be greater than current capacity unless current
* capacity is MAXIMUM_CAPACITY (in which case value
* is irrelevant).
*/
void resize(int newCapacity) {
Entry[] oldTable = table;
int oldCapacity = oldTable.length;
if (oldCapacity == MAXIMUM_CAPACITY) {
threshold = Integer.MAX_VALUE;
return;
}
Entry[] newTable = new Entry[newCapacity];
transfer(newTable);
table = newTable;
threshold = (int)(newCapacity * loadFactor);
}
/**
* Transfers all entries from current table to newTable.
*/
void transfer(Entry[] newTable) {
Entry[] src = table;
int newCapacity = newTable.length;
for (int j = 0; j < src.length; j++) {
Entry<K,V> e = src[j];
if (e != null) {
src[j] = null;
do {
Entry<K,V> next = e.next;
int i = indexFor(e.hash, newCapacity);
e.next = newTable[i];
newTable[i] = e;
e = next;
} while (e != null);
}
}
}
注意数据从旧表到新表时要根据Entry中保存的hash重新定位在新表中的位置。还要注意threshold=capacity * loadFactor,容量和加载因子共同决定了内部表扩容时机。
注意加载因子对内部表的影响:假设有一个较小的加载因子0.5f,那么当内部表中数据的个数达到表容量一半的时候,便会扩容。这无疑浪费了很多空间,但换来的却是查询和删除的性能。因为散列表的空间大,那么每个桶里的链表的平均长度就会缩短,而一个查找操作做耗费的时间,最坏情况下是计算散列值花费的常数时间加上遍历桶内链表所花费的时间。反过来,一个大的加载因子,会减少一定的空间浪费,但是会增加查找和删除的平均操作时间。所以如果要自己设置加载因子和容量,需要根据实际上下文环境来选择合理的大小。
有了以上的分析,HashMap内部的迭代器HashIterator也很容易看懂,这里就不分析了。
最后小总结一下,HashMap是基于散列表实现的一种表示键值对儿集合的集合工具类,对于按照key进行的添加、查找和删除操作,它都提供了平均常数时间的性能。最后注意一点,它是线程不安全的。