HashMap源码解析

目录

1 背景

2 类注释

3 实现原理

4 方法实现

4.1 


1 背景

工作学习中经常用到的HashMap,里面知识点很多,此博客从源码角度出发详细分析一下。本文源码取自JDK8。

2 类注释

/**
 * Hash table based implementation of the Map interface.  This
 * implementation provides all of the optional map operations, and permits
 * null values and the null key.  (The HashMap
 * class is roughly equivalent to Hashtable, except that it is
 * unsynchronized and permits nulls.)  This class makes no guarantees as to
 * the order of the map; in particular, it does not guarantee that the order
 * will remain constant over time.

不支持线程同步;key和value可以为null;不保证顺序性。

问:为什么HashTable和ConcurrentHashMap不支持null?

答:hashTable和concurrentHashMap不允许null的键,因为无法分辨是key没找到的null还是有key值为null,这在多线程里面是模糊不清的,所以压根就不让put null。hashMap是非多线程工具类,可以通过map.contains(key)来区分这种情况。

 * 

This implementation provides constant-time performance for the basic * operations (get and put), assuming the hash function * disperses the elements properly among the buckets. Iteration over * collection views requires time proportional to the "capacity" of the * HashMap instance (the number of buckets) plus its size (the number * of key-value mappings). Thus, it's very important not to set the initial * capacity too high (or the load factor too low) if iteration performance is * important.

在hash函数合理的前提下,get和put操作复杂度是常数O(1)。如果对迭代操作的性能要求比较高的话,注意不能将初始容量设置的太高也不能把负载因子设置的过低。

 * 

An instance of HashMap has two parameters that affect its * performance: initial capacity and load factor. The * capacity is the number of buckets in the hash table, and the initial * capacity is simply the capacity at the time the hash table is created. The * load factor is a measure of how full the hash table is allowed to * get before its capacity is automatically increased. When the number of * entries in the hash table exceeds the product of the load factor and the * current capacity, the hash table is rehashed (that is, internal data * structures are rebuilt) so that the hash table has approximately twice the * number of buckets.

 Entry个数达到load factor * capacity,发生rehash,扩容成之前的两倍。

问:为什么按照两倍扩容?

答:1、减少hash碰撞 2、方便转换成位运算

 * 

As a general rule, the default load factor (.75) offers a good * tradeoff between time and space costs. Higher values decrease the * space overhead but increase the lookup cost (reflected in most of * the operations of the HashMap class, including * get and put). The expected number of entries in * the map and its load factor should be taken into account when * setting its initial capacity, so as to minimize the number of * rehash operations. If the initial capacity is greater than the * maximum number of entries divided by the load factor, no rehash * operations will ever occur.

通常,默认负载因子(.75)在时间和空间成本之间提供了一个很好的折衷方案。 较高的值会减少空间开销,但会增加查找成本(在HashMap类的大多数操作中都得到体现,包括get和put)。 设置映射表的初始容量时,应考虑映射中的预期条目数及其负载因子,以最大程度地减少重新哈希操作的数量。 如果初始容量大于最大条目数除以负载因子,则将不会进行任何哈希操作。

问:初始容量如何设置?

答:映射个数/负载因子 + 1.

 * 

If many mappings are to be stored in a HashMap * instance, creating it with a sufficiently large capacity will allow * the mappings to be stored more efficiently than letting it perform * automatic rehashing as needed to grow the table. Note that using * many keys with the same {@code hashCode()} is a sure way to slow * down performance of any hash table. To ameliorate impact, when keys * are {@link Comparable}, this class may use comparison order among * keys to help break ties.

如果要将许多映射存储在HashMap实例中,则创建具有足够大容量的映射将比使它根据需要增长表的自动重新哈希处理更有效地存储映射。 请注意,使用许多具有相同{@code hashCode()}的键是降低任何哈希表性能的肯定方法。 为了改善影响,当键为{@link Comparable}时,此类可以使用键之间的比较顺序来帮助打破平局。

 * 

Note that this implementation is not synchronized. * If multiple threads access a hash map concurrently, and at least one of * the threads modifies the map structurally, it must be * synchronized externally. (A structural modification is any operation * that adds or deletes one or more mappings; merely changing the value * associated with a key that an instance already contains is not a * structural modification.) This is typically accomplished by * synchronizing on some object that naturally encapsulates the map. * * If no such object exists, the map should be "wrapped" using the * {@link Collections#synchronizedMap Collections.synchronizedMap} * method. This is best done at creation time, to prevent accidental * unsynchronized access to the map:

 *   Map m = Collections.synchronizedMap(new HashMap(...));

多线程并发情况下,有任何一个线程做了结构上的修改,都要在外部做线程同步措施。结构修改是添加或删除一个或多个映射的任何操作,仅更改与实例已经包含的键相关联的值并不是结构上的修改。 通常,通过在自然封装Map的某个对象上进行同步来实现。如果不存在这样的对象,则应使用Collections#synchronizedMap包装Map。 最好在创建时完成此操作,以防止意外地不同步访问:Map m = Collections.synchronizedMap(new HashMap(...));

 * 

The iterators returned by all of this class's "collection view methods" * are fail-fast: if the map is structurally modified at any time after * the iterator is created, in any way except through the iterator's own * remove method, the iterator will throw a * {@link ConcurrentModificationException}. Thus, in the face of concurrent * modification, the iterator fails quickly and cleanly, rather than risking * arbitrary, non-deterministic behavior at an undetermined time in the * future. * *

Note that the fail-fast behavior of an iterator cannot be guaranteed * as it is, generally speaking, impossible to make any hard guarantees in the * presence of unsynchronized concurrent modification. Fail-fast iterators * throw ConcurrentModificationException on a best-effort basis. * Therefore, it would be wrong to write a program that depended on this * exception for its correctness: the fail-fast behavior of iterators * should be used only to detect bugs.

所有此类的“集合视图方法”返回的迭代器是‘快速失败’的:如果在创建迭代器之后的任何时间对Map进行结构修改,则除了通过迭代器自己的 remove 方法之外,该迭代器都会抛出一个ConcurrentModificationException。 因此,面对并发修改,迭代器会快速干净地失败,而不会在未来的不确定时间内冒任意,不确定的行为的风险。

请注意,迭代器的快速失败行为无法得到保证,因为通常来说,在存在不同步的并发修改的情况下,不可能做出任何严格的保证。 快速失败的迭代器会尽最大努力抛出ConcurrentModificationException。 因此,编写依赖于此异常的程序来确保程序的正确性是错误的:迭代器的快速失败行为应仅用于检测错误。

3 实现原理

     * This map usually acts as a binned (bucketed) hash table, but
     * when bins get too large, they are transformed into bins of
     * TreeNodes, each structured similarly to those in
     * java.util.TreeMap. Most methods try to use normal bins, but
     * relay to TreeNode methods when applicable (simply by checking
     * instanceof a node).  Bins of TreeNodes may be traversed and
     * used like any others, but additionally support faster lookup
     * when overpopulated. However, since the vast majority of bins in
     * normal use are not overpopulated, checking for existence of
     * tree bins may be delayed in the course of table methods.

当桶中元素太多时,转换成树状桶结构。TreeNodes的bin可以像其他任何bin一样被遍历和被使用,但在元素过多时还支持更快的查找。 然而,由于正常使用中的绝大多数bin并未负载过多,因此在使用方法的过程中可能会延迟检查是否存在tree bins。

问:在使用方法的过程中可能会延迟检查是否存在tree bins?

答:

     * Tree bins (i.e., bins whose elements are all TreeNodes) are
     * ordered primarily by hashCode, but in the case of ties, if two
     * elements are of the same "class C implements Comparable",
     * type then their compareTo method is used for ordering. (We
     * conservatively check generic types via reflection to validate
     * this -- see method comparableClassFor).  The added complexity
     * of tree bins is worthwhile in providing worst-case O(log n)
     * operations when keys either have distinct hashes or are
     * orderable, Thus, performance degrades gracefully under
     * accidental or malicious usages in which hashCode() methods
     * return values that are poorly distributed, as well as those in
     * which many keys share a hashCode, so long as they are also
     * Comparable. (If neither of these apply, we may waste about a
     * factor of two in time and space compared to taking no
     * precautions. But the only known cases stem from poor user
     * programming practices that are already so slow that this makes
     * little difference.)

Tree Bin主要按照hashCode排序,若hashCode相等则使用Comparable的compareTo方法比较大小(通过反射检查是否实现了Comparable接口)。当键具有不同的哈希值或可排序时,使用树状容器所增加的复杂度与提供最坏情况的O(log n)操作相比是值得的。因此,在hashCode方法返回分布不均的值以及许多键都共享hashCode的值(只要它们也是可比较的)的偶然或恶意使用下,性能也会优雅下降。(如果这两种方法都不适用,那么与不采取预防措施相比,我们可能在时间和空间上浪费大约两倍。 但是唯一已知的情况是由于不良的用户编程实践已经非常缓慢,以至于几乎没有什么区别。)

     * Because TreeNodes are about twice the size of regular nodes, we
     * use them only when bins contain enough nodes to warrant use
     * (see TREEIFY_THRESHOLD). And when they become too small (due to
     * removal or resizing) they are converted back to plain bins.  In
     * usages with well-distributed user hashCodes, tree bins are
     * rarely used.  Ideally, under random hashCodes, the frequency of
     * nodes in bins follows a Poisson distribution
     * (http://en.wikipedia.org/wiki/Poisson_distribution) with a
     * parameter of about 0.5 on average for the default resizing
     * threshold of 0.75, although with a large variance because of
     * resizing granularity. Ignoring variance, the expected
     * occurrences of list size k are (exp(-0.5) * pow(0.5, k) /
     * factorial(k)). The first values are:
     *
     * 0:    0.60653066
     * 1:    0.30326533
     * 2:    0.07581633
     * 3:    0.01263606
     * 4:    0.00157952
     * 5:    0.00015795
     * 6:    0.00001316
     * 7:    0.00000094
     * 8:    0.00000006
     * more: less than 1 in ten million

TreeNode占空间大约是普通Node的两倍,所以规定链表长度 > TREEIFY_THRESHOLD 再转换成树状。理想情况下,在随机hashCodes下,bin的节点频率遵循Poisson分布,默认调整大小阈值0.75的平均参数约为0.5,尽管由于调整大小粒度而有较大差异。 忽略方差,列表大小k的预期出现次数是(exp(-0.5)* pow(0.5,k)/ factorial(k))。

问:为什么树化的阈值是8?

答:如上文所述

     * The root of a tree bin is normally its first node.  However,
     * sometimes (currently only upon Iterator.remove), the root might
     * be elsewhere, but can be recovered following parent links
     * (method TreeNode.root()).
     *
     * When bin lists are treeified, split, or untreeified, we keep
     * them in the same relative access/traversal order (i.e., field
     * Node.next) to better preserve locality, and to slightly
     * simplify handling of splits and traversals that invoke
     * iterator.remove. When using comparators on insertion, to keep a
     * total ordering (or as close as is required here) across
     * rebalancings, we compare classes and identityHashCodes as
     * tie-breakers.
     *
     * The use and transitions among plain vs tree modes is
     * complicated by the existence of subclass LinkedHashMap. See
     * below for hook methods defined to be invoked upon insertion,
     * removal and access that allow LinkedHashMap internals to
     * otherwise remain independent of these mechanics. (This also
     * requires that a map instance be passed to some utility methods
     * that may create new nodes.)
     *
     * The concurrent-programming-like SSA-based coding style helps
     * avoid aliasing errors amid all of the twisty pointer operations.

通常,tree bin的根是其第一个节点。 但是,有时(当前仅在Iterator.remove上)根可能在其他位置,但可以在父链接之后恢复(方法TreeNode.root())。

当bin列表被树化,拆分或反树化时,我们将它们保持在相同的相对访问/遍历顺序(即字段Node.next)中,以更好地保留局部性,并略微简化调用iterator.remove的拆分和遍历的处理。在插入时使用比较器时,为了保持重新平衡的总顺序(或此处要求的接近度),我们将类和identityHashCodes作为决胜局进行比较。

子类LinkedHashMap的存在使普通模式与树模式之间的使用和转换变得复杂。 请参阅下面的定义为在插入,删除和访问时调用的挂钩方法,这些挂钩方法使LinkedHashMap内部能够以其他方式独立于这些机制。 (这还要求将Map实例传递给一些可能创建新节点的实用程序方法。)

类似于并发编程的基于SSA的编码样式有助于避免所有扭曲指针操作中的混叠错误。

4 方法实现

4.1 属性

    /**
     * The default initial capacity - MUST be a power of two.
     */
    static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // aka 16

    /**
     * The maximum capacity, used if a higher value is implicitly specified
     * by either of the constructors with arguments.
     * MUST be a power of two <= 1<<30.
     */
    static final int MAXIMUM_CAPACITY = 1 << 30;

    /**
     * The load factor used when none specified in constructor.
     */
    static final float DEFAULT_LOAD_FACTOR = 0.75f;

    /**
     * The bin count threshold for using a tree rather than list for a
     * bin.  Bins are converted to trees when adding an element to a
     * bin with at least this many nodes. The value must be greater
     * than 2 and should be at least 8 to mesh with assumptions in
     * tree removal about conversion back to plain bins upon
     * shrinkage.
     */
    static final int TREEIFY_THRESHOLD = 8;

    /**
     * The bin count threshold for untreeifying a (split) bin during a
     * resize operation. Should be less than TREEIFY_THRESHOLD, and at
     * most 6 to mesh with shrinkage detection under removal.
     */
    static final int UNTREEIFY_THRESHOLD = 6;

    /**
     * The smallest table capacity for which bins may be treeified.
     * (Otherwise the table is resized if too many nodes in a bin.)
     * Should be at least 4 * TREEIFY_THRESHOLD to avoid conflicts
     * between resizing and treeification thresholds.
     */
    static final int MIN_TREEIFY_CAPACITY = 64;



    /**
     * The number of times this HashMap has been structurally modified
     * Structural modifications are those that change the number of mappings in
     * the HashMap or otherwise modify its internal structure (e.g.,
     * rehash).  This field is used to make iterators on Collection-views of
     * the HashMap fail-fast.  (See ConcurrentModificationException).
     */
    transient int modCount;

    /**
     * The next size value at which to resize (capacity * load factor).
     *
     * @serial
     */
    // (The javadoc description is true upon serialization.
    // Additionally, if the table array has not been allocated, this
    // field holds the initial array capacity, or zero signifying
    // DEFAULT_INITIAL_CAPACITY.)
    int threshold;

4.2 hash计算

    /**
     * Computes key.hashCode() and spreads (XORs) higher bits of hash
     * to lower.  Because the table uses power-of-two masking, sets of
     * hashes that vary only in bits above the current mask will
     * always collide. (Among known examples are sets of Float keys
     * holding consecutive whole numbers in small tables.)  So we
     * apply a transform that spreads the impact of higher bits
     * downward. There is a tradeoff between speed, utility, and
     * quality of bit-spreading. Because many common sets of hashes
     * are already reasonably distributed (so don't benefit from
     * spreading), and because we use trees to handle large sets of
     * collisions in bins, we just XOR some shifted bits in the
     * cheapest possible way to reduce systematic lossage, as well as
     * to incorporate impact of the highest bits that would otherwise
     * never be used in index calculations because of table bounds.
     */
    static final int hash(Object key) {
        int h;
        return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
    }

高16位不变,低16位=高16位^低16位。为什么用异或?减少哈是冲突,只要一位发生变化,结果就会变化。

4.3 容量计算

    /**
     * Returns a power of two size for the given target capacity.
     * 大于cap的最小的2的幂次方
     */
    static final int tableSizeFor(int cap) {
        int n = cap - 1;
        n |= n >>> 1;
        n |= n >>> 2;
        n |= n >>> 4;
        n |= n >>> 8;
        n |= n >>> 16;
        return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
    }

4.4 

你可能感兴趣的:(java)