Java并发-JUC（上）

引言

本文介绍了JDK中常用的并发库（JUC）的使用方式，并且自上而下地剖析了其实现原理，从直接下级框架AbstractQueuedSynchronizer，也就是大家常说的AQS，再到其中使用的CAS，Wait，Park，最后到操作系统层面的Mutex，Condition，希望通过这篇文章，大家能够对整个Java并发有一个清晰全面的认识，而且把这些内容串在一起你会发现他们本质上都是相通的。

JUC

在java.util.concurrent包（JUC）中，有各式各样的并发控制工具，这里我们简单介绍一个常用的工具及其使用方式。

Atomic类

Atomic 类有很多种，它们都在 java.util.concurrent.atomic 包中。基本都是通过 CAS（CompareAndSwap）来实现的，而 CAS 的具体实现依赖于体系结构提供的指令。

这里我们仅介绍几个例子，并不会介绍每一个Atomic类的使用。首先看一下AtomicInteger，通过它我们可以无锁化的修改一个int类型的值，并且能够保证修改过程是原子的。

public static class LockTest {
    private AtomicInteger sum = new AtomicInteger(0);

    public void increase() {
        sum.incrementAndGet();
    }
}

比如统计一个网页的访问量时，就可以使用它，因为不会使用到锁，所以没有上下文切换的消耗，速度很快。

如果你不仅仅是修改一个基础类型的数据，例如一次要修改好几个基础数据类型得话，你可以把它们封装到一个对象中，然后使用AtomicReference来进行整个对象的更新操作。下例中，我们就一次性更新了一个对象的所有属性。

public static class LockTest {
    private AtomicReference reference = new AtomicReference<>();
    private int test1;
    private int test2;

    public void changeObject() {
        reference.getAndUpdate(new UnaryOperator() {
            @Override public LockTest apply(LockTest test) {
                LockTest newItem = new LockTest();
                newItem.test1 = test.test1 + 1;
                newItem.test2 = test.test2 - 1;
                return newItem;
            }
        });
    }
}

Aomic类虽然很快，但是也有一个问题就是ABA问题，当一个Atomic的值从A修改为B，再重新修改为A时，虽然值改变了，但是在进行CAS时，会错认为该值没有发生变化，为了解决这类问题，你可以使用AtomicStampedReference。它通过一个版本号来控制数据的变化，如果遵循使用规范，即每次进行修改时都将版本号加一，那么就可以杜绝ABA问题。

public static class LockTest {
    private AtomicStampedReference reference = new AtomicStampedReference<>(null, 0);
    private int test1;

    public void changeObject() {
        LockTest newObject = new LockTest();
        for (; ; ) {
            int previousStamp = reference.getStamp();
            LockTest previousObject = reference.getReference();
            if (reference.compareAndSet(previousObject, newObject, previousStamp, previousStamp + 1)) {
                break;
            }
        }

    }

}

Semaphore

Semaphore（信号量）和synchronized类似，是控制线程能否进入某一同步代码区的一种手段，但是synchronized每次只有一个进程可以进入同步代码区，而Semaphore可以指定多个线程同时访问某个资源。

public static class LockTest {

    public static void main(String[] args) {
        ExecutorService threadPool = Executors.newFixedThreadPool(300);
        Semaphore semaphore = new Semaphore(5);
        for (int i = 0; i < 100; i++) {
            int finali = i;
            threadPool.execute(() -> {
                try {
                    semaphore.acquire();
                    System.out.println("Index:" + finali);
                    Thread.sleep(2000);
                    semaphore.release();
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
            });
        }
        threadPool.shutdown();
    }

}

上例中，我们设定使用了5个许可证（同一时刻最多5个线程进入同步区），每次调用acquire都会消耗一个许可证，调用release时释放一个许可证，当许可证不足时调用acquire就会进入队列等待，值得一提的是Semaphore包含两种模式，公平模式和非公平模式，在公平模式下，获取许可证是以FIFO的顺序进行，而在非公平模式，是不能保证顺序的。

CountDownLatch

CountDownLatch是一个同步工具类，它允许一个或多个线程一直等待，直到其他线程的操作执行完再执行。下面看一个例子：

public static class LockTest {
    public static void main(String[] args) throws InterruptedException {
        ExecutorService threadPool = Executors.newFixedThreadPool(1);
        CountDownLatch countDownLatch = new CountDownLatch(5);
        for (int i = 0; i < 5; i++) {
            int finali = i;
            threadPool.submit(() -> {
                try {
                    System.out.println("Index:" + finali);
                    Thread.sleep(2000);
                    countDownLatch.countDown();
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
            });
        }
        countDownLatch.await();
        System.out.println("Finish");
    }
}

首先，我们指定CountDownLatch等待5个线程完成任务，在每个线程执行完任务之后，都调用countDown函数，它会将CountDownLatch内部的计数器减1，当计数器为0时，CountDownLatch::await函数才会返回。我一般用它来实现Future接口。值得一提的是，CountDownLatch是一次性的，计数器的值只能在构造方法中初始化一次，之后没有任何机制再次对其设置值，当CountDownLatch使用完毕后，就不能再次被使用。

CyclicBarrier

CyclicBarrier和CountDownLatch非常相似，它也可以实现线程间的计数等待，但是它的功能比CountDownLatch更加复杂和强大。它可以控制一组线程全部完成第一轮任务时，再同时开始让它们执行下一轮任务。

public static class LockTest {
    public static void main(String[] args) {
        ExecutorService threadPool = Executors.newFixedThreadPool(5);
        CyclicBarrier cyclicBarrier = new CyclicBarrier(5, () -> System.out.println("barrierAction merge data"));
        for (int i = 0; i < 5; i++) {
            int finali = i;
            threadPool.submit(() -> {
                try {
                    System.out.println("Task 1 Begin Index:" + finali);
                    Thread.sleep(ThreadLocalRandom.current().nextInt(2000));
                    System.out.println("Task 1 Finished Index:" + finali);
                    cyclicBarrier.await();
                    System.out.println("Task 2 Begin Index:" + finali);
                    Thread.sleep(ThreadLocalRandom.current().nextInt(2000));
                } catch (InterruptedException | BrokenBarrierException e) {
                    e.printStackTrace();
                }
            });
        }
    }
}

CyclicBarrier很适合进行数据分组处理的任务，而且下一轮任务依赖上一轮任务的结果，比如我们将一个大任务拆分成很多小任务，当所有小任务完成时，我们可以通过barrierAction合并上一轮任务的结果，然后再开始下一轮任务。

关于CyclicBarrier和CountDownLatch的区别：
CountDownLatch：A synchronized aid that allows one or more threads to wait until a set of operations being performed in other threads completes.(CountDownLatch:一个或者多个线程，等待其他多个线程完成某件事情之后才能执行)
CyclicBarrier：A synchronized aid that allows aset of threads to all wait for each other to reach a common barrier point.（CyclicBarrier：多个线程相互等待，直到到达同一个同步点，再继续一起执行。）

ThreadLocal

通常情况下，我们创建的变量是可以被任何一个线程访问并修改的。但是JDK也为我们提供了让某一变量独享与各个线程的方案，也就是ThreadLocal。因为每个线程都有自己专属的变量，所以各个线程在操作ThreadLocal变量时不需要加锁。

public static class LockTest {
    public static void main(String[] args) {
        ThreadLocal threadLocal = new ThreadLocal<>();
        new Thread(new Runnable() {
            @Override public void run() {
                try {
                    System.out.println("Thread 1 Current Value:" + threadLocal.get());
                    threadLocal.set(10);
                    Thread.sleep(500);
                    System.out.println("Thread 1 Current Value:" + threadLocal.get());
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
            }
        }).start();
        Thread.sleep(100);
        new Thread(new Runnable() {
            @Override public void run() {
                try {
                    System.out.println("Thread 2 Current Value:" + threadLocal.get());
                    threadLocal.set(5);
                    Thread.sleep(1000);
                    System.out.println("Thread 2 Current Value:" + threadLocal.get());
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
            }
        }).start();
    }
}

threadLocal 的初始值是 null，然后线程1先启动将值改为 10，100 ms 后线程2启动，会发现 threadLocal 的值仍然为 null，然后将其改为 5，400ms 后线程1从睡眠中苏醒，发现 threadLocal 的值仍然为 10，可见这两个线程所观测到的 threadLocal 值是各自独立的。

ThreadLocal实现

这里我们简单地介绍一下 ThreadLocal 的实现原理，在每个 Thread 对象中都保存了一个 ThreadLocal Map，其中 key 为 ThreadLocal 对象，Value 为 ThreadLocal 的值。这里 ThreadLocalMap 中的 Entry 使用了弱引用，是为了帮助 GC。当一个 ThreadLocal 对象不再被引用时，就会被 GC，这时候 ThreadLocalMap 中就会出现 key 为 null 的情况。但是因为 value 是强引用，所以如果 key 为 null 的数据不加管理的话，就会出现内存泄漏问题。ThreadLocalMap实现中已经考虑了这种情况，在调用 set()、get()、remove() 方法的时候，会清理掉 key 为 null 的记录。使用完 ThreadLocal方法后最好手动调用remove()方法，来帮助 GC。

public class Thread implements Runnable {
    /* ThreadLocal values pertaining to this thread. This map is maintained
     * by the ThreadLocal class. */
    ThreadLocal.ThreadLocalMap threadLocals = null;
}

static class ThreadLocalMap {

    /**
     * The entries in this hash map extend WeakReference, using
     * its main ref field as the key (which is always a
     * ThreadLocal object).  Note that null keys (i.e. entry.get()
     * == null) mean that the key is no longer referenced, so the
     * entry can be expunged from table.  Such entries are referred to
     * as "stale entries" in the code that follows.
     */
    static class Entry extends WeakReference> {
        /** The value associated with this ThreadLocal. */
        Object value;

        Entry(ThreadLocal k, Object v) {
            super(k);
            value = v;
        }
    }
}

如果一个对象只具有弱引用，那么下次GC时该对象就会被清理。弱引用与软引用的区别在于：只含有软引用的对象只有在内存不足（即将发生OOM）时才会清除，而只含有弱引用的对象下次GC时就会清除无论内存是否紧张。

当我们想要获取 ThreadLocal 值的时候，会从当前 Thread 的 ThreadLocalMap 中查找，如果没有找到时，它会将初始值塞入该 Map 并返回。

/**
 * Sets the current thread's copy of this thread-local variable
 * to the specified value.  Most subclasses will have no need to
 * override this method, relying solely on the {@link #initialValue}
 * method to set the values of thread-locals.
 *
 * @param value the value to be stored in the current thread's copy of
 *        this thread-local.
 */
public void set(T value) {
    Thread t = Thread.currentThread();
    ThreadLocalMap map = getMap(t);
    if (map != null)
        map.set(this, value);
    else
        createMap(t, value);
}

/**
 * Set the value associated with key.
 *
 * @param key the thread local object
 * @param value the value to be set
 */
private void set(ThreadLocal key, Object value) {

    // We don't use a fast path as with get() because it is at
    // least as common to use set() to create new entries as
    // it is to replace existing ones, in which case, a fast
    // path would fail more often than not.

    Entry[] tab = table;
    int len = tab.length;
    // 计算 hash 值对应的槽位
    int i = key.threadLocalHashCode & (len-1);
    for (Entry e = tab[i];
         e != null;
         e = tab[i = nextIndex(i, len)]) {
        ThreadLocal k = e.get();
        // 如果该槽位存储的 ThreadLocal 对象就是自己，就返回
        if (k == key) {
            e.value = value;
            return;
        }
        // 如果遍历找到了一个空的槽位，就占用它
        if (k == null) {
            replaceStaleEntry(key, value, i);
            return;
        }
    }

    tab[i] = new Entry(key, value);
    int sz = ++size;
    // 清除 key 为 null的槽位，如果size过大就扩容，扩容阈值是哈希表长度的 2 / 3
    if (!cleanSomeSlots(i, sz) && sz >= threshold)
        rehash();
}

ThreadLocalMap内部使用一个数组来保存数据，类似HashMap；每个ThreadLocal在初始化的时候会分配一个threadLocalHashCode，然后和数组的长度进行取模来计算当前 ThreadLocal 变量所处的槽位。但是这样也会出现hash冲突的情况，在HashMap中处理冲突是使用链表+红黑树的方式。而在ThreadLocalMap中，我们可以看到它直接使用nextIndex，进行遍历操作，期间如果找到了自己之前使用到的槽位，就直接返回，否则占用一个没有被使用的槽位。很明显当 ThreadLocal 很多时这样效率很低。获取 ThreadLocal 值的过程也类似，先通过 hash 找到槽位，如果该槽位保存的不是我们要的 ThreadLocal 对象，则进行遍历查找。

private Entry getEntry(ThreadLocal key) {
    int i = key.threadLocalHashCode & (table.length - 1);
    Entry e = table[i];
    // 直接 hash 槽位找到了目标对象，直接返回
    if (e != null && e.get() == key)
        return e;
    else
        // 否则，遍历查找
        return getEntryAfterMiss(key, i, e);
}

private Entry getEntryAfterMiss(ThreadLocal key, int i, Entry e) {
    Entry[] tab = table;
    int len = tab.length;

    while (e != null) {
        ThreadLocal k = e.get();
        if (k == key)
            return e;
        if (k == null)
            // 这里如果发现了空的槽位，要进行重新 hash，来提升效率
            expungeStaleEntry(i);
        else
            i = nextIndex(i, len);
        e = tab[i];
    }
    return null;
}

private int expungeStaleEntry(int staleSlot) {
    Entry[] tab = table;
    int len = tab.length;

    // expunge entry at staleSlot
    tab[staleSlot].value = null;
    tab[staleSlot] = null;
    size--;

    // Rehash until we encounter null
    Entry e;
    int i;
    for (i = nextIndex(staleSlot, len);
         (e = tab[i]) != null;
         i = nextIndex(i, len)) {
        ThreadLocal k = e.get();
        // 同样如果找到了 key 为 null 的槽位，就把他清空来帮助 GC
        if (k == null) {
            e.value = null;
            tab[i] = null;
            size--;
        } else {
            // 如果一个节点通过hash 计算的槽位，和实际保存的槽位不一样时，从计算所得的槽位出发，找到一个为 null 的槽位，并将该节点存进去
            int h = k.threadLocalHashCode & (len - 1);
            if (h != i) {
                tab[i] = null;
                // Unlike Knuth 6.4 Algorithm R, we must scan until
                // null because multiple entries could have been stale.
                while (tab[h] != null)
                    h = nextIndex(h, len);
                tab[h] = e;
            }
        }
    }
    return i;
}

fastThreadLocal

正是因为 JDK 提供的 ThreadLocal 存在性能问题，所以在 Netty 中，对 ThreadLocal 进行了改写，Netty 配套的提供了 FastThreadLocal ， FastThreadLocalThread ， FastThreadLocalRunnable ，其中 FastThreadLocalRunnable 比较简单，就是在原始 Runnable 接口之上的装饰者。达到了执行完毕后自动清除 FastThreadLocal 的效果。

final class FastThreadLocalRunnable implements Runnable {
    private final Runnable runnable;

    private FastThreadLocalRunnable(Runnable runnable) {
        this.runnable = ObjectUtil.checkNotNull(runnable, "runnable");
    }

    @Override
    public void run() {
        try {
            runnable.run();
        } finally {
            FastThreadLocal.removeAll();
        }
    }

    static Runnable wrap(Runnable runnable) {
        return runnable instanceof FastThreadLocalRunnable ? runnable : new FastThreadLocalRunnable(runnable);
    }
}

FastThreadLocal 必须配合 FastThreadLocalThread 一起使用，才能达到性能更优的效果，否则可能还不如直接使用 JDK 自身的 ThreadLocal。FastThreadLocal 为什么能比 JDK 的实现更快呢，原因就在于 FastThreadLocal 不是使用 hash 表来保存 ThreadLocal 的值，而是直接使用了数组。让我们看看它的具体实现方案吧。FastThreadLocal 所有的数据都保存在了 InternalThreadLocalMap 中, 这里我们先明白它是一个保存数据的容器就行，它是如何保存数据的我们后面介绍。那么，InternalThreadLocalMap 这个容器存在哪了呢？是不是像 JDK 提供的 ThreadLocal 一样把容器（JDK 中用到的Hash 表）存在了Thread 对象中呢？没错，如果我们使用的是 FastThreadLocalThread 的话，InternalThreadLocalMap 就是 FastThreadLocalThread 的一个成员变量。

public class FastThreadLocalThread extends Thread {
    // This will be set to true if we have a chance to wrap the Runnable.
    private final boolean cleanupFastThreadLocals;
    // 保存了所有的 FastThreadLocal 值
    private InternalThreadLocalMap threadLocalMap;
    // ...
}

下面给大家展示的是 UnpaddedInternalThreadLocalMap 它是 InternalThreadLocalMap 的父类，大部分重要的数据都是保存在 UnpaddedInternalThreadLocalMap 中的。

class UnpaddedInternalThreadLocalMap {
    // 当没有使用 FastThreadLocalThread 时，通过 JDK ThreadLocal 来保存
    static final ThreadLocal slowThreadLocalMap = new ThreadLocal();
    static final AtomicInteger nextIndex = new AtomicInteger();

    /** Used by {@link FastThreadLocal} */
    Object[] indexedVariables;
    // ...
}

在 UnpaddedInternalThreadLocalMap 中，通过 JDK 提供的 ThreadLocal 来保存一个 InternalThreadLocalMap，以应对没有使用 FastThreadLocalThread 的情况。正因如此，当 FastThreadLocal 需要获取 InternalThreadLocalMap 对象时会根据当前运行的线程是不是 FastThreadLocalThread 来决定到底从哪里提取 InternalThreadLocalMap。

// InternalThreadLocalMap.java
public static InternalThreadLocalMap get() {
    Thread thread = Thread.currentThread();
    if (thread instanceof FastThreadLocalThread) {
        // 当前使用的是 FastThreadLocalThread
        return fastGet((FastThreadLocalThread) thread);
    } else {
        // 当前使用的是 Thread
        return slowGet();
    }
}
// 使用 FastThreadLocalThread 时，直接从成员变量获取
private static InternalThreadLocalMap fastGet(FastThreadLocalThread thread) {
    InternalThreadLocalMap threadLocalMap = thread.threadLocalMap();
    if (threadLocalMap == null) {
        thread.setThreadLocalMap(threadLocalMap = new InternalThreadLocalMap());
    }
    return threadLocalMap;
}
// 使用 Thread 时，通过 InternalThreadLocalMap 内部的 ThreadLocal 获取
private static InternalThreadLocalMap slowGet() {
    ThreadLocal slowThreadLocalMap = UnpaddedInternalThreadLocalMap.slowThreadLocalMap;
    InternalThreadLocalMap ret = slowThreadLocalMap.get();
    if (ret == null) {
        ret = new InternalThreadLocalMap();
        slowThreadLocalMap.set(ret);
    }
    return ret;
}

明确了 InternalThreadLocalMap 这个容器存在哪了之后，我们再来看看这个容器内部是如何保存数据的。再回到 UnpaddedInternalThreadLocalMap 的代码中。我们可以看到这里有一个 indexedVariables 数组，它就是保存所有数据的地方，但是如果我们要使用数组就得有明确的数据下标对应关系。而那个 nextIndex 原子变量就是维护下标对应关系的关键。

class UnpaddedInternalThreadLocalMap {
    static final AtomicInteger nextIndex = new AtomicInteger();

    /** Used by {@link FastThreadLocal} */
    Object[] indexedVariables;
    // ...
    public static int nextVariableIndex() {
        int index = nextIndex.getAndIncrement();
        if (index < 0) {
            nextIndex.decrementAndGet();
            throw new IllegalStateException("too many thread-local indexed variables");
        }
        return index;
    }

回到 FastThreadLocal 对象中，我们可以看到这里面有一个静态属性 variablesToRemoveIndex，它的值通过 UnpaddedInternalThreadLocalMap 的静态字段 nextIndex 计算所得。因为 variablesToRemoveIndex 是一个静态属性，而且也是唯一一个静态调用 nextIndex 的地方，所以它的值恒为 0。而每个 FastThreadLocal 对象，都会在构造函数中申请一个新的 index 槽位。

public class FastThreadLocal {

    private static final int variablesToRemoveIndex = InternalThreadLocalMap.nextVariableIndex();
    private final int index;
    public FastThreadLocal() {
        index = InternalThreadLocalMap.nextVariableIndex();
    }

当我们使用 FastThreadLocal 时，比如调用 set 函数，它会先获取 InternalThreadLocalMap 对象，然后根据当前 FastThreadLocal 分配的下标 index，直接设置数组 indexedVariables 中的值。

// FastThreadLocal.java
/**
 * Set the value for the current thread.
 */
public final void set(V value) {
    if (value != InternalThreadLocalMap.UNSET) {
        // 获取 Map
        InternalThreadLocalMap threadLocalMap = InternalThreadLocalMap.get();
        setKnownNotUnset(threadLocalMap, value);
    } else {
        remove();
    }
}

/**
 * @return see {@link InternalThreadLocalMap#setIndexedVariable(int, Object)}.
 */
private void setKnownNotUnset(InternalThreadLocalMap threadLocalMap, V value) {
    if (threadLocalMap.setIndexedVariable(index, value)) {
        // 如果设置成功，则将该 FastThreadLocal 保存起来，方便后续清理
        addToVariablesToRemove(threadLocalMap, this);
    }
}

// InternalThreadLocalMap.java
/**
 * @return {@code true} if and only if a new thread-local variable has been created
 */
public boolean setIndexedVariable(int index, Object value) {
    Object[] lookup = indexedVariables;
    // 如果长度足够，直接通过 index 进行修改
    if (index < lookup.length) {
        Object oldValue = lookup[index];
        lookup[index] = value;
        return oldValue == UNSET;
    } else {
        // 否则进行扩容，扩容后的大小是比 index 大的最小的2的幂
        expandIndexedVariableTableAndSet(index, value);
        return true;
    }
}

设置成功后，会将当前 FastThreadLocal 保存在一个集合中，之后在进行清理工作时，能够快速的进行清除。

// FastThreadLocal.java
@SuppressWarnings("unchecked")
private static void addToVariablesToRemove(InternalThreadLocalMap threadLocalMap, FastThreadLocal variable) {
    Object v = threadLocalMap.indexedVariable(variablesToRemoveIndex);
    Set> variablesToRemove;
    // 保存 FastThreadLocal 对象的集合存在数组的 index 0 位置，因为 variablesToRemoveIndex 恒等于 0
    if (v == InternalThreadLocalMap.UNSET || v == null) {
        variablesToRemove = Collections.newSetFromMap(new IdentityHashMap, Boolean>());
        threadLocalMap.setIndexedVariable(variablesToRemoveIndex, variablesToRemove);
    } else {
        variablesToRemove = (Set>) v;
    }

    variablesToRemove.add(variable);
}

/**
 * Removes all {@link FastThreadLocal} variables bound to the current thread.  This operation is useful when you
 * are in a container environment, and you don't want to leave the thread local variables in the threads you do not
 * manage.
 */
public static void removeAll() {
    // 在进行清除时，如果InternalThreadLocalMap为空，则说明没有使用 FastThreadLocal
    InternalThreadLocalMap threadLocalMap = InternalThreadLocalMap.getIfSet();
    if (threadLocalMap == null) {
        return;
    }

    try {
        Object v = threadLocalMap.indexedVariable(variablesToRemoveIndex);
        // 如果 index0 不存在集合，说明没有使用 FastThreadLocal
        if (v != null && v != InternalThreadLocalMap.UNSET) {
            @SuppressWarnings("unchecked")
            Set> variablesToRemove = (Set>) v;
            FastThreadLocal[] variablesToRemoveArray =
                    variablesToRemove.toArray(new FastThreadLocal[0]);
            // 挨个删除集合中的所有 FastThreadLocal，其中会调用 FastThreadLocal 的 onRemoval 回调函数
            for (FastThreadLocal tlv: variablesToRemoveArray) {
                tlv.remove(threadLocalMap);
            }
        }
    } finally {
        // 将整个 InternalThreadLocalMap 删除
        InternalThreadLocalMap.remove();
    }
}

/**
 * Sets the value to uninitialized for the specified thread local map;
 * a proceeding call to get() will trigger a call to initialValue().
 * The specified thread local map must be for the current thread.
 */
@SuppressWarnings("unchecked")
public final void remove(InternalThreadLocalMap threadLocalMap) {
    if (threadLocalMap == null) {
        return;
    }
    // 将对应槽位的值设为 null，并返回之前的值
    Object v = threadLocalMap.removeIndexedVariable(index);
    // 将该 FastThreadLocal 从待删除结合（index0 保存的集合）中删除
    removeFromVariablesToRemove(threadLocalMap, this);
    // 如果之前的值不是空，就调用 onRemoval 回调函数
    if (v != InternalThreadLocalMap.UNSET) {
        try {
            onRemoval((V) v);
        } catch (Exception e) {
            PlatformDependent.throwException(e);
        }
    }
}

至此，FastThreadLocal 的核心内容就介绍完了。这里我有一点费解，为什么 InternalThreadLocalMap 的下标分配器 nextIndex 要声明为静态变量，而不是成员变量呢，如果是成员变量的话，就不会有内存的浪费了啊，可能现在这种方案效率会稍微好一点吧。但是，现在这种实现很可能每个 InternalThreadLocalMap 中的数组中都会有一些空洞（如果当前线程没有使用所有的 FastThreadLocal）。

最后，还有一点要提一下，就是 InternalThreadLocalMap 为了进一步的提高效率在成员变量中添加了几个填充字段。这是为了防止伪共享。

// InternalThreadLocalMap.java
// Cache line padding (must be public)
// With CompressedOops enabled, an instance of this class should occupy at least 128 bytes.
public long rp1, rp2, rp3, rp4, rp5, rp6, rp7, rp8, rp9;

通常 CPU 的缓存行一般是 64 或 128 字节，为了防止InternalThreadLocalMap的不同实例被加载到同一个缓存行，我们需要多余填充一些字段，使得每个实例的大小超出缓存行的大小。

下图是计算的基本结构。L1、L2、L3分别表示一级缓存、二级缓存、三级缓存，越靠近CPU的缓存，速度越快，容量也越小。所以L1缓存很小但很快，并且紧靠着在使用它的CPU内核；L2大一些，也慢一些，并且仍然只能被一个单独的CPU核使用；L3更大、更慢，并且被单个插槽上的所有CPU核共享；最后是主存，由全部插槽上的所有CPU核共享。

当CPU执行运算的时候，它先去L1查找所需的数据、再去L2、然后是L3，如果最后这些缓存中都没有，所需的数据就要去主内存拿。走得越远，运算耗费的时间就越长。所以如果你在做一些很频繁的事，你要尽量确保数据在L1缓存中。另外，线程之间共享一份数据的时候，需要一个线程把数据写回主存，而另一个线程访问主存中相应的数据。

Cache是由很多个cache line组成的。每个cache line通常是64字节，对应了主内存中的一块儿地址。CPU每次从主存中拉取数据时，会把相邻的数据也存入同一个cache line。这就有可能有可能Thread1要使用FastThreadLocal1时一次性地将两个内存地址相邻的 FastThreadLocal 对象（FastThreadLocal1，FastThreadLocal2）放入自己的 cache line 中。这时候如果另一个线程Thread2只修改了FastThreadLocal2，之后如果 Thread1 要使用 FastThreadLocal1 也需要从主存中重新拉取（因为 FastThreadLocal1 和 FastThreadLocal2 在同一个缓存行中，只要缓存行内的任意位置的数据被修改，那么其他线程就需要从主存中拉取最新的缓存行数据之后才能使用）。这种无法充分使用缓存行特性的现象，称为伪共享。