netty中的定时机制HashedWheelTimer

前言

好久没写文章了，最近没事儿看了下Redisson里面的分布式锁的写法，进而看到了它使用了netty中的HashedWheelTimer，大致扫了一下，觉得有点意思，花了点时间看了下代码，把自己的一些感想写出来，供大家参考一下。

一图胜千言

netty中的HashedWheelTimer基于这篇论文，首先我们确定，HashedWheelTimer提供的是一个定时任务的一个优化实现方案，在netty中主要用于异步IO的定时规划触发（A timer optimized for approximated I/O timeout scheduling）。为了方便大家理解，可以先来看看我画的这个图

HashedWheelTimer

这个图基本上就涵盖了HashedWheelTimer的所有的概念要素：

workerThread 单线程用于处理所有的定时任务，它会在每个tick执行一个bucket中所有的定时任务，以及一些其他的操作。意味着定时任务不能有较大的阻塞和耗时，不然就会影响定时任务执行的准时性和有效性。
wheel 一个时间轮，其实就是一个环形数组，数组中的每个元素代表的就是未来的某些时间片段上需要执行的定时任务的集合。
这里需要注意的就是不是某个时间而是某些时间。因为比方说我时间轮上的大小是10，时间间隔是1s，那么我1s和11s的要执行的定时任务都会在index为1的格子上。
tick 工作线程当前运行的tick数，每一个tick代表worker线程当前的一次工作时间
hash 在时间轮上的hash函数。默认是tick%bucket的数量，即将某个时间节点映射到了时间轮上的某个唯一的格子上。
bucket 时间轮上的一个格子，它维护的是一个Timeout的双向链表，保存的是这个哈希到这个格子上的所有Timeout任务。
timeout 代表一个定时任务，其中记录了自己的deadline，运行逻辑以及在bucket中需要呆满的圈数，比方说之前1s和11s的例子，他们对应的timeout中圈数就应该是0和1。这样当遍历一个bucket中所有的timeout的时候，只要圈数为0说明就应该被执行，而其他情况就把圈数-1就好。

除此之外，netty的HashedWheelTimer实现还有两个东西值得关注，分别是pending-timeouts队列和cancelled-timeouts队列。这两个队列分别记录新添加的定时任务和要取消的定时任务，当workerThread每次循环运行时，它会先将pending-timeouts队列中一定数量的任务移动到它们对应的bucket，并取消掉cancelled-timeouts中所有的任务。由于添加和取消任务可以由任意线程发起，而相应的处理只会在workerThread里，所以为了进一步提高性能，这两个队列都是用了JCTools里面的MPSC(multiple-producer-single-consumer)队列。

撸码

看完了图，咱们就来撸码了。在撸码之前，咱们再来看个图

HashedWheelTimer UML

这个图是为我根据netty HashedWheelTimer代码结构整理的一个UML结构图，大家看代码的时候可以以这个图作为一个参考借鉴。比较简单的父接口咱们就不去了，只需要知道几点：

TimerTask是一个定时任务的实现接口，其中run方法包装了定时任务的逻辑
Timeout是一个定时任务提交到Timer之后返回的句柄，通过这个句柄外部可以取消这个定时任务，并对定时任务的状态进行一些基本的判断
Timer是HashedWheelTimer实现的父接口，仅定义了如何提交定时任务和如何停止整个定时机制

接下来咱们就去HashedWheelTimer去走一遭

HashedWheelTimer

我们首先来看看构造函数（下文中所有注释均为本人书写）

    public HashedWheelTimer(
            ThreadFactory threadFactory,
            long tickDuration, TimeUnit unit, int ticksPerWheel, boolean leakDetection,
            long maxPendingTimeouts) {
        // 线程工厂，用于创建我们的worker线程
        if (threadFactory == null) {
            throw new NullPointerException("threadFactory");
        }
        // 一个tick的时间单位
        if (unit == null) {
            throw new NullPointerException("unit");
        }
        // 一个tick的时间间隔
        if (tickDuration <= 0) {
            throw new IllegalArgumentException("tickDuration must be greater than 0: " + tickDuration);
        }
        // 时间轮上一轮有多少个tick/bucket
        if (ticksPerWheel <= 0) {
            throw new IllegalArgumentException("ticksPerWheel must be greater than 0: " + ticksPerWheel);
        }

        // 将时间轮的大小规范化到2的n次方，这样可以用位运算来处理mod操作，提高效率
        wheel = createWheel(ticksPerWheel);
        // 计算位运算需要的掩码
        mask = wheel.length - 1;

        // 转换时间间隔到纳秒
        long duration = unit.toNanos(tickDuration);

        // 防止溢出
        if (duration >= Long.MAX_VALUE / wheel.length) {
            throw new IllegalArgumentException(String.format(
                    "tickDuration: %d (expected: 0 < tickDuration in nanos < %d",
                    tickDuration, Long.MAX_VALUE / wheel.length));
        }
        // 时间间隔至少要1ms
        if (duration < MILLISECOND_NANOS) {
            if (logger.isWarnEnabled()) {
                logger.warn("Configured tickDuration %d smaller then %d, using 1ms.",
                            tickDuration, MILLISECOND_NANOS);
            }
            this.tickDuration = MILLISECOND_NANOS;
        } else {
            this.tickDuration = duration;
        }
        // 创建worker线程
        workerThread = threadFactory.newThread(worker);
        
        // 处理泄露监控（本篇文章不涉及）
        leak = leakDetection || !workerThread.isDaemon() ? leakDetector.track(this) : null;

        // 设置最大等待任务数
        this.maxPendingTimeouts = maxPendingTimeouts;

        // 限制timer的实例数，避免过多的timer线程反而影响性能
        if (INSTANCE_COUNTER.incrementAndGet() > INSTANCE_COUNT_LIMIT &&
            WARNED_TOO_MANY_INSTANCES.compareAndSet(false, true)) {
            reportTooManyInstances();
        }
    }

构造函数的调理比较清晰，这其中我们需要关注的可能就只有createWheel方法：


    private static HashedWheelBucket[] createWheel(int ticksPerWheel) {
        // 处理时间轮太小或者太大造成的异常
        if (ticksPerWheel <= 0) {
            throw new IllegalArgumentException(
                    "ticksPerWheel must be greater than 0: " + ticksPerWheel);
        }
        if (ticksPerWheel > 1073741824) {
            throw new IllegalArgumentException(
                    "ticksPerWheel may not be greater than 2^30: " + ticksPerWheel);
        }
        // 规范化到2的n次方
        ticksPerWheel = normalizeTicksPerWheel(ticksPerWheel);
        // 创建每个bucket
        HashedWheelBucket[] wheel = new HashedWheelBucket[ticksPerWheel];
        for (int i = 0; i < wheel.length; i ++) {
            wheel[i] = new HashedWheelBucket();
        }
        return wheel;
    }

    private static int normalizeTicksPerWheel(int ticksPerWheel) {
        int normalizedTicksPerWheel = 1;
        // 不断地左移位直到找到大于等于时间轮大小的2的n次方出现
        while (normalizedTicksPerWheel < ticksPerWheel) {
            normalizedTicksPerWheel <<= 1;
        }
        return normalizedTicksPerWheel;
    }

然后我们再来看看start方法，它是如何启动定时器的：

    public void start() {
        // 针对worker的状态进行switch
        switch (WORKER_STATE_UPDATER.get(this)) {
            // 如果是初始化
            case WORKER_STATE_INIT:
                // 如果能cas更新到开始状态
                if (WORKER_STATE_UPDATER.compareAndSet(this, WORKER_STATE_INIT, WORKER_STATE_STARTED)) {
                    // 那就启动worker线程
                    workerThread.start();
                }
                break;
            // 如果已经处于启动，自然什么都不用做
            case WORKER_STATE_STARTED:
                break;
            // 如果已经shutdown， 那也就进入了非法状态
            case WORKER_STATE_SHUTDOWN:
                throw new IllegalStateException("cannot be started once stopped");
            default:
                throw new Error("Invalid WorkerState");
        }

        // 这里需要同步等待worker线程启动并完成startTime初始化的工作
        while (startTime == 0) {
            try {
                startTimeInitialized.await();
            } catch (InterruptedException ignore) {
                // Ignore - it will be ready very soon.
            }
        }
    }

从这里我们看到，其实真正的逻辑全部到了worker里面，这里只需要做一些基础状态判断并等待work启动完毕即可。
接下来就来到最后，HashedWheelTimer实现的Timer接口的两个函数了：

    @Override
    public Set stop() {
        // 判断当前线程是否是worker线程，stop方法不能由TimerTask触发，
        // 否则后面的同步等待join操作就无法完成
        if (Thread.currentThread() == workerThread) {
            throw new IllegalStateException(
                    HashedWheelTimer.class.getSimpleName() +
                            ".stop() cannot be called from " +
                            TimerTask.class.getSimpleName());
        }
        // 同样的cas更新状态
        if (!WORKER_STATE_UPDATER.compareAndSet(this, WORKER_STATE_STARTED, WORKER_STATE_SHUTDOWN)) {
            // workerState can be 0 or 2 at this moment - let it always be 2.
            if (WORKER_STATE_UPDATER.getAndSet(this, WORKER_STATE_SHUTDOWN) != WORKER_STATE_SHUTDOWN) {
                INSTANCE_COUNTER.decrementAndGet();
                if (leak != null) {
                    boolean closed = leak.close(this);
                    assert closed;
                }
            }

            return Collections.emptySet();
        }
        // 如果来到了这里，说明之前cas更新一切顺利
        try {
            boolean interrupted = false;
            // while循环持续中断worker线程直到它醒悟它该结束了(有可能被一些耗时的操作耽误了)
            // 通过isAlive判断worker线程是否已经结束
            while (workerThread.isAlive()) {
                workerThread.interrupt();
                try {
                    workerThread.join(100);
                } catch (InterruptedException ignored) {
                    interrupted = true;
                }
            }
            // 如果当前线程被interrupt，就设置标志位，常规操作
            if (interrupted) {
                Thread.currentThread().interrupt();
            }
        } finally {
            // 减少实例数
            INSTANCE_COUNTER.decrementAndGet();
            if (leak != null) {
                boolean closed = leak.close(this);
                assert closed;
            }
        }
        // 返回还没执行的定时任务
        return worker.unprocessedTimeouts();
    }


    @Override
    public Timeout newTimeout(TimerTask task, long delay, TimeUnit unit) {
        // 基本的异常判断
        if (task == null) {
            throw new NullPointerException("task");
        }
        if (unit == null) {
            throw new NullPointerException("unit");
        }
        // 增加等待执行的定时任务数
        long pendingTimeoutsCount = pendingTimeouts.incrementAndGet();
        
        // 如果超过最大就gg
        if (maxPendingTimeouts > 0 && pendingTimeoutsCount > maxPendingTimeouts) {
            pendingTimeouts.decrementAndGet();
            throw new RejectedExecutionException("Number of pending timeouts ("
                + pendingTimeoutsCount + ") is greater than or equal to maximum allowed pending "
                + "timeouts (" + maxPendingTimeouts + ")");
        }
        // 开启定时器，还记得么？
        // 它的状态判断能够很好的处理多次启动
        // 并且还能帮我们做一下状态判断
        start();

        // 计算这个任务的deadline，startTime是worker线程开始的时间戳
        long deadline = System.nanoTime() + unit.toNanos(delay) - startTime;

        // 溢出保护
        if (delay > 0 && deadline < 0) {
            deadline = Long.MAX_VALUE;
        }
        // 直接创建一个HashedWheelTimeout句柄
        HashedWheelTimeout timeout = new HashedWheelTimeout(this, task, deadline);
        // 添加到等待执行的任务队列中（大家还记得上一节我们说这是个MPSC队列么）
        timeouts.add(timeout);
        // 返回这个句柄
        return timeout;
    }

这两个方法很好的实现了本来应该有的语义，这里注释已经把每个语句的含义解释清楚了，这里就不在多解释。通过这里我们大概了解，其实大部分逻辑都在worker线程内部，HashedWheelTimer只是做了一个比较好的开启、结束和插入任务的机制。接下来我们再到worker里面去看看。

worker

既然worker封装的是工作线程的实现逻辑，那么我们肯定就的先来看看run方法了：

        @Override
        public void run() {
            // 初始化startTime
            startTime = System.nanoTime();
            if (startTime == 0) {
                // We use 0 as an indicator for the uninitialized value here, so make sure it's not 0 when initialized.
                startTime = 1;
            }

            // 通知还等待在start方法上的线程，我worker初始化好了
            startTimeInitialized.countDown();

            do {
                // 等待下一个tick
                final long deadline = waitForNextTick();
                if (deadline > 0) {
                    // 获取下一个bucket的index，即当前tick mod bucket数量
                    int idx = (int) (tick & mask);
                    // 处理掉已取消的任务
                    processCancelledTasks();
                    // 获取当前要处理的bucket
                    HashedWheelBucket bucket =
                            wheel[idx];
                    // 将待处理的任务移动到它该去的bucket去
                    transferTimeoutsToBuckets();
                    // 处理掉当前bucket的所有到期定时任务
                    bucket.expireTimeouts(deadline);
                    // 递增tick
                    tick++;
                }
                // 除非当前的状态不是started，否则循环进行下一个tick
            } while (WORKER_STATE_UPDATER.get(HashedWheelTimer.this) == WORKER_STATE_STARTED);

            // 到了这里，说明worker的状态已经不是started了，到了shutdown
            // 那就得善后了，遍历所有的bucket，将还没来得及处理的任务全部清理到unprocessedTimeouts中
            for (HashedWheelBucket bucket: wheel) {
                bucket.clearTimeouts(unprocessedTimeouts);
            }
            // 遍历所有待处理并且还没取消的任务，添加到unprocessedTimeouts中
            for (;;) {
                HashedWheelTimeout timeout = timeouts.poll();
                if (timeout == null) {
                    break;
                }
                if (!timeout.isCancelled()) {
                    unprocessedTimeouts.add(timeout);
                }
            }
            // 处理所有的已取消的task，防止内存泄漏
            // 其实我任务应该在结束的时候先处理已经取消的任务，这样似乎好理解一些
            // 不过我理解如果我这样做有可能出现的问题是bucket里面会有任务残留
            // 特别是这个间隙其他线程还在不断cancel任务，这里就不过多展开了
            processCancelledTasks();
        }

从run方法中我们可以看出，这个方法定义了在每一个tick循环中需要执行的任务以及他们的先后顺序，还有在退出的时候worker线程需要做的善后工作。
那我们先来看waitForNextTick:


        private long waitForNextTick() {
            // 计算下一个tick的deadline
            long deadline = tickDuration * (tick + 1);
            // 循环直到当前时间来到了下一个tick
            for (;;) {
                // 计算当前时间
                final long currentTime = System.nanoTime() - startTime;
                // 计算需要sleep的毫秒数
                long sleepTimeMs = (deadline - currentTime + 999999) / 1000000;
                // 如果sleep的毫秒数小于等于0
                if (sleepTimeMs <= 0) {
                    // 特殊判断 这里说实话我没咋看懂
                    if (currentTime == Long.MIN_VALUE) {
                        return -Long.MAX_VALUE;
                    } else {
                        // 无需等待，直接返回
                        return currentTime;
                    }
                }

                // Check if we run on windows, as if thats the case we will need
                // to round the sleepTime as workaround for a bug that only affect
                // the JVM if it runs on windows.
                //
                // See https://github.com/netty/netty/issues/356
                // 处理一些windows才有的jvm bug
                if (PlatformDependent.isWindows()) {
                    sleepTimeMs = sleepTimeMs / 10 * 10;
                }
                // 尝试sleep到下个tick的deadline
                try {
                    Thread.sleep(sleepTimeMs);
                } catch (InterruptedException ignored) {
                    // 如果发现已经被shutdown了，则返回MIN_VALUE，可以让run快速处理
                    if (WORKER_STATE_UPDATER.get(HashedWheelTimer.this) == WORKER_STATE_SHUTDOWN) {
                        return Long.MIN_VALUE;
                    }
                }
            }
        }

这个方法的核心目标其实就是让worker线程能够等够足够多的时间到下一个tick，并做了一些特殊处理。接下来我们看看worker线程如何分发任务到bucket的：


        private void transferTimeoutsToBuckets() {
            // 最多一次转移100000个待分发定时任务到它们对应的bucket内
            // 不然如果有一个线程一直添加定时任务就能让工作线程活生生饿死
            for (int i = 0; i < 100000; i++) {
                // 获取一个定时任务
                HashedWheelTimeout timeout = timeouts.poll();
                if (timeout == null) {
                    // all processed
                    break;
                }
                // 如果已经取消就不管了
                if (timeout.state() == HashedWheelTimeout.ST_CANCELLED) {
                    // Was cancelled in the meantime.
                    continue;
                }
                // 计算从worker线程开始运行算起要经过多少个tick才能到这个任务
                long calculated = timeout.deadline / tickDuration;
                // 计算这个任务要经过多少圈
                timeout.remainingRounds = (calculated - tick) / wheel.length;
                // 如果这个任务我们取晚了，那就让他在这个tick执行
                final long ticks = Math.max(calculated, tick); // Ensure we don't schedule for past.
                // 计算他应该到的bucket的index
                int stopIndex = (int) (ticks & mask);
                // 指派过去即可
                HashedWheelBucket bucket = wheel[stopIndex];
                bucket.addTimeout(timeout);
            }
        }

这里代码也比较简单，我就不过多解释了，需要注意的就是一次最多只会分发100000个任务，防止worker没法做其他的事情。接着我们再来看看worker如何处理已取消的任务的：


        private void processCancelledTasks() {
            // 毕竟取消的任务占极少数，所以这里就没有个数限制了
            for (;;) {
                // 取出一个取消的任务
                HashedWheelTimeout timeout = cancelledTimeouts.poll();
                if (timeout == null) {
                    // all processed
                    break;
                }
                try {
                    // 这里实际上是将它从它所属的bucket的双向链表中删除，这里后面会看到
                    timeout.remove();
                } catch (Throwable t) {
                    if (logger.isWarnEnabled()) {
                        logger.warn("An exception was thrown while process a cancellation task", t);
                    }
                }
            }
        }

同样也比较简单，worker就分析完了。接着我们再去HashedWheelBucket，看看时间轮上每一个单元做了什么，因为worker线程run方法中最终处理定时任务实际上是调用了bucket.expireTimeouts(deadline);

HashedWheelBucket

public void expireTimeouts(long deadline) {
            // 获取双向链表的头
            HashedWheelTimeout timeout = head;

            // 从头到尾处理所有的任务
            while (timeout != null) {
                HashedWheelTimeout next = timeout.next;
                // 如果剩余轮数小于0 说明需要马上执行
                if (timeout.remainingRounds <= 0) {
                    // 将它从当前链表中移除
                    next = remove(timeout);
                    if (timeout.deadline <= deadline) {
                        // 执行timeout的逻辑
                        timeout.expire();
                    } else {
                        // The timeout was placed into a wrong slot. This should never happen.
                        throw new IllegalStateException(String.format(
                                "timeout.deadline (%d) > deadline (%d)", timeout.deadline, deadline));
                    }
                } else if (timeout.isCancelled()) {
                    // 如果已经被取消，同样remove掉
                    next = remove(timeout);
                } else {
                    // 否则将他们的轮数减一
                    timeout.remainingRounds --;
                }
                // 继续链表的遍历
                timeout = next;
            }
        }

这里实际上就是遍历整个链表，将应该执行的执行掉，然后其余的维护他们的状态，保证后续的执行。
HashedWheelBucket中其他的方法基本上都是关于双向链表的操作，这里就不在赘述了，需要注意的是，由于所有的工作都在worker线程中进行，因此基本不需要任何线程安全机制，保证了高性能。
最后，我们还是来到了HashedWheelTimeout，看看一个定时任务内部需要有哪些处理。

HashedWheelTimeout

        public void expire() {
            if (!compareAndSetState(ST_INIT, ST_EXPIRED)) {
                return;
            }

            try {
                task.run(this);
            } catch (Throwable t) {
                if (logger.isWarnEnabled()) {
                    logger.warn("An exception was thrown by " + TimerTask.class.getSimpleName() + '.', t);
                }
            }
        }

大家应该还记得之前bucket.expireTimeouts(deadline);里面就是调用了timeout.expire(); 来执行具体的业务逻辑，而这里我们也看到，除了一些基本的状态判断，就是直接执行了task中的run方法。所以这里需要大家注意的就是task的run方法一定慎重放置任何有耗时可能的操作，不然就会导致HashedWheelTimer中worker线程被长时间占用，其他任务得不到执行或者无法准时执行，最终导致性能和正确性下降。
其他的方法也太过简单，这里就不提及，有兴趣的同学可以直接查看源码。

结语

这篇文章针对netty中的定时任务处理机制HashedWheelTimer的实现以及背后原理进行了一定程度的阐述和说明，希望能对大家理解有所帮助。

netty中的定时机制HashedWheelTimer

前言

一图胜千言

撸码

HashedWheelTimer

worker

HashedWheelBucket

HashedWheelTimeout

结语

你可能感兴趣的:(netty中的定时机制HashedWheelTimer)