3-netty源码分析之Reactor

首先这里用一个图来简单描述下netty的线程模型

image.png

其实这里的NioEventLoop就是主要讲的是reactor线程模型,如上图所示，该线程在一个无线死循环里主要做了三件事：

1.select轮训出就绪的IO事件
2.处理轮训出的IO事件
3.处理task

那么下面详细记录这三个动作是怎么处理的，当然这里bossGroup与workGroup动作一样，但做的事情前面介绍过。

看看核心代码【NioEventLoop#run】

/**
 * Netty 的事件循环机制
 * 当 taskQueue 中没有任务时, 那么 Netty 可以阻塞地等待 IO 就绪事件;
 * 而当 taskQueue 中有任务时, 我们自然地希望所提交的任务可以尽快地执行, 因此 Netty 会调用非阻塞的 selectNow() 方法, 以保证 taskQueue 中的任务尽快可以执行.
 *
 * 1.轮询IO事件
 * 2.处理轮询到的事件
 * 3.执行任务队列中的任务
 * */
@Override
protected void run() {
    for (;;) {
        try {
            /** 如果任务队列没有任务,则进行一次selectNow() */
            switch (selectStrategy.calculateStrategy(selectNowSupplier, hasTasks())) {
                case SelectStrategy.CONTINUE:
                    continue;
                case SelectStrategy.SELECT:

                    /**
                     * 轮询出注册在selector上面的IO事件
                     *
                     * wakenUp 表示是否应该唤醒正在阻塞的select操作，
                     * 可以看到netty在进行一次新的loop之前，都会将wakeUp 被设置成false，标志新的一轮loop的开始
                     */
                    select(wakenUp.getAndSet(false));

                    if (wakenUp.get()) {
                        selector.wakeup();
                    }
                    // fall through
                default:
            }

            cancelledKeys = 0;

            /**
             * 第一步是通过 select/selectNow 调用查询当前是否有就绪的 IO 事件. 那么当有 IO 事件就绪时, 第二步自然就是处理这些 IO 事件啦.
             */
            needsToSelectAgain = false;

            /**
             * 此线程分配给 IO 操作所占的时间比
             * 即运行 processSelectedKeys 耗时在整个循环中所占用的时间
             */
            final int ioRatio = this.ioRatio;

            /** 当 ioRatio 为 100 时, Netty 就不考虑 IO 耗时的占比, 而是分别调用 processSelectedKeys()、runAllTasks();   */
            if (ioRatio == 100) {
                try {

                    /** 查询就绪的 IO 事件后 进行处理 */
                    processSelectedKeys();
                } finally {
                    // Ensure we always run tasks.
                    /** 运行 taskQueue 中的任务. */
                    runAllTasks();
                }
            }

            /**
             * ioRatio 不为 100时, 则执行到 else 分支, 在这个分支中, 首先记录下 processSelectedKeys() 所执行的时间(即 IO 操作的耗时),
             * 然后根据公式, 计算出执行 task 所占用的时间, 然后以此为参数, 调用 runAllTasks().
             */
            else {
                final long ioStartTime = System.nanoTime();
                try {
                    processSelectedKeys();
                } finally {
                    // Ensure we always run tasks.
                    final long ioTime = System.nanoTime() - ioStartTime;
                    runAllTasks(ioTime * (100 - ioRatio) / ioRatio);
                }
            }
        } catch (Throwable t) {
            handleLoopException(t);
        }
        // Always handle shutdown even if the loop processing threw an exception.
        try {
            if (isShuttingDown()) {
                closeAll();
                if (confirmShutdown()) {
                    return;
                }
            }
        } catch (Throwable t) {
            handleLoopException(t);
        }
    }
}

这个方法怎么触发的可以直接看前面的文章，理解了NioEventLoop是一个线程，当提交任务时会进行start，就会触发这里的run循环了。
那么详细分解途中的三个主要功能。

1.select轮训就绪的IO事件

可以看到switch条件体:当前任务队列中没有任务时，进行一次selectNow操作，selectNow操作是无阻塞操作，会立刻返回，而select(time)是阻塞操作。

@Override
public int calculateStrategy(IntSupplier selectSupplier, boolean hasTasks) throws Exception {
    return hasTasks ? selectSupplier.get() : SelectStrategy.SELECT;
}
private final IntSupplier selectNowSupplier = new IntSupplier() {
    @Override
    public int get() throws Exception {
        return selectNow();
    }
};

可以看到代码中select不会获取返回值，那么这里netty怎么去知道已经有就绪的IO事件呢？既然不获取返回值，那么必然是select操作中操作了某些标识IO事件就绪的变量，这里暂时留个疑问，先看处理就绪的IO事件后再回来一探究竟填充了什么用以标识IO事件已就绪。这里暂时只要知道select操作的就是轮训就绪的IO事件就好。

既然select是轮训，那么我们详细看下轮训逻辑

/**
 * 如果发现当前的定时任务队列中有任务的截止事件快到了(<=0.5ms)，就跳出循环。
 * 此外，跳出之前如果发现目前为止还没有进行过select操作（if (selectCnt == 0)），那么就调用一次selectNow()，该方法会立即返回，不会阻塞
 */
private void select(boolean oldWakenUp) throws IOException {
    Selector selector = this.selector;
    try {
        int selectCnt = 0;
        long currentTimeNanos = System.nanoTime();

        /**
         * netty里面定时任务队列是按照延迟时间从小到大进行排序，
         * delayNanos(currentTimeNanos)方法即取出第一个定时任务的延迟时间
         */
        long selectDeadLineNanos = currentTimeNanos + delayNanos(currentTimeNanos);
        for (;;) {
            /**
             * 1.定时任务截至事时间快到了，中断本次轮询
             */
            long timeoutMillis = (selectDeadLineNanos - currentTimeNanos + 500000L) / 1000000L;
            if (timeoutMillis <= 0) {
                if (selectCnt == 0) {
                    selector.selectNow();
                    selectCnt = 1;
                }
                break;
            }

        
            /**
             * 2.轮询过程中发现有任务加入，中断本次轮询
             */
            if (hasTasks() && wakenUp.compareAndSet(false, true)) {
                selector.selectNow();
                selectCnt = 1;
                break;
            }

            /**
             *  3.阻塞式select操作
             *  为了保证任务队列能够及时执行，在进行阻塞select操作的时候会判断任务队列是否为空，如果不为空，就执行一次非阻塞select操作，跳出循环
             *
             * 到来这里, 我们可以看到, 当 hasTasks() 为真时, 调用的的 selectNow() 方法是不会阻塞当前线程的, 而当 hasTasks() 为假时, 调用的 select(oldWakenUp) 是会阻塞当前线程的.
             * 执行到这一步，说明netty任务队列里面队列为空，并且所有定时任务延迟时间还未到(大于0.5ms)，于是，在这里进行一次阻塞select操作，截止到第一个定时任务的截止时间
             */
            int selectedKeys = selector.select(timeoutMillis);
            selectCnt ++;

            /**
             * 阻塞select操作结束之后，netty又做了一系列的状态判断来决定是否中断本次轮询，中断本次轮询的条件有
             */
            if (selectedKeys != 0 || oldWakenUp || wakenUp.get() || hasTasks() || hasScheduledTasks()) {
                break;
            }
            if (Thread.interrupted()) {
            
                if (logger.isDebugEnabled()) {
                    logger.debug("Selector.select() returned prematurely because " +
                            "Thread.currentThread().interrupt() was called. Use " +
                            "NioEventLoop.shutdownGracefully() to shutdown the NioEventLoop.");
                }
                selectCnt = 1;
                break;
            }

            /**
             * 4.解决jdk的nio bug
             * 记录select空转的次数，定义一个阀值，这个阀值默认是512，可以在应用层通过设置系统属性io.netty.selectorAutoRebuildThreshold传入，
             * 当空转的次数超过了这个阀值，重新构建新Selector，将老Selector上注册的Channel转移到新建的Selector上，关闭老Selector，用新的Selector代替老Selector，详细实现可以查看NioEventLoop中的selector和rebuildSelector方法：
             */
            long time = System.nanoTime();
            if (time - TimeUnit.MILLISECONDS.toNanos(timeoutMillis) >= currentTimeNanos) {
                logger.warn("Selector.select() returned prematurely {} times in a row; rebuilding Selector {}.", selectCnt, selector);

                rebuildSelector();
                selector = this.selector;

                // Select again to populate selectedKeys.
                selector.selectNow();
                selectCnt = 1;
                break;
            }

            currentTimeNanos = time;
        }

        if (selectCnt > MIN_PREMATURE_SELECTOR_RETURNS) {
            if (logger.isDebugEnabled()) {
                logger.debug("Selector.select() returned prematurely {} times in a row for Selector {}.", selectCnt - 1, selector);
            }
        }
    } catch (CancelledKeyException e) {
        if (logger.isDebugEnabled()) {
            logger.debug(CancelledKeyException.class.getSimpleName() + " raised by a Selector {} - JDK bug?", selector, e);
        }
        // Harmless exception - log anyway
    }
}

核心逻辑就是调用selector的select逻辑去轮训。
那么我们知道这里轮训跟后面的处理事件是一个线程，总不能一直在这死循环吧，那么什么时候会中断轮训呢?

其实上面代码已经很好的表达了中断逻辑：

1.发现当前的定时任务队列中有任务的截止事件快到了(<=0.5ms)
2.selectedKeys != 0
3.oldWakenUp==true
4.hasTasks()
5.hasScheduledTasks()
6.wakenUp.get()

以上任何一个条件满足就会中断轮训，但只有第一个条件满足时，还会无阻塞selectNow
一次，可以看到这个轮训的处理优越之处。

其实这里还有个核心点:

long time = System.nanoTime();
if (time - TimeUnit.MILLISECONDS.toNanos(timeoutMillis) >= currentTimeNanos) {
    // timeoutMillis elapsed without anything selected.
    selectCnt = 1;
} else if (SELECTOR_AUTO_REBUILD_THRESHOLD > 0 && selectCnt >= SELECTOR_AUTO_REBUILD_THRESHOLD) {
    // The selector returned prematurely many times in a row.
    // Rebuild the selector to work around the problem.
    logger.warn("Selector.select() returned prematurely {} times in a row; rebuilding Selector {}.", selectCnt, selector);

    rebuildSelector();
    selector = this.selector;

    // Select again to populate selectedKeys.
    selector.selectNow();
    selectCnt = 1;
    break;
}

这里利用重新构建Selector的方式解决了jdk 的nio 空轮训的bug。那么什么时候重建呢? 就是设置netty的最大空轮训次数，默认是512，达到这个数后，就会重建Selector了。

2.processSelectedKeys处理就绪的IO事件

从代码中可以看出，select后紧接着就是处理了，因为上面select没有返回值，那么这里处理必然有两种情况：

1.没有就绪的IO事件，直接返回。
2.有就绪的IO事件，处理。

那么看看究竟是否如此：

private void processSelectedKeys() {
    if (selectedKeys != null) {
        /** 处理优化过的selectedKeys */
        processSelectedKeysOptimized();
    } else {
        /** 正常的处理 */
        processSelectedKeysPlain(selector.selectedKeys());
    }
}

/**
 *  迭代 selectedKeys 获取就绪的 IO 事件, 然后为每个事件都调用 processSelectedKey 来处理它.
 *
 *  1.对于boss NioEventLoop来说，轮询到的是基本上就是连接事件，后续的事情就通过他的pipeline将连接扔给一个worker NioEventLoop处理
 *  2.对于worker NioEventLoop来说，轮询到的基本上都是io读写事件，后续的事情就是通过他的pipeline将读取到的字节流传递给每个channelHandler来处理
 */
private void processSelectedKeysOptimized() {
    for (int i = 0; i < selectedKeys.size; ++i) {

        /** 1.取出IO事件以及对应的channel */
        final SelectionKey k = selectedKeys.keys[i];
        // null out entry in the array to allow to have it GC'ed once the Channel close
        // See https://github.com/netty/netty/issues/2363
        /** 取出后将数组置为空 */
        selectedKeys.keys[i] = null;

        final Object a = k.attachment();

        /** 处理该channel */
        if (a instanceof AbstractNioChannel) {
            processSelectedKey(k, (AbstractNioChannel) a);
        } else {
            @SuppressWarnings("unchecked")
            NioTask task = (NioTask) a;
            processSelectedKey(k, task);
        }

        /**
         * 判断是否该再来次轮询
         * 也就是说，对于每个NioEventLoop而言，每隔256个channel从selector上移除的时候，就标记 needsToSelectAgain 为true，我们还是跳回到上面这段代码
         */
        if (needsToSelectAgain) {
            // null out entries in the array to allow to have it GC'ed once the Channel close
            // See https://github.com/netty/netty/issues/2363
            /** 将selectedKeys的内部数组全部清空 */
            selectedKeys.reset(i + 1);

            /** 重新调用selectAgain重新填装一下 selectionKey */
            selectAgain();
            i = -1;
        }
    }
}

上面两端代码直接就是处理keys了，可以看到有两个入口，一个是优化的，一个是正常的。那么这里介绍优化的处理；其实这里优化的是什么呢？

可以看到第二段代码中，for循环做的就是将selectedKeys中的遍历进行处理，那么验证前面的猜想，select是不是就是填充了这个selectedKeys数组？先继续这个处理逻辑，后面回头进行验证这一点。

其实这里已经验证了一个猜想，当没有就绪的IO事件时直接返回。

其次，当有就绪的IO事件时,直接调用下面逻辑处理：

if (a instanceof AbstractNioChannel) {
    processSelectedKey(k, (AbstractNioChannel) a);
}

有个疑问哈，那么这里的attachment怎么就是AbstractNioChannel呢？
在server端启动时，我们看这段逻辑:

@Override
protected void doRegister() throws Exception {
    boolean selected = false;
    for (;;) {
        try {
            /** 将 Channel 对应的 Java NIO SockerChannel 注册到一个 eventLoop 的 Selector 中, 并且将当前 Channel 作为 attachment. */
            selectionKey = javaChannel().register(eventLoop().unwrappedSelector(), 0, this);
            return;
        } catch (CancelledKeyException e) {
           
        }
    }
}

继续跟踪到java nio中的逻辑：

public final SelectionKey register(Selector sel, int ops,Object att)
    throws ClosedChannelException
{
    synchronized (regLock) {
       ...
        if (k != null) {
            k.interestOps(ops);
            k.attach(att);
        }
       ...        
       return k;
    }
}

这里就会发现，在server启动注册时，就会将this 即 AbstractNioChannel 作为attachment保存在selectionKey中，这样轮训出事件后，就可以直接取出对应的channel使用了。

那么继续回到processSelectedKeys这个核心方法：

/**
 * processSelectedKey 中处理了三个事件, 分别是:

 * 1.OP_READ, 可读事件, 即 Channel 中收到了新数据可供上层读取.
 * 2.OP_WRITE, 可写事件, 即上层可以向 Channel 写入数据.
 * 3.OP_CONNECT, 连接建立事件, 即 TCP 连接已经建立, Channel 处于 active 状态.
 * 4.OP_ACCEPT,请求连接事件
 */
private void processSelectedKey(SelectionKey k, AbstractNioChannel ch) {
    final NioUnsafe unsafe = ch.unsafe();
    if (!k.isValid()) {
        final EventLoop eventLoop;
        try {
            eventLoop = ch.eventLoop();
        } catch (Throwable ignored) {
            // If the channel implementation throws an exception because there is no event loop, we ignore this
            // because we are only trying to determine if ch is registered to this event loop and thus has authority
            // to close ch.
            return;
        }
        // Only close ch if ch is still registered to this EventLoop. ch could have deregistered from the event loop
        // and thus the SelectionKey could be cancelled as part of the deregistration process, but the channel is
        // still healthy and should not be closed.
        // See https://github.com/netty/netty/issues/5125
        if (eventLoop != this || eventLoop == null) {
            return;
        }
        // close the channel if the key is not valid anymore
        unsafe.close(unsafe.voidPromise());
        return;
    }

    try {
        int readyOps = k.readyOps();
        // We first need to call finishConnect() before try to trigger a read(...) or write(...) as otherwise
        // the NIO JDK channel implementation may throw a NotYetConnectedException.
        /**
         * 事件是 OP_CONNECT, 即 TCP 连接已建立事件.
         * 1.我们需要将 OP_CONNECT 从就绪事件集中清除, 不然会一直有 OP_CONNECT 事件.
         * 2.调用 unsafe.finishConnect() 通知上层连接已建立
         *  */
        if ((readyOps & SelectionKey.OP_CONNECT) != 0) {
            // remove OP_CONNECT as otherwise Selector.select(..) will always return without blocking
            // See https://github.com/netty/netty/issues/924
            int ops = k.interestOps();
            ops &= ~SelectionKey.OP_CONNECT;
            k.interestOps(ops);

            /**
             * unsafe.finishConnect() 调用最后会调用到 pipeline().
             * fireChannelActive(), 产生一个 inbound 事件, 通知 pipeline 中的各个 handler TCP 通道已建立(即 ChannelInboundHandler.channelActive 方法会被调用)
             */
            unsafe.finishConnect();
        }

        // Process OP_WRITE first as we may be able to write some queued buffers and so free memory.
        if ((readyOps & SelectionKey.OP_WRITE) != 0) {
            // Call forceFlush which will also take care of clear the OP_WRITE once there is nothing left to write
            ch.unsafe().forceFlush();
        }

        // Also check for readOps of 0 to workaround possible JDK bug which may otherwise lead
        // to a spin loop
        /** boos reactor处理新的连接   或者 worker reactor 处理 已存在的连接有数据可读 */
        if ((readyOps & (SelectionKey.OP_READ | SelectionKey.OP_ACCEPT)) != 0 || readyOps == 0) {

            /** AbstractNioByteChannel中实现,重点 */
            unsafe.read();
        }
    } catch (CancelledKeyException ignored) {
        unsafe.close(unsafe.voidPromise());
    }
}

其实已经出现了想要的东西了，IO事件分为以下几种，

1.OP_READ, 可读事件, 即 Channel 中收到了新数据可供上层读取.
2.OP_WRITE, 可写事件, 即上层可以向 Channel 写入数据.
3.OP_CONNECT, 连接建立事件, 即 TCP 连接已经建立, Channel 处于 active 状态.
4.OP_ACCEPT,请求连接事件

那么上一段代码就是从k.readyOps();获取对应的就绪事件，然后进行对应的处理；其实前面就已经指出，netty最终的读写及连接等底层相关的操作都是交由UnSafe去处理的，那么这里也就是最终也是如此。这里拿连接请求举例：

SelectionKey.OP_ACCEPT事件对应的处理逻辑就是

unsafe.read();

netty 与bossGroup对应的UnSafe就是NioMessageUnsafe这个了，专门处理连接相关的IO事件，从上面的read进去就会到一下逻辑：

/**
 * 服务端需要的UnSafe对象
 * 将新的连接注册到worker线程组
 *
 * netty将一个新连接的建立也当作一个io操作来处理，这里的Message的含义我们可以当作是一个SelectableChannel，读的意思就是accept一个SelectableChannel，写的意思是针对一些无连接的协议，比如UDP来操作的
 */
private final class NioMessageUnsafe extends AbstractNioUnsafe {

    private final List

3-netty源码分析之Reactor