NIO中存在的bug—epoll空轮询

IO&NIO介绍

IO读取

NIO读取

NIO中epoll空轮询表现

public static void main(String[] args) {
        Selector selector = Selector.open();
        System.out.println(selector.isOpen());
        ServerSocketChannel socketChannel = ServerSocketChannel.open();
        InetSocketAddress inetSocketAddress = new InetSocketAddress("localhost", 8080);
        socketChannel.bind(inetSocketAddress);
        socketChannel.configureBlocking(false);
        int ops = socketChannel.validOps();
        SelectionKey selectionKey = socketChannel.register(selector, ops, null);
        Set selectedKeys = selector.selectedKeys();
        for (;;) {
            System.out.println("等待...");
            /**
             * 通常是阻塞的,但是在epoll空轮询的bug中,
             * 之前处于连接状态突然被断开,select()的
             * 返回值noOfKeys应该等于0,也就是阻塞状态
             * 但是,在此bug中,select()被唤醒,而又
             * 没有数据传入,导致while (itr.hasNext())
             * 根本不会执行,而后就进入for (;;) {的死循环
             * 但是,正常状态下应该阻塞,也就是只输出一个waiting...
             * 而此时进入死循环,不断的输出waiting...,程序死循环
             * cpu自然很快飙升到100%状态。
             */
            int noOfKeys = selector.select();
            System.out.println("selected keys:" + noOfKeys);
            Iterator itr = selectedKeys.iterator();
            while (itr.hasNext()) {
                SelectionKey key = (SelectionKey) itr.next();
                if (key.isAcceptable()) {
                    SocketChannel client = socketChannel.accept();
                    client.configureBlocking(false);
                    client.register(selector, SelectionKey.OP_READ);
                    System.out.println("The new connection is accepted from the client: " + client);
                } else if (key.isReadable()) {
                    SocketChannel client = (SocketChannel) key.channel();
                    ByteBuffer buffer = ByteBuffer.allocate(256);
                    client.read(buffer);
                    String output = new String(buffer.array()).trim();
                    System.out.println("Message read from client: " + output);
                    if (output.equals("Bye Bye")) {
                        client.close();
                        System.out.println("The Client messages are complete; close the session.");
                    }
                }
                itr.remove();
            }
        }
    }

bug原因

JDK bug列表中有两个相关的bug报告:

  1. JDK-6670302 : (se) NIO selector wakes up with 0 selected keys infinitely
  2. JDK-6403933 : (se) Selector doesn't block on Selector.select(timeout) (lnx)

JDK-6403933的bug说出了实质的原因:

This is an issue with poll (and epoll) on Linux. If a file descriptor for a connected socket is polled with a request event mask of 0, and if the connection is abruptly terminated (RST) then the poll wakes up with the POLLHUP (and maybe POLLERR) bit set in the returned event set. The implication of this behaviour is that Selector will wakeup and as the interest set for the SocketChannel is 0 it means there aren't any selected events and the select method returns 0.

具体解释为:在部分Linux的2.6的kernel中,poll和epoll对于突然中断的连接socket会对返回的eventSet事件集合置为POLLHUP,也可能是POLLERR,eventSet事件集合发生了变化,这就可能导致Selector会被唤醒。

这是与操作系统机制有关系的,JDK虽然仅仅是一个兼容各个操作系统平台的软件,但很遗憾在JDK5和JDK6最初的版本中(严格意义上来将,JDK部分版本都是),这个问题并没有解决,而将这个帽子抛给了操作系统方,这也就是这个bug最终一直到2013年才最终修复的原因,最终影响力太广。

解决办法
不完善的解决办法
grizzly的commiteer们最先进行修改的,并且通过众多的测试说明这种修改方式大大降低了JDK NIO的问题。

// the key you registered on the temporary selector
if (SelectionKey != null)  {  
   // cancel the SelectionKey that was registered with the temporary selector
   SelectionKey.cancel();  
   // flush the cancelled key
   temporarySelector.selectNow();
} 

但是,这种修改仍然不是可靠的,一共有两点:

  1. 多个线程中的SelectionKey的key的cancel,很可能和下面的Selector.selectNow同时并发,如果是导致key的cancel后运行很可能没有效果
  2. 与其说第一点使得NIO空转出现的几率大大降低,经过Jetty服务器的测试报告发现,这种重复利用Selector并清空SelectionKey的改法很可能没有任何的效果,

完善的解决办法

最终的终极办法是创建一个新的Selector:

Trash wasted Selector, creates a new one.

各应用具体解决方法
Jetty

Jetty首先定义两了-D参数:

  • JVMBUG_THRESHHOLD

org.mortbay.io.nio.JVMBUG_THRESHHOLD, defaults to 512 and is the number of zero select returns that must be exceeded in a period.

  • threshhold

org.mortbay.io.nio.MONITOR_PERIOD defaults to 1000 and is the period over which the threshhold applies.

第一个参数是select返回值为0的计数,第二个是多长时间,整体意思就是控制在多长时间内,如果Selector.select不断返回0,说明进入了JVM的bug的模式。

做法是:

  • 记录select()返回为0的次数(记做jvmBug次数)
  • 在MONITOR_PERIOD时间范围内,如果jvmBug次数超过JVMBUG_THRESHHOLD,则新创建一个selector
long before = now;
int selected = selector.select(wait);
now = System.currentTimeMillis();
_idleTimeout.setNow(now);
_timeout.setNow(now);

/**
 * 判断等待时间是否大于_JVMBUG_THRESHHOLD
 * selected是否等于0,时间的机制。
 */
if (_JVMBUG_THRESHHOLD > 0 && selected == 0
        && wait > _JVMBUG_THRESHHOLD
        && (now - before) < (wait/2)) {
    _jvmBug++;
    // 判断jvmBug计数是否大于设置的标准值
    if (_jvmBug >= (_JVMBUG_THRESHHOLD2)) {
        // 确定发生epoll空轮询bug,开启新的selector
        synchronized (this) {
            _lastJvmBug = now;
            final Selector new_selector = Selector.open();
            // 将之前的事件复制到新的selector
            for (SelectionKey k:selector.selectedKeys()) {
                if (!k.isValid() || k.interestOps() == 0) {
                    continue;
                }
                final SelectableChannel channel = k.channel();
                final Object attachment = k.attachment();
                if (attachment == null) {
                    addChange(channel);
                } else {
                    addChange(channel, attachment);
                }
            }
            // 关闭旧selector
            _selector.close();
            // 开启新的selector
            _selector = new_selector;
            // bug数归0
            _jvmBug = 0;
            return;
        }
    }
}

Netty

思路和Jetty的处理方式几乎是一样的,就是netty讲重建Selector的过程抽取成了一个方法。

long currentTimeNanos = System.nanoTime();
for (;;) {
    // 1.定时任务截止事时间快到了,中断本次轮询
    ...
    // 2.轮询过程中发现有任务加入,中断本次轮询
    ...
    // 3.阻塞式select操作
    selector.select(timeoutMillis);
    // 4.解决jdk的nio bug
    long time = System.nanoTime();
    if (time - TimeUnit.MILLISECONDS.toNanos(timeoutMillis) >= currentTimeNanos) {
        selectCnt = 1;
    } else if (SELECTOR_AUTO_REBUILD_THRESHOLD > 0 &&
            selectCnt >= SELECTOR_AUTO_REBUILD_THRESHOLD) {

        rebuildSelector();
        selector = this.selector;
        selector.selectNow();
        selectCnt = 1;
        break;
    }
    currentTimeNanos = time; 
    ...
 }

netty 会在每次进行 selector.select(timeoutMillis) 之前记录一下开始时间currentTimeNanos,在select之后记录一下结束时间,判断select操作是否至少持续了timeoutMillis秒(这里将time - TimeUnit.MILLISECONDS.toNanos(timeoutMillis) >= currentTimeNanos改成time - currentTimeNanos >= TimeUnit.MILLISECONDS.toNanos(timeoutMillis)或许更好理解一些),
如果持续的时间大于等于timeoutMillis,说明就是一次有效的轮询,重置selectCnt标志,否则,表明该阻塞方法并没有阻塞这么长时间,可能触发了jdk的空轮询bug,当空轮询的次数超过一个阀值的时候,默认是512,就开始重建selector

你可能感兴趣的:(NIO中存在的bug—epoll空轮询)