handler机制--让线程变为“永动机”

handler要想能工作起来，第一步要做的事情是让线程变为“永动机”，也就是让线程一直循环起来，不死掉，这样线程就可以不断的处理各种任务了。那这节就来介绍下如何让线程变为“永动机”。

如何让线程变为“永动机”

下面代码可以做到

public class Thread{

    public void run(){
        Looper.prepare();

        Looper.loop();
    }
}

如上代码，需要依次调用Looper.prepare()和Looper.loop()方法就可以让线程变为“永动机”，是不是非常的简单，那我们就从源码角度一趟究竟。

线程变为“永动机”-源码分析

前置知识

fd：Linux系统中把一切都看做是文件，当进程打开现有文件或创建新文件时，内核向进程返回一个文件描述符，文件描述符就是内核为了高效管理已被打开的文件所创建的索引，用来指向被打开的文件，所有执行 I/O 操作的系统调用都会通过文件描述符，文件描述符不一定是文件，也可以是一块匿名内存。
epoll:io多路复用技术，就是在一个线程或进程中监听多个文件描述符是否可以执行io操作的能力，handler使用了这种技术，在下面会详细介绍
eventfd:类似于管道的概念，可以实现线程间的事件通知 eventfd原文
ThreadLocal：这个类的主要作用是保存当前线程独有的数据。

1. Looper.prepare()

    public static void prepare() {
        //调用prepare方法，quitAllowed值为true，代表允许结束loop
        prepare(true);
    }

    private static void prepare(boolean quitAllowed) {
        //当前线程已经存在Looper，则不能再次创建，抛异常
        if (sThreadLocal.get() != null) {
            throw new RuntimeException("Only one Looper may be created per thread");
        }
        //sThreadLocal是一个ThreadLocal类型的静态变量，它存储Looper实例
        [1.1]
        sThreadLocal.set(new Looper(quitAllowed));
    }

sThreadLocal.set(new Looper(quitAllowed)) 这行代码的作用是把Looper实例存储在ThreadLocal中，存储下来的一个最大的目的就是：能更方便的获取当前Thread对应的Looper。比如可以在代码的任何位置，只要调用Looper.myLooper()方法就能非常方便的获取到Looper对象。

1.1 Looper#Looper()

    //Looper的构造方法是私有的，只能调用Looper的静态方法来创建Looper
    private Looper(boolean quitAllowed) {
        //new MessageQueue
        [1.2]
        mQueue = new MessageQueue(quitAllowed);
        //保存当前线程引用
        mThread = Thread.currentThread();
    }

1.2 MessageQueue#MessageQueue()

    MessageQueue(boolean quitAllowed) {
        //quitAllowed的值当前为true
        mQuitAllowed = quitAllowed;
        //这一看就是进入nativeInit jni方法，该方法返回一个指针，mPtr保存下来
        [1.3]
        mPtr = nativeInit();
    }

1.3 android_os_MessageQueue.cpp#android_os_MessageQueue_nativeInit

    static jlong android_os_MessageQueue_nativeInit(JNIEnv* env, jclass clazz) {
        //new一个NativeMessageQueue
        [1.4]
        NativeMessageQueue* nativeMessageQueue = new NativeMessageQueue();
        if (!nativeMessageQueue) {
            jniThrowRuntimeException(env, "Unable to allocate native queue");
            return 0;
        }

        //增加它的引用计数器
        nativeMessageQueue->incStrong(env);
        //把nativeMessageQueue指针转化为jlong类型，返回给java层，
        return reinterpret_cast(nativeMessageQueue);
    }

1.4 android_os_MessageQueue.cpp#NativeMessageQueue

    NativeMessageQueue::NativeMessageQueue() :
            mPollEnv(NULL), mPollObj(NULL), mExceptionObj(NULL) {
        //先获取当前线程的Looper，不存在则创建，native层也存在一个Looper
        mLooper = Looper::getForThread();
        if (mLooper == NULL) {
            //new 一个Looper出来
            [1.5]
            mLooper = new Looper(false);
            Looper::setForThread(mLooper);
        }
    }

1.5 Looper.cpp#Looper

    Looper::Looper(bool allowNonCallbacks)
        : mAllowNonCallbacks(allowNonCallbacks),
          mSendingMessage(false),
          mPolling(false),
          mEpollRebuildRequired(false),
          mNextRequestSeq(WAKE_EVENT_FD_SEQ + 1),
          mResponseIndex(0),
          mNextMessageUptime(LLONG_MAX) {
        //eventfd方法返回fd，EFD_NONBLOCK的作用在调用read/write函数的时候 不阻塞，EFD_CLOEXEC作用是在fork子进程时候调用exec()方法的时，把fd close调，这样在子进程就不存在相应的fd了
        mWakeEventFd.reset(eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC));
        LOG_ALWAYS_FATAL_IF(mWakeEventFd.get() < 0, "Could not make wake event fd: %s", strerror(errno));

        AutoMutex _l(mLock);
        rebuildEpollLocked();
    }

该方法中有几个关键点需要介绍下：
eventfd：可以实现线程之间或者进程之间通信，eventfd类似于pipe，但是比 pipe 更高效，一方面它比 pipe 少用一个fd，节省了资源；另一方面，eventfd 的缓冲区管理也简单得多，全部“buffer”一共只有8字节。eventfd在通信时候不能传大数据。
eventfd是如何实现线程之间的通信，通过下面的伪代码来说明

    //用来保持生成的fd
    savedEventfd

    //调用eventfd方法，返回的fd是阻塞类型的
    savedEventfd = eventfd(0, EFD_CLOEXEC);

    //下面方法发生于线程A中，调用read方法从savedEventfd中读取int数据，因为savedEventfd是阻塞类型的，因此线程A会阻塞于read方法，直到有数据为止
    uint64_t counter;
    read(savedEventfd,&counter,sizeof(uint64_t));

    //下面方法调用发生于线程B中，调用write方法往savedEventfd中写一个int值，因为savedEventfd写入了数据，上面线程A就会被唤醒，读出刚刚写入的int数据
    uint64_t inc = 1;
    write(savedEventfd,&inc,sizeof(uint64_t));

mWakeEventFd：它保存了eventfd方法返回的fd，并且需要注意，在调用eventfd方法的时候，传递了EFD_NONBLOCK这个参数，表示返回的fd是非阻塞类型，即调用read，write方法不会发生阻塞，上面的伪代码是阻塞的，现在却是非阻塞类型的，那又怎么实现线程之间或者进程之间通信呢？答案是结合epoll机制。
mWakeEventFd的主要作用是java层的MessageQueue的唤醒/等待操作，唤醒/等待操作都是通过给mWakeEventFd写数据和读数据实现的。

1.6 Looper.cpp#rebuildEpollLocked

    void Looper::rebuildEpollLocked() {
        // Close old epoll instance if we have one.
        // mEpollFd存在，则重新设置
        if (mEpollFd >= 0) {
    #if DEBUG_CALLBACKS
            ALOGD("%p ~ rebuildEpollLocked - rebuilding epoll set", this);
    #endif
            mEpollFd.reset();
        }

        // Allocate the new epoll instance and register the wake pipe.
        //调用epoll_create1方法重新创建，这个方法会返回一个fd并赋值给mEpollFd
        mEpollFd.reset(epoll_create1(EPOLL_CLOEXEC));
        LOG_ALWAYS_FATAL_IF(mEpollFd < 0, "Could not create epoll instance: %s", strerror(errno));

        struct epoll_event eventItem;
        memset(& eventItem, 0, sizeof(epoll_event)); // zero out unused members of data field union
        eventItem.events = EPOLLIN;
        eventItem.data.fd = mWakeEventFd.get();
        //通过epoll_ctl方法来添加一个event
        int result = epoll_ctl(mEpollFd.get(), EPOLL_CTL_ADD, mWakeEventFd.get(), &eventItem);
        LOG_ALWAYS_FATAL_IF(result != 0, "Could not add wake event fd to epoll instance: %s",
                            strerror(errno));

        //若mRequests中存在，则依次调用epoll_ctl方法添加event
        for (size_t i = 0; i < mRequests.size(); i++) {
            const Request& request = mRequests.valueAt(i);
            struct epoll_event eventItem;
            request.initEventItem(&eventItem);

            int epollResult = epoll_ctl(mEpollFd.get(), EPOLL_CTL_ADD, request.fd, &eventItem);
            if (epollResult < 0) {
                ALOGE("Error adding epoll events for fd %d while rebuilding epoll set: %s",
                      request.fd, strerror(errno));
            }
        }
    }

在介绍上面方法之前，先来介绍下fd管道，管道，eventfd，再来介绍epoll机制。

管道，FIFO，fd管道，eventfd
这节不是介绍handler机制吗？怎么涉及到了管道这些内容，主要原因是咱们的handler机制已经不是单单解决线程之间通信的问题，还解决进程之间的通信。

管道：是半双工的（就是数据只能在一个方向上流动），只能有公共祖先的两个进程之间使用
FIFO：也是一种管道，它没有管道的只能在公共祖先的两个进程之间使用的限制
fd管道：又称UNIX 域套接字，它是全双工的（一端既可以是写端也可以是读端），并且对进程没有限制。vsync机制就使用的是fd管道
eventfd：上面已经介绍了，它的优点高效/少占用资源，缺点：传播的数据只能是int。

这四者都可以实现线程之间/进程之间的通信，并且它们的read，write的目标都是fd（文件描述符）。它们实现通信的方式如上面eventfd的伪代码一样，在创建的时候需要设置为阻塞模式，在阻塞模式下read，write函数都是阻塞的，因此尤其read函数的调用就需要放在单独的线程中。那如果创建了很多个管道来实现线程之间通信，那岂不是要创建很多的读线程，随着管道的数量多起来，读线程也增加，这种方式肯定不是一个好的解决方案，解决这个问题的一个技术是IO多路复用。

epoll
IO多路复用：我的理解是比如原先创建一个阻塞型的管道，那就需要创建一个读线程专门的来监听管道的读端是否有数据，那创建n多个阻塞型管道，那就需要创建n多个读线程；那IO多路复用就是只创建一个读线程，来监听n多个管道读端的数据。

IO多路复用的优势是不是很明显，它有多种实现：select机制，poll机制，epoll机制。epoll机制是最高效，优点最多的一个机制，用一段伪代码来介绍下它的使用：

    //1. 调用epoll_create1方法先创建，该方法返回一个fd
    epollFd = epoll_create1(EPOLL_CLOEXEC);

    //2. 调用epoll_ctl方法添加一个event，epoll_ctl方法的第一个参数就是epollFd。这样event就和epollFd绑定在了一起，epollFd就可以监听event上的数据了
    struct epoll_event eventItem;
    //event类型
    eventItem.events = EPOLLIN;
    //event对应的fd
    eventItem.data.fd = pipeFd;
    int result = epoll_ctl(mEpollFd.get(), EPOLL_CTL_ADD, mWakeEventFd.get(), &eventItem);

    //3. 调用epoll_wait方法开始监听所有event的数据。在没有监听到数据的情况下，会进入阻塞状态，会释放cpu等资源。参数timeoutMillis代表等待时间，== 0代表不等待立马返回，== -1 则代表等待，直到等待数据为止，> 0则代表需要等待的时间。
    struct epoll_event eventItems[EPOLL_MAX_EVENTS];
    int eventCount = epoll_wait(mEpollFd.get(), eventItems, EPOLL_MAX_EVENTS, timeoutMillis);

总结下epoll的用法：

初始化，调用epoll_create1方法创建并返回一个fd
添加event事件，调用epoll_ctl方法
等待数据到来，调用epoll_wait，它的参数timeoutMillis代表等待时间，== 0代表不等待立马返回，== -1 则代表等待，直到等待数据为止，> 0则代表需要等待的时间。在没有数据到来的时候，会进入阻塞状态，会释放cpu等资源

rebuildEpollLocked
那就来看下rebuildEpollLocked方法所做的事情：

调用epoll_create1方法创建并返回fd赋值给mEpollFd
把mWakeEventFd关联的epoll_event，调用epoll_ctl方法添加该事件，这样mEpollFd就和mWakeEventFd产生了关联，就可以监听它的数据了
把mRequests中的请求也通过epoll_ctl方法添加这些事件。mRequests包含了各种需要监听的fd，比如：vsync机制就是通过fd管道实现的，其中一个fd在surfaceflinger进程，另外一个对应的fd位于app进程，位于app进程的fd会通过Looper.cpp#addFd方法把自己加入mRequests，这样就可以通过epoll来监听是否有数据到来

1.7 小结

到此Looper.prepare()方法的流程就分析完了，用一张时序图来看下整个调用流程

handler-looper-prepare

2.Looper.loop()

再来分析下Looper.loop()这个方法的流程

    public static void loop() {
        //获取线程绑定的Looper
        final Looper me = myLooper();
        if (me == null) {
            throw new RuntimeException("No Looper; Looper.prepare() wasn't called on this thread.");
        }
        if (me.mInLoop) {
            Slog.w(TAG, "Loop again would have the queued messages be executed"
                    + " before this one completed.");
        }

        me.mInLoop = true;

        省略代码......

        //启动一个死循环
        for (;;) {
            [2.1]
            if (!loopOnce(me, ident, thresholdOverride)) {
                return;
            }
        }
    }

    public static @Nullable Looper myLooper() {
        //从ThreadLocal中获取Looper
        return sThreadLocal.get();
    }

2.1 Looper.loopOnce

    private static boolean loopOnce(final Looper me,
            final long ident, final int thresholdOverride) {
        //从MessageQueue中取Message，若没有可执行的Message 则block
        [2.2]
        Message msg = me.mQueue.next(); // might block
        if (msg == null) {
            // No message indicates that the message queue is quitting.
            return false;
        }

        省略代码......

        try {
            msg.target.dispatchMessage(msg);
            if (observer != null) {
                observer.messageDispatched(token, msg);
            }
            dispatchEnd = needEndTime ? SystemClock.uptimeMillis() : 0;
        } catch (Exception exception) {
            if (observer != null) {
                observer.dispatchingThrewException(token, msg, exception);
            }
            throw exception;
        } finally {
            ThreadLocalWorkSource.restore(origWorkSource);
            if (traceTag != 0) {
                Trace.traceEnd(traceTag);
            }
        }
        
        省略代码......

        msg.recycleUnchecked();

        return true;
    }

me.mQueue.next()会进入MessageQueue.next()方法获取Message，进入该方法

2.2 MessageQueue.next

    Message next() {
        // Return here if the message loop has already quit and been disposed.
        // This can happen if the application tries to restart a looper after quit
        // which is not supported.
        final long ptr = mPtr;
        if (ptr == 0) {
            return null;
        }

        int pendingIdleHandlerCount = -1; // -1 only during first iteration
        int nextPollTimeoutMillis = 0;

        //同样起一个死循环
        for (;;) {
            if (nextPollTimeoutMillis != 0) {
                Binder.flushPendingCommands();
            }

            //进入jni方法, 这时候nextPollTimeoutMillis的值是-1
            [2.3]
            nativePollOnce(ptr, nextPollTimeoutMillis);

            省略获取Message的代码...... (下一节会重点介绍)
        }
    }

nativePollOnce方法最终会调jni的android_os_MessageQueue_nativePollOnce方法

2.3 android_os_MessageQueue.cpp#android_os_MessageQueue_nativePollOnce

    //obj java层的MessageQueue， ptr NativeMessageQueue指针，timeoutMillis值为-1
    static void android_os_MessageQueue_nativePollOnce(JNIEnv* env, jobject obj,
            jlong ptr, jint timeoutMillis) {
        //把ptr转换为NativeMessageQueue指针
        NativeMessageQueue* nativeMessageQueue = reinterpret_cast(ptr);
        [2.4]
        nativeMessageQueue->pollOnce(env, obj, timeoutMillis);
    }

2.4 android_os_MessageQueue.cpp#nativeMessageQueue#pollOnce

    void NativeMessageQueue::pollOnce(JNIEnv* env, jobject pollObj, int timeoutMillis) {
        mPollEnv = env;
        mPollObj = pollObj;
        //调用pollonce方法，timeoutMillis值为-1
        [2.5]
        mLooper->pollOnce(timeoutMillis);
        mPollObj = NULL;
        mPollEnv = NULL;

        if (mExceptionObj) {
            env->Throw(mExceptionObj);
            env->DeleteLocalRef(mExceptionObj);
            mExceptionObj = NULL;
        }
    }

2.5 Looper.cpp#pollOnce

    //下面方法在system/core/libutils/include/utils/Looper.h
    int pollOnce(int timeoutMillis, int* outFd, int* outEvents, void** outData);
    inline int pollOnce(int timeoutMillis) {
        //调用了pollOnce的重载方法，outFd,outEvents,outData都为nullptr
        return pollOnce(timeoutMillis, nullptr, nullptr, nullptr);
    }


    //下面方法在system/core/libutils/Looper.cpp
    //timeoutMillis为-1
    int Looper::pollOnce(int timeoutMillis, int* outFd, int* outEvents, void** outData) {
        int result = 0;
        //同样死循环
        for (;;) {
            //若有没处理的response，则处理，并返回
            while (mResponseIndex < mResponses.size()) {
                const Response& response = mResponses.itemAt(mResponseIndex++);
                int ident = response.request.ident;
                if (ident >= 0) {
                    int fd = response.request.fd;
                    int events = response.events;
                    void* data = response.request.data;
    #if DEBUG_POLL_AND_WAKE
                    ALOGD("%p ~ pollOnce - returning signalled identifier %d: "
                            "fd=%d, events=0x%x, data=%p",
                            this, ident, fd, events, data);
    #endif
                    if (outFd != nullptr) *outFd = fd;
                    if (outEvents != nullptr) *outEvents = events;
                    if (outData != nullptr) *outData = data;
                    return ident;
                }
            }

            //若result不为0,则返回，刚开始进入这方法，result是0
            if (result != 0) {
    #if DEBUG_POLL_AND_WAKE
                ALOGD("%p ~ pollOnce - returning result %d", this, result);
    #endif
                if (outFd != nullptr) *outFd = 0;
                if (outEvents != nullptr) *outEvents = 0;
                if (outData != nullptr) *outData = nullptr;
                return result;
            }
            //进入pollInner方法
            [2.6]
            result = pollInner(timeoutMillis);
        }
    }

2.6 Looper.cpp#pollInner

    //timeoutMillis值为-1
    int Looper::pollInner(int timeoutMillis) {
    #if DEBUG_POLL_AND_WAKE
        ALOGD("%p ~ pollOnce - waiting: timeoutMillis=%d", this, timeoutMillis);
    #endif

        // Adjust the timeout based on when the next message is due.
        //因为timeoutMillis当前值为-1 并且 mNextMessageUptime在Looper构造方法初始化的时候，它的值为LLONG_MAX，因此不会进入下面调整timeout的逻辑
        if (timeoutMillis != 0 && mNextMessageUptime != LLONG_MAX) {
            nsecs_t now = systemTime(SYSTEM_TIME_MONOTONIC);
            int messageTimeoutMillis = toMillisecondTimeoutDelay(now, mNextMessageUptime);
            if (messageTimeoutMillis >= 0
                    && (timeoutMillis < 0 || messageTimeoutMillis < timeoutMillis)) {
                timeoutMillis = messageTimeoutMillis;
            }
    #if DEBUG_POLL_AND_WAKE
            ALOGD("%p ~ pollOnce - next message in %" PRId64 "ns, adjusted timeout: timeoutMillis=%d",
                    this, mNextMessageUptime - now, timeoutMillis);
    #endif
        }

        // Poll.
        int result = POLL_WAKE;
        mResponses.clear();
        mResponseIndex = 0;

        // We are about to idle.
        mPolling = true;

        //定义eventItems数组它主要接受传递过来的event， 调用epoll_wait方法开始等待event，因为timeoutMillis当前的值为-1,因此会阻塞等待events，并且释放cpu等资源
        struct epoll_event eventItems[EPOLL_MAX_EVENTS];
        int eventCount = epoll_wait(mEpollFd.get(), eventItems, EPOLL_MAX_EVENTS, timeoutMillis);

        省略掉处理wakefd和mRequest的代码......（后面章节会详细讲解）

        return result;
    }

该方法会调用epoll_wait方法等待events，因为timeoutMillis当前的值为-1,因此会使当前线程进入阻塞状态，并释放cpu等资源

2.7 小结

到此 Looper.loop()方法的流程就分析完了，用一张时序图总结下：

handler-looper-loop

总结

让线程变为“永动机”可以分为两个步骤：

准备阶段，Looper.prepare()方法其实做的都是准备工作：
- 初始化Looper对象，并且把他放入ThreadLocal中，放入ThreadLocal的主要作用就是：为了能在当前线程的任何代码处非常方便的获取到当前线程”绑定“的Looper。
- 初始化MessageQueue对象，它与上面初始化的Looper对象是一对一关系
- 初始化native层的Looper对象，调用eventfd方法创建fd；调用epoll_create方法创建epollfd，并且依次调用epoll_ctl把mWakeEventfd以及mRequest封装成event并添加，这样就可以通过epoll机制来监听它所添加的event上面的事件是否发生了
开始工作阶段，在线程中调用Looper.loop()方法后就开始工作了：
- Looper的loop方法会启动一个死循环，这样一个线程就真正的变成“永动机”了
- 启动死循环后，会调用MessageQueue的next方法从中获取Message，next方法调用了nativePollOnce方法，最终会调用到Looper.cpp的pollInner方法，由于第一步准备阶段，epoll，eventfd相关的准备工作都已经准备好了，pollInner方法中会调用epoll_wait方法等待事件到来，因为这时候的等待时间（timeoutMillis）为-1,会一直等待，直到有事件发生为止，进而导致当前的线程进入阻塞状态，并释放cpu等资源

因为handler使用了epoll机制，handler既可以实现线程之间通信，也可以实现进程之间通信。

好了，到此线程已经做好了一切准备，就等待着“各种事件“的到来了。

思考

MessageQueue的next方法获取消息时候等待/唤醒实现方案为啥没用 wait/notify 来实现？
上面源码分析提到MessageQueue的next方法最终是因为epoll_wait方法，导致线程进入等待阻塞状态的，那为啥没有使用wait/notify来实现呢？大家其实可以找很早以前的android代码，那时候确实是用wait/notify来实现等待/唤醒机制。

我认为的主要原因是：用epoll机制实现的等待/唤醒机制，主要是它有如下优点：

功能强大：不仅实现线程之间通信，还实现进程之间通信的功能，vsync机制就是利用fd管道实现进程通信，epoll只需要监听fd管道的一端fd上的数据状态即可，surfaceflinger进程往对端fd上写数据，epoll在当前的线程中就可以监听到surfaceflinger发过来的数据。
高性能: epoll机制可以监听n多个fd，并且不会随着fd的增加而性能下降
扩展性好: 只需要调用Looper.cpp的addFd方法就可以在当前线程监听fd上的数据
既可以为java层提供服务，也可以为native层提供服务:native层的Looper.cpp类其实也提供了和上层MessageQueue相关的功能

假如使用wait/notify实现进程之间通信就困难了，并且即使实现了还可能会涉及到线程之间的切换，性能方面肯定大打折扣。

为啥要用eventfd机制？
在把Message放入MessageQueue的时候，这时候只是需要给阻塞的MessageQueue发一个有多简单就能多简单的通知或者信号就行，告诉它有消息到达因。为没有用 wait/notify 来实现等待/唤醒，所以就需要用管道这类技术来实现，但是用管道做这种事情又大才小用了，evenfd是最合适，它占用的内存非常小并且只使用一个fd，并且它就是发送一个int类型的值就可以，因此使用了eventfd来实现：把Message放入MessageQueue的时候，通知阻塞的MessageQueue有消息到来了这样的功能。