一).Epoll 介绍
Epoll 可是当前在 Linux 下开发大规模并发网络程序的热门人选, Epoll 在 Linux2.6 内核中正式引入,和 select 相似,其实都 I/O 多路复用技术而已 ,并没有什么神秘的。其实在 Linux 下设计并发网络程序,向来不缺少方法,比如典型的 Apache 模型( Process Per Connection ,简称 PPC ), TPC ( Thread Per Connection )模型,以及 select 模型和 poll 模型,那为何还要再引入 Epoll 呢?那还是有得说说的 …
二). 常用模型的缺点
如果不摆出来其他模型的缺点,怎么能对比出 Epoll 的优点呢。
① PPC/TPC 模型
这两种模型思想类似,就是让每一个到来的连接一边自己做事去,别再来烦我 。只是 PPC 是为它开了一个进程,而 TPC 开了一个线程。可是别烦我是有代价的,它要时间和空间啊,连接多了之后,那么多的进程 / 线程切换,这开销就上来了;因此这类模型能接受的最大连接数都不会高,一般在几百个左右。
② select 模型
1. 最大并发数限制,因为一个进程所打开的 FD (文件描述符)是有限制的,由 FD_SETSIZE 设置,默认值是 1024/2048 ,因此 Select 模型的最大并发数就被相应限制了。自己改改这个 FD_SETSIZE ?想法虽好,可是先看看下面吧 …
2. 效率问题, select 每次调用都会线性扫描全部的 FD 集合,这样效率就会呈现线性下降,把 FD_SETSIZE 改大的后果就是,大家都慢慢来,什么?都超时了。
3. 内核 / 用户空间 内存拷贝问题,如何让内核把 FD 消息通知给用户空间呢?在这个问题上 select 采取了内存拷贝方法。
总结为:1.连接数受限 2.查找配对速度慢 3.数据由内核拷贝到用户态
③ poll 模型
基本上效率和 select 是相同的, select 缺点的 2 和 3 它都没有改掉。
三). Epoll 的提升
把其他模型逐个批判了一下,再来看看 Epoll 的改进之处吧,其实把 select 的缺点反过来那就是 Epoll 的优点了。
①. Epoll 没有最大并发连接的限制,上限是最大可以打开文件的数目,这个数字一般远大于 2048, 一般来说这个数目和系统内存关系很大 ,具体数目可以 cat /proc/sys/fs/file-max 察看。
②. 效率提升, Epoll 最大的优点就在于它只管你“活跃”的连接 ,而跟连接总数无关,因此在实际的网络环境中, Epoll 的效率就会远远高于 select 和 poll 。
③. 内存拷贝, Epoll 在这点上使用了“共享内存 ”,这个内存拷贝也省略了。
四). Epoll 为什么高效
Epoll 的高效和其数据结构的设计是密不可分的,这个下面就会提到。
首先回忆一下 select 模型,当有 I/O 事件到来时, select 通知应用程序有事件到了快去处理,而应用程序必须轮询所有的 FD 集合,测试每个 FD 是否有事件发生,并处理事件;代码像下面这样:
int res = select(maxfd+1, &readfds, NULL, NULL, 120); if (res > 0) { for (int i = 0; i < MAX_CONNECTION; i++) { if (FD_ISSET(allConnection[i], &readfds)) { handleEvent(allConnection[i]); } } } // if(res == 0) handle timeout, res < 0 handle error
int res = epoll_wait(epfd, events, 20, 120); for (int i = 0; i < res;i++) { handleEvent(events[n]); }
前面提到 Epoll 速度快和其数据结构密不可分,其关键数据结构就是:
struct epoll_event { __uint32_t events; // Epoll events epoll_data_t data; // User data variable }; typedef union epoll_data { void *ptr; int fd; __uint32_t u32; __uint64_t u64; } epoll_data_t;
events可以是以下几个宏的集合:
EPOLLIN : 表示对应的文件描述符可以读(包括对端SOCKET正常关闭);
EPOLLOUT: 表示对应的文件描述符可以写;
EPOLLPRI: 表示对应的文件描述符有紧急的数据可读(这里应该表示有带外数据到来);
EPOLLERR: 表示对应的文件描述符发生错误;
EPOLLHUP: 表示对应的文件描述符被挂断;
EPOLLET: 将EPOLL设为边缘触发(Edge Triggered)模式,这是相对于水平触发(Level Triggered)来说的。
EPOLLONESHOT: 只监听一次事件,当监听完这次事件之后,如果还需要继续监听这个socket的话,需要再次把这个socket加入到EPOLL队列里
六). 使用 Epoll
既然 Epoll 相比 select 这么好,那么用起来如何呢?会不会很繁琐啊 … 先看看下面的三个函数吧,就知道 Epoll 的易用了。
int epoll_create( int size);生成一个 Epoll 专用的文件描述符,其实是申请一个内核空间,用来存放你想关注的 socket fd 上是否发生以及发生了什么事件。 size 是用户期待监控的最大的文件描述符数目,系统将该值作为一个参考进行内存分配,在需要的时候,内存可以同步增大,也就是说epoll可以监控的最大文件描述符数目是可以无限大的,只要系统硬件满足。在最新版本的epoll中,epoll甚至不以size作为参考,自行动态增长,由于要兼容老版本,size参数必须提供,只需大于0即可。
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event );
控制某个 Epoll 文件描述符上的事件:注册、修改、删除。其中参数 epfd 是 epoll_create() 创建 Epoll 专用的文件描述符。相对于 select 模型中的 FD_SET 和 FD_CLR 宏。
int epoll_wait(int epfd,struct epoll_event * events,int maxevents,int timeout);
等待 I/O 事件的发生;参数说明:
epfd: 由 epoll_create() 生成的 Epoll 专用的文件描述符;
epoll_event: 用于回传代处理事件的数组;
maxevents: 每次能处理的事件数;
timeout: 等待 I/O 事件发生的超时值;
返回发生事件数。
相对于 select 模型中的 select 函数
相关参数的英文说明:
epfd为epoll_create返回的描述符。op表示对描述符fd采取的操作,取值如下:
EPOLL_CTL_ADD
Add a monitor on the file associated with the file descriptor fd to the epoll instance associated with epfd, per the events defined in event.
EPOLL_CTL_DEL
Remove a monitor on the file associated with the file descriptor fd from the epollinstance associated with epfd.
EPOLL_CTL_MOD
Modify an existing monitor of fd with the updated events specified by event.
在Linux的帮助文档中,关于epoll帮助内容如下:
man epoll EPOLL(7) Linux Programmer's Manual EPOLL(7) NAME epoll - I/O event notification facility SYNOPSIS #include <sys/epoll.h> DESCRIPTION The epoll API performs a similar task to poll(2): monitoring multiple file descriptors to see if I/O is possible on any of them. The epoll API can be used either as an edge-triggered or a level-triggered interface and scales well to large numbers of watched file descriptors. The following system calls are provided to create and manage an epoll instance: * epoll_create(2) creates an epoll instance and returns a file descriptor referring to that instance. (The more recent epoll_cre鈥? ate1(2) extends the functionality of epoll_create(2).) * Interest in particular file descriptors is then registered via epoll_ctl(2). The set of file descriptors currently registered on an epoll instance is sometimes called an epoll set. * epoll_wait(2) waits for I/O events, blocking the calling thread if no events are currently available. Level-triggered and edge-triggered The epoll event distribution interface is able to behave both as edge-triggered (ET) and as level-triggered (LT). The difference between the two mechanisms can be described as follows. Suppose that this scenario happens: 1. The file descriptor that represents the read side of a pipe (rfd) is registered on the epoll instance. 2. A pipe writer writes 2 kB of data on the write side of the pipe. 3. A call to epoll_wait(2) is done that will return rfd as a ready file descriptor. 4. The pipe reader reads 1 kB of data from rfd. 5. A call to epoll_wait(2) is done. If the rfd file descriptor has been added to the epoll interface using the EPOLLET (edge-triggered) flag, the call to epoll_wait(2) done in step 5 will probably hang despite the available data still present in the file input buffer; meanwhile the remote peer might be expecting a response based on the data it already sent. The reason for this is that edge-triggered mode delivers events only when changes occur on the monitored file descriptor. So, in step 5 the caller might end up waiting for some data that is already present inside the input buffer. In the above example, an event on rfd will be generated because of the write done in 2 and the event is consumed in 3. Since the read operation done in 4 does not consume the whole buffer data, the call to epoll_wait(2) done in step 5 might block indefinitely. An application that employs the EPOLLET flag should use nonblocking file descriptors to avoid having a blocking read or write starve a task that is handling multiple file descriptors. The suggested way to use epoll as an edge-triggered (EPOLLET) interface is as follows: i with nonblocking file descriptors; and ii by waiting for an event only after read(2) or write(2) return EAGAIN. By contrast, when used as a level-triggered interface (the default, when EPOLLET is not specified), epoll is simply a faster poll(2), and can be used wherever the latter is used since it shares the same semantics. Since even with edge-triggered epoll, multiple events can be generated upon receipt of multiple chunks of data, the caller has the option to specify the EPOLLONESHOT flag, to tell epoll to disable the associated file descriptor after the receipt of an event with epoll_wait(2). When the EPOLLONESHOT flag is specified, it is the caller's responsibility to rearm the file descriptor using epoll_ctl(2) with EPOLL_CTL_MOD. /proc interfaces The following interfaces can be used to limit the amount of kernel memory consumed by epoll: /proc/sys/fs/epoll/max_user_watches (since Linux 2.6.28) This specifies a limit on the total number of file descriptors that a user can register across all epoll instances on the system. The limit is per real user ID. Each registered file descriptor costs roughly 90 bytes on a 32-bit kernel, and roughly 160 bytes on a 64-bit kernel. Currently, the default value for max_user_watches is 1/25 (4%) of the available low memory, divided by the registration cost in bytes. Example for suggested usage While the usage of epoll when employed as a level-triggered interface does have the same semantics as poll(2), the edge-triggered usage requires more clarification to avoid stalls in the application event loop. In this example, listener is a nonblocking socket on which listen(2) has been called. The function do_use_fd() uses the new ready file descriptor until EAGAIN is returned by either read(2) or write(2). An event-driven state machine application should, after having received EAGAIN, record its current state so that at the next call to do_use_fd() it will continue to read(2) or write(2) from where it stopped before. #define MAX_EVENTS 10 struct epoll_event ev, events[MAX_EVENTS]; int listen_sock, conn_sock, nfds, epollfd; /* Set up listening socket, 'listen_sock' (socket(), bind(), listen()) */ epollfd = epoll_create(10); if (epollfd == -1) { perror("epoll_create"); exit(EXIT_FAILURE); } ev.events = EPOLLIN; ev.data.fd = listen_sock; if (epoll_ctl(epollfd, EPOLL_CTL_ADD, listen_sock, &ev) == -1) { perror("epoll_ctl: listen_sock"); exit(EXIT_FAILURE); } for (;;) { nfds = epoll_wait(epollfd, events, MAX_EVENTS, -1); if (nfds == -1) { perror("epoll_pwait"); exit(EXIT_FAILURE); } for (n = 0; n < nfds; ++n) { if (events[n].data.fd == listen_sock) { conn_sock = accept(listen_sock, (struct sockaddr *) &local, &addrlen); if (conn_sock == -1) { perror("accept"); exit(EXIT_FAILURE); } setnonblocking(conn_sock); ev.events = EPOLLIN | EPOLLET; ev.data.fd = conn_sock; if (epoll_ctl(epollfd, EPOLL_CTL_ADD, conn_sock, &ev) == -1) { perror("epoll_ctl: conn_sock"); exit(EXIT_FAILURE); } } else { do_use_fd(events[n].data.fd); } } } When used as an edge-triggered interface, for performance reasons, it is possible to add the file descriptor inside the epoll interface (EPOLL_CTL_ADD) once by specifying (EPOLLIN|EPOLLOUT). This allows you to avoid continuously switching between EPOLLIN and EPOLLOUT calling epoll_ctl(2) with EPOLL_CTL_MOD. Questions and answers Q0 What is the key used to distinguish the file descriptors registered in an epoll set? A0 The key is the combination of the file descriptor number and the open file description (also known as an "open file handle", the kernel's internal representation of an open file). Q1 What happens if you register the same file descriptor on an epoll instance twice? A1 You will probably get EEXIST. However, it is possible to add a duplicate (dup(2), dup2(2), fcntl(2) F_DUPFD) descriptor to the same epoll instance. This can be a useful technique for filtering events, if the duplicate file descriptors are registered with different events masks. Q2 Can two epoll instances wait for the same file descriptor? If so, are events reported to both epoll file descriptors? A2 Yes, and events would be reported to both. However, careful programming may be needed to do this correctly. Q3 Is the epoll file descriptor itself poll/epoll/selectable? A3 Yes. If an epoll file descriptor has events waiting then it will indicate as being readable. Q4 What happens if one attempts to put an epoll file descriptor into its own file descriptor set? A4 The epoll_ctl(2) call will fail (EINVAL). However, you can add an epoll file descriptor inside another epoll file descriptor set. Q5 Can I send an epoll file descriptor over a UNIX domain socket to another process? A5 Yes, but it does not make sense to do this, since the receiving process would not have copies of the file descriptors in the epoll set. Q6 Will closing a file descriptor cause it to be removed from all epoll sets automatically? A6 Yes, but be aware of the following point. A file descriptor is a reference to an open file description (see open(2)). When鈥? ever a descriptor is duplicated via dup(2), dup2(2), fcntl(2) F_DUPFD, or fork(2), a new file descriptor referring to the same open file description is created. An open file description continues to exist until all file descriptors referring to it have been closed. A file descriptor is removed from an epoll set only after all the file descriptors referring to the underlying open file description have been closed (or before if the descriptor is explicitly removed using epoll_ctl(2) EPOLL_CTL_DEL). This means that even after a file descriptor that is part of an epoll set has been closed, events may be reported for that file descriptor if other file descriptors referring to the same underlying file description remain open. Q7 If more than one event occurs between epoll_wait(2) calls, are they combined or reported separately? A7 They will be combined. Q8 Does an operation on a file descriptor affect the already collected but not yet reported events? A8 You can do two operations on an existing file descriptor. Remove would be meaningless for this case. Modify will reread available I/O. Q9 Do I need to continuously read/write a file descriptor until EAGAIN when using the EPOLLET flag (edge-triggered behavior) ? A9 Receiving an event from epoll_wait(2) should suggest to you that such file descriptor is ready for the requested I/O operation. Manual page epoll(7) line 147 (press h for help or q to quit) You must consider it ready until the next (nonblocking) read/write yields EAGAIN. When and how you will use the file descrip鈥? tor is entirely up to you. For packet/token-oriented files (e.g., datagram socket, terminal in canonical mode), the only way to detect the end of the read/write I/O space is to continue to read/write until EAGAIN. For stream-oriented files (e.g., pipe, FIFO, stream socket), the condition that the read/write I/O space is exhausted can also be detected by checking the amount of data read from / written to the target file descriptor. For example, if you call read(2) by asking to read a certain amount of data and read(2) returns a lower number of bytes, you can be sure of having exhausted the read I/O space for the file descriptor. The same is true when writing using write(2). (Avoid this latter technique if you cannot guarantee that the monitored file descriptor always refers to a stream-oriented file.) Possible pitfalls and ways to avoid them o Starvation (edge-triggered) If there is a large amount of I/O space, it is possible that by trying to drain it the other files will not get processed causing starvation. (This problem is not specific to epoll.) The solution is to maintain a ready list and mark the file descriptor as ready in its associated data structure, thereby allowing the application to remember which files need to be processed but still round robin amongst all the ready files. This also supports ignoring subsequent events you receive for file descriptors that are already ready. o If using an event cache... If you use an event cache or store all the file descriptors returned from epoll_wait(2), then make sure to provide a way to mark its closure dynamically (i.e., caused by a previous event's processing). Suppose you receive 100 events from epoll_wait(2), and in event #47 a condition causes event #13 to be closed. If you remove the structure and close(2) the file descriptor for event #13, then your event cache might still say there are events waiting for that file descriptor causing confusion. One solution for this is to call, during the processing of event 47, epoll_ctl(EPOLL_CTL_DEL) to delete file descriptor 13 and close(2), then mark its associated data structure as removed and link it to a cleanup list. If you find another event for file descriptor 13 in your batch processing, you will discover the file descriptor had been previously removed and there will be no con鈥? fusion. VERSIONS The epoll API was introduced in Linux kernel 2.5.44. Support was added to glibc in version 2.3.2. Manual page epoll(7) line 183 (press h for help or q to quit) CONFORMING TO The epoll API is Linux-specific. Some other systems provide similar mechanisms, for example, FreeBSD has kqueue, and Solaris has /dev/poll. SEE ALSO epoll_create(2), epoll_create1(2), epoll_ctl(2), epoll_wait(2) COLOPHON This page is part of release 3.54 of the Linux man-pages project. A description of the project, and information about reporting bugs, can be found at http://www.kernel.org/doc/man-pages/. Linux 2012-04-17 EPOLL(7) Manual page epoll(7) line 196/233 (END) (press h for help or q to quit)
Edge Triggered (ET) 边缘触发只有数据到来,才触发,不管缓存区中是否还有数据。
Level Triggered(LT) 水平触发只要有数据都会触发。
假如有这样一个例子:
1. 我们已经把一个用来从管道中读取数据的文件句柄(RFD)添加到epoll描述符
2. 这个时候从管道的另一端被写入了2KB的数据
3. 调用epoll_wait(2),并且它会返回RFD,说明它已经准备好读取操作
4. 然后我们读取了1KB的数据
5. 调用epoll_wait(2)......
如果我们在第1步将RFD添加到epoll描述符的时候使用了EPOLLET标志,那么在第5步调用epoll_wait(2)之后将有可能会挂起,因为剩余的数据还存在于文件的输入缓冲区内,而且数据发出端还在等待一个针对已经发出数据的反馈信息。只有在监视的文件句柄上发生了某个事件的时候 ET 工作模式才会汇报事件。因此在第5步的时候,调用者可能会放弃等待仍在存在于文件输入缓冲区内的剩余数据。在上面的例子中,会有一个事件产生在RFD句柄上,因为在第2步执行了一个写操作,然后,事件将会在第3步被销毁。因为第4步的读取操作没有读空文件输入缓冲区内的数据,因此我们在第5步调用 epoll_wait(2)完成后,是否挂起是不确定的。epoll工作在ET模式的时候,必须使用非阻塞套接口,以避免由于一个文件句柄的阻塞读/阻塞写操作把处理多个文件描述符的任务饿死。最好以下面的方式调用ET模式的epoll接口,在后面会介绍避免可能的缺陷。
i 基于非阻塞文件句柄
ii 只有当read(2)或者write(2)返回EAGAIN时才需要挂起,等待。但这并不是说每次read()时都需要循环读,直到读到产生一个EAGAIN才认为此次事件处理完成,当read()返回的读到的数据长度小于请求的数据长度时,就可以确定此时缓冲中已没有数据了,也就可以认为此事读事件已处理完成。
相反的,以LT方式调用epoll接口的时候,它就相当于一个速度比较快的poll(2),并且无论后面的数据是否被使用,因此他们具有同样的职能。因为即使使用ET模式的epoll,在收到多个chunk的数据的时候仍然会产生多个事件。调用者可以设定EPOLLONESHOT标志,在 epoll_wait(2)收到事件后epoll会与事件关联的文件句柄从epoll描述符中禁止掉。因此当EPOLLONESHOT设定后,使用带有 EPOLL_CTL_MOD标志的epoll_ctl(2)处理文件句柄就成为调用者必须作的事情。
然后详细解释ET, LT:
LT(level triggered)是缺省的工作方式,并且同时支持block和no-block socket.在这种做法中,内核告诉你一个文件描述符是否就绪了,然后你可以对这个就绪的fd进行IO操作。如果你不作任何操作,内核还是会继续通知你的,所以,这种模式编程出错误可能性要小一点。传统的select/poll都是这种模型的代表.
ET(edge-triggered)是高速工作方式,只支持no-block socket。在这种模式下,当描述符从未就绪变为就绪时,内核通过epoll告诉你。然后它会假设你知道文件描述符已经就绪,并且不会再为那个文件描述符发送更多的就绪通知,直到你做了某些操作导致那个文件描述符不再为就绪状态了(比如,你在发送,接收或者接收请求,或者发送接收的数据少于一定量时导致了一个EWOULDBLOCK 错误)。但是请注意,如果一直不对这个fd作IO操作(从而导致它再次变成未就绪),内核不会发送更多的通知(only once),不过在TCP协议中,ET模式的加速效用仍需要更多的benchmark确认(这句话不理解)。
在许多测试中我们会看到如果没有大量的idle-connection或者dead-connection,epoll的效率并不会比select/poll高很多,但是当我们遇到大量的idle- connection(例如WAN环境中存在大量的慢速连接),就会发现epoll的效率大大高于select/poll。
epoll的典型示例代码:
#define MAX_EVENTS 10 struct epoll_event ev, events[MAX_EVENTS]; int listen_sock, conn_sock, nfds, epollfd; /* Set up listening socket, 'listen_sock' (socket(), bind(), listen()) */ epollfd = epoll_create(10); if (epollfd == -1) { perror("epoll_create"); exit(EXIT_FAILURE); } ev.events = EPOLLIN; ev.data.fd = listen_sock; if (epoll_ctl(epollfd, EPOLL_CTL_ADD, listen_sock, &ev) == -1) { perror("epoll_ctl: listen_sock"); exit(EXIT_FAILURE); } for (;;) { nfds = epoll_wait(epollfd, events, MAX_EVENTS, -1);//有多少个描述符可以处理 if (nfds == -1) { perror("epoll_pwait"); exit(EXIT_FAILURE); } for (n = 0; n < nfds; ++n) {//可读的描述符直接位于events的前nfds个元素中,因而只需遍历前nfds个元素就可以处理 if (events[n].data.fd == listen_sock) { conn_sock = accept(listen_sock, (struct sockaddr *) &local, &addrlen); if (conn_sock == -1) { perror("accept"); exit(EXIT_FAILURE); } setnonblocking(conn_sock); ev.events = EPOLLIN | EPOLLET; ev.data.fd = conn_sock; if (epoll_ctl(epollfd, EPOLL_CTL_ADD, conn_sock, &ev) == -1) { perror("epoll_ctl: conn_sock"); exit(EXIT_FAILURE); } } else { do_use_fd(events[n].data.fd); } } }
参考:
1. http://blog.chinaunix.net/uid-20583479-id-1920065.html
2.http://blog.csdn.net/tianmohust/article/details/6677985