看到一篇不错的文章《The method to epoll's madness》(作者:Cindy Sridharan)(原文链接)。
下面是摘录:
epoll stands for event poll and is a Linux specific construct. It allows for a process to monitor multiple file descriptors and get notifications when I/O is possible on them. It allows for both edge-triggered as well as level-triggered notifications. Before we look into the bowels of epoll, first let’s explore the syntax.
Unlike poll, epoll itself is not a system call. It's a kernel data structure that allows a process to multiplex I/O on multiple file descriptors.
This data structure can be created, modified and deleted by three system calls.
The epoll instance is created by means of the epoll_create
system call, which returns a file descriptor to the epoll instance. The signature of epoll_create is as follows:
#include
int epoll_create(int size);
The size argument is an indication to the kernel about the number of file descriptors a process wants to monitor, which helps the kernel to decide the size of the epoll instance. Since Linux 2.6.8, this argument is ignored because the epoll data structure dynamically resizes as file descriptors are added or removed from it.
The epoll_create system call returns a file descriptor to the newly created epoll kernel data structure. The calling process can then use this file descriptor to add, remove or modify other file descriptors it wants to monitor for I/O to the epoll instance.
There is another system call epoll_create1 which is defined as follows:
int epoll_create1(int flags);
The flags argument can either be 0 or EPOLL_CLOEXEC.
When set to 0, epoll_create1 behaves the same way as epoll_create.
When the EPOLL_CLOEXEC flag is set, any child process forked by the current process will close the epoll descriptor before it execs, so the child process won’t have access to the epoll instance anymore.
It’s important to note that the file descriptor associated with the epoll instance needs to be released with a close() system call. Multiple processes might hold a descriptor to the same epoll instance, since, for example, a fork without the EPOLL_CLOEXEC flag will duplicate the descriptor to the epoll instance in the child process). When all of these processes have relinquished their descriptor to the epoll instance (by either calling close() or by exiting), the kernel destroys the epoll instance.
A process can add file descriptors it wants monitored to the epoll instance by calling epoll_ctl
. All the file descriptors registered with an epoll instanceare collectively called an epoll set or the interest list.
In the above diagram, process 483 has registered file descriptors fd1, fd2, fd3, fd4 and fd5 with the epoll instance. This is the interest list or the epoll set of that particular epoll instance. Subsequently, when any of the file descriptors registered become ready for I/O, then they are considered to be in the ready list.
The ready list is a subset of the interest list.
The signature of the epoll_ctl
syscall is as follows:
#include
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
epfd — is the file descriptor returned by epoll_create
which identifies the epoll instance in the kernel.
fd — is the file descriptor we want to add to the epoll list/interest list.
op — refers to the operation to be performed on the file descriptor fd. In general, three operations are supported:
— Register fd with the epoll instance (EPOLL_CTL_ADD) and get notified about events that occur on fd
— Delete/deregister fd from the epoll instance. This would mean that the process would no longer get any notifications about events on that file descriptor (EPOLL_CTL_DEL). If a file descriptor has been added to multiple epoll instances, then closing it will remove it from all of the epoll interest lists to which it was added.
— Modify the events fd is monitoring (EPOLL_CTL_MOD)
event — is a pointer to a structure called epoll_event which stores the event we actually want to monitor fd for.
The first field events of the epoll_event structure is a bitmask that indicates which events fd is being monitored for.
Like so, if fd is a socket, we might want to monitor it for the arrival of new data on the socket buffer (EPOLLIN). We might also want to monitor fd for edge-triggered notifications which is done by OR-ing EPOLLET with EPOLLIN. We might also want to monitor fd for the occurrence of a registered event but only once and stop monitoring fd for subsequent occurrences of that event. This can be accomplished by OR-ing the other flags (EPOLLET, EPOLLIN) we want to set for descriptor fd with the flag for only-once notification delivery EPOLLONESHOT. All possible flags can be found in the man page.
The second field of the epoll_event struct is a union field.
A thread can be notified of events that happened on the epoll set/interest set of an epoll instance by calling the epoll_wait
system call, which blocks until any of the descriptors being monitored becomes ready for I/O.
The signature of epoll_wait
is as follows:
#include
int epoll_wait(int epfd, struct epoll_event *evlist, int maxevents, int timeout);
epfd — is the file descriptor returned by epoll_create
which identifies the epoll instance in the kernel.
evlist — is an array of epoll_event structures. evlist is allocated by the calling process and when epoll_wait returns, this array is modified to indicate information about the subset of file descriptors in the interest list that are in the ready state (this is called the ready list)
maxevents — is the length of the evlist array
timeout — this argument behaves the same way as it does for poll or select. This value specifies for how long the epoll_wait system call will block:
— when the timeout is set to 0, epoll_wait does not block but returns immediately after checking which file descriptors in the interest list for epfdare ready
— when timeout is set to -1, epoll_wait will block “forever”. When epoll_wait blocks, the kernel can put the process to sleep until epoll_waitreturns. epoll_wait will block until 1) one or more descriptors specified in the interest list for epfd become ready or 2) the call is interrupted by a signal handler
— when timeout is set to a non negative and non zero value, then epoll_wait will block until 1) one or more descriptors specified in the interest list for epfd becomes ready or 2) the call is interrupted by a signal handler or 3) the amount of time specified by timeout milliseconds have expired
The return values of epoll_wait are the following:
— if an error (EBADF or EINTR or EFAULT or EINVAL) occurred, then the return code is -1
— if the call timed out before any file descriptor in the interest list became ready, then the return code is 0
— if one or more file descriptors in the interest list became ready, then the return code is a positive integer which indicates the total number of file descriptors in the evlist array. The evlist is then examined to determine which events occurred on which file descriptors.