Please indicate the source: http://blog.csdn.net/gaoxiangnumber1
Welcome to my github: https://github.com/gaoxiangnumber1
63.1 Overview
- Traditional blocking I/O model: A process performs I/O on one file descriptor at a time, and each I/O system call blocks until the data is transferred.
- Disk files are a special case. The kernel employs the buffer cache to speed disk I/O requests.
- A write() to a disk returns as soon as the requested data has been transferred to the kernel buffer cache, rather than waiting until the data is written to disk(unless O_SYNC flag was specified when opening the file).
- A read() transfers data from the buffer cache to a user buffer, and if the required data is not in the buffer cache, then the kernel puts the process to sleep while a disk read is performed.
- Some applications need to able to do one or both of the following:
- Check whether I/O is possible on a file descriptor without blocking if it is not possible.
- Monitor multiple file descriptors to see if I/O is possible on any of them.
- Three techniques that partially address these needs: nonblocking I/O and the use of multiple processes or threads.
- If we place a file descriptor in nonblocking mode by enabling the O_NONBLOCK open file status flag, then an I/O system call that can’t be immediately completed returns an error instead of blocking. Nonblocking I/O can be employed with pipes, FIFOs, sockets, terminals, pseudo-terminals, and some other types of devices. Nonblocking I/O allows us to periodically check(“poll”) whether I/O is possible on a file descriptor.
- If we don’t want a process to block when performing I/O on a file descriptor, we can create a new process to perform the I/O. The parent process can then carry on to perform other tasks, while the child process blocks until the I/O is complete. If we need to handle I/O on multiple file descriptors, we can create one child for each descriptor. The problems are expense and complexity. Creating and maintaining processes places a load on the system, and the child processes will need to use some form of IPC to inform the parent about the status of I/O operations.
- Using multiple threads instead of processes is less demanding of resources, but the threads will probably still need to communicate information to one another about the status of I/O operations, and the programming can be complex, especially if we are using thread pools to minimize the number of threads used to handle large numbers of simultaneous clients.(One place where threads can be useful is if the application needs to call a third-party library that performs blocking I/O. An application can avoid blocking in this case by making the library call in a separate thread.)
- Because of the limitations of both nonblocking I/O and the use of multiple threads or processes, one of the following alternatives is preferable:
- I/O multiplexing allows a process to simultaneously monitor multiple file descriptors to find out whether I/O is possible on any of them. select() and poll() perform I/O multiplexing.
- Signal-driven I/O is a technique whereby a process requests that the kernel send it a signal when input is available or data can be written on a specified file descriptor. The process can then carry on performing other activities, and is notified when I/O becomes possible via receipt of the signal. When monitoring large numbers of file descriptors, signal-driven I/O provides better performance than select() and poll().
- epoll is a Linux-specific feature.
Like the I/O multiplexing APIs, epoll allows a process to monitor multiple file descriptors to see if I/O is possible on any of them.
Like signal-driven I/O, epoll provides better performance when monitoring large numbers of file descriptors.
- I/O multiplexing, signal-driven I/O, and epoll are all methods of achieving the same result: monitoring one or several file descriptors simultaneously to see if they are ready to perform I/O(to be precise, to see whether an I/O system call could be performed without blocking). The transition of a file descriptor into a ready state is triggered by some type of I/O event(the arrival of input, the completion of a socket connection and so on). None of these techniques performs I/O. They merely tell us that a file descriptor is ready.
- One I/O model that we don’t describe in this chapter is POSIX asynchronous I/O(AIO). POSIX AIO allows a process to queue an I/O operation to a file and then later be notified when the operation is complete.
Advantage: The initial I/O call returns immediately, so that the process is not tied up waiting for data to be transferred to the kernel or for the operation to complete. This allows the process to perform other tasks in parallel with the I/O(which may include queuing further I/O requests).
Which technique?
- select() and poll() are standard interfaces that have been present on UNIX for many years.
- Advantage: portability.
- Disadvantage: they don’t scale well when monitoring large numbers(hundreds or thousands) of file descriptors.
- Advantage of epoll: it allows an application to efficiently monitor large numbers of file descriptors.
Disadvantage: it is only on Linux.
- Signal-driven I/O allows an application to efficiently monitor large numbers of file descriptors. But epoll provides advantages over signal-driven I/O:
- Avoid the complexities of dealing with signals.
- Ability to specify the kind of monitoring that we want to perform(e.g., ready for reading/writing).
- Ability to select either level-triggered or edge-triggered notification(Section 63.1.1).
- select() and poll() are more portable, while signal-driven I/O and epoll deliver better performance. For some applications, it is worthwhile writing an abstract software layer for monitoring file descriptor events. With such a layer, portable programs can employ epoll on Linux, and fall back to the use of select() or poll() on other systems.
- libevent is a software layer that provides an abstraction for monitoring file descriptor events. It can employ any of the techniques: select(), poll(), signal-driven I/O, or epoll, as well as the Solaris specific /dev/poll interface or the BSD kqueue interface.
63.1.1 Level-Triggered and Edge-Triggered Notification
- Level-triggered notification: A file descriptor is considered to be ready if it is possible to perform an I/O system call without blocking.
- Edge-triggered notification: Notification is provided if there is I/O activity(e.g., new input) on a file descriptor since it was last monitored.
- epoll can employ both level-triggered notification(the default) and edge-triggered notification.
How different notification model affects the way we design a program?
- When we employ level-triggered notification, we can check the readiness of a file descriptor at any time. This means that when we determine that a file descriptor is ready(e.g., it has input available), we can perform I/O on the descriptor, and then repeat the monitoring operation to check if the descriptor is still ready(e.g., it still has more input available), in which case we can perform more I/O, and so on.
Because the level-triggered model allows us to repeat the I/O monitoring operation at any time, it is not necessary to perform as much I/O as possible(e.g., read as many bytes as possible) on the file descriptor(or even perform any I/O at all) each time we are notified that a file descriptor is ready.
- When we employ edge-triggered notification, we receive notification only when an I/O event occurs. We don’t receive any further notification until another I/O event occurs. Furthermore, when an I/O event is notified for a file descriptor, we usually don’t know how much I/O is possible(e.g., how many bytes are available for reading). Therefore, programs that employ edge-triggered notification are usually designed according to the following rules:
- After notification of an I/O event, the program should(at some point) perform as much I/O as possible(e.g., read as many bytes as possible) on the corresponding file descriptor. If the program fails to do this, then it might miss the opportunity to perform some I/O, because it would not be aware of the need to operate on the file descriptor until another I/O event occurred. This could lead to spurious data loss or blockages in a program.
We said “at some point” because sometimes it may not be desirable to perform all of the I/O immediately after we determine that the file descriptor is ready. The problem is that we may starve other file descriptors of attention if we perform a large amount of I/O on one file descriptor(Section 63.4.6).
- If the program employs a loop to perform as much I/O as possible on the file descriptor, and the descriptor is marked as blocking, then eventually an I/O system call will block when no more I/O is possible. For this reason, each monitored file descriptor is normally placed in nonblocking mode, and after notification of an I/O event, I/O operations are performed repeatedly until the relevant system call(e.g., read() or write()) fails with the error EAGAIN or EWOULDBLOCK.
63.1.2 Employing Nonblocking I/O with Alternative I/O Models
- Nonblocking I/O(the O_NONBLOCK flag) is often used in conjunction with the I/O models described in this chapter. Examples of why this can be useful are:
- As explained in the previous section, nonblocking I/O is usually employed in conjunction with I/O models that provide edge-triggered notification of I/O events.
- If multiple processes(or threads) are performing I/O on the same open file descriptions, then, from a particular process’s point of view, a descriptor’s readiness may change between the time the descriptor was notified as being ready and the time of the subsequent I/O call. Consequently, a blocking I/O call could block, thus preventing the process from monitoring other file descriptors.(This can occur for all of the I/O models that we describe in this chapter, regardless of whether they employ level-triggered or edge-triggered notification.)
- Even after a level-triggered API such as select() or poll() informs us that a file descriptor for a stream socket is ready for writing, if we write a large enough block of data in a single write() or send(), then the call will nevertheless block.
- In rare cases, level-triggered APIs such as select() and poll() can return spurious readiness notifications—they can falsely inform us that a file descriptor is ready. This could be caused by a kernel bug or be expected behavior in an uncommon scenario.
- Section 16.6 of UNP describes one example of spurious readiness notifications on BSD systems for a listening socket. If a client connects to a server’s listening socket and then resets the connection, a select() performed by the server between these two events will indicate the listening socket as being readable, but a subsequent accept() that is performed after the client’s reset will block.
63.2 I/O Multiplexing
- I/O multiplexing allows us to simultaneously monitor multiple file descriptors to see if I/O is possible on any of them. We can perform I/O multiplexing using select()/ poll() to monitor file descriptors for regular files, terminals, pseudo-terminals, pipes, FIFOs, sockets, and some types of character devices.
63.2.1 The select() System Call
#include <sys/time.h>
#include <sys/select.h>
#include <sys/types.h>
#include <unistd.h>
int select(int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, struct timeval *timeout);
Return: number of ready file descriptors, 0 on timeout, -1 on error
- nfds, readfds, writefds, and exceptfds arguments specify the file descriptors that select() is to monitor.
- timeout can be used to set an upper limit on the time for which select() will block.
File descriptor sets
- readfds, writefds, and exceptfds are pointers to file descriptor sets that use the data type fd_set. These arguments are used as follows:
- readfds is the set of file descriptors to be tested to see if input is possible;
- writefds is the set of file descriptors to be tested to see if output is possible;
- exceptfds is the set of file descriptors to be tested to see if an exceptional condition has occurred. An exceptional condition occurs in just two circumstances on Linux:
-1- A state change occurs on a pseudo-terminal slave connected to a master that is in packet mode(Section 64.5).
-2- Out-of-band data is received on a stream socket(Section 61.13.1).
- fd_set data type is implemented as a bit mask. All manipulation of file descriptor sets is done via four macros: FD_ZERO(), FD_SET(), FD_CLR(), and FD_ISSET().
#include <sys/time.h>
#include <sys/select.h>
#include <sys/types.h>
#include <unistd.h>
void FD_ZERO(fd_set *fdset);
void FD_SET(int fd, fd_set *fdset);
void FD_CLR(int fd, fd_set *fdset);
int FD_ISSET(int fd, fd_set *fdset);
Return: true(1) if fd is in fdset, or false(0) otherwise
- FD_ZERO() initializes the set pointed to by fdset to be empty.
FD_SET() adds the file descriptor fd to the set pointed to by fdset.
FD_CLR() removes the file descriptor fd from the set pointed to by fdset.
FD_ISSET() returns true if the file descriptor fd is a member of the set pointed to by fdset.
- A file descriptor set has a maximum size FD_SETSIZE, which is 1024 on Linux. If we want to change this limit, we must modify the definition in the glibc header files. If we need to monitor large numbers of descriptors, then using epoll is preferable to the use of select().
- readfds, writefds, and exceptfds are all value-result. Before the call to select(), the fd_set structures pointed to by these arguments must be initialized(using FD_ZERO() and FD_SET()) to contain the set of file descriptors of interest. select() modifies each of these structures and on return, they contain the set of file descriptors that are ready. The structures can then be examined using FD_ISSET().
- If we are not interested in a particular class of events, then the corresponding fd_set argument can be specified as NULL.
- nfds is set one greater than the highest file descriptor number included in any of the three file descriptor sets. This argument allows select() to be efficient since the kernel knows not to check whether file descriptor numbers higher than this value are part of each file descriptor set.
The timeout argument
- timeout can be specified as
- NULL: select() blocks indefinitely;
- A pointer to a timeval structure.
struct timeval
{
time_t tv_sec;
suseconds_t tv_usec;
};
- If both fields of timeout are 0, then select() doesn’t block; it polls the specified file descriptors to see which ones are ready and returns immediately. Otherwise, timeout specifies an upper limit on the time for which select() is to wait.
- Although the timeval structure affords microsecond precision, the accuracy of the call is limited by the granularity of the software clock(Section 10.6).
- When timeout is NULL, or points to a structure containing nonzero fields, select() blocks until one of the following occurs:
- at least one of the file descriptors specified in readfds, writefds, or exceptfds becomes ready;
- the call is interrupted by a signal handler;
- the amount of time specified by timeout has passed.
- On Linux, if select() returns because one or more file descriptors became ready, and if timeout was non-NULL, then select() updates the structure to which timeout points to indicate how much time remained until the call would have timed out. Most other UNIX systems don’t modify this structure. Portable applications that employ select() within a loop should always ensure that the structure pointed to by timeout is initialized before each select(), and should ignore the information returned in the structure after the call.
- On Linux, if select() is interrupted by a signal handler(so that it fails with the error EINTR), then the structure is modified to indicate the time remaining until a timeout would have occurred.
- If we use the Linux-specific personality() system call to set a personality that includes the STICKY_TIMEOUTS personality bit, then select() doesn’t modify the structure pointed to by timeout.
Return value from select()
- -1 indicates that an error occurred. Possible errors include EBADF and EINTR. EBADF indicates that one of the file descriptors in readfds, writefds, or exceptfds is invalid(e.g., not currently open).
EINTR indicates that the call was interrupted by a signal handler.(select() is never automatically restarted if interrupted by a signal handler.)
- 0 means that the call timed out before any file descriptor became ready. In this case, each of the returned file descriptor sets will be empty.
- A positive return value indicates that one or more file descriptors is ready. The return value is the number of ready descriptors. In this case, each of the returned file descriptor sets must be examined(using FD_ISSET()) in order to find out which I/O events occurred. If the same file descriptor is specified in more than one of readfds, writefds, and exceptfds, it is counted multiple times if it is ready for more than one event.
Example program
- The first command-line argument specifies the timeout for select(), in seconds. If a hyphen(-) is specified here, then select() is called with a timeout of NULL, meaning block indefinitely. Each of the remaining command-line arguments specifies the number of a file descriptor to be monitored, followed by letters indicating the operations for which the descriptor is to be checked. The letters we can specify here are r(ready for read) and w(ready for write).
- First example: make a request to monitor file descriptor 0 for input with a 10-second timeout:
$ ./t_select 10 0r
#Press Enter, so that a line of input is available on file descriptor 0
ready = 1
0: r
timeout after select(): 8.003
$ #Next shell prompt is displayed
- The output shows us that select() determined that file descriptor 0 was ready for reading and the timeout was modified. The final line of output, consisting of just the shell $ prompt, appeared because the t_select program didn’t read the newline character that made file descriptor 0 ready, and so that character was read by the shell, which responded by printing another prompt.
- Next example: monitor file descriptor 0 for input with a timeout of 0 seconds:
$ ./t_select 0 0r
ready = 0
timeout after select(): 0.000
- The select() call returned immediately, and found no file descriptor was ready.
- Next example: monitor two file descriptors: descriptor 0, to see if input is available, and descriptor 1, to see if output is possible. In this case, we specify the timeout as NULL(the first command-line argument is a hyphen):
$ ./t_select - 0r 1w
ready = 1
0:
1: w
- The select() call returned immediately, informing us that output was possible on file descriptor 1.
63.2.2 The poll() System Call
- Difference between select/poll lies in how we specify the file descriptors to be monitored.
- select(): provide three sets, each marked to indicate the file descriptors of interest.
- poll(): provide a list of file descriptors, each marked with the set of events of interest.
#include <poll.h>
int poll(struct pollfd fds[], nfds_t nfds, int timeout);
Returns number of ready file descriptors, 0 on timeout, or -1 on error
The pollfd array
- fds lists the file descriptors to be monitored by poll(). This argument is an array of pollfd structures, defined as follows:
struct pollfd
{
int fd;
short events;
short revents;
};
- nfds specifies the number of items in the fds array. The nfds_t data type is an unsigned integer type.
- The events and revents fields of the pollfd structure are bit masks. The caller initializes events to specify the events to be monitored for the file descriptor fd. Upon return from poll(), revents is set to indicate which of those events actually occurred for this file descriptor.
- Table 63-2 lists the bits that may appear in the events and revents fields.
- The first five bits are concerned with input events.
- The next three bits are concerned with output events.
- The next three bits are set in the revents field to return additional information about the file descriptor. If specified in the events field, these three bits are ignored.
- The final bit is unused by poll() on Linux. On systems providing STREAMS devices, POLLMSG indicates that a message containing a SIGPOLL signal has reached the head of the stream.
- It is permissible to specify events as 0 if we are not interested in events on a particular file descriptor. Specify a negative value for the fd field(e.g., negating its value if nonzero) causes the corresponding events field to be ignored and the revents field always to be returned as 0. Either of these techniques can be used to disable monitoring of a single file descriptor, without needing to rebuild the entire fds list.
Points with Linux implementation of poll()
- Synonymous: POLLIN = POLLRDNORM; POLLOUT = POLLWRNORM
- POLLRDBAND is generally unused: it is ignored in the events field and not set in revents. The only place where it is set is in code implementing the obsolete DECnet networking protocol. There are no circumstances in which POLLWRBAND is set when POLLOUT and POLLWRNORM are not also set.
- POLLRDBAND and POLLWRBAND are meaningful on implementations that provide System V STREAMS(which Linux does not). Under STREAMS, a message can be assigned a nonzero priority, and such messages are queued to the receiver in decreasing order of priority, in a band ahead of normal(priority 0) messages.
- The _XOPEN_SOURCE feature test macro must be defined in order to obtain the definitions of the constants POLLRDNORM, POLLRDBAND, POLLWRNORM, and POLLWRBAND from
$ ./poll_pipes 10 3
Writing to fd: 4 (read fd: 3)
Writing to fd: 14 (read fd: 13)
Writing to fd: 14 (read fd: 13)
poll() returned: 2
Readable: 3
Readable: 13
63.2.3 When Is a File Descriptor Ready?
- SUSv3 says that a file descriptor(with O_NONBLOCK clear) is considered to be ready if a call to an I/O function would not block, regardless of whether the function would actually transfer data. select() and poll() tell us whether an I/O operation would not block, rather than whether it would successfully transfer data.
- We show this information in tables containing two columns:
- select() column indicates whether a file descriptor is marked as readable(r), writable(w), or having an exceptional condition(x).
- poll() column indicates the bit(s) returned in the revents field. In these tables, we omit mention of POLLRDNORM, POLLWRNORM, POLLRDBAND, and POLLWRBAND because they convey no useful information beyond that provided by POLLIN, POLLOUT, POLLHUP, and POLLERR.
Regular files
- File descriptors that refer to regular files are always marked as readable and writable by select(), and returned with POLLIN and POLLOUT set in revents for poll(), for the following reasons:
- A read() will always immediately return data, end-of-file, or an error(e.g., the file was not opened for reading).
- A write() will always immediately transfer data or fail with some error.
Terminals and pseudo-terminals
- When one half of a pseudo-terminal pair is closed, the revents setting returned by poll() for the other half of the pair depends on the implementation. On Linux, at least the POLLHUP flag is set.
Pipes and FIFOs
- Table 63-4 summarizes the details for the read end of a pipe or FIFO. The “Data in pipe?” column indicates whether the pipe has at least 1 byte of data available for reading. In this table, we assume that POLLIN was specified in the events field for poll().
- On some other UNIX implementations, if the write end of a pipe is closed, instead of returning with POLLHUP set, poll() returns with the POLLIN bit set(since a read() will return immediately with end-of-file). Portable applications should check to see if either bit is set in order to know if a read() will block.
- Table 63-5 summarizes the details for the write end of a pipe. In this table, we assume that POLLOUT was specified in the events field for poll().
- The “Space for PIPE_BUF bytes?” column indicates whether the pipe has room to atomically write PIPE_BUF bytes without blocking. This is the criterion on which Linux considers a pipe ready for writing. Some other UNIX use the same criterion; others consider a pipe writable if even a single byte can be written.(In Linux 2.6.10 and earlier, the capacity of a pipe is the same as PIPE_BUF. This means that a pipe is considered un-writable if it contains even a single byte of data.)
- On some other UNIX implementations, if the read end of a pipe is closed, instead of returning with POLLERR set, poll() returns with either the POLLOUT bit or the POLLHUP bit set. Portable applications need to check to see if any of these bits is set to determine if a write() will block.
Sockets
- Table 63-6 summarizes the behavior of select() and poll() for sockets. This table covers just the common cases, not all possible scenarios.
For the poll() column, we assume that events was specified as
(POLLIN | POLLOUT | POLLPRI)
.
For the select() column, we assume that the file descriptor is being tested to see if input is possible, output is possible, or an exceptional condition occurred(i.e., the file descriptor is specified in all three sets passed to select()).
- The Linux poll() behavior for UNIX domain sockets after a peer close() differs from that shown in Table 63-6. poll() additionally returns POLLHUP in revents.
- The Linux-specific POLLRDHUP flag: This flag(actually in the form of EPOLLRDHUP) is designed primarily for use with the edge-triggered mode of epoll(Section 63.4). It is returned when the remote end of a stream socket connection has shut down the writing half of the connection. The use of this flag allows an application that uses the epoll edge-triggered interface to employ simpler code to recognize a remote shutdown.(The alternative is for the application to note that the POLLIN flag is set and then perform a read(), which indicates the remote shutdown with a return of 0.)
63.2.4 Comparison of select() and poll()
Implementation details
- Within the Linux kernel, select() and poll() both employ the same set of kernel-internal poll routines. These poll routines are distinct from the poll() system call itself. Each routine returns information about the readiness of a single file descriptor. This readiness information takes the form of a bit mask whose values correspond to the bits returned in the revents field by the poll() system call(Table 63-2).
- The implementation of the poll() system call involves calling the kernel poll routine for each file descriptor and placing the resulting information in the corresponding revents field.
- To implement select(), a set of macros is used to convert the information returned by the kernel poll routines into the corresponding event types returned by select():
#define POLLIN_SET (POLLRDNORM | POLLRDBAND | POLLIN | POLLHUP | POLLERR)
#define POLLOUT_SET (POLLWRBAND | POLLWRNORM | POLLOUT | POLLERR)
#define POLLEX_SET (POLLPRI)
- These macro definitions reveal the semantic correspondence between the information returned by select() and poll().(If we look at the select() and poll() columns in the tables in Section 63.2.3, we see that the indications provided by each system call are consistent with the above macros.) The only additional information we need to complete the picture is that poll() returns POLLNVAL in the revents field if one of the monitored file descriptors was closed at the time of the call, while select() returns -1 with errno set to EBADF .
API differences
- The use of the fd_set data type places an upper limit(FD_SETSIZE) on the range of file descriptors that can be monitored by select(). By default, this limit is 1024 on Linux, and changing it requires recompiling the application. By contrast, poll() places no intrinsic limit on the range of file descriptors that can be monitored.
- Because the fd_set arguments of select() are value-result, we must reinitialize them if making repeated select() calls from within a loop. By using separate events(input) and revents(output) fields, poll() avoids this requirement.
- The timeout precision afforded by select()(microseconds) is greater than that afforded by poll()(milliseconds).(The accuracy of the timeouts of both of these system calls is nevertheless limited by the software clock granularity.)
- If one of the file descriptors being monitored was closed, then poll() informs us exactly which one, via the POLLNVAL bit in the corresponding revents field. By contrast, select() merely returns -1 with errno set to EBADF, leaving us to determine which file descriptor is closed by checking for an error when performing an I/O system call on the descriptor. However, this is typically not an important difference, since an application can usually keep track of which file descriptors it has closed.
Portability
- Both interfaces are standardized by SUSv3 and available on contemporary implementations.
- However, there is some variation in the behavior of poll() across implementations, as noted in Section 63.2.3.
Performance
- The performance of poll() and select() is similar if either of the following is true:
- The range of file descriptors to be monitored is small(i.e., the maximum file descriptor number is low).
- A large number of file descriptors are being monitored, but they are densely packed(i.e., most or all of the file descriptors from 0 up to some limit are being monitored).
- The performance of select() and poll() can differ noticeably if the set of file descriptors to be monitored is sparse; that is, the maximum file descriptor number, N, is large, but only one or a few descriptors in the range 0 to N are being monitored. In this case, poll() can perform better than select().
- We can understand the reasons for this by considering the arguments passed to the two system calls.
- With select(), we pass one or more file descriptor sets and an integer, nfds, which is one greater than the maximum file descriptor to be examined in each set. The nfds argument has the same value, regardless of whether we are monitoring all file descriptors in the range 0 to(nfds - 1) or only the descriptor(nfds - 1). In both cases, the kernel must examine nfds elements in each set in order to check exactly which file descriptors are to be monitored.
- When using poll(), we specify only the file descriptors of interest to us, and the kernel checks only those descriptors.
63.2.5 Problems with select() and poll()
- select() and poll() suffer problems when monitoring a large number of file descriptors:
- On each call to select() or poll(), the kernel must check all of the specified file descriptors to see if they are ready. When monitoring a large number of file descriptors that are in a densely packed range, the time required for this operation greatly outweighs the time required for the next two operations.
- In each call to select() or poll(), the program must pass a data structure to the kernel describing all of the file descriptors to be monitored, and, after checking the descriptors, the kernel returns a modified version of this data structure to the program.(Furthermore, for select(), we must initialize the data structure before each call.)
For poll(), the size of the data structure increases with the number of file descriptors being monitored, and the task of copying it from user to kernel space and back again consumes a noticeable amount of CPU time when monitoring many file descriptors.
For select(), the size of the data structure is fixed by FD_SETSIZE, regardless of the number of file descriptors being monitored.
- After the call to select() or poll(), the program must inspect every element of the returned data structure to see which file descriptors are ready.
- The consequence of the above points is that the CPU time required by select() and poll() increases with the number of file descriptors being monitored(Section 63.4.5). This creates problems for programs that monitor large numbers of file descriptors.
- The poor scaling performance of select() and poll() stems from a simple limitation of these APIs: typically, a program makes repeated calls to monitor the same set of file descriptors; however, the kernel doesn’t remember the list of file descriptors to be monitored between successive calls.
- Signal-driven I/O and epoll are both mechanisms that allow the kernel to record a persistent list of file descriptors in which a process is interested. Doing this eliminates the performance scaling problems of select() and poll(), yielding solutions that scale according to the number of I/O events that occur, rather than according to the number of file descriptors being monitored. So, signal-driven I/O and epoll provide superior performance when monitoring large numbers of file descriptors.
63.3 Signal-Driven I/O
- With I/O multiplexing, a process makes a system call(select() or poll()) in order to check whether I/O is possible on a file descriptor.
With signal-driven I/O, a process requests that the kernel send it a signal when I/O is possible on a file descriptor. The process can then perform any other activity until I/O is possible, at which time the signal is delivered to the process.
- To use signal-driven I/O, a program performs the following steps:
- Establish a handler for the signal delivered by the signal-driven I/O mechanism. By default, this notification signal is SIGIO.
- Set the owner of the file descriptor—that is, the process or process group that is to receive signals when I/O is possible on the file descriptor. Typically, we make the calling process the owner. The owner is set using an fcntl() F_SETOWN operation of the following form:
fcntl(fd, F_SETOWN, pid);
- Enable nonblocking I/O by setting the O_NONBLOCK open file status flag.
- Enable signal-driven I/O by turning on the O_ASYNC open file status flag. This can be combined with the previous step, since they both require the use of the fcntl() F_SETFL operation(Section 5.3), as in the following example:
flags = fcntl(fd, F_GETFL);
fcntl(fd, F_SETFL, flags | O_ASYNC | O_NONBLOCK);
- The calling process can now perform other tasks. When I/O becomes possible, the kernel generates a signal for the process and invokes the signal handler established in step 1.
- Signal-driven I/O provides edge-triggered notification(Section 63.1.1). This means that once the process has been notified that I/O is possible, it should perform as much I/O(e.g., read as many bytes) as possible. Assuming a nonblocking file descriptor, this means executing a loop that performs I/O system calls until a call fails with the error EAGAIN or EWOULDBLOCK.
- Signal-driven I/O can be employed with file descriptors for sockets, terminals, pseudo-terminals, pipes, FIFOs, inotify file descriptors and certain other types of devices.
- The program in 63-3 performs the steps described above for enabling signal-driven I/O on standard input, and then places the terminal in cbreak mode(Section 62.6.3), so that input is available a character at a time. The program then enters an infinite loop, performing the “work” of incrementing a variable, cnt, while waiting for input to become available. Whenever input becomes available, the SIGIO handler sets a flag, gotSigio, that is monitored by the main program. When the main program sees that this flag is set, it reads all available input characters and prints them along with the current value of cnt.
- If a hash character(#) is read in the input, the program terminates. Example when type the x character a number of times, followed by a hash( #) character:
$ ./demo_sigio
cnt=37; read x
cnt=100; read x
cnt=159; read x
cnt=223; read x
cnt=288; read x
cnt=333; read #
Establish the signal handler before enabling signal-driven I/O
- Because the default action of SIGIO is to terminate the process, we should enable the handler for SIGIO before enabling signal-driven I/O on a file descriptor. If we enable signal-driven I/O before establishing the SIGIO handler, then there is a time window during which, if I/O becomes possible, delivery of SIGIO will terminate the process.
Setting the file descriptor owner
- We set the file descriptor owner using an fcntl() operation of the following form:
fcntl(fd, F_SETOWN, pid);
- We may specify that either a single process or all of the processes in a process group are to be signaled when I/O is possible on the file descriptor. If pid is positive, it is interpreted as a process ID. If pid is negative, its absolute value specifies a process group ID.
- Typically, pid is specified as the process ID of the calling process(so that the signal is sent to the process that has the file descriptor open). It is possible to specify another process or a process group(e.g., the caller’s process group), and signals will be sent to that target, subject to the permission checks described in Section 20.5, where the sending process is considered to be the process that does the F_SETOWN.
- The fcntl() F_GETOWN operation returns the ID of the process or process group that is to receive signals when I/O is possible on a specified file descriptor:
id = fcntl(fd, F_GETOWN);
if(id == -1)
Exit("fcntl");
- A process group ID is returned as a negative number by this call.
- A limitation in the system call convention employed on Linux architectures(x86) means that if a file descriptor is owned by a process group ID less than 4096, then, instead of returning that ID as a negative function result from the fcntl() F_GETOWN operation, glibc misinterprets it as a system call error. Consequently, the fcntl() wrapper function returns -1, and errno contains the(positive) process group ID. This is a consequence of the fact that the kernel system call interface indicates errors by returning a negative errno value as a function result, and there are a few cases where it is necessary to distinguish such results from a successful call that returns a valid negative value.
- To make this distinction, glibc interprets negative system call returns in the range -1 to -4095 as indicating an error, copies this(absolute) value into errno, and returns -1 as the function result for the application program. This technique is generally sufficient for dealing with the few system call service routines that can return a valid negative result; the fcntl() F_GETOWN operation is the only practical case where it fails. This limitation means that an application that uses process groups to receive “I/O possible” signals(which is unusual) can’t reliably use F_GETOWN to discover which process group owns a file descriptor.
- Since glibc version 2.11, the fcntl() wrapper function fixes the problem of F_GETOWN with process group IDs less than 4096. It does this by implementing F_GETOWN in user space using the F_GETOWN_EX operation(Section 63.3.2), which is provided by Linux 2.6.32 and later.
63.3.1 When Is “I/O Possible” Signaled?
Terminals and pseudo-terminals
- For terminals and pseudo-terminals, a signal is generated whenever new input becomes available, even if previous input has not yet been read.
- “Input possible” is also signaled if an end-of-file condition occurs on a terminal(but not on a pseudo-terminal).
- There is no “output possible” signaling for terminals. A terminal disconnect is also not signaled. Linux provides “output possible” signaling for the slave side of a pseudo-terminal. This signal is generated whenever input is consumed on the master side of the pseudo-terminal.
Pipes and FIFOs
- For the read end of a pipe or FIFO, a signal is generated in these circumstances:
- Data is written to the pipe(even if there was already unread input available).
- The write end of the pipe is closed.
- For the write end of a pipe or FIFO, a signal is generated in these circumstances:
- A read from the pipe increases the amount of free space in the pipe so that it is now possible to write PIPE_BUF bytes without blocking.
- The read end of the pipe is closed.
Sockets
- Signal-driven I/O works for datagram sockets in both the UNIX and the Internet domains. A signal is generated in the following circumstances:
- An input datagram arrives on the socket(even if there were already unread datagrams waiting to be read).
- An asynchronous error occurs on the socket.
- Signal-driven I/O works for stream sockets in both the UNIX and the Internet domains. A signal is generated in the following circumstances:
- A new connection is received on a listening socket.
- A TCP connect() request completes; that is, the active end of a TCP connection entered the ESTABLISHED state. The analogous condition is not signaled for UNIX domain sockets.
- New input is received on the socket(even if there was already unread input available).
- The peer closes its writing half of the connection using shutdown(), or closes its socket altogether using close().
- Output is possible on the socket(e.g., space has become available in the socket send buffer).
- An asynchronous error occurs on the socket.
inotify file descriptors
- A signal is generated when the inotify file descriptor becomes readable, that is, when an event occurs for one of the files monitored by the inotify file descriptor.
63.3.2 Refining the Use of Signal-Driven I/O
- In applications that need to simultaneously monitor large numbers(i.e., thousands) of file descriptors, signal-driven I/O can provide better performance by comparison with select() and poll(). The kernel “remembers” the list of file descriptors to be monitored, and signals the program only when I/O events actually occur on those descriptors. So, the performance of a program employing signal-driven I/O scales according to the number of I/O events that occur, rather than the number of file descriptors being monitored.
- To take advantage of signal-driven I/O, we must perform two steps:
- Employ a Linux-specific fcntl() operation, F_SETSIG, to specify a real-time signal that should be delivered instead of SIGIO when I/O is possible on a file descriptor.
- Specify the SA_SIGINFO flag when using sigaction() to establish the handler for the real-time signal employed in the previous step(Section 21.4).
- The fcntl() F_SETSIG operation specifies an alternative signal that should be delivered instead of SIGIO when I/O is possible on a file descriptor:
if(fcntl(fd, F_SETSIG, sig) == -1)
Exit("fcntl");
- The F_GETSIG operation performs the converse of F_SETSIG, retrieving the signal currently set for a file descriptor:
sig = fcntl(fd, F_GETSIG);
if(sig == -1)
Exit("fcntl");
- In order to obtain the definitions of the F_SETSIG and F_GETSIG constants from
struct f_owner_ex
{
int type;
pid_t pid;
};
- The type field defines the meaning of the pid field, and has one of the following values:
- F_OWNER_PGRP
The pid field specifies the ID of a process group that is to be the target of “I/O possible” signals. Unlike with F_SETOWN, a process group ID is specified as a positive value.
- F_OWNER_PID
The pid field specifies the ID of a process that is to be the target of “I/O possible” signals.
- F_OWNER_TID
The pid field specifies the ID of a thread that is to be the target of “I/O possible” signals. The ID specified in pid is a value returned by clone() or gettid().
- The F_GETOWN_EX operation is the converse of the F_SETOWN_EX operation. It uses the f_owner_ex structure pointed to by the third argument of fcntl() to return the settings defined by a previous F_SETOWN_EX operation.
- Because the F_SETOWN_EX and F_GETOWN_EX operations represent process group IDs as positive values, F_GETOWN_EX doesn’t suffer the problem described earlier for F_GETOWN when using process group IDs less than 4096.
63.4 The epoll API
- The primary advantages of epoll:
- The performance of epoll scales better than select() and poll() when monitoring large numbers of file descriptors.
- epoll permits either level-triggered or edge-triggered notification. select() and poll() provide only level-triggered notification, and signal-driven I/O provides only edge-triggered notification.
- The performance of epoll and signal-driven I/O is similar. But epoll has advantages over signal-driven I/O:
- We avoid the complexities of signal handling(e.g., signal-queue overflow).
- We have greater flexibility in specifying what kind of monitoring we want to perform(e.g., check to ready for reading, writing, or both).
- The central data structure of epoll is an epoll instance, which is referred to via an open file descriptor. This file descriptor is not used for I/O, it is a handle for kernel data structures that serve two purposes:
- recording a list of file descriptors that this process has declared an interest in monitoring—the interest list; and
- maintaining a list of file descriptors that are ready for I/O—the ready list. ready list is a subset of the interest list.
- For each file descriptor monitored by epoll, we can specify a bit mask indicating events that we are interested in knowing about. These bit masks correspond to the bit masks used with poll().
- epoll consists of three system calls:
- epoll_create(): Create an epoll instance and returns a file descriptor referring to the instance.
- epoll_ctl(): Manipulate the interest list associated with an epoll instance. Using epoll_ctl(), we can add a new file descriptor to the list, remove an existing descriptor from the list, and modify the mask that determines which events are to be monitored for a descriptor.
- epoll_wait(): Return items from the ready list associated with an epoll instance.
63.4.1 Creating an epoll Instance: epoll_create()
#include <sys/epoll.h>
int epoll_create(int size);
int epoll_create1(int flags);
Returns file descriptor on success, or -1 on error
- epoll_create() creates a new epoll instance whose interest list is initially empty.
- size specifies the number of file descriptors that we expect to monitor via the epoll instance. This argument is not an upper limit, but rather a hint to the kernel about how to initially dimension internal data structures.
- Since Linux 2.6.8, size is ignored, the kernel dynamically sizes the required data structures without needing the hint, but size must be greater than zero, in order to ensure backward compatibility when new epoll applications are run on older kernels.
- epoll_create() returns a file descriptor referring to the new epoll instance. This file descriptor is used to refer to the epoll instance in other epoll system calls. When the file descriptor is no longer required, it should be closed using close().
- Multiple file descriptors may refer to the same epoll instance as a consequence of calls to fork() or descriptor duplication using dup() or similar. When all file descriptors referring to an epoll instance are closed, the instance is destroyed and its associated resources are released back to the system.
- epoll_create1 performs the same task as epoll_create(), but drops the size argument and adds a flags argument that can be used to modify the behavior of the system call.
- One flag is supported: EPOLL_CLOEXEC, which causes the kernel to enable the close-on-exec flag(FD_CLOEXEC) for the new file descriptor.
- Linux 3.7: epoll_ctl() adds a new flag, EPOLL_CTL_DISABLE that allows multithreaded applications to safely disable monitoring of a file descriptor.
63.4.2 Modifying the epoll Interest List: epoll_ctl()
#include <sys/epoll.h>
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *ev);
Returns 0 on success, or -1 on error
- epoll_ctl() modifies the interest list of the epoll instance referred to by the file descriptor epfd.
- fd identifies which of the file descriptors in the interest list is to have its settings modified. It can be a file descriptor for a pipe, FIFO, socket, POSIX message queue, inotify instance, terminal, device, or another epoll descriptor(i.e., we can build hierarchy of monitored descriptors). But fd can’t be a file descriptor for a regular file or a directory(the error EPERM results).
- op specifies the operation to be performed, and has one of the following values:
- EPOLL_CTL_ADD
Add fd to the interest list for epfd. The set of events that we are interested in monitoring for fd is specified in the buffer pointed to by ev. If we attempt to add a file descriptor that is already in the interest list, epoll_ctl() fails with the error EEXIST.
- EPOLL_CTL_MOD
Modify the events setting for the file descriptor fd, using the information specified in the buffer pointed to by ev. If we attempt to modify the settings of a file descriptor that is not in the interest list for epfd, epoll_ctl() fails with the error ENOENT.
- EPOLL_CTL_DEL
Remove fd from the interest list for epfd. ev is ignored for this operation. If we attempt to remove a file descriptor that is not in the interest list for epfd, epoll_ctl() fails with the error ENOENT. Closing a file descriptor automatically removes it from all of the epoll interest lists of which it is a member.
- ev is a pointer to a structure of type epoll_event, defined as follows:
struct epoll_event
{
uint32_t events;
epoll_data_t data;
};
- The data field of the epoll_event structure is typed as follows:
typedef union epoll_data
{
void* ptr;
int fd;
uint32_t u32;
uint64_t u64;
} epoll_data_t;
- ev specifies settings for the file descriptor fd as follows:
- events is a bit mask specifying the set of events that we are interested in monitoring for fd(next section).
- data is a union, one of whose members can be used to specify information that is passed back to the calling process(via epoll_wait()) if fd later becomes ready.
#include <stdio.h>
#include <stdlib.h>
#include <sys/epoll.h>
void Exit(char *string)
{
printf("%s\n", string);
exit(1);
}
int main()
{
int epfd = epoll_create(1);
if(epfd == -1)
{
Exit("epoll_create error");
}
struct epoll_event ev;
int fd = 1;
ev.data.fd = fd;
ev.events = EPOLLIN;
if(epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &ev) == -1)
{
Exit("epoll_ctl error");
}
exit(0);
}
The max_user_watches limit
- Because each file descriptor registered in an epoll interest list requires a small amount of non-swappable kernel memory, the kernel provides an interface that defines a limit on the total number of file descriptors that each user can register in all epoll interest lists. The value of this limit can be viewed and modified via max_user_watches, a Linux-specific file in the /proc/sys/fs/epoll directory. The default value of this limit is calculated based on available system memory(see the epoll(7) manual page).
63.4.3 Waiting for Events: epoll_wait()
#include <sys/epoll.h>
int epoll_wait(int epfd, struct epoll_event *evlist, int maxevents, int timeout);
Returns number of ready file descriptors, 0 on timeout, or -1 on error
- epoll_wait() returns information about ready file descriptors from the epoll instance referred to by the file descriptor epfd. A single epoll_wait() call can return information about multiple ready file descriptors.
- The information about ready file descriptors is returned in the array of epoll_event structures pointed to by evlist. The evlist array is allocated by the caller, and the number of elements it contains is specified in maxevents.
struct epoll_event
{
uint32_t events;
epoll_data_t data;
};
typedef union epoll_data
{
void* ptr;
int fd;
uint32_t u32;
uint64_t u64;
} epoll_data_t;
- Each item in the array evlist returns information about a single ready file descriptor. The events field returns a mask of the events that have occurred on this descriptor. The data field returns whatever value was specified in ev.data when we registered interest in this file descriptor using epoll_ctl().
- The data field provides the only mechanism for finding out the number of the file descriptor associated with this event. When we make the epoll_ctl() call that places a file descriptor in the interest list, we should either set ev.data.fd to the file descriptor number(as in 63-4) or set ev.data.ptr to point to a structure that contains the file descriptor number.
- If timeout:
- = -1, block until an event occurs for one of the file descriptors in the interest list for epfd or until a signal is caught.
- = 0, perform a nonblocking check to see which events are currently available on the file descriptors in the interest list for epfd.
- > 0, block for up to timeout milliseconds, until an event occurs on one of the file descriptors in the interest list for epfd, or until a signal is caught.
- On success, epoll_wait() returns the number of items that have been placed in the array evlist, or 0 if no file descriptors were ready within the interval specified by timeout.
On error, epoll_wait() returns -1, with errno set to indicate the error.
- In a multithreaded program, it is possible for one thread to use epoll_ctl() to add file descriptors to the interest list of an epoll instance that is already being monitored by epoll_wait() in another thread. These changes to the interest list will be taken into account immediately, and the epoll_wait() call will return readiness information about the newly added file descriptors.
epoll events
- The bit values that can be specified in ev.events when we call epoll_ctl() and that are placed in the evlist[].events fields returned by epoll_wait() are shown in Table 63-8.
- When specified as input to epoll_ctl() or returned as output via epoll_wait(), these bits convey exactly the same meaning as the corresponding poll() event bits.
The EPOLLONESHOT flag
- By default, once a file descriptor is added to an epoll interest list using the epoll_ctl() EPOLL_CTL_ADD operation, it remains active(i.e., subsequent calls to epoll_wait() will inform us whenever the file descriptor is ready) until we explicitly remove it from the list using the epoll_ctl() EPOLL_CTL_DEL operation.
- If we want to be notified only once about a particular file descriptor, then we can specify the EPOLLONESHOT flag in the ev.events value passed in epoll_ctl(). If this flag is specified, after the next epoll_wait() call that informs us that the corresponding file descriptor is ready, the file descriptor is marked inactive in the interest list, and we won’t be informed about its state by future epoll_wait() calls. If desired, we can subsequently reenable monitoring of this file descriptor using the epoll_ctl() EPOLL_CTL_MOD operation. (We can’t use EPOLL_CTL_ADD operation for this purpose, because the inactive file descriptor is still part of the epoll interest list.)
Example program
- As command-line arguments, this program expects the pathnames of one or more terminals or FIFOs. The program performs the following steps:
- Create an epoll instance*1*.
- Open each of the files named on the command line for input 2 and add the resulting file descriptor to the interest list of the epoll instance 3, specifying the set of events to be monitored as EPOLLIN.
- Execute a loop 4 that calls epoll_wait() 5 to monitor the interest list of the epoll instance and handles the returned events from each call. Note the following points about this loop:
- After the epoll_wait() call, the program checks for an EINTR return 6, which may occur if the program was stopped by a signal in the middle of the epoll_wait() call and then resumed by SIGCONT. If this occurs, the program restarts the epoll_wait() call.
- It the epoll_wait() call was successful, the program uses a further loop to check each of the ready items in evlist 7. For each item in evlist, the program checks the events field for the presence of not just EPOLLIN 8, but also EPOLLHUP and EPOLLERR 9. These latter events can occur if the other end of a FIFO was closed or a terminal hangup occurred. If EPOLLIN was returned, then the program reads some input from the corresponding file descriptor and displays it on standard output. Otherwise, if either EPOLLHUP or EPOLLERR occurred, the program closes the corresponding file descriptor 10 and decrements the counter of open files(numOpenFds).
- The loop terminates when all open file descriptors have been closed(i.e., when numOpenFds equals 0).
- The following shell session logs demonstrate the use of the program in Listing 63-5. We use two terminal windows. In one window, we use the program in Listing 63-5 to monitor two FIFOs for input.(Each open of a FIFO for reading by this program will complete only after another process has opened the FIFO for writing, as described in Section 44.7.) In the other window, we run instances of cat(1) that write data to these FIFOs.
- Above, we suspended our monitoring program so that we can now generate input on both FIFOs, and close the write end of one of them:
qqq
Type Control-D to terminate “cat > q”
$ fg %1
cat >p
ppp
- Now we resume our monitoring program by bringing it into the foreground, at which point epoll_wait() returns two events:
$ fg
./epoll_input p q
About to epoll_wait()
Ready: 2
fd=4; events: EPOLLIN
read 4 bytes: ppp
fd=5; events: EPOLLIN EPOLLHUP
read 4 bytes: qqq
closing fd 5
About to epoll_wait()
- The two blank lines in the above output are the newlines that were read by the instances of cat, written to the FIFOs, and then read and echoed by our monitoring program.
- Now we type Control-D in the second terminal window in order to terminate the remaining instance of cat, which causes epoll_wait() to once more return, this time with a single event:
Type Control-D to terminate “cat >p”
Ready: 1
fd=4; events: EPOLLHUP
closing fd 4
All file descriptors closed; bye
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>
#include <sys/epoll.h>
#define MAX_BUF 1000
#define MAX_EVENTS 5
void Exit(char *string)
{
printf("%s\n", string);
exit(1);
}
int main(int argc, char *argv[])
{
int epfd, ready, fd, nread, index, numOpenFds;
struct epoll_event ev;
struct epoll_event evlist[MAX_EVENTS];
char buf[MAX_BUF];
if(argc < 2 || strcmp(argv[1], "--help") == 0)
{
printf("%s file...\n", argv[0]);
exit(1);
}
epfd = epoll_create(1);
if (epfd == -1)
{
Exit("epoll_create error");
}
for(index = 1; index < argc; ++index)
{
fd = open(argv[index], O_RDONLY);
if(fd == -1)
{
Exit("open error");
}
printf("Opened \"%s\" on fd %d\n", argv[index], fd);
ev.events = EPOLLIN;
ev.data.fd = fd;
if (epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &ev) == -1)
{
Exit("epoll_ctl error");
}
}
numOpenFds = argc - 1;
while (numOpenFds > 0)
{
printf("About to epoll_wait()\n");
ready = epoll_wait(epfd, evlist, MAX_EVENTS, -1);
if(ready == -1)
{
if (errno == EINTR)
{
continue;
}
else
{
Exit("epoll_wait error");
}
}
printf("Ready: %d\n", ready);
for(index = 0; index < ready; ++index)
{
printf(" fd=%d; events: %s%s%s\n", evlist[index].data.fd,
(evlist[index].events & EPOLLIN) ? "EPOLLIN " : "",
(evlist[index].events & EPOLLHUP) ? "EPOLLHUP " : "",
(evlist[index].events & EPOLLERR) ? "EPOLLERR " : "");
if(evlist[index].events & EPOLLIN)
{
nread = read(evlist[index].data.fd, buf, MAX_BUF);
if (nread == -1)
{
Exit("read");
}
printf("read %d bytes: %.*s\n", nread, nread, buf);
}
else if(evlist[index].events & (EPOLLHUP | EPOLLERR))
{
printf(" closing fd %d\n", evlist[index].data.fd);
if(close(evlist[index].data.fd) == -1)
{
Exit("close error");
}
numOpenFds--;
}
}
}
printf("All file descriptors closed; bye\n");
exit(EXIT_SUCCESS);
}
63.4.4 A Closer Look at epoll Semantics
- Figure 5-2(page 95) shows the relationship between file descriptors, open file descriptions, and the system-wide file i-node table.
- When we create an epoll instance using epoll_create(), the kernel creates a new in-memory i-node and open file description, and allocates a new file descriptor in the calling process that refers to the open file description. The interest list for an epoll instance is associated with the open file description, not with the epoll file descriptor.
- This has the following consequences:
- If we duplicate an epoll file descriptor using dup()(or similar), then the duplicated descriptor refers to the same epoll interest and ready lists as the original descriptor. We may modify the interest list by specifying either file descriptor as the epfd argument in a call to epoll_ctl(). Similarly, we can retrieve items from the ready list by specifying either file descriptor as the epfd argument in a call to epoll_wait().
- The preceding point also applies after a call to fork(). The child inherits a duplicate of the parent’s epoll file descriptor, and this duplicate descriptor refers to the same epoll data structures.
- When we perform an epoll_ctl() EPOLL_CTL_ADD operation, the kernel adds an item to the epoll interest list that records both the number of the monitored file descriptor and a reference to the corresponding open file description. For the purpose of epoll_wait() calls, the kernel monitors the open file description. This means that we must refine our earlier statement that when a file descriptor is closed, it is automatically removed from any epoll interest lists of which it is a member.
- The refinement is this: an open file description is removed from the epoll interest list once all file descriptors that refer to it have been closed. This means that if we create duplicate descriptors referring to an open file(using dup()(or similar) or fork()), then the open file will be removed only after the original descriptor and all of the duplicates have been closed.
- Suppose we execute the code shown in Listing 63-6. The epoll_wait() call in this code will tell us that the file descriptor fd1 is ready(i.e, evlist[0].data.fd = fd1), even though fd1 has been closed. This is because there is still one open file descriptor, fd2, referring to the open file description contained in the epoll interest list.
- A similar scenario occurs when two processes hold duplicate descriptors for the same open file description(typically after fork()), and the process performing the epoll_wait() has closed its file descriptor, but the other process still holds the duplicate descriptor open.
- From Table 63-9, we see that as the number of file descriptors to be monitored grows large, poll() and select() perform poorly. The performance of epoll hardly declines as N grows large.
- Reasons for why select() and poll() perform poorly when monitoring large numbers of file descriptors are in Section 63.2.5, now look at reasons why epoll performs better:
- On each call to select() or poll(), the kernel must check all of the file descriptors specified in the call.
When we mark a descriptor to be monitored with epoll_ctl(), the kernel records this fact in a list associated with the underlying open file description, and whenever an I/O operation that makes the file descriptor ready is performed, the kernel adds an item to the ready list for the epoll descriptor. (An I/O event on a single open file description may cause multiple file descriptors associated with that description to become ready.) Subsequent epoll_wait() calls simply fetch items from the ready list.
- Each time we call select() or poll(), we pass a data structure to the kernel that identifies all of the file descriptors that are to be monitored, and, on return, the kernel passes back a data structure describing the readiness of all of these descriptors.
With epoll, we use epoll_ctl() to build up a data structure in kernel space that lists the set of file descriptors to be monitored. Once this data structure has been built, each later call to epoll_wait() doesn’t need to pass any information about file descriptors to the kernel, and the call returns information about only those descriptors that are ready.
- Additionally, for select(), we must initialize the input data structure prior to each call; for both select() and poll(), we must inspect the returned data structure to find out which of the N file descriptors are ready.
- Very roughly, we can say that for large values of N(the number of file descriptors being monitored), the performance of select() and poll() scales linearly with N. epoll scales(linearly) according to the number of I/O events that occur.
- epoll is thus efficient in a scenario that is common in servers that handle many simultaneous clients: of the many file descriptors being monitored, most are idle; only a few descriptors are ready.
63.4.6 Edge-Triggered Notification
- By default, epoll provides level-triggered notification. That is, epoll tells us whether an I/O operation can be performed on a file descriptor without blocking. This is the same type of notification as is provided by poll() and select().
- epoll also allows for edge-triggered notification: a call to epoll_wait() tells us if there has been I/O activity on a file descriptor since the previous call to epoll_wait()(or since the descriptor was opened, if there was no previous call).
- Using epoll with edge-triggered notification is similar to signal-driven I/O, except that if multiple I/O events occur, epoll coalesces them into a single notification returned via epoll_wait(); with signal-driven I/O, multiple signals may be generated.
- To employ edge-triggered notification, we specify the EPOLLET flag in ev.events when calling epoll_ctl():
struct epoll_event ev;
ev.data.fd = fd;
ev.events = EPOLLIN | EPOLLET;
if (epoll_ctl(epfd, EPOLL_CTL_ADD, fd, ev) == -1)
{
Exit("epoll_ctl error");
}
- Suppose use epoll to monitor a socket for input(EPOLLIN), the following steps occur:
- Input arrives on the socket.
- We perform an epoll_wait(). This call will tell us that the socket is ready, regardless of whether we are employing level-triggered or edge-triggered notification.
- We perform a second call to epoll_wait().
- If employ level-triggered notification: the second epoll_wait() call will inform us that the socket is ready.
If employ edge-triggered notification: the second epoll_wait() call will block, because no new input has arrived since the previous call to epoll_wait().
- Section 63.1.1: edge-triggered notification is usually employed in conjunction with nonblocking file descriptors. The general framework for using edge-triggered epoll notification is as follows:
- Make all file descriptors that are to be monitored nonblocking.
- Build the epoll interest list using epoll_ctl().
- Handle I/O events using the following loop:
a) Retrieve a list of ready descriptors using epoll_wait().
b) For each file descriptor that is ready, process I/O until the relevant system call(e.g., read(), write(), recv(), send(), or accept()) returns with the error EAGAIN or EWOULDBLOCK.
Preventing file-descriptor starvation when using edge-triggered notification
- Suppose that we are monitoring multiple file descriptors using edge-triggered notification, and that a ready file descriptor has a large amount(perhaps an endless stream) of input available. If, after detecting that this file descriptor is ready, we attempt to consume all of the input using nonblocking reads, then we risk starving the other file descriptors of attention(i.e., it may be a long time before we again check them for readiness and perform I/O on them).
- One solution is for the application to maintain a list of file descriptors that have been notified as being ready, and execute a loop that continuously performs the following actions:
- Monitor the file descriptors using epoll_wait() and add ready descriptors to the application list. If any file descriptors are already registered as being ready in the application list, then the timeout for this monitoring step should be small or 0, so that if no new file descriptors are ready, the application can quickly proceed to the next step and service any file descriptors that are already known to be ready.
- Perform a limited amount of I/O on those file descriptors registered as being ready in the application list(perhaps cycling through them in round-robin fashion, rather than always starting from the beginning of the list after each call to epoll_wait()). A file descriptor can be removed from the application list when the relevant nonblocking I/O system call fails with the EAGAIN or EWOULDBLOCK error.
- This approach offers other benefits in addition to preventing file-descriptor starvation. For example, we can include other steps in the above loop, such as handling timers and accepting signals with sigwaitinfo()(or similar).
- Starvation considerations can also apply when using signal-driven I/O, since it also presents an edge-triggered notification mechanism. By contrast, starvation considerations don’t necessarily apply in applications employing a level-triggered notification mechanism. This is because we can employ blocking file descriptors with level-triggered notification and use a loop that continuously checks descriptors for readiness, and then performs some I/O on the ready descriptors before once more checking for ready file descriptors.
63.5 Waiting on Signals and File Descriptors
63.5.1 The pselect() System Call
63.5.2 The Self-Pipe Trick
63.6 Summary
- select() and poll() I/O multiplexing calls simultaneously monitor multiple file descriptors to see if I/O is possible on any of the descriptors. With both system calls, we pass a complete list of to-be-checked file descriptors to the kernel on each system call, and the kernel returns a modified list indicating which descriptors are ready. The fact that complete file descriptor lists are passed and checked on each call means that select() and poll() perform poorly when monitoring large numbers of file descriptors.
- Signal-driven I/O allows a process to receive a signal when I/O is possible on a file descriptor. Linux allows us to change the signal used for notification, and if we instead employ a real-time signal, then multiple notifications can be queued, and the signal handler can use its siginfo_t argument to determine the file descriptor and event type that generated the signal.
- The performance advantage of epoll(and signal-driven I/O) derives from the fact that the kernel “remembers” the list of file descriptors that a process is monitoring(by contrast with select() and poll(), where each system call must again tell the kernel which file descriptors to check).
epoll has notable advantages over the use of signal-driven I/O: we avoid the complexities of dealing with signals and can specify which types of I/O events(e.g., input or output) are to be monitored.
- With a level-triggered notification model, we are informed whether I/O is currently possible on a file descriptor.
Edge-triggered notification informs us whether I/O activity has occurred on a file descriptor since it was last monitored. Edge-triggered notification is usually employed in conjunction with nonblocking I/O.
Further information
- A particularly interesting online resource is at http://www.kegel.com/c10k.html. This web page explores the issues facing developers of web servers designed to simultaneously serve tens of thousands of clients.
Exercises(Redo)
Please indicate the source: http://blog.csdn.net/gaoxiangnumber1
Welcome to my github: https://github.com/gaoxiangnumber1