Please indicate the source: http://blog.csdn.net/gaoxiangnumber1
Welcome to my github: https://github.com/gaoxiangnumber1
14.1 Introduction
14.2 Nonblocking I/O
- Section 10.5: System calls are divided into two categories: the slow ones and all the others. The slow system calls are those that can block forever. They include
• Reads that can block the caller forever if data isn’t present with certain file types(pipes, terminal devices, and network devices)
• Writes that can block the caller forever if the data can’t be accepted immediately by these same file types(e.g., no room in the pipe, network flow control)
• Opens that block until some condition occurs on certain file types(such as an open of a FIFO for writing only, when no other process has the FIFO open for reading)
• Reads and writes of files that have mandatory record locking enabled
• Certain ioctl operations
• Some of the interprocess communication functions(Chapter 15)
- System calls related to disk I/O are not considered slow, even though the read or write of a disk file can block the caller temporarily.
- Nonblocking I/O lets us issue an I/O operation and not have it block forever. If the operation cannot be completed, the call returns immediately with an error noting that the operation would have blocked.
- Two ways to specify nonblocking I/O for a given descriptor.
- Call open to get the descriptor specifying the O_NONBLOCK flag(Section 3.3).
- For a descriptor that is already open, call fcntl to turn on the O_NONBLOCK file status flag(Section 3.14). Figure 3.12 shows a function that we can call to turn on any of the file status flags for a descriptor.
- The program in Figure 14.1 reads up to 500000 bytes from the standard input and attempts to write it to the standard output. The standard output is first set to be nonblocking. The output is in a loop, with the results of each write being printed on the standard error.
- If the standard output is a regular file, we expect the write to be executed once:
$ ls -l /etc/services #print file size
-rw-r--r-- 1 root 677959 Jun 23 2009 /etc/services
$ ./a.out < /etc/services > temp.file #try a regular file first
read 500000 bytes
nwrite = 500000, errno = 0 #a single write
$ ls -l temp.file #verify size of output file
-rw-rw-r-- 1 sar 500000 Apr 1 13:03 temp.file
- But if the standard output is a terminal, we expect the write to return a partial count sometimes and an error at other times.
$ ./a.out < /etc/services 2>stderr.out #output to terminal
#lots of output to terminal ...
$ cat stderr.out
read 500000 bytes
nwrite = 999, errno = 0
nwrite = -1, errno = 35
nwrite = -1, errno = 35
nwrite = -1, errno = 35
nwrite = -1, errno = 35
nwrite = 1001, errno = 0
nwrite = -1, errno = 35
nwrite = 1002, errno = 0
nwrite = 1004, errno = 0
nwrite = 1003, errno = 0
nwrite = 1003, errno = 0
nwrite = 1005, errno = 0
nwrite = -1, errno = 35
... #61 of these errors
nwrite = 1006, errno = 0
nwrite = 1004, errno = 0
nwrite = 1005, errno = 0
nwrite = 1006, errno = 0
nwrite = -1, errno = 35
... #108 of these errors
nwrite = 1006, errno = 0
nwrite = 1005, errno = 0
nwrite = 1005, errno = 0
nwrite = -1, errno = 35
... #681 of these errors and so on ...
nwrite = 347, errno = 0
- The errno of 35 is EAGAIN(Resource temporarily unavailable). The amount of data accepted by the terminal driver varies from system to system. The results also vary depending on how you are logged in to the system: on the system console, on a hard-wired terminal, on a network connection using a pseudo terminal. If you are running a windowing system on your terminal, you are also going through a pseudo terminal device.
- In this example, the program issues more than 9000 write calls, even though only 500 are needed to output the data. The rest just return an error. This type of loop, called polling, is a waste of CPU time on a multiuser system.
- We can avoid using nonblocking I/O by using multiple threads(Chapter 11). We can allow individual threads to block in I/O calls if we can continue to make progress in other threads.
14.3 Record Locking
- Record locking is the term used to describe the ability of a process to prevent other processes from modifying a region of a file while the first process is reading or modifying that portion of the file.
- Figure 14.2 shows the forms of record locking provided by various systems.
fcntl Record Locking
#include <unistd.h>
#include <fcntl.h>
int fcntl(int fd, int cmd, ... );
Returns: depends on cmd if OK(see following), -1 on error
- For record locking, cmd is F_GETLK, F_SETLK, or F_SETLKW.
- flockptr is a pointer to an flock structure.
struct flock
{
short l_type;
short l_whence;
off_t l_start;
off_t l_len;
pid_t l_pid;
};
- This structure describes
• The type of lock desired: F_RDLCK(a shared read lock), F_WRLCK(an exclusive write lock), or F_UNLCK(unlocking a region)
• The starting byte offset of the region being locked or unlocked(l_start and l_whence)
• The size of the region in bytes(l_len)
• The ID(l_pid) of the process holding the lock that can block the current process(returned with F_GETLK only)
- Numerous rules apply to the specification of the region to be locked or unlocked.
• The l_whence member is specified as SEEK_SET, SEEK_CUR, or SEEK_END.
• Locks can start and extend beyond the current end of file, but cannot start or extend before the beginning of the file.
• l_len = 0: the lock extends to the largest possible offset of the file(until EOF). This allows us to lock a region starting anywhere in the file, up through and including any data that is appended to the file.
• To lock the entire file, we set l_start as 0 and l_whence as SEEK_SET to point to the beginning of the file and specify a length(l_len) of 0.
- Figure 14.3: The compatibility rule applies to lock requests made from different processes, not multiple lock requests made by a single process. If a process has an existing lock on a range of a file, a subsequent attempt to place a lock on the same range by the same process will replace the existing lock with the new one.
- To obtain a read lock, the descriptor must be open for reading; to obtain a write lock, the descriptor must be open for writing. Three commands for the fcntl function:
- F_GETLK: Determine whether the lock described by flockptr is blocked by some other lock.
-1- If a lock exists that would prevent our lock from being created, the information on that existing lock overwrites the information pointed to by flockptr.
-2- If no lock exists that would prevent our lock from being created, the structure pointed to by flockptr is left unchanged except for the l_type member(set to F_UNLCK).
- F_SETLK: Set the lock described by flockptr.
If we try to obtain a read lock(l_type = F_RDLCK) or a write lock(l_type = F_WRLCK) and the compatibility rule prevents the system from giving us the lock, fcntl returns with errno = EAGAIN. This command is also used to clear the lock described by flockptr(l_type = F_UNLCK).
- F_SETLKW: This command is a blocking version of F_SETLK.
If the requested read lock or write lock cannot be granted because another process currently has some part of the requested region locked, the calling process is put to sleep. The process wakes up either when the lock becomes available or when interrupted by a signal.
- Testing a lock with F_GETLK and then trying to obtain that lock with F_SETLK or F_SETLKW is not an atomic operation. Between the two fcntl calls, some other process may come in and obtain the same lock. If we don’t want to block while waiting for a lock to become available to us, we must handle the error returns from F_SETLK.
- POSIX.1 doesn’t specify what happens when one process read locks a range of a file, a second process blocks while trying to get a write lock on the same range, and a third processes then attempts to get another read lock on the range. If the third process is allowed to place a read lock on the range, then the it can starve processes with pending write locks.
- When setting or releasing a lock on a file, the system combines or splits adjacent areas as required. E.g., if we lock bytes [100, 199] and then unlock byte 150, the kernel maintains the locks on bytes [100, 149] and [151, 199].
- If we then lock byte 150, the system would coalesce the adjacent locked regions into a single region [100, 199].
- We normally use one of the following five macros:
#define read_lock(fd, offset, whence, len) \
lock_reg((fd), F_SETLK, F_RDLCK,(offset),(whence),(len))
#define readw_lock(fd, offset, whence, len) \
lock_reg((fd), F_SETLKW, F_RDLCK,(offset),(whence),(len))
#define write_lock(fd, offset, whence, len) \
lock_reg((fd), F_SETLK, F_WRLCK,(offset),(whence),(len))
#define writew_lock(fd, offset, whence, len) \
lock_reg((fd), F_SETLKW, F_WRLCK,(offset),(whence),(len))
#define un_lock(fd, offset, whence, len) \
lock_reg((fd), F_SETLK, F_UNLCK,(offset),(whence),(len))
- Figure 14.6 defines lock_test() to test for a lock.
- If a lock exists that would block the request specified by the arguments, this function returns the process ID of the process holding the lock. Otherwise, the function returns 0. We normally call this function from the following two macros(defined in apue.h):
#define is_read_lockable(fd, offset, whence, len) \
(lock_test((fd), F_RDLCK,(offset),(whence),(len)) == 0)
#define is_write_lockable(fd, offset, whence, len) \
(lock_test((fd), F_WRLCK,(offset),(whence),(len)) == 0)
- The lock_test function can’t be used by a process to see whether it is currently holding a portion of a file locked. The definition of F_GETLK command states that the information returned applies to an existing lock that would prevent us from creating our own lock. Since the F_SETLK and F_SETLKW commands always replace a process’s existing lock if it exists, we can never block on our own lock; thus, the F_GETLK command will never report our own lock.
- Deadlock occurs when two processes are each waiting for a resource that the other has locked. Figure 14.7 shows an example of deadlock.
- The child locks byte 0 and the parent locks byte 1. Each then tries to lock the other’s already locked byte. We use TELL_xxx and WAIT_xxx from Section 8.9 so that each process can wait for the other to obtain its lock.
$ ./a.out
parent: got the lock, byte 1
child: got the lock, byte 0
parent: writew_lock error: Resource deadlock avoided
child: got the lock, byte 1
- When a deadlock is detected, the kernel choose one process(parent or child) to receive the error return.
Implied Inheritance and Release of Locks
- Three rules govern the automatic inheritance and release of record locks.
- Locks are associated with a process and a file.
-1- When a process terminates, all its locks are released.
-2- Whenever a descriptor is closed, any locks on the file referenced by that descriptor for that process are released. This means that if we make the calls
fd1 = open(pathname, ...);
read_lock(fd1, ...);
fd2 = dup(fd1);
close(fd2);
after close(fd2)
, the lock obtained on fd1 is released. The same thing happen if we replaced the dup with open:
fd1 = open(pathname, ...);
read_lock(fd1, ...);
fd2 = open(pathname, ...)
close(fd2);
open the same file on another descriptor.
- Locks are never inherited by the child across a fork.
The child has to call fcntl to obtain its own locks on any descriptors that were inherited across the fork.
- Locks are inherited by a new program across an exec.
If the close-on-exec flag is set for a file descriptor, all locks for the underlying file are released when the descriptor is closed as part of an exec.
FreeBSD Implementation
- Consider a process that executes the following statements(ignoring error returns):
fd1 = open(pathname, ...);
write_lock(fd1, 0, SEEK_SET, 1);
if((pid = fork()) > 0)
{
fd2 = dup(fd1);
fd3 = open(pathname, ...);
}
else if(pid == 0)
{
read_lock(fd1, 1, SEEK_SET, 1);
}
pause();
- Figure 14.8 shows the resulting data structures after both the parent and the child have paused.
- The lockf structures are linked together from the i-node structure. Each lockf structure describes one locked region(defined by an offset and length) for a given process. We show two of these structures: one for the parent’s call to write_lock and one for the child’s call to read_lock. Each structure contains the corresponding process ID.
- In the parent, closing any one of fd1, fd2, or fd3 causes the parent’s lock to be released. When any one of these three file descriptors is closed, the kernel goes through the linked list of locks for the corresponding i-node and releases the locks held by the calling process. The kernel can’t tell which descriptor of the three was used by the parent to obtain the lock.
- Figure 13.6: A daemon can use a lock on a file to ensure that only one copy of the daemon is running. Figure 14.9 shows the lockfile function used by the daemon to place a write lock on a file.
- We can define the lockfile function in terms of the write_lock function:
#define lockfile(fd) write_lock((fd), 0, SEEK_SET, 0)
Locks at End of File
- Most implementations convert an l_whence value of SEEK_CUR or SEEK_END into an absolute file offset(use l_start and the file’s current position or current length). We often need to specify a lock relative to the file’s current length, but we can’t call fstat to obtain the current file size, since we don’t have a lock on the file. There’s a chance that another process could change the file’s length between the call to fstat and the lock call.
- Consider the following sequence of steps:
writew_lock(fd, 0, SEEK_END, 0);
write(fd, buf, 1);
un_lock(fd, 0, SEEK_END);
write(fd, buf, 1);
- This sequence of code obtains a write lock from the current end of the file onward, covering any future data we might append to the file. Assuming that we are at end of file when we perform the first write, this operation will extend the file by one byte, and that byte will be locked. The unlock operation that follows removes the locks for future writes that append data to the file, but it leaves a lock on the last byte in the file. When the second write occurs, the end of file is extended by one byte, but this byte is not locked. The state of the file locks is shown in Figure 14.10.
- When a portion of a file is locked, the kernel converts the offset specified into an absolute file offset. In addition to specifying an absolute file offset(SEEK_SET), fcntl allows us to specify this offset relative to a point in the file: current(SEEK_CUR) or end of file(SEEK_END). The kernel needs to remember the locks independent of the current file offset or end of file, because the current offset and end of file can change, and changes to these attributes shouldn’t affect the state of existing locks.
- If we want to remove the lock covering the byte we wrote in the first write, we can specify the length as -1. Negative length values represent the bytes before the specified offset.
Advisory versus Mandatory Locking
- Mandatory locking causes the kernel to check every open, read, and write to verify that the calling process isn’t violating a lock on the file being accessed. Mandatory locking is also called enforcement-mode locking.
- On Linux, if you want mandatory locking, you need to enable it on a per file system basis by using the
-o
option to the mount command.
- Mandatory locking is enabled for a particular file by turning on the set-group-ID bit and turning off the group-execute bit(Figure 4.12).
- When a process that tries to read or write a file that has mandatory locking enabled and that part of the file is currently locked by another process, the result depends on the type of operation(read/write), the type of lock held by the other process(read/write lock), and whether the descriptor for the read or write is nonblocking. Figure 14.11 shows the eight possibilities.
- Normally, the open function succeeds even if the file being opened has outstanding mandatory record locks. The next read or write follows the rules in Figure 14.11. But if the file being opened has outstanding mandatory record locks(read/write locks), and if the flags in open() specify O_TRUNC or O_CREAT(but linux allow), then open returns an error of EAGAIN immediately, regardless of whether O_NONBLOCK is specified.
- This handling of locking conflicts with open can lead to surprising results. A test program was run that opened a file(whose mode specified mandatory locking), established a read lock on an entire file, and then went to sleep for a while. During sleep, the following behavior was seen in other programs:
• The same file could be edited by some editor, and the results written back to disk! Using strace(1), we can see that editor wrote the new contents to a temporary file, removed the original file, and then renamed the temporary file to be the original file. The mandatory record locking has no effect on the unlink function, which allowed this to happen.
• The vi editor was never able to edit the file. It could read the file’s contents, but whenever we tried to write new data to the file, EAGAIN was returned. If we tried to append new data to the file, the write blocked. This behavior from vi is what we expect.
• Using Korn shell’s > and >> operators to overwrite or append to the file resulted in the error “cannot create”.
• Using the same two operators with Bourne shell resulted in an error for >, but >> operator just blocked until the mandatory lock was removed, and then proceeded.
- The difference in handling the append operator occurs because the Korn shell opens the file with O_CREAT and O_APPEND, and specifying O_CREAT generates an error. The Bourne shell doesn’t specify O_CREAT if the file already exists, so the open succeeds but the next write blocks.
- Results depends on the version of the operating system you are using.
- We can run the program in Figure 14.12 to determine whether our system supports mandatory locking.
- This program creates a file and enables mandatory locking for the file. The program then splits into parent and child, with the parent obtaining a write lock on the entire file. The child first sets its descriptor to be nonblocking and then attempts to obtain a read lock on the file, expecting to get an error. This lets us see whether the system returns EACCES or EAGAIN. Next, the child rewinds the file and tries to read from the file. If mandatory locking is provided, the read should return EACCES or EAGAIN(since the descriptor is nonblocking). Otherwise, the read returns the data that it read.
#If supports mandatory locking:
$ ./a.out temp.lock
read_lock of already-locked region returns 11
read failed(mandatory locking works): Resource temporarily unavailable
#If not supports mandatory locking:
$ ./a.out temp.lock
read_lock of already-locked region returns 35
read OK(no mandatory locking), buf = ab
What happens when two people edit the same file at the same time?
- The normal UNIX text editors do not use record locking, so the final result of the file corresponds to the last process that wrote the file.
- We can do it ourself: We write program that is a front end to vi. This program immediately calls fork, and the parent just waits for the child to complete. The child opens the file specified on the command line, enables mandatory locking, obtains a write lock on the entire file, and then executes vi. While vi is running, the file is write locked, so other users can’t modify it. When vi terminates, the parent’s wait returns and our front end terminates.
- But a front-end program of this type doesn’t work. The problem is that it is common practice for editors to read their input file and then close it. A lock is released on a file whenever a descriptor that references that file is closed. So, when the editor closes the file after reading its contents, the lock is gone. There is no way to prevent this from happening in the front-end program.
14.4 I/O Multiplexing(UNP: ch6 & TLPI: ch63)
14.5 Asynchronous I/O
- Signals provide an asynchronous form of notification that something has happened. All systems derived from BSD and System V provide asynchronous I/O, using a signal(SIGPOLL in System V; SIGIO in BSD) to notify the process that something of interest has happened on a descriptor.
- These forms of asynchronous I/O are limited: they don’t work with all file types and they allow the use of only one signal. If we enable more than one descriptor for asynchronous I/O, we cannot tell which descriptor the signal corresponds to when the signal is delivered.
- The non-asynchronous I/O function calls aren’t really synchronous because they are synchronous with respect to the program flow, not synchronous with respect to the I/O. We call a write synchronous if the data we write is persistent when we return from the call to the write function.
- When we decide to use asynchronous I/O, we complicate the design of our application by choosing to juggle multiple concurrent operations. A simpler approach may be to use multiple threads, which would allow us to write the program using a synchronous model, and let the threads run asynchronous to each other.
- We incur additional complexity when we use the POSIX asynchronous I/O interfaces: • We have to worry about three sources of errors for every asynchronous operation: one associated with the submission of the operation, one associated with the result of the operation itself, and one associated with the functions used to determine the status of the asynchronous operations.
• The interfaces involve a lot of extra setup and processing rules compared to their conventional counterparts.
• Recovering from errors can be difficult. E.g., if we submit multiple asynchronous writes and one fails, how should we proceed? If the writes are related, we might have to undo the ones that succeeded.
14.5.1 System V Asynchronous I/O
14.5.2 BSD Asynchronous I/O
14.5.3 POSIX Asynchronous I/O
- The POSIX asynchronous I/O interfaces give us a consistent way to perform asynchronous I/O, regardless of the type of file.
- The asynchronous I/O interfaces use AIO control blocks to describe I/O operations. The aiocb structure defines an AIO control block. It contains at least the fields shown in the following structure:
struct aiocb
{
int aio_fildes;
off_t aio_offset;
volatile void* aio_buf;
size_t aio_nbytes;
int aio_reqprio;
struct sigevent aio_sigevent;
int aio_lio_opcode;
};
- aio_fildes: the file descriptor open for the file to be read or written.
- aio_offset: the offset that read or write starts at.
- aio_buf:
- For read, data is copied to the buffer that begins at the address specified by aio_buf.
- For write, data is copied from this buffer.
- aio_nbytes: contains the number of bytes to read or write.
- We have to provide an explicit offset when we perform asynchronous I/O. The asynchronous I/O interfaces don’t affect the file offset maintained by the operating system. This won’t be a problem as long as we never mix asynchronous I/O functions with conventional I/O functions on the same file in a process. If we write to a file opened in append mode(O_APPEND) using an asynchronous interface, the aio_offset field in the AIO control block is ignored by the system.
- aio_reqprio: a hint that gives applications a way to suggest an ordering for the asynchronous I/O requests. The system has little control over the exact ordering, so there is no guarantee that the hint will be honored.
- aio_lio_opcode: used only with list-based asynchronous I/O.
- aio_sigevent: controls how the application is notified about the completion of the I/O event. It is described by a sigevent structure.
struct sigevent
{
int sigev_notify;
int sigev_signo;
union sigval sigev_value;
void(*sigev_notify_function)(union sigval);
pthread_attr_t* sigev_notify_attributes;
};
- sigev_notify controls the type of notification. It can take on one of three values.
- SIGEV_NONE
The process is not notified when the asynchronous I/O request completes.
- SIGEV_SIGNAL
The signal specified by sigev_signo is generated when the asynchronous I/O request completes. If the application has elected to catch the signal and has specified the SA_SIGINFO flag when establishing the signal handler, the signal is queued(if system supports queued signals). The signal handler is passed a siginfo structure whose si_value field is set to sigev_value(if SA_SIGINFO is used).
- SIGEV_THREAD
The function specified by sigev_notify_function is called when the asynchronous I/O request completes. It is passed sigev_value as its only argument. The function is executed in a separate thread in a detached state, unless sigev_notify_attributes is set to the address of a pthread attribute structure specifying alternative attributes for the thread.
- To perform asynchronous I/O, we need to initialize an AIO control block and call either aio_read to make an asynchronous read or aio_write to make an asynchronous write.
#include <aio.h>
int aio_read(struct aiocb *aiocb);
int aio_write(struct aiocb *aiocb);
Both return: 0 if OK, -1 on error
- When these functions return success, the asynchronous I/O request has been queued for processing by the operating system. The return value has no relation with the result of the actual I/O operation. While the I/O operation is pending, we have to be careful to ensure that the AIO control block and data buffer remain stable; their underlying memory must remain valid and we can’t reuse them until the I/O operation completes.
- To force all pending asynchronous writes to persistent storage without waiting, we can set up an AIO control block and call the aio_fsync function.
#include <aio.h>
int aio_fsync(int op, struct aiocb *aiocb);
Returns: 0 if OK, -1 on error
- aio_fildes in the AIO control block indicates the file whose asynchronous writes are synched.
- If op =
- O_DSYNC: the operation behaves like a call to fdatasync.
- O_SYNC: the operation behaves like a call to fsync.
- The aio_fsync operation returns when the synch is scheduled. The data won’t be persistent until the asynchronous synch completes. The AIO control block controls how we are notified.
- To determine the completion status of an asynchronous read, write, or synch operation, we need to call the aio_error function.
#include <aio.h>
int aio_error(const struct aiocb *aiocb);
Returns:(see following)
- The return value =:
-
- The asynchronous operation completed successfully. We need to call the aio_return function to obtain the return value from the operation.
- -1. The call to aio_error failed. In this case, errno tells us why.
- EINPROGRESS. The asynchronous read, write, or synch is still pending.
- Anything else. Any other return value gives us the error code corresponding to the failed asynchronous operation.
- If the asynchronous operation succeeded, we can call the aio_return function to get the asynchronous operation’s return value.
#include <aio.h>
ssize_t aio_return(const struct aiocb *aiocb);
Returns:(see following)
- We should not call the aio_return function until the asynchronous operation completes. The results are undefined until the operation completes. We can call aio_return only once per asynchronous I/O operation. Once we call this function, the operating system is free to deallocate the record containing the I/O operation’s return value.
- The aio_return function will return -1 and set errno if aio_return itself fails. Otherwise, it will return the results of the asynchronous operation. In this case, it will return whatever read, write, or fsync would have returned on success if one of those functions had been called.
- We use asynchronous I/O when we have other processing to do and we don’t want to block while performing the I/O operation. When we have completed the processing and we still have asynchronous operations outstanding, we can call the aio_suspend function to block until an operation completes.
#include <aio.h>
int aio_suspend(const struct aiocb *const list[], int nent, const struct timespec *timeout);
Returns: 0 if OK, -1 on error
- One of three things can cause aio_suspend to return.
- If we are interrupted by a signal, it returns -1 with errno set to EINTR.
- If the time limit specified by the timeout argument expires without any of the I/O operations completing, it returns -1 with errno set to EAGAIN(when timeout = NULL: block forever).
- If any of the I/O operations complete, aio_suspend returns 0. If all asynchronous I/O operations are complete when we call aio_suspend, then aio_suspend will return without blocking.
- The list argument is a pointer to an array of AIO control blocks and the nent argument indicates the number of entries in the array. Null pointers in the array are skipped; the other entries must point to AIO control blocks that have been used to initiate asynchronous I/O operations.
- When we have pending asynchronous I/O operations that we no longer want to complete, we can attempt to cancel them with the aio_cancel function.
#include <aio.h>
int aio_cancel(int fd, struct aiocb *aiocb);
Returns:(see following)
- The fd argument specifies the file descriptor with the outstanding asynchronous I/O operations.
- If aiocb = NULL: the system attempts to cancel all outstanding asynchronous I/O operations on the file. Otherwise, the system attempts to cancel the single asynchronous I/O operation described by the AIO control block. We say that the system “attempts” to cancel the operations because there is no guarantee that the system will be able to cancel any operations that are in progress.
- The aio_cancel function can return one of four values:
- AIO_ALLDONE. All of the operations completed before the attempt to cancel them.
- AIO_CANCELED. All of the requested operations have been canceled.
- AIO_NOTCANCELED. At least one of the requested operations could not be canceled.
- -1. The call to aio_cancel failed. The error code will be stored in errno.
- If an asynchronous I/O operation is successfully canceled, calling the aio_error function on the corresponding AIO control block will return the error ECANCELED. If the operation can’t be canceled, then the corresponding AIO control block is unchanged by the call to aio_cancel.
- The lio_listio function submits a set of I/O requests described by a list of AIO control blocks.
#include <aio.h>
int lio_listio(int mode, struct aiocb *restrict const list[restrict], int nent, struct sigevent *restrict sigev);
Returns: 0 if OK, −1 on error
- mode determines whether the I/O is truly asynchronous. When it =
- LIO_WAIT: this function won’t return until all the I/O operations specified by the list are complete. In this case, the sigev argument is ignored.
- LIO_NOWAIT: this function returns as soon as the I/O requests are queued. The process is notified asynchronously when all of the I/O operations complete, as specified by the sigev argument. If we don’t want to be notified, we can set sigev to NULL. The individual AIO control blocks may enable asynchronous notification when an individual operation completes. The asynchronous notification specified by the sigev argument is in addition to these, and is sent only when all of the I/O operations complete.
- list points to a list of AIO control blocks specifying the I/O operations to perform. nent specifies the number of elements in the array. The list of AIO control blocks can contain NULL pointers; these entries are ignored.
- In each AIO control block, the aio_lio_opcode field specifies whether the operation is a read(LIO_READ), a write(LIO_WRITE), or a no-op(LIO_NOP, which is ignored). A read is treated as if the corresponding AIO control block had been passed to the aio_read function. A write is treated as if the AIO control block had been passed to aio_write.
- Implementations can limit the number of asynchronous I/O operations we are allowed to have outstanding. The limits are runtime in-variants, and are summarized in Figure 14.19.
- The program in Figure 14.20 translates a file using the ROT-13 algorithm: rotates the characters ’a’ to ’z’ and ’A’ to ’Z’ by 13 positions, but leaves all other characters unchanged.
- The I/O portion of the program: read a block from the input file, translate it, and then write the block to the output file. We repeat this until we hit the end of file and read returns zero. The program in Figure 14.21 shows how to perform the same task using the equivalent asynchronous I/O functions.
- Since we use eight buffers, we can have up to eight asynchronous I/O requests pending. But this might reduce performance: if the reads are presented to the file system out of order, it can defeat the operating system’s read-ahead algorithm.
- Before we can check the return value of an operation, we need to make sure the operation has completed. When aio_error returns a value other than EINPROGRESS or -1, we know the operation is complete. Excluding these values, if the return value is anything other than 0, then we know the operation failed. Once we’ve checked these conditions, it is safe to call aio_return to get the return value of the I/O operation.
- As long as we have work to do, we can submit asynchronous I/O operations. When we have an unused AIO control block, we can submit an asynchronous read request. When a read completes, we translate the buffer contents and then submit an asynchronous write request. When all AIO control blocks are in use, we wait for an operation to complete by calling aio_suspend.
- When we write a block to the output file, we retain the same offset at which we read the data from the input file. Consequently, the order of the writes doesn’t matter. This strategy works only because each character in the input file has a corresponding character in the output file at the same offset; we neither add nor delete characters in the output file.
- We don’t use asynchronous notification in this example because it is easier to use a synchronous programming model. If we had something else to do while the I/O operations were in progress, then the additional work could be folded into the for loop. If we needed to prevent this additional work from delaying the task of translating the file, we might have to structure the code to use some form of asynchronous notification. With multiple tasks, we need to prioritize the tasks before deciding how the program should be structured.
14.6 readv and writev Functions
- The readv and writev functions let us read into and write from multiple non-contiguous buffers in a single function call. These operations are called scatter read and gather write.
#include <sys/uio.h>
ssize_t readv(int fd, const struct iovec *iov, int iovcnt);
ssize_t writev(int fd, const struct iovec *iov, int iovcnt);
Both return: number of bytes read or written, -1 on error
- iov is a pointer to an array of iovec structures:
struct iovec
{
void* iov_base;
size_t iov_len
};
- iovcnt is the number of elements in the iov array. It is limited to IOV_MAX(Figure 2.11). Figure 14.22 shows a diagram relating the arguments to these two functions and the iovec structure.
- The readv function scatters the data into the buffers in order, always filling one buffer before proceeding to the next. readv returns the total number of bytes that were read. A count of 0 is returned if there is no more data and the end of file is encountered.
- The writev function gathers the output data from the buffers in order: iov[0], iov[1], through iov[iovcnt-1]; writev returns the total number of bytes output, which should normally equal the sum of all the buffer lengths.
- In Section 20.8, in the function _db_writeidx, we need to write two buffers consecutively to a file. The second buffer to output is an argument passed by the caller, and the first buffer is one we create, containing the length of the second buffer and a file offset of other information in the file. There are three ways we can do this. 1. Call write twice, once for each buffer.
- Allocate a buffer of our own that is large enough to contain both buffers, and copy both into the new buffer. We then call write once for this new buffer.
- Call writev to output both buffers.
- The solution we use in Section 20.8 is to use writev, but it’s instructive to compare it to the other two solutions. Figure 14.23 shows the results from the three methods just described.
- The test program that we measured output a 100-byte header followed by 200 bytes of data. This was done 1,048,576 times, generating a 300-megabyte file. The test program has three separate cases: one for each of the techniques measured in Figure 14.23. We used times(Section 8.17) to obtain the user CPU time, system CPU time, and wall clock time before and after the writes. All three times are shown in seconds.
- As we expect, the system time increases when we call write twice, compared to calling either write or writev once. This correlates with the results in Figure 3.6.
- Next, note that the sum of the CPU times(user plus system) is slightly less when we do a buffer copy followed by a single write compared to a single call to writev. With the single write, we copy the buffers to a staging buffer at user level, and then the kernel will copy the data to its internal buffers when we call write. With writev, we should do less copying, because the kernel only needs to copy the data directly into its staging buffers. The fixed cost of using writev for such small amounts of data, however, is greater than the benefit. As the amount of data we need to copy increases, the more expensive it will be to copy the buffers in our program, and the writev alternative will be more attractive.
- In summary, we should always try to use the fewest number of system calls necessary to get the job done. If we are writing small amounts of data, we will find it less expensive to copy the data ourselves and use a single write instead of using writev. We might find, however, that the performance benefits aren’t worth the extra complexity cost needed to manage our own staging buffers.
14.7 readn and writen Functions
- Pipes, FIFOs, and some devices(notably terminals and networks) have the following two properties.
- A read operation may return less than asked for, even though we have not encountered the end of file. This is not an error, and we should continue reading from the device.
- A write operation can return less than we specified. It’s not an error, and we should continue writing the remainder of the data.
- We’ll never see this happen when reading or writing a disk file, except when the file system runs out of space or we hit our quota limit and we can’t write all that we requested.
- The readn and writen functions call read or write as many times as required to read or write the entire N bytes of data.
#include "apue.h"
ssize_t readn(int fd, void *buf, size_t nbytes);
ssize_t writen(int fd, void *buf, size_t nbytes);
Both return: number of bytes read or written, -1 on error
- We call writen whenever we’re writing to one of the file types that we mentioned, but we call readn only when we know ahead of time that we will be receiving a certain number of bytes. Figure 14.24 shows implementations of readn and writen.
- If we encounter an error and have previously read or written any data, we return the amount of data transferred instead of the error. If we reach the end of file while reading, we return the number of bytes copied to the caller’s buffer if we already read some data successfully and have not yet satisfied the amount requested.
14.8 Memory-Mapped I/O
- Memory-mapped I/O lets us map a file on disk into a buffer in memory. So:
- When we fetch bytes from the buffer, the corresponding bytes of the file are read.
- When we store data in the buffer, the corresponding bytes are automatically written to the file.
- We use the mmap function to tell the kernel to map a given file to a region in memory.
#include <sys/mman.h>
void *mmap(void *addr, size_t len, int prot, int flag, int fd, off_t off);
Returns: starting address of mapped region if OK, MAP_FAILED on error
- addr lets us specify the address where we want the mapped region to start. We normally set this value to 0 to allow the system to choose the starting address. The return value is the starting address of the mapped area.
- fd is the file descriptor specifying the file that is to be mapped. We have to open this file before we can map it into the address space.
- len is the number of bytes to map, and off is the starting offset in the file of the bytes to map.
- prot specifies the protection of the mapped region. Figure 14.25
- We can specify the protection as either PROT_NONE or the bitwise OR of any combination of PROT_READ, PROT_WRITE, and PROT_EXEC. The protection specified for a region can’t allow more access than the open mode of the file. E.g., we can’t specify PROT_WRITE if the file was opened read-only.
- Figure 14.26 shows a memory-mapped file. start addr is the return value from mmap. We show the mapped memory between the heap and the stack: this is an implementation detail and may differ from one implementation to the next.
- flag affects various attributes of the mapped region. flag =
- MAP_FIXED The return value must equal addr. Use of this flag is discouraged since it hinders portability. If this flag is not specified and if addr is nonzero, then the kernel uses addr as a hint of where to place the mapped region, but there is no guarantee that the requested address will be used. Maximum portability is obtained by specifying addr as 0.
- MAP_SHARED This flag specifies that store operations modify the mapped file: a store operation is equivalent to a write to the file. Either this flag or the next (MAP_PRIVATE), but not both, must be specified.
- MAP_PRIVATE This flag says that store operations into the mapped region cause a private copy of the mapped file to be created. All successive references to the mapped region then reference the copy. One use of this flag is for a debugger that maps the text portion of a program file but allows the user to modify the instructions. Any modifications affect the copy, not the original program file.
- The value of off and the value of addr(if MAP_FIXED is specified) are usually required to be multiples of the system’s virtual memory page size. This value can be obtained from the sysconf function(Section 2.5.4) with an argument of _SC_PAGESIZE or _SC_PAGE_SIZE. Since off and addr are often specified as 0, this requirement is not a big deal.
- Since the starting offset of the mapped file is tied to the system’s virtual memory page size, what happens if the length of the mapped region isn’t a multiple of the page size? Assume the file size is 12 bytes and the system’s page size is 512 bytes. In this case, the system normally provides a mapped region of 512 bytes, and the final 500 bytes of this region are set to 0. We can modify the final 500 bytes, but any changes we make to them are not reflected in the file. Thus we cannot append to a file with mmap, we must first grow the file(Figure 14.27).
- Two signals are normally used with mapped regions.
- SIGSEGV is used to indicate that we have tried to access memory that is not available to us. This signal can also be generated if we try to store into a mapped region that we specified to mmap as read-only.
- SIGBUS can be generated if we access a portion of the mapped region that does not make sense at the time of the access. E.g., assume that we map a file using the file’s size, but before we reference the mapped region, the file’s size is truncated by some other process. If we then try to access the memory-mapped region corresponding to the end portion of the file that was truncated, we’ll receive SIGBUS.
- A memory-mapped region is inherited by a child across a fork(since it’s part of the parent’s address space), but not inherited by the new program across an exec. We can change the permissions on an existing mapping by calling mprotect.
#include <sys/mman.h>
int mprotect(void *addr, size_t len, int prot);
Returns: 0 if OK, −1 on error
- The legal values for prot are the same as those for mmap(Figure 14.25). Systems may require the address argument to be an integral multiple of the system’s page size.
- When we modify pages that we’ve mapped into our address space using the MAP_SHARED flag, the changes aren’t written back to the file immediately. The kernel daemons decide when dirty pages are written back based on (a) system load and (b) configuration parameters meant to limit data loss in the event of a system failure. When the changes are written back, they are written in units of pages. If we modify only one byte in a page, when the change is written back to the file, the entire page will be written.
- If the pages in a shared mapping have been modified, we can call msync to flush the changes to the file that backs the mapping.
#include <sys/mman.h>
int msync(void *addr, size_t len, int flags);
Returns: 0 if OK, -1 on error
- If the mapping is private, the file mapped is not modified. The address must be aligned on a page boundary.
- flags control how the memory is flushed. flags =
- MS_ASYNC: schedule the pages to be written.
- MS_SYNC: wait for the writes to complete before returning.
Either MS_ASYNC or MS_SYNC must be specified.
- An optional flag, MS_INVALIDATE, lets us tell the operating system to discard any pages that are out of sync with the underlying storage. Some implementations will discard all pages in the specified range when we use this flag, but this behavior is not required.
- A memory-mapped region is automatically unmapped when the process terminates or we can unmap a region by calling the munmap function. Closing the file descriptor used when we mapped the region does not unmap the region.
#include <sys/mman.h>
int munmap(void *addr, size_t len);
Returns: 0 if OK, -1 on error
- The call to munmap does not cause the contents of the mapped region to be written to the disk file.
- The updating of the disk file for a MAP_SHARED region happens automatically by the kernel’s virtual memory algorithm sometime after we store into the memory-mapped region.
- Modifications to memory in a MAP_PRIVATE region are discarded when the region is unmapped.
- The program in Figure 14.27 copies a file using memory-mapped I/O.
- We first open both files and then call fstat to obtain the size of the input file. We need this size for the call to mmap for the input file, and we also need to set the size of the output file. We call ftruncate to set the size of the output file. If we don’t set the output file’s size, the call to mmap for the output file is successful, but the first reference to the associated memory region generates a SIGBUS signal.
- We then call mmap for each file, to map the file into memory, and finally call memcpy to copy data from the input buffer to the output buffer. We copy at most 1 GB of data at a time to limit the amount of memory we use(it might not be possible to map the entire contents of a very large file if the system doesn’t have enough memory). Before mapping the next sections of the files, we unmap the previous sections.
- As the bytes of data are fetched from the input buffer(src), the input file is automatically read by the kernel; as the data is stored in the output buffer(dst), the data is automatically written to the output file.
- Exactly when the data is written to the file depends on the system’s page management algorithms. Some systems have daemons that write dirty pages to disk slowly over time. If we want to ensure that the data is safely written to the file, we need to call msync with the MS_SYNC flag before exiting.
- Let’s compare this memory-mapped file copy to a copy that is done by calling read and write(with a buffer size of 8,192). Figure 14.28 shows the results. The times are given in seconds and the size of the file copied was 300 MB. Note that we don’t synch the data to disk before exiting.
- For both Linux 3.2.0 and Solaris 10, the total CPU time(user + system) is almost the same for both approaches. On Solaris, copying using mmap and memcpy takes more user time but less system time than copying using read and write. On Linux, the results are similar for the user time, but the system time for using read and write is slightly better than using mmap and memcpy. The two versions do the same work, but they go about it differently.
- The major difference is that with read and write, we execute a lot more system calls and do more copying than with mmap and memcpy. With read and write, we copy the data from the kernel’s buffer to the application’s buffer(read), and then copy the data from the application’s buffer to the kernel’s buffer(write). With mmap and memcpy, we copy the data directly from one kernel buffer mapped into our address space into another kernel buffer mapped into our address space. This copying occurs as a result of page fault handling when we reference memory pages that don’t yet exist (there is one fault per page read and one fault per page written). If the overhead for the system call and extra copying differs from the page fault overhead, then one approach will perform better than the other.
- On Linux 3.2.0, as far as elapsed time is concerned, the two versions of the program show a large difference in clock time: the version using read and write completes four times faster than the version using mmap and memcpy. However, on Solaris 10, the version with mmap and memcpy is faster than the version with read and write. If the CPU times are almost the same, then why would the clock times differ? One possibility is that we might have to wait longer for I/O to complete in one version. This wait time is not counted as CPU processing time. Another possibility is that some system processing might not be counted against our program : the processing done by system daemons to write pages to disk, E.g.. As we need to allocate pages for reading and writing, these system daemons will help make pages available. If the page writes are random instead of sequential, then it will take longer to write them out to disk, so we will need to wait longer before the pages become available for us to reuse.
- Depending on the system, memory-mapped I/O can be more efficient when copying one regular file to another. There are limitations. We can’t use this technique to copy between certain devices(such as a network device or a terminal device), and we have to be careful if the size of the underlying file could change after we map it. Nevertheless, some applications can benefit from memory-mapped I/O, as it can often simplify the algorithms, since we manipulate memory instead of reading and writing a file. One example is the manipulation of a frame buffer device that references a bit-mapped display.
14.9 Summary
Exercises(Redo)
Please indicate the source: http://blog.csdn.net/gaoxiangnumber1
Welcome to my github: https://github.com/gaoxiangnumber1