Asynchronous vectored I/O operations

英文原文是https://lwn.net/Articles/170954/,其描述了vectored operations。这里在原文中间加上了与之对应的内核代码的分析。

The file_operations structure contains pointers to the basic I/O operations exported by filesystems and char device drivers. This structure currently contains three different methods for performing a read operation:

    ssize_t (*read) (struct file *filp, char __user *buffer, size_t size, 
                     loff_t *pos);
    ssize_t (*readv) (struct file *filp, const struct iovec *iov, 
                      unsigned long niov, loff_t *pos);
    ssize_t (*aio_read) (struct kiocb *iocb, char __user *buffer, 
                         size_t size, loff_t pos);

Normal read operations end up with a call to the read() method, which reads a single segment from the source into the supplied buffer. The readv() method implements the system call by the same name; it will read one segment and scatter it into several user buffers, each of which is described by an iovec structure. Finally, aio_read()is invoked in response to asynchronous I/O requests; it reads a single segment into the supplied buffer, possibly returning before the operation is complete. There is a similar set of three methods for write operations.

=========================添加============================

上述对于读取单个segment的并写入单个用户buffer的操作,其opcode是IO_CMD_PREAD,对于read的vectored操作来说,对应的opcode是IO_CMD_PREADV,它也是读取单个数据段(属于磁盘分区或文件的),但是要将读取的数据分散写入多个用户buffer,每个用户的buffer由struct iovec来描述:

struct iovec iov = { .iov_base = buf(存储数据的用户buffer), .iov_len = count(本buffer存放的读取数据的长度,也即本buffer的长度) };

从下面的io_prep_preadv也可以看出,读取的只有一个段,即从fd所表示的文件的offset偏移处读取iovcnt个buffer长度的数据,读取的数据存入到struct iovec *iov指针所指向的多个用户buffer中(IO的完成过程实际是要由dma从后端将数据搬移到用户buffer所映射的内核page中),用户buffer的数量由参数iovcnt指定:

Asynchronous vectored I/O operations_第1张图片

函数aio_setup_rw函数在aio流程中被调用, 对于vectored操作,调用import_iovec函数为多个用户buffer+len信息创建迭代器,用于后续io操作:

Asynchronous vectored I/O operations_第2张图片

Asynchronous vectored I/O operations_第3张图片

对于vectored的读操作来说,需要读取的总的数据长度就是用户buffer数组中所有用户buffer所能存入的数据长度总和,即每个buffer描述结构struct iovec iov中的iov_len之和:

struct iovec iov = { .iov_base = buf(存储数据的用户buffer), .iov_len = count(本buffer存放的读取数据的长度) };

rw_copy_check_uvector函数:

Asynchronous vectored I/O operations_第4张图片

引入vectored IO的目的与优点:
        Vectored IO(也称为Scatter / Gather)在缓冲区向量(多个用户buffer组成)上运行,并允许每个系统调用使用多个缓冲区从磁盘读取和写入数据。
执行向量读取时,首先将字节从源读取到缓冲区。然后,从第一个缓冲区的长度开始直到第二个缓冲区的长度偏移的源的字节将被读入第二个缓冲区,依此类推,就像源一个接一个地填充缓冲区一样。向量写入以类似的方式工作:缓冲区将被写入,就好像它们在写入之前被连接一样。
        这种方法可以通过允许读取较小的块,因此避免为连续块分配大的存储区域,同时减少用来自磁盘的数据填充所有这些缓冲区所需的系统调用量。另一个优点是读取和写入都是原子的:内核阻止其他进程在读取和写入操作期间对同一描述符执行IO,从而保证数据的完整性。
        从开发人员的角度来看,如果数据在文件中以某种方式布局。例如,它被拆分为固定大小的头和多个固定大小的块,则可以发出一个单独的调用来填充单独的缓冲区分配给这些部分,即使用vectored IO的多个用户buffer,将读取的不同块的数据分别放入这些buffer。每个buffer存储的数量量大小就是在struct iovec中定义的,这个大小与数据在文件中的布局匹配就行了。
        这听起来很有用,但不知何故,只有少数数据库使用向量IO。这可能是因为通用数据库同时处理大量文件,试图保证每个正在运行的操作的活跃性并减少它们的延迟,因此可以按块访问和缓存数据。向量IO对于分析工作负载和/或列式数据库更有用,其中数据连续存储在磁盘上,并且其处理可以在稀疏块中并行完成。其中一个例子是Apache Arrow。

单IO与Vectored IO在aio接口调用上的差异(https://blog.csdn.net/weixin_34008784/article/details/89821367):

Asynchronous vectored I/O operations_第5张图片

单个IO处理的模式很好理解,而对vectored io处理,个人认为有一处man手册和头文件中都没有说明白的地方,那就是: io_prep_pwritev/ io_prep_preadv函数的最后一个参数offset的含义,它表示的是io vecotr里面最早执行的那个IO开始执行时读写操作在磁盘或文件上的物理偏移, 而下一个IO在磁盘或文件上读写的起始地址,就是这个偏移再加上刚完成IO操作的长度。因此,IO操作的总长度是IO vector里面所有成员的iov_len字段之和。 

======================================================= 

Back in November, Zach Brown posted a vectored AIO patch intended to provide a combination of the vectored (readv()/writev()) operations and asynchronous I/O. To that end, it defined a couple of new AIO operations for user space, and added two more file_operations methods: aio_readv() and aio_writev(). There was some resistance to the idea of creating yet another pair of operations, and a feeling that there was a better way. The result, after work by Christoph Hellwig and Badari Pulavarty, is a new vectored AIO patch with a much simpler interface - at the cost of a significant API change.

The observation was made that a number of subsystems use vectored I/O operations internally in all cases, even in the case of a "scalar" read() or write() call. For example, the read() function in the current mainline pipe driver is:

    static ssize_t
    pipe_read(struct file *filp, char __user *buf, size_t count, loff_t *ppos)
    {
	struct iovec iov = { .iov_base = buf, .iov_len = count };
	return pipe_readv(filp, &iov, 1, ppos);
    }

Here, the read() method is essentially superfluous; it is provided simply because the API requires it. So, it was asked, rather than adding more vectored I/O operations, why not just "vectorize" the standard API? The resulting patch set brings about that change in a couple of steps.

The first of those is to change the prototypes for the asynchronous I/O methods to:

    ssize_t (*aio_read) (struct kiocb *iocb, const struct iovec *iov, 
             unsigned long niov, loff_t pos);
    ssize_t (*aio_write) (struct kiocb *iocb, const struct iovec *iov,  
             unsigned long niov, loff_t pos);

Thus, the single buffer has been replaced with an array of iovec structures, each describing one segment of the I/O operation. For the current single-buffer AIO read and write commands, the new code creates a single-entry iovec array and passes it to the new methods. (It's worth noting that, as the code is currently written, that iovecarray is no longer valid after aio_read() or aio_write() returns; that array will need to be copied for any operation which remains outstanding when those functions finish).

The prototypes of a couple of VFS helper functions (generic_file_aio_read() and generic_file_aio_write()) have been changed in a similar manner. These changes ripple through every driver and filesystem providing AIO methods, making the patch reasonably large. A second patch then adds two new AIO operations (IOCB_CMD_PREADV and IOCB_CMD_PWRITEV) to the user-space interface, making vectored asynchronous I/O available to applications.

The patch set then goes one step further by eliminating the readv() and writev() methods altogether. With this patch in place, any filesystem or driver which wishes to provide vectored I/O operations must do so via aio_read()and aio_write() instead. Note that this change does not imply that asynchronous operations themselves must be supported - it is entirely permissible (if suboptimal) for aio_read() and aio_write() to operate synchronously at all times. But this patch does make it necessary for modules wishing to provide vectored operations to, at a minimum, provide the file_operations methods for asynchronous I/O. If the AIO methods are not available for a given device or filesystem, a call to readv() or writev() will be emulated through multiple calls to read() or write(), as usual.

Finally, with this patch in place, it is possible for a driver or filesystem to omit the read() and write() methods altogether if the asynchronous versions are provided. If, for example, only aio_read() is provided, all read() andreadv() system calls will be handled by the aio_read() method. If, someday, all code implements the AIO methods, the regular read() and write() methods could be removed altogether. That would result in an interface which contained only one method for all read operations (and one more for writes). This change would also realize the vision expressed at the 2003 Kernel Summit that all I/O paths inside the kernel would, in the end, be made asynchronous.

There has been little discussion of the current patch set, so it is hard to predict what may ultimately become of it. Given that it simplifies a core kernel API while simultaneously making it more powerful, however, chances are that some version of this patch will find its way into the kernel eventually.

(For more information on the AIO interface, see this Driver Porting Series article or chapter 15 of LDD3). 

另外,关于"Driver porting: supporting asynchronous I/O"

https://lwn.net/Articles/24366/

One of the key "enterprise" features added to the 2.6 kernel is asynchronous I/O (AIO). The AIO facility allows user processes to initiate multiple I/O operations without waiting for any of them to complete; the status of the operations can then be retrieved at some later time. Block and network drivers are already fully asynchronous, and thus there is nothing special that needs to be done to them to support the new asynchronous operations. Character drivers, however, have a synchronous API, and will not support AIO without some additional work. For most char drivers, there is little benefit to be gained from AIO support. In a few rare cases, however, it may be beneficial to make AIO available to your users.

AIO file operations

The first step in supporting AIO (beyond including ) is the implementation of three new methods which have been added to the file_operations structure:

    ssize_t (*aio_read) (struct kiocb *iocb, char __user *buffer, 
			 size_t count, loff_t pos);
    ssize_t (*aio_write) (struct kiocb *iocb, const char __user *buffer, 
			  size_t count, loff_t pos);
    int (*aio_fsync) (struct kiocb *, int datasync);

For most drivers, the real work will be in the implementation of aio_read() and aio_write(). These functions are analogous to the standard read() and write() methods, with a couple of changes: the file parameter has been replaced with an I/O control block (iocb), and they (usually) need not complete the requested operations immediately. The iocb argument can usually be treated as an opaque cookie used by the AIO subsystem; if you need the struct file pointer for this file descriptor, however, you can find it as iocb->ki_filp.

The aio_ operations can be synchronous. One obvious example is when the requested operation can be completed without blocking. If the operation is complete before aio_read() or aio_write() returns, the return value should be the usual status or error code. So, the following aio_read() method, while being pointless, is entirely correct:

    ssize_t my_aio_read(struct kiocb *iocb, char __user *buffer, 
                        size_t count, loff_t pos)
    {
	return my_read(iocb->ki_filp, buf, count, &pos);
    }
In some cases, synchronous behavior may actually be required. The so-called "synchronous iocb's" allow the AIO subsystem to be used synchronously when need be. The macro:
    is_sync_kiocb(struct kiocb *iocb)

will return a true value if the request must be handled synchronously.

In most cases, though, it is assumed that the I/O request will not be satisfied immediately by aio_read() or aio_write(). In this case, those functions should do whatever is required to get the operation started, then return -EIOCBQUEUED. Note that any work that must be done within the user process's context must be done before returning; you will not have access to that context later. In order to access the user buffer, you will probably need to either set up a DMA mapping or turn the buffer pointer into a series of struct page pointers before returning. Bear in mind also that there can be multiple asynchronous I/O requests active at any given time. A driver which implements AIO will have to include proper locking (and, probably queueing) to keep these requests from interfering with each other.

When the I/O operation completes, you must inform the AIO subsystem of the fact by calling aio_complete():

    int aio_complete(struct kiocb *iocb, long res, long res2);

Here, iocb is, of course, the IOCB you were given when the request was initiated. res is the usual result of an I/O operation: the number of bytes transfered, or a negative error code. res2 is a second status value which will be returned to the user; currently (2.6.0-test9), callers of aio_complete() within the kernel always set res2 to zero. aio_complete() can be safely called in an interrupt handler. Once you have called aio_complete(), you no longer own the IOCB or the user buffer, and should not touch them again.

The aio_fsync() method serves the same purpose as the fsync() method; its purpose is to ensure that all pending data are written to disk. As a general rule, device drivers will not need to implement aio_fsync().

Cancellation

The design of the AIO subsystem includes the ability to cancel outstanding operations. Cancellation may occur as the result of a specific user-mode request, or during the cleanup of a process which has exited. It is worth noting that, as of 2.6.0-test9, no code in the kernel actually performs cancellation. So cancellation may not work properly, and the interface could change in the process of making it work. That said, here is how the interface looks today.

A driver which implements cancellation needs to implement a function for that purpose:

    int my_aio_cancel(struct kiocb *iocb, struct io_event *event);

A pointer to this function can be stored into any IOCB which can be cancelled:

    iocb->ki_cancel = my_aio_cancel;

Should the operation be cancelled, your cancellation function will be called with pointers to the IOCB and an io_event structure. If it is possible to cancel (or successfuly complete) the operation prior to returning from the cancellation function, the result of the operation should be stored into the res and res2 fields of the io_event structure, and return zero. A non-zero return value from the cancellation function indicates that cancellation was not possible.

你可能感兴趣的:(Storage)