The iov_iter interface

The iov_iter interface

By Jonathan Corbet
December 9, 2014
One of the most common tasks in the kernel is processing a buffer ofdata supplied by user space, possibly in several chunks. Perhapsunsurprisingly, this is a task that kernel code often gets wrong, leading to bugs and, possibly, securityproblems. The kernel contains a primitive (called " iov_iter")meant to make this task simpler. While iov_iter use is mostlyconfined to the memory-management and filesystem layers currently, it isslowly spreading out into other parts of the kernel. This interface iscurrently undocumented, a situation this article will attempt to remedy.

The iov_iter concept is not new; it was first addedby Nick Piggin for the 2.6.24 kernel in 2007. But there has been aneffort over the last year to expand this API and use it in more parts ofthe kernel; the 3.19 merge window should see it making its first steps intothe networking subsystem, for example.

An iov_iter structure is essentially an iterator for workingthrough an iovec structure, defined in<uapi/linux/uio.h> as:

    struct iovec
    {
	void __user *iov_base;
	__kernel_size_t iov_len;
    };

This structure matches the user-space iovec structure defined byPOSIX and used with system calls like readv(). As the "vec" portion of the name would suggest, iovec structurestend to come in arrays; as a whole, an iovec describes a bufferthat may be scattered in both physical and virtual memory.

The actual iov_iter structure is defined in<linux/uio.h>:

    struct iov_iter {
	int type;
	size_t iov_offset;
	size_t count;
	const struct iovec *iov; /* SIMPLIFIED - see below */
	unsigned long nr_segs;
    };

The type field describes the type of the iterator. It is abitmask containing, among other things, either READ or WRITE depending on whether data is being readinto the iterator or written from it. The data direction, thus, refers notto the iterator itself, but to the other part of the data transaction; aniov_iter created with a type of READ will bewritten to.

Beyond that, iov_offset contains the offset to the first byte ofinteresting data in the first iovec pointed to by iov.The total amount of data pointed to by the iovec array is storedin count, while the number of iovec structures is storedin nr_segs. Note that most of these fields will change as code"iterates" through the buffer. They describe a cursor into the buffer,rather than the buffer as a whole.

Working with struct iov_iter

Before use, an iov_iter must be initialized to contain an (alreadypopulated) iovec with:

    void iov_iter_init(struct iov_iter *i, int direction,
		       const struct iovec *iov, unsigned long nr_segs,
		       size_t count);

Then, for example, data can be moved between the iterator and user spacewith either of:

    size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i);
    size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i);

The naming here can be a little confusing until one gets the hang of it. Acall to copy_to_iter() will copy bytes data fromthe buffer at addr to the user-space buffer indicated by theiterator. So copy_to_iter() can be thought of as being like avariant of copy_to_user() that takes an iterator rather than asingle buffer. Similarly, copy_from_iter() will copy the datafrom the user-space buffer to addr. The similarity withcopy_to_user() continues through to the return value, which is thenumber of bytes not copied.

Note that these calls will "advance" the iterator through the buffer tocorrespond to the amount of data transferred. In other words, theiov_offset, count, nr_segs, and iovfields of the iterator will all be changed as needed. So two calls tocopy_from_iter() will copy two successive areas from user space.Among other things, this means that the code owning the iterator mustremember the base address for the iovec array, since theiov value in the iov_iter structure may change.

Various other functions exist. To move data referenced by a pagestructure into or out of an iterator, use:

    size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
			     struct iov_iter *i);
    size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
			       struct iov_iter *i);

Only the single page provided will be copied to or from, so thesefunctions should not be asked to copy data that would cross the pageboundary.

Code running in atomic context can attempt to obtain data from user spacewith:

    size_t iov_iter_copy_from_user_atomic(struct page *page, struct iov_iter *i,
					  unsigned long offset, size_t bytes);

Since this copy will be done in atomic mode, it will only succeed if thedata is already resident in RAM; callers must thus be prepared for ahigher-than-normal chance of failure.

If it is necessary to map the user-space buffer into the kernel, one ofthese calls can be used:

    ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
                               size_t maxsize, unsigned maxpages, size_t *start);
    ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages, 
    	    			     size_t maxsize, size_t *start);

Either function turns into a call to get_user_pages_fast(),causing (hopefully) the pages to be brought in and their locations storedin the pages array. The difference between them is thatiov_iter_get_pages() expects the pages array to beallocated by the caller, while iov_iter_get_pages_alloc() will dothe allocation itself. In that case, the array returned in pagesmust eventually be freed with kvfree(), since it might have beenallocated with either kmalloc() or vmalloc().

Advancing through the iterator without moving any data can be done with:

    void iov_iter_advance(struct iov_iter *i, size_t size);

The buffer referred to by an iterator (or a portion thereof) can be clearedwith:

    size_t iov_iter_zero(size_t bytes, struct iov_iter *i);

Information about the iterator is available from a number of helperfunctions:

    size_t iov_iter_single_seg_count(const struct iov_iter *i);
    int iov_iter_npages(const struct iov_iter *i, int maxpages);
    size_t iov_length(const struct iovec *iov, unsigned long nr_segs);

A call to iov_iter_single_seg_count() returns the length of thedata in the first segment of the buffer. iov_iter_npages()reports the number of pages occupied by the buffer in the iterator, whileiov_length() returns the total data length. The latter functionmust be used with care, since it trusts the len field in theiovec structures. If that data comes from user space, it couldcause integer overflows in the kernel.

Not just iovecs

The definition of struct iov_iter shown above does not quite matchwhat is actually found in the kernel. Instead of a single field for theiov array, the real structure has (in 3.18):

    union {
	const struct iovec *iov;
	const struct bio_vec *bvec;
    };

In other words, the iov_iter structure is also set up to work withthe BIO structures used by the block layer. Such iterators are marked byhaving ITER_BVEC include in the type field bitmask. Oncesuch an iterator is created, all of the above calls will work with it as ifit were an "ordinary" iterator using iovec structures. Currently,the use of BIO-based iterators in the kernel is minimal; they can only befound in the swap and splice() code.

Coming in 3.19

The 3.19 kernel is likely to see a substantial rewrite of theiov_iter code aimed at reducing the vast amount of boilerplatecode needed to implement all of the above-mentioned functions. The code isindeed shorter afterward, but at the cost of introducing a fair amount ofmildly frightening preprocessor macro magic to generate the neededboilerplate on demand.

The iov_iter code already works if the "user-space" buffer isactually located in kernel space. In 3.19, things will be formalized andoptimized a bit. Such an iterator will be created with:

    void iov_iter_kvec(struct iov_iter *i, int direction,
		       const struct kvec *iov, unsigned long nr_segs,
		       size_t count);

There will also be a new kvec field added to the union shown abovefor this case.

Finally, some functions have been added to help with the networking case; itwill be possible, for example, to copy a buffer and generate a checksum inthe process.

The end result is that the iov_iter interface is slowly becomingthe standard way of hiding many of the complexities associated withhandling user-space buffers. We can expect to see its use encouraged inmore places in the future. It only took seven years or so, butiov_iter appears to be reaching a point of being an interface thatmost kernel developers will want to be aware of.

你可能感兴趣的:(The iov_iter interface)