读文件是基于页的,内核总是一次传送几个完整的数据页。读数据主要完成的是一个查找数据位置的操作。如果进程发出read()系统调用来读取一些字节,而这些数据还不在RAM中,那么内核就要分配一个新页框,并且使用文件的适当部分来填充这个页,把该页加入页高速缓存,最后把所请求的字节拷贝到进程地址空间中(还是将一个页面拷贝到进程地址空间中)。
Mm/filemap.c(do_generic_file_read(struct file *filp, loff_t *ppos,
read_descriptor_t *desc))
函数分析:
static void do_generic_file_read(struct file *filp, loff_t *ppos,
read_descriptor_t *desc)
{
/*1、获得要读取的文件对应的address_space对象;他的地址存放在filp->filp->f_mapping*/
struct address_space *mapping = filp->f_mapping;
/*2、获得地址空间对象的所有者,即索引节点对象,它将拥有条虫了文件数据的页面。他的地址存放在address_space对象的host字段中。*/
struct inode *inode = mapping->host;
struct file_ra_state *ra = &filp->f_ra;
pgoff_t index;
pgoff_t last_index;
pgoff_t prev_index;
unsigned long offset; /* offset into pagecache page */
unsigned int prev_offset;
int error;
/*3、把文件看作细分的数据页,并从文件指针*ppos导出第一个请求字节所在的页的逻辑号,即地址空间中的页索引,并把它存放在index局部变量中*/
index = *ppos >> PAGE_CACHE_SHIFT;
prev_index = ra->prev_pos >> PAGE_CACHE_SHIFT;
prev_offset = ra->prev_pos & (PAGE_CACHE_SIZE-1);
last_index = (*ppos + desc->count + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT;
/*把第一个请求字节在页内的偏移量存放在offset局部变量中*/
offset = *ppos & ~PAGE_CACHE_MASK;
/*4、开始一个循环来读入包含请求字节的所有页,要去数据的字节数存放在read_descriptor_t描述符desc的count字段中,在单独的循环期间,函数通过执行下边的步骤来传送一个数据页*/
for (;;) {
struct page *page;
pgoff_t end_index;
loff_t isize;
unsigned long nr, ret;
/*a、调用cond_resched()来检查当前进程的标志TIF_NEED_RESCHED。如果该标志置位,则调用函数schedule()分配处理器资源*/
cond_resched();
find_page:
/*b、调用find_get_page(),并传入address_space对象的指针及索引值做为参数;他将查找页高速缓存以找到包含所请求数据的页描述符 */
page = find_get_page(mapping, index);
/*c、如果find_get_page()反回NULL指针,则所请求的页不再页高速缓存中,及cache没有命中,如果发生这种情况则会执行如下步骤:*/
if (!page) {
/*(1)、强制执行同步的预读操作*/
page_cache_sync_readahead(mapping,
ra, filp,
index, last_index - index);
/*(2)、再次调用find_get_page(),查找页高速缓存以找到包含所请求数据的页描述符*/
page = find_get_page(mapping, index);
/*(3)、如果还是没有在页高速缓存中找到包含请求数据的页描述符,则跳到标签:no_cached_page */
if (unlikely(page == NULL))
goto no_cached_page;
}
/*d、如果有预读的页,则调用page_cache_async_readahead ()函数读取这些页面*/
if (PageReadahead(page)) {
page_cache_async_readahead(mapping,
ra, filp, page,
index, last_index - index);
}
/*e、如果程序执行到此,说明页面已经在页高速缓存区中,这时需要检查页所存的数据是否是最新的,如果页面中所存数据没有更新则跳到:page_not_up_to_date */
if (!PageUptodate(page)) {
/*f、页面中的数据是无效的,就必须从磁盘读取,函数通过trylock_page ()函数获取对页的互斥访问*/
if (inode->i_blkbits == PAGE_CACHE_SHIFT ||
!mapping->a_ops->is_partially_uptodate)
goto page_not_up_to_date;
if (!trylock_page(page))
goto page_not_up_to_date;
/* Did it get truncated before we got the lock? */
if (!page->mapping)
goto page_not_up_to_date_locked;
if (!mapping->a_ops->is_partially_uptodate(page,
desc, offset))
goto page_not_up_to_date_locked;
unlock_page(page);
}
page_ok:
/*
* i_size must be checked after we know the page is Uptodate.
*
* Checking i_size after the check allows us to calculate
* the correct value for "nr", which means the zero-filled
* part of the page is not copied back to userspace (unless
* another truncate extends the file - this is desired though).
*/
isize = i_size_read(inode);
end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
/*g、如果index超出文件中包含的页数(该数是通过将inode对象的i_size字段的值除以4096得到的),那么它将减少页的引用计数器,并跳出循环到out。这种情况发生在这个正被本进程读的文件同时又其他进程正在删减它的时候*/
if (unlikely(!isize || index > end_index)) {
page_cache_release(page);
goto out;
}
/* nr is the maximum number of bytes to copy from this page */
/*h、将应被拷入用户态缓冲区的页中字数放在局部变量nr中,这个值应该等于页的大小,除非offset非零(这只发生在读请求的首尾页时)或请求数据不全在该文件中*/
nr = PAGE_CACHE_SIZE;
if (index == end_index) {
nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
if (nr <= offset) {
page_cache_release(page);
goto out;
}
}
nr = nr - offset;
/* If users can be writing to this page using arbitrary
* virtual addresses, take care about potential aliasing
* before reading the page on the kernel side.
*/
if (mapping_writably_mapped(mapping))
flush_dcache_page(page);
/*
* When a sequential read accesses a page several times,
* only mark it as accessed the first time.
*/
/*i、调用mark_page_access()将标志PG_referenced或PG_active置位,从而表示该页正在被访问并且不应该被换出。如果同一文件在do_generic_file_read()的后续执行中要读几次,那么这个步骤只在第一次读时执行*/
if (prev_index != index || offset != prev_offset)
mark_page_accessed(page);
prev_index = index;
/*
* Ok, we have the page, and it's up-to-date, so
* now we can copy it to user space...
*
* The file_read_actor routine returns how many bytes were
* actually used..
* NOTE! This may not be the same as how much of a user buffer
* we filled up (we may be padding etc), so we can only update
* "pos" here (the actor routine has to update the user buffer
* pointers and the remaining count).
*/
/**************************************************
* j、现在已经有了page并且已经将其中的数据更新,是时候将页面中的数据拷贝到用户缓冲区的时候了,为此调用了file_read_actor函数,该函数执行以下几部:
* (1):调用kmap(),该函数为处于高端内存中的页建立永久的内核映射
* (2):调用__copy_to_user(),该函数把页中的数据拷贝到用户态地址空间。这个操作在访问用户态地址空间时如有缺页异常将会阻塞进程。
* (3):调用kunmap()来释放页的任意永久内核映射.
* (4):更新read_descriptor_t描述符的count、written和buf字段。
*
* int file_read_actor(read_descriptor_t *desc, struct page *page,
* unsigned long offset, unsigned long size)
* {
* char *kaddr;
* unsigned long left, count = desc->count;
*
* if (size > count)
* size = count;
*
* if (!fault_in_pages_writeable(desc->arg.buf, size)) {
* kaddr = kmap_atomic(page);
* left = __copy_to_user_inatomic(desc->arg.buf,
* kaddr + offset, size);
* kunmap_atomic(kaddr);
* if (left == 0)
* goto success;
* }
*
* kaddr = kmap(page);
* left = __copy_to_user(desc->arg.buf, kaddr + offset, size);
* kunmap(page);
* if (left) {
* size -= left;
* desc->error = -EFAULT;
* }
* success:
* desc->count = count - size;
* desc->written += size;
* desc->arg.buf += size;
* return size;
* }
*********************************************/
ret = file_read_actor(desc, page, offset, nr);
/*k根据传入用户态缓冲区的有效字节数来更新局部变量index和count。一般情况下,如果页的最后一个字节已经拷贝到用户态缓冲区,那么index的值加1而offset的值清零;否则,index的值不变而offset的值被设为已拷贝到用户态缓冲区的字节数*/
offset += ret;
index += offset >> PAGE_CACHE_SHIFT;
offset &= ~PAGE_CACHE_MASK;
prev_offset = offset;
/*l、减少页描述符的引用计数器,如果read_descriptor_t描述符放的count字段不为0,那么文件中还有其他数据要读跳到循环头部开始读取下一个页面,否则跳出循环*/
page_cache_release(page);
if (ret == nr && desc->count)
continue;
goto out;
page_not_up_to_date:
/* Get exclusive access to the page ... */
error = lock_page_killable(page);
if (unlikely(error))
goto readpage_error;
/*m、现在的页已经由进程锁定。然而,另一个进程也许会在上一步之前已经从页高速缓存中删除该页,那么就要检查页描述符的mapping字段是否为NULL。在这种情况下,将调用unlock_page()来解锁页,减少它的引用计数,并调回循环开头重读一个页*/
page_not_up_to_date_locked:
/* Did it get truncated before we got the lock? */
if (!page->mapping) {
unlock_page(page);
page_cache_release(page);
continue;
}
/* Did somebody else fill it already? */
if (PageUptodate(page)) {
unlock_page(page);
goto page_ok;
}
readpage:
/*
* A previous I/O error may have been due to temporary
* failures, eg. multipath errors.
* PG_error will be set again if readpage fails.
*/
ClearPageError(page);
/* Start the actual read. The read will unlock the page. */
/*n、现在正真的I/O操作可以开始了,调用文件的address_space对象的readpage方法。相应的函数会负责激活磁盘到页之间的I/O数据传递。*/
error = mapping->a_ops->readpage(filp, page);
if (unlikely(error)) {
if (error == AOP_TRUNCATED_PAGE) {
page_cache_release(page);
goto find_page;
}
goto readpage_error;
}
/*o、如果标志PG_ uptodate还没有置位,则他会等待直到调用lock_page_killable ()函数后页被有效读入,该页从f被锁定等待从磁盘中读取数据,一旦读操作完成就被解锁。当前进程在I/O数据传输完成时才停止睡眠*/
if (!PageUptodate(page)) {
error = lock_page_killable(page);
if (unlikely(error))
goto readpage_error;
if (!PageUptodate(page)) {
if (page->mapping == NULL) {
/*
* invalidate_mapping_pages got it
*/
unlock_page(page);
page_cache_release(page);
goto find_page;
}
unlock_page(page);
shrink_readahead_size_eio(filp, ra);
error = -EIO;
goto readpage_error;
}
unlock_page(page);
}
goto page_ok;
readpage_error:
/* UHHUH! A synchronous read error occurred. Report it */
desc->error = error;
page_cache_release(page);
goto out;
/*p、如果所请求的页不再页高速缓存中,则执行下列步骤*/
no_cached_page:
/*
* Ok, it wasn't cached, so we need to create a new
* page..
*/
/*(1)、分配一个新页,并插入该页描述符到页高速缓存中*/
page = page_cache_alloc_cold(mapping);
if (!page) {
desc->error = -ENOMEM;
goto out;
}
/*插入新页描述符到LRU链表*/
error = add_to_page_cache_lru(page, mapping,
index, GFP_KERNEL);
if (error) {
page_cache_release(page);
if (error == -EEXIST)
goto find_page;
desc->error = error;
goto out;
}
goto readpage;
}
/*5、所有请求的或者说可以独到的数据已读完。函数更新预读数据结构filp->f_ra来标记数据已被顺序从文件读入*/
out:
ra->prev_pos = prev_index;
ra->prev_pos <<= PAGE_CACHE_SHIFT;
ra->prev_pos |= prev_offset;
/*把index*4096+offset值付给*ppos,从而保存以后调用read()和write()进行顺序访问的位置*/
*ppos = ((loff_t)index << PAGE_CACHE_SHIFT) + offset;
/*调用file_accessed ()函数把当前时间存放在文件的索引节点对象的i_atime字段中,并把它标记为脏后返回*/
file_accessed(filp);
}
Write()系统调用设计把数据从调用进程的用户态地址空间中移动到内核数据结构中,再移动到磁盘上。每个write()方法都是一个过程,该过程主要标识写操作涉及的磁盘块,把数据从用户态地址空间拷贝到页高速缓存的某些页中,然后把这些页中的缓冲区标记成脏。
/**
* __generic_file_aio_write - write data to a file
* @iocb: IO state structure (file, offset, etc.)
* @iov: vector with data to write
* @nr_segs: number of segments in the vector
* @ppos: position where to write
*
* This function does all the work needed for actually writing data to a
* file. It does all basic checks, removes SUID from the file, updates
* modification times and calls proper subroutines depending on whether we
* do direct IO or a standard buffered write.
*
* It expects i_mutex to be grabbed unless we work on a block device or similar
* object which does not need locking at all.
*
* This function does *not* take care of syncing data in case of O_SYNC write.
* A caller has to handle it. This is mainly due to the fact that we want to
* avoid syncing under i_mutex.
*/
ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
unsigned long nr_segs, loff_t *ppos)
{
struct file *file = iocb->ki_filp;
struct address_space * mapping = file->f_mapping;
size_t ocount; /* original count */
size_t count; /* after file limit checks */
struct inode *inode = mapping->host;
loff_t pos;
ssize_t written;
ssize_t err;
ocount = 0;
/*调用generic_segment_checks ()函数确定iovec描述符所描述的用户态缓冲区是有效的,如果参数无效则返回错误*/
err = generic_segment_checks(iov, &nr_segs, &ocount, VERIFY_READ);
if (err)
return err;
count = ocount;
pos = *ppos;
/*将文件file->f_mapping->backing_dev_info的backing_dev_info描述的地址设为current->backing_dev_info。实际上,即使相应请求队列是阻塞的,这个设置也会允许当前进程写回由file->f_mapping拥有的脏页 */
/* We can write back this queue in page reclaim */
current->backing_dev_info = mapping->backing_dev_info;
written = 0;
/*如果file->flags的O_APPEND标志置位而且文件是普通文件,它将*ppos设为文件尾,从而新数据将都追加到文件的后边*/
/*对文件的大小进行几次检查*/
err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode));
if (err)
goto out;
if (count == 0)
goto out;
/*如果设定,则将文件的suid标志清零,而且如果是可执行文件的话就将sgid标志也清零*/
err = file_remove_suid(file);
if (err)
goto out;
/*将当前时间存放在inode->mtime字段中,而且将索引节点对象标记为脏*/
err = file_update_time(file);
if (err)
goto out;
/* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
if (unlikely(file->f_flags & O_DIRECT)) {
loff_t endbyte;
ssize_t written_buffered;
written = generic_file_direct_write(iocb, iov, &nr_segs, pos,
ppos, count, ocount);
if (written < 0 || written == count)
goto out;
/*
* direct-io write to a hole: fall through to buffered I/O
* for completing the rest of the request.
*/
pos += written;
count -= written;
written_buffered = generic_file_buffered_write(iocb, iov,
nr_segs, pos, ppos, count,
written);
/*
* If generic_file_buffered_write() retuned a synchronous error
* then we want to return the number of bytes which were
* direct-written, or the error code if that was zero. Note
* that this differs from normal direct-io semantics, which
* will return -EFOO even if some bytes were written.
*/
if (written_buffered < 0) {
err = written_buffered;
goto out;
}
/*
* We need to ensure that the page cache pages are written to
* disk and invalidated to preserve the expected O_DIRECT
* semantics.
*/
endbyte = pos + written_buffered - written - 1;
err = filemap_write_and_wait_range(file->f_mapping, pos, endbyte);
if (err == 0) {
written = written_buffered;
invalidate_mapping_pages(mapping,
pos >> PAGE_CACHE_SHIFT,
endbyte >> PAGE_CACHE_SHIFT);
} else {
/*
* We don't know how much we wrote, so just return
* the number of bytes which were direct-written
*/
}
} else {
written = generic_file_buffered_write(iocb, iov, nr_segs,
pos, ppos, count, written);
}
out:
current->backing_dev_info = NULL;
return written ? written : err;
}
一个线性区可以和磁盘文件系统的普通文件的某一部分或者块设备文件相关联,这就意味着内核把对区线性中页内某个字节的访问转换成对文件中相应字节的操作-内存映射
- 共享型:在线性区页上的任何写操作都会修改磁盘文件上的文件,而且如果进程对共享映射中的一个页进行写,那么这种修改对于其他映射了这同一文件的所有进程来说都是可见的。
- 私有型:当进程创建的映射只是为了读文件,而不是写文件时才会使用此种映射。
进程可以发出一个mmap()系统调用来创建一个新的内存映射。
共享内存的页通常都包含在页高速缓存中,私有内存映射的页只要还没有被修改也都包含在页高速缓存中。当进程试图修改一个私有内存映射的页时,内核就把该页进行复制,并在进程页表中用复制的页来替换原来的页框。虽然原来的页框还仍留在页高速缓存中,但不再属于这个内存映射,这是由于被复制的页框替换了原来的页框。由此,这个复制的页框不会被插入到页高速缓存中,因为其中所包含的数据不在是磁盘上表示文件的那个有效数据。
事实上,一个新建立的内存映射就是一个不包含任何页的线性区。当进程引用线性区中的一个地址时,缺页异常发生,缺页异常中断处理程序检查线性区的nopage方法是否被定义。如果没有定义nopage,则说明线性区不应摄磁盘上的文件;否则,进行映射,这个方法通过访问块设备处理读取的页。
出于效率的原因,内存映射创建之后并没有立即把页框分配给它,而是尽可能向后推迟到不能在推迟,也就是说,当进程试图对其中的一个页进行寻址时,就产生一个“缺页”异常。
进程可以使用msync()系统调用把属于共享内存映射的脏页刷新到磁盘。
因为非线性内存映射的内存页是按照相对于文件开始出的页索引存放在页高速缓存中,而不是按照相对于线性区开始处的索引存放的,所以非线性内存映射刷新到磁盘的方式与线性内存映射是一样的。
通过绕过页高速缓存的方法直接I/O传送。在每次I/O直接传送中,内核对磁盘控制器进行编程,以便在自缓存的应用程序的用户程序的用户态地址空间中的页与磁盘之间直接传送数据。
标准为异步方式访问定义了一套库函数,“异步”实际上就是:当用户进程调用库函数读写文件时,一旦读写操作进入队列函数就结束,甚至有可能正真的I/O数据传输还没有开始。这样调用进程可以在数据正在传输时继续自己的运行。
一下是整理的内核虚拟文件系统以及文件系统数据结构调用关系图:
图片显示的不是很清楚,pdf版本的详见链接:文件系统结构图