最近线上bcache机器爆出io hung的问题,出问题的机器占到灰度bcache的10%左右,其中一个日志如下:
<3>[12041.639169@1] INFO: task qierouterproxy:22202 blocked for more than 1200 seconds.
<3>[12041.641004@1] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
<4>[ 61.976705@1] [aml_sdhc_data_thread] SDHC_ESTA=0x0
<4>[ 61.976705@1] [aml_sdhc_data_thread] SDHC_ESTA=0x0
<4>[ 61.976705@1] [aml_sdhc_data_thread] SDHC_ESTA=0x0
<4>[ 61.976705@1] [aml_sdhc_data_thread] SDHC_ESTA=0x0
<4>[ 61.976705@1] [aml_sdhc_data_thread] SDHC_ESTA=0x0
<4>[ 61.976705@1] [aml_sdhc_data_thread] SDHC_ESTA=0x0
<4>[ 61.976705@1] [aml_sdhc_data_thread] SDHC_ESTA=0x0
<4>[ 61.976705@1] [aml_sdhc_data_thread] SDHC_ESTA=0x0
<4>[ 61.976705@1] [aml_sdhc_data_thread] SDHC_ESTA=0x0
<4>[ 61.976705@1] [aml_sdhc_data_thread] SDHC_ESTA=0x0
<4>[ 61.976705@1] [aml_sdhc_data_thread] SDHC_ESTA=0x0
<4>[ 61.976705@1] [aml_sdhc_data_thread] SDHC_ESTA=0x0
<4>[ 61.976705@1]
<6>[12041.649007@1] qierouterproxy D c09f286c 0 22202 22005 0x00000001
<4>[12041.652663@1] [] (__schedule+0x6a4/0x7e8) from [] (io_schedule+0xa0/0xf8)
<4>[12041.661190@1] [] (io_schedule+0xa0/0xf8) from [] (sleep_on_page+0x8/0x10)
<4>[12041.669606@1] [] (sleep_on_page+0x8/0x10) from [] (__wait_on_bit+0x54/0xbc)
<4>[12041.678200@1] [] (__wait_on_bit+0x54/0xbc) from [] (wait_on_page_bit+0xbc/0xc4)
<4>[12041.687208@1] [] (wait_on_page_bit+0xbc/0xc4) from [] (filemap_fdatawait_range+0x78/0x118)
<4>[12041.697608@1] [] (filemap_fdatawait_range+0x78/0x118) from [] (filemap_write_and_wait_range+0x54/0x78)
<4>[12041.711136@1] [] (filemap_write_and_wait_range+0x54/0x78) from [] (ext4_sync_file+0xb8/0x388)
<4>[12041.718222@1] [] (ext4_sync_file+0xb8/0x388) from [] (vfs_fsync+0x3c/0x4c)
<4>[12041.727098@1] [] (vfs_fsync+0x3c/0x4c) from [] (do_fsync+0x28/0x50)
<4>[12041.734928@1] [] (do_fsync+0x28/0x50) from [] (ret_fast_syscall+0x0/0x30)
由日志看出,典型的d状态超时,因为我们把超时时间由内核默认的120s改成了1200s,依然爆栈,显然是是出了问题。由栈可以看出是qierouterproxy进程调用fsync系统调用引起的,进程d在了page的等待队列上等待唤醒,但却没有等到。
(一)复现方法
上面的栈是线上爆出的现场,后来在bcache设备目录里写小文件没事,写大文件并sync或者fsync就会出样同样的问题,二分法很快发现一次写32个page大小也就是128k的数据就是会出现此问题,比128kk小没事,128k或者更大会出问题,系统hung死。
复现代码如下:
#include
#include
#include
#include
#include
#include
#include
#define LEN 4096*32
#define NUM 1
int main(int argc, char *argv[])
{
int fd = -1;
char *buf = NULL;
int ret = -1;
int i = 0;
buf = malloc(LEN);
if(!buf)
{
printf("buf alloc failed!\n");
return -1;
}
fd = open("test", O_RDWR|O_CREAT);
if (-1 == fd)
{
printf("open err\n");
}
else
{
printf("open success!\n");
}
for(i = 0;i
(二)io 路径分析
线上环境是比较复杂,线上用户磁盘hdd一般是NTFS文件系统,我们在NTFS上创建loop设备,再基于loop设备创建linear映射的lvm设备,对于使用ssd和hdd双盘的情况 ,还要使用bcache加速hdd,用lvm设备作为bcache的后端设备,ssd作为缓存设备,最后在bcache设备上部署ext4文件系统。总结起来,IO写的路径大概为:ext4文件系统,bcache设备,lvm设备,loop设备,ntfs,hdd,十分漫长。
这里主要关注ext4、bcache 设备,写流程由上到下大概流程为:
ext4写流程分为vfs写和bdi回写两部分:
前半部,vfs写流程如下:
SyS_write
vfs_write
do_sync_write
ext4_file_write
generic_file_aio_write
__generic_file_aio_write
generic_file_buffered_write
generic_perform_write
主要关注下generic_perform_write函数,源码如下:
static ssize_t generic_perform_write(struct file *file,
struct iov_iter *i, loff_t pos)
{
struct address_space *mapping = file->f_mapping;
const struct address_space_operations *a_ops = mapping->a_ops;
long status = 0;
ssize_t written = 0;
unsigned int flags = 0;
/*
* Copies from kernel address space cannot fail (NFSD is a big user).
*/
if (segment_eq(get_fs(), KERNEL_DS))
flags |= AOP_FLAG_UNINTERRUPTIBLE;
do {
struct page *page;
unsigned long offset; /* Offset into pagecache page */
unsigned long bytes; /* Bytes to write to page */
size_t copied; /* Bytes copied from user */
void *fsdata;
offset = (pos & (PAGE_CACHE_SIZE - 1));
bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
iov_iter_count(i));
again:
/*
* Bring in the user page that we will copy from _first_.
* Otherwise there's a nasty deadlock on copying from the
* same page as we're writing to, without it being marked
* up-to-date.
*
* Not only is this an optimisation, but it is also required
* to check that the address is actually valid, when atomic
* usercopies are used, below.
*/
if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
status = -EFAULT;
break;
}
status = a_ops->write_begin(file, mapping, pos, bytes, flags,
&page, &fsdata);
if (unlikely(status))
break;
if (mapping_writably_mapped(mapping))
flush_dcache_page(page);
pagefault_disable();
copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
pagefault_enable();
flush_dcache_page(page);
mark_page_accessed(page);
blk_associate_page(page);
status = a_ops->write_end(file, mapping, pos, bytes, copied,
page, fsdata);
if (unlikely(status < 0))
break;
copied = status;
cond_resched();
iov_iter_advance(i, copied);
if (unlikely(copied == 0)) {
/*
* If we were unable to copy any data at all, we must
* fall back to a single segment length write.
*
* If we didn't fallback here, we could livelock
* because not all segments in the iov can be copied at
* once without a pagefault.
*/
bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
iov_iter_single_seg_count(i));
goto again;
}
pos += copied;
written += copied;
balance_dirty_pages_ratelimited(mapping);
if (fatal_signal_pending(current)) {
status = -EINTR;
break;
}
} while (iov_iter_count(i));
return written ? written : status;
}
该函数是前半部写的核心函数,主要是是将数据从用户缓存区拷贝到文件的page cache,具体工作如下:
(1)a_ops->write_begin。ext4上具体为ext4_da_write_begin函数,主要工作是查看文件的page cache里对应的page是否存在,如果不存在,分配并插入inode address 基树,插入系统inactive file lru链表。对于要写入的page依次调用ext4_da_get_block_prep函数判断文件page所在的逻辑块对应的磁盘的物理块是否分配,如果没有分配,由于ext4采用delay alloc 策略,并不会取分配物理块,只是为文件预留相应的quota,并将delay alloc 的range记录到inode的status tree里。delay alloc 的rang被视为mapped(参见ext4_da_get_block_prep函数里的map_bh函数)。
(2)iov_iter_copy_from_user_atomic函数将用户缓存区数据拷贝到page cache对应的page。
(3)a->ops->write_end。ext4上具体为ext4_da_write_end函数。对于追加写的情况,i_size_write函数更新文件i_size,mark_inode_dirty函数为回写ext4 磁盘inode元数据做准备。
后半部-bdi回写:
kthread
worker_thread
process_one_work
bdi_writeback_workfn
wb_do_writeback
__writeback_inodes_wb
writeback_sb_inodes
writeback_sb_inodes
__writeback_single_inode
ext4_da_writepages
这里重点关注ext4_da_writepages函数的实现,源码如下:
static int ext4_da_writepages(struct address_space *mapping,
struct writeback_control *wbc)
{
pgoff_t index;
int range_whole = 0;
handle_t *handle = NULL;
struct mpage_da_data mpd;
struct inode *inode = mapping->host;
int pages_written = 0;
unsigned int max_pages;
int range_cyclic, cycled = 1, io_done = 0;
int needed_blocks, ret = 0;
long desired_nr_to_write, nr_to_writebump = 0;
loff_t range_start = wbc->range_start;
struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb);
pgoff_t done_index = 0;
pgoff_t end;
struct blk_plug plug;
trace_ext4_da_writepages(inode, wbc);
/*
* No pages to write? This is mainly a kludge to avoid starting
* a transaction for special inodes like journal inode on last iput()
* because that could violate lock ordering on umount
*/
if (!mapping->nrpages || !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
return 0;
/*
* If the filesystem has aborted, it is read-only, so return
* right away instead of dumping stack traces later on that
* will obscure the real source of the problem. We test
* EXT4_MF_FS_ABORTED instead of sb->s_flag's MS_RDONLY because
* the latter could be true if the filesystem is mounted
* read-only, and in that case, ext4_da_writepages should
* *never* be called, so if that ever happens, we would want
* the stack trace.
*/
if (unlikely(sbi->s_mount_flags & EXT4_MF_FS_ABORTED))
return -EROFS;
if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
range_whole = 1;
range_cyclic = wbc->range_cyclic;
if (wbc->range_cyclic) {
index = mapping->writeback_index;
if (index)
cycled = 0;
wbc->range_start = index << PAGE_CACHE_SHIFT;
wbc->range_end = LLONG_MAX;
wbc->range_cyclic = 0;
end = -1;
} else {
index = wbc->range_start >> PAGE_CACHE_SHIFT;
end = wbc->range_end >> PAGE_CACHE_SHIFT;
}
/*
* This works around two forms of stupidity. The first is in
* the writeback code, which caps the maximum number of pages
* written to be 1024 pages. This is wrong on multiple
* levels; different architectues have a different page size,
* which changes the maximum amount of data which gets
* written. Secondly, 4 megabytes is way too small. XFS
* forces this value to be 16 megabytes by multiplying
* nr_to_write parameter by four, and then relies on its
* allocator to allocate larger extents to make them
* contiguous. Unfortunately this brings us to the second
* stupidity, which is that ext4's mballoc code only allocates
* at most 2048 blocks. So we force contiguous writes up to
* the number of dirty blocks in the inode, or
* sbi->max_writeback_mb_bump whichever is smaller.
*/
max_pages = sbi->s_max_writeback_mb_bump << (20 - PAGE_CACHE_SHIFT);
if (!range_cyclic && range_whole) {
if (wbc->nr_to_write == LONG_MAX)
desired_nr_to_write = wbc->nr_to_write;
else
desired_nr_to_write = wbc->nr_to_write * 8;
} else
desired_nr_to_write = ext4_num_dirty_pages(inode, index,
max_pages);
if (desired_nr_to_write > max_pages)
desired_nr_to_write = max_pages;
if (wbc->nr_to_write < desired_nr_to_write) {
nr_to_writebump = desired_nr_to_write - wbc->nr_to_write;
wbc->nr_to_write = desired_nr_to_write;
}
retry:
if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
tag_pages_for_writeback(mapping, index, end);
blk_start_plug(&plug);
while (!ret && wbc->nr_to_write > 0) {
/*
* we insert one extent at a time. So we need
* credit needed for single extent allocation.
* journalled mode is currently not supported
* by delalloc
*/
BUG_ON(ext4_should_journal_data(inode));
needed_blocks = ext4_da_writepages_trans_blocks(inode);
/* start a new transaction*/
handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE,
needed_blocks);
if (IS_ERR(handle)) {
ret = PTR_ERR(handle);
ext4_msg(inode->i_sb, KERN_CRIT, "%s: jbd2_start: "
"%ld pages, ino %lu; err %d", __func__,
wbc->nr_to_write, inode->i_ino, ret);
blk_finish_plug(&plug);
goto out_writepages;
}
/*
* Now call write_cache_pages_da() to find the next
* contiguous region of logical blocks that need
* blocks to be allocated by ext4 and submit them.
*/
ret = write_cache_pages_da(handle, mapping,
wbc, &mpd, &done_index);
/*
* If we have a contiguous extent of pages and we
* haven't done the I/O yet, map the blocks and submit
* them for I/O.
*/
if (!mpd.io_done && mpd.next_page != mpd.first_page) {
mpage_da_map_and_submit(&mpd);
ret = MPAGE_DA_EXTENT_TAIL;
}
trace_ext4_da_write_pages(inode, &mpd);
wbc->nr_to_write -= mpd.pages_written;
ext4_journal_stop(handle);
if ((mpd.retval == -ENOSPC) && sbi->s_journal) {
/* commit the transaction which would
* free blocks released in the transaction
* and try again
*/
jbd2_journal_force_commit_nested(sbi->s_journal);
ret = 0;
} else if (ret == MPAGE_DA_EXTENT_TAIL) {
/*
* Got one extent now try with rest of the pages.
* If mpd.retval is set -EIO, journal is aborted.
* So we don't need to write any more.
*/
pages_written += mpd.pages_written;
ret = mpd.retval;
io_done = 1;
} else if (wbc->nr_to_write)
/*
* There is no more writeout needed
* or we requested for a noblocking writeout
* and we found the device congested
*/
break;
}
blk_finish_plug(&plug);
if (!io_done && !cycled) {
cycled = 1;
index = 0;
wbc->range_start = index << PAGE_CACHE_SHIFT;
wbc->range_end = mapping->writeback_index - 1;
goto retry;
}
/* Update index */
wbc->range_cyclic = range_cyclic;
if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
/*
* set the writeback_index so that range_cyclic
* mode will write it back later
*/
mapping->writeback_index = done_index;
out_writepages:
wbc->nr_to_write -= nr_to_writebump;
wbc->range_start = range_start;
trace_ext4_da_writepages_result(inode, wbc, ret, pages_written);
return ret;
}
在分析这个函数之前首先介绍这个关键结构体:
struct mpage_da_data {
struct inode *inode;
sector_t b_blocknr; /* start block number of extent */
size_t b_size; /* size of extent */
unsigned long b_state; /* state of the extent */
unsigned long first_page, next_page; /* extent of pages */
struct writeback_control *wbc;
int io_done;
int pages_written;
int retval;
};
struct mpage_da_data主要记录待映射的extent,其中
first_page,next_page:分别记录本次要回写的range,单位是文件页偏移量。
b_blocknr:待映射的extent的起始逻辑块号。
b_size:待映射的extent的大小。
b_state:extent对应的bh的标志。
io_done:如果[first_page,next_page-1]的区域回写完毕,io_done=1。
再看ext4_da_writepages函数主要处理流程:
该函数主体是个while循环,循环回写wbc->start_range,wbc->end_range之间的区域,每次回写mpd->first_page,mpd->next_page的区域。mpd->first_page,mpd->next_page如何得来的?
比如说,wbc->start_range,wbc->end_range覆盖文件的1,2,3,4,[5]],6,7,8,9页,其中1,2,3,4页是dirty的,5页是clean的,6,7,8,9页是dirty的,那么ext4_da_writepages函数需要循环两个。第一次[mpd->first_page = 1,mpd->next_page-1 = 4],第二次[mpd->first_page = 6,mpd->next_page-1 = 9]。mpd->first_page,mpd->next_page标示最大连续脏页区域。
(1)调用write_cache_pages_da函数累计记录本次连续脏页区域(由mpd->first_page,mpd->next_page记录)。同时write_cache_pages_da函数调用mpage_add_bh_to_extent函数累集该区域内的delay或者unwritten的区域,并记录到mpd的b->blocknr和mpd->b_size里。
(2)mpage_da_map_and_submit完成对mpd记录的extent的的分配工作。mpage_da_map_and_submit函数调用ext4_map_blocks为extent分配空间。对与delay的extent的直接分配,对于unwritten的extent,直接转换成written就行。
(3)mpage_da_submit_io提交属于[mpd->first_page,mpd->next_page-1]区间的bio。经过mpage_da_map_and_submit函数的分配Block,此时所有bh都是真正mapped的。需要说明的是ext4_bio_write_page会调用io_submit_add_bh累集连续的bio到ext4_io_submit里。
如果遇到不连续的bio,调用ext4_io_submit函数提交先前累集的bio,并重新开始累集。直到最后,如果io_done不为1,说明还有io没有提交,再次调用ext4_io_submit函数提交剩余的bio。
vfs 写基本已经介绍完了,来看bcache写流程:
上面说到的ext4_da_writepages函数最后调用ext4_io_submit函数来提交IO。ext4_io_submit函数调用submit_bio函数进入通用提交Bio流程。submit_bio函数调用generic_make_request函数,最终调用q->make_request_fn,也就是调用bcache设备的的cached_dev_make_request函数,自此进入bcache设备的处理流程。以下是具体流程:
(1).进入bcache设备的q->make_request_fn函数,具体为cached_dev_make_request。在该函数中,申请struct search结构体,结构体中的struct closure通过计数跟踪bio在bcache里的完成情况,并将bio和struct search进行绑定,同时对bio进行clone,之所以要进行clone,是因为像bcache lvm这类逻辑设备在处理bio的时候需要修改bio某些域或者split bio,而在最后bio的所有请求在底层完成的时候,还需要完成一些列回调函数,这些回调的参数就是bio,所以需要对原始bio进行保留,clone一个新的bio进行提交。最后调用request_write函数。
(2).request_write函数。线上的bcache的写模式为writearound,所以会直接写后端设备。调用bch_generic_make_request提交bio,这里提交的是clone后的bio。
(3).bch_generic_make_reques函数主要处理bio 的split。如果bio大小超过bch_bio_max_sectors之后,需要对bio根据bch_bio_max_sectors大小进行切分。
(4).bch_generic_make_request_hack调用generic_make_request向lvm设备队列提交bio,进入lvm设备的处理流程。
(三)bug分析定位
从hung的调用栈来看,是qierouterproxy进程调用了fsync函数在sleep_on_page里切换了出去,再也没有调度回来。
看下,fsync系统调用的过程。do_fsync调用vfs_fsync,vfs_sync函数调用vfs_fsync_range函数(),vfs_fsync_range函数调用ext4的ext4_sync_file函数。注意vfs_fsync_range(file, 0, LLONG_MAX, datasync)的参数,是等待所有数据。ext4_fsync_file函数调用filemap_write_and_wait_range函数,来看下这个函数:
int filemap_write_and_wait_range(struct address_space *mapping,
loff_t lstart, loff_t lend)
{
int err = 0;
if (mapping->nrpages) {
err = __filemap_fdatawrite_range(mapping, lstart, lend,
WB_SYNC_ALL);
/* See comment of filemap_write_and_wait() */
if (err != -EIO) {
int err2 = filemap_fdatawait_range(mapping,
lstart, lend);
if (!err)
err = err2;
}
} else {
err = filemap_check_errors(mapping);
}
return err;
}
从函数的名字filemap_write_and_wait_range名字来看,它主要主要干了两件事write和wait。事实上,7行的__filemap_fdatawrite_range通过调用ext4_da_writepages函数来写address space里的所有脏页,11行filemap_fdatawait_range函数负责依次等待所有的page回写完成。
看下filemap_fdatawait_range函数:
int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
loff_t end_byte)
{
pgoff_t index = start_byte >> PAGE_CACHE_SHIFT;
pgoff_t end = end_byte >> PAGE_CACHE_SHIFT;
struct pagevec pvec;
int nr_pages;
int ret2, ret = 0;
if (end_byte < start_byte)
goto out;
pagevec_init(&pvec, 0);
while ((index <= end) &&
(nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
PAGECACHE_TAG_WRITEBACK,
min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1)) != 0) {
unsigned i;
for (i = 0; i < nr_pages; i++) {
struct page *page = pvec.pages[i];
/* until radix tree lookup accepts end_index */
if (page->index > end)
continue;
wait_on_page_writeback(page);
if (TestClearPageError(page))
ret = -EIO;
}
pagevec_release(&pvec);
cond_resched();
}
out:
ret2 = filemap_check_errors(mapping);
if (!ret)
ret = ret2;
return ret;
}
该函数遍历inode的adress space的writeback的page,对每个writeback的page调用wait_on_page_writeback函数等待page回写完成。如果page还处于writeback状态,那么进程切换出去。
在wait_on_page_writeback前后加两句打印(26-27,29-30行),如下:
int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
loff_t end_byte)
{
pgoff_t index = start_byte >> PAGE_CACHE_SHIFT;
pgoff_t end = end_byte >> PAGE_CACHE_SHIFT;
struct pagevec pvec;
int nr_pages;
int ret2, ret = 0;
if (end_byte < start_byte)
goto out;
pagevec_init(&pvec, 0);
while ((index <= end) &&
(nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
PAGECACHE_TAG_WRITEBACK,
min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1)) != 0) {
unsigned i;
for (i = 0; i < nr_pages; i++) {
struct page *page = pvec.pages[i];
/* until radix tree lookup accepts end_index */
if (page->index > end)
continue;
if((page->mapping)&&(!((unsigned long)page->mapping&0x1))&&(page->mapping->host->i_ino == 13))
printk("Before wait page index %ld!\n",page->index);
wait_on_page_writeback(page);
if((page->mapping)&&(!((unsigned long)page->mapping&0x1))&&(page->mapping->host->i_ino == 13))
printk("After wait page index %ld!\n",page->index);
if (TestClearPageError(page))
ret = -EIO;
}
pagevec_release(&pvec);
cond_resched();
}
out:
ret2 = filemap_check_errors(mapping);
if (!ret)
ret = ret2;
return ret;
}
其中13为要写的文件的inode no。
重新编译烧写内核,在bcache目录里执行前面的复现的程序test_32k,系统hung住,串口打印如下:
[ 191.456617] Before wait page index 0!
[ 191.461711] After wait page index 0!
[ 191.461757] Before wait page index 1!
[ 191.464117] After wait page index 1!
[ 191.467853] Before wait page index 2!
[ 191.472090] After wait page index 2!
[ 191.475609] Before wait page index 3!
[ 191.479335] After wait page index 3!
[ 191.483291] Before wait page index 4!
[ 191.487035] After wait page index 4!
[ 191.490799] Before wait page index 5!
[ 191.494462] After wait page index 5!
[ 191.498253] Before wait page index 6!
[ 191.502042] After wait page index 6!
[ 191.505843] Before wait page index 7!
[ 191.509621] After wait page index 7!
[ 191.513391] Before wait page index 8!
[ 191.517184] After wait page index 8!
[ 191.521016] Before wait page index 9!
[ 191.524775] After wait page index 9!
[ 191.528587] Before wait page index 10!
[ 191.532452] After wait page index 10!
[ 191.536670] Before wait page index 11!
[ 191.540250] After wait page index 11!
[ 191.544103] Before wait page index 12!
[ 191.547983] After wait page index 12!
[ 191.551893] Before wait page index 13!
[ 191.555748] After wait page index 13!
[ 191.559674] Before wait page index 31!
发现index为31的page只有Before,没有After。由此断定fsync系统调用hung在第31个page,即写入的32个page的最后一个page。之所以没有把[0,31]个page全打出来,是因为,在调用fsync之前,有的page已经完成io 传输。
index为31的page为什么没有完成writeback导致在此页上等待完成的进程d死。bio->bi_end_io负责唤醒在页上等待io完成的进程,并清除page的writeback。bio->bi_end_io在ext4上为具体为ext4_end_bio函数。代码如下:
static void ext4_end_bio(struct bio *bio, int error)
{
ext4_io_end_t *io_end = bio->bi_private;
struct inode *inode;
int i;
int blocksize;
sector_t bi_sector = bio->bi_sector;
struct page *page1 = NULL;
int need_printk = 0;
if(bio_has_data(bio)&&!PageSwapBacked(page1=bio->bi_io_vec[0].bv_page)&&(page1->mapping)&&(!((unsigned long)page1->mapping&0x1))&&(page1->mapping->host->i_ino == 13))
need_printk = 1;
BUG_ON(!io_end);
inode = io_end->inode;
blocksize = 1 << inode->i_blkbits;
bio->bi_private = NULL;
bio->bi_end_io = NULL;
if (test_bit(BIO_UPTODATE, &bio->bi_flags))
error = 0;
for (i = 0; i < bio->bi_vcnt; i++) {
struct bio_vec *bvec = &bio->bi_io_vec[i];
struct page *page = bvec->bv_page;
struct buffer_head *bh, *head;
unsigned bio_start = bvec->bv_offset;
unsigned bio_end = bio_start + bvec->bv_len;
unsigned under_io = 0;
unsigned long flags;
if (!page)
continue;
if(need_printk)
printk("page->index %ld start %d end %d!\n",page->index,bio_start,bio_end);
if (error) {
SetPageError(page);
set_bit(AS_EIO, &page->mapping->flags);
}
bh = head = page_buffers(page);
/*
* We check all buffers in the page under BH_Uptodate_Lock
* to avoid races with other end io clearing async_write flags
*/
local_irq_save(flags);
bit_spin_lock(BH_Uptodate_Lock, &head->b_state);
do {
if (bh_offset(bh) < bio_start ||
bh_offset(bh) + blocksize > bio_end) {
if (buffer_async_write(bh))
under_io++;
continue;
}
clear_buffer_async_write(bh);
if (error)
buffer_io_error(bh);
} while ((bh = bh->b_this_page) != head);
bit_spin_unlock(BH_Uptodate_Lock, &head->b_state);
local_irq_restore(flags);
if (!under_io){
end_page_writeback(page);
if(need_printk)
printk("inode blocksize %d page index %ld!\n",1<<(inode->i_blkbits),page->index);
}
}
bio_put(bio);
if (error) {
io_end->flag |= EXT4_IO_END_ERROR;
ext4_warning(inode->i_sb, "I/O error writing to inode %lu "
"(offset %llu size %ld starting block %llu)",
inode->i_ino,
(unsigned long long) io_end->offset,
(long) io_end->size,
(unsigned long long)
bi_sector >> (inode->i_blkbits - 9));
}
if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
ext4_free_io_end(io_end);
return;
}
ext4_add_complete_io(io_end);
}
该函数遍历bio的所有page,查看page所属的bh,根据bh的回写情况,来处理page。如果page的所有bh均完成回写,那么调用end_page_writeback函数来唤醒该page上等待的进程并清除writeback标志。其中11-12,59-60行是添加打印,过滤打印inode no为13的已经完成回写的page。
重现编译烧写内核,在bcache设备上执行test_32复现程序,dmesg输出如下:
[ 214.340375] inode blocksize 4096 page index 0!
[ 214.340424] inode blocksize 4096 page index 1!
[ 214.343944] inode blocksize 4096 page index 2!
[ 214.348572] inode blocksize 4096 page index 3!
[ 214.353120] inode blocksize 4096 page index 4!
[ 214.357661] inode blocksize 4096 page index 5!
[ 214.362261] inode blocksize 4096 page index 6!
[ 214.366924] inode blocksize 4096 page index 7!
[ 214.371436] inode blocksize 4096 page index 8!
[ 214.375958] inode blocksize 4096 page index 9!
[ 214.380575] inode blocksize 4096 page index 10!
[ 214.385187] inode blocksize 4096 page index 11!
[ 214.390170] inode blocksize 4096 page index 12!
[ 214.394488] inode blocksize 4096 page index 13!
[ 214.399165] inode blocksize 4096 page index 14!
[ 214.403803] inode blocksize 4096 page index 15!
[ 214.408478] inode blocksize 4096 page index 16!
[ 214.413117] inode blocksize 4096 page index 17!
[ 214.417773] inode blocksize 4096 page index 18!
[ 214.422536] inode blocksize 4096 page index 19!
[ 214.427092] inode blocksize 4096 page index 20!
[ 214.431816] inode blocksize 4096 page index 21!
[ 214.436418] inode blocksize 4096 page index 22!
[ 214.441101] inode blocksize 4096 page index 23!
[ 214.445736] inode blocksize 4096 page index 24!
[ 214.450415] inode blocksize 4096 page index 25!
[ 214.455042] inode blocksize 4096 page index 26!
[ 214.459744] inode blocksize 4096 page index 27!
[ 214.464369] inode blocksize 4096 page index 28!
[ 214.469098] inode blocksize 4096 page index 29!
[ 214.473705] inode blocksize 4096 page index 30!
test_32函数写32个page,从打印来看只有[0,30]完成IO,index为31最后一个page没有完成io,没有唤醒等待其完成的进程。这个结果和上面filemap_fdatawait_range函数的打印相互印证。另外打印了ext4的blocksize是4096,表明一个page只有一个bh。
显然,最后一个page没有调用end_page_writeback函数。函数265-276行,遍历page的所有bh(只有一个),如果落在bvec的范围内,那么调用clear_buffer_async_write函数清理bh的BH_Async_Write标志。如果还有在bh在bvec范围之外,且BH_Async_Write标志置位的话,那么under_io++,表明本page还有尚未回写完成的bh,那么整个page不能调用end_page_writeback。BH_Async_Write标志是在ext4_bio_write_page函数里在map完extent,累积提交的bio之前置位的。
在ext4_end_bio函数继续添加打印,打印处每个bvec的覆盖范围,添加打印后的代码(11-12行,31-32行)如下:
static void ext4_end_bio(struct bio *bio, int error)
{
ext4_io_end_t *io_end = bio->bi_private;
struct inode *inode;
int i;
int blocksize;
sector_t bi_sector = bio->bi_sector;
struct page *page1 = NULL;
int need_printk = 0;
if(bio_has_data(bio)&&!PageSwapBacked(page1=bio->bi_io_vec[0].bv_page)&&(page1->mapping)&&(!((unsigned long)page1->mapping&0x1))&&(page1->mapping->host->i_ino == 13))
need_printk = 1;
BUG_ON(!io_end);
inode = io_end->inode;
blocksize = 1 << inode->i_blkbits;
bio->bi_private = NULL;
bio->bi_end_io = NULL;
if (test_bit(BIO_UPTODATE, &bio->bi_flags))
error = 0;
for (i = 0; i < bio->bi_vcnt; i++) {
struct bio_vec *bvec = &bio->bi_io_vec[i];
struct page *page = bvec->bv_page;
struct buffer_head *bh, *head;
unsigned bio_start = bvec->bv_offset;
unsigned bio_end = bio_start + bvec->bv_len;
unsigned under_io = 0;
unsigned long flags;
if (!page)
continue;
if(need_printk)
printk("page->index %ld start %d end %d!\n",page->index,bio_start,bio_end);
if (error) {
SetPageError(page);
set_bit(AS_EIO, &page->mapping->flags);
}
bh = head = page_buffers(page);
/*
* We check all buffers in the page under BH_Uptodate_Lock
* to avoid races with other end io clearing async_write flags
*/
local_irq_save(flags);
bit_spin_lock(BH_Uptodate_Lock, &head->b_state);
do {
if (bh_offset(bh) < bio_start ||
bh_offset(bh) + blocksize > bio_end) {
if (buffer_async_write(bh))
under_io++;
continue;
}
clear_buffer_async_write(bh);
if (error)
buffer_io_error(bh);
} while ((bh = bh->b_this_page) != head);
bit_spin_unlock(BH_Uptodate_Lock, &head->b_state);
local_irq_restore(flags);
if (!under_io){
end_page_writeback(page);
//if(need_printk)
// printk("inode blocksize %d page index %ld!\n",1<<(inode->i_blkbits),page->index);
}
}
bio_put(bio);
if (error) {
io_end->flag |= EXT4_IO_END_ERROR;
ext4_warning(inode->i_sb, "I/O error writing to inode %lu "
"(offset %llu size %ld starting block %llu)",
inode->i_ino,
(unsigned long long) io_end->offset,
(long) io_end->size,
(unsigned long long)
bi_sector >> (inode->i_blkbits - 9));
}
if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
ext4_free_io_end(io_end);
return;
}
ext4_add_complete_io(io_end);
}
重新编译烧写内核,bcache目录执行./test_32,dmesg输出结果如下:
[ 175.681627] page->index 0 start 0 end 4096!
[ 175.681686] page->index 1 start 0 end 4096!
[ 175.684686] page->index 2 start 0 end 4096!
[ 175.689027] page->index 3 start 0 end 4096!
[ 175.693330] page->index 4 start 0 end 4096!
[ 175.697622] page->index 5 start 0 end 4096!
[ 175.701956] page->index 6 start 0 end 4096!
[ 175.706245] page->index 7 start 0 end 4096!
[ 175.710627] page->index 8 start 0 end 4096!
[ 175.714875] page->index 9 start 0 end 4096!
[ 175.719203] page->index 10 start 0 end 4096!
[ 175.723581] page->index 11 start 0 end 4096!
[ 175.727991] page->index 12 start 0 end 4096!
[ 175.732449] page->index 13 start 0 end 4096!
[ 175.736791] page->index 14 start 0 end 4096!
[ 175.741231] page->index 15 start 0 end 4096!
[ 175.745588] page->index 16 start 0 end 4096!
[ 175.750204] page->index 17 start 0 end 4096!
[ 175.754370] page->index 18 start 0 end 4096!
[ 175.758787] page->index 19 start 0 end 4096!
[ 175.763168] page->index 20 start 0 end 4096!
[ 175.767566] page->index 21 start 0 end 4096!
[ 175.772026] page->index 22 start 0 end 4096!
[ 175.776384] page->index 23 start 0 end 4096!
[ 175.780831] page->index 24 start 0 end 4096!
[ 175.785183] page->index 25 start 0 end 4096!
[ 175.789629] page->index 26 start 0 end 4096!
[ 175.793983] page->index 27 start 0 end 4096!
[ 175.798418] page->index 28 start 0 end 4096!
[ 175.802776] page->index 29 start 0 end 4096!
[ 175.807161] page->index 30 start 0 end 4096!
[ 175.811627] page->index 31 start 3584 end 4096!
最后一个bvec为何是[3582,4096]、大小512字节,不也应该是[0,4096]么。根据前面对ext4写流程的分析,写32个page,应该会在磁盘分配连续的物理块,由于物理空间连续,一个bio就可以搞定。那应该是32个page全部应该是[0,4096]才符合实际。bio的bvec被修改,我想到了bio的split,bio的split如果split不是位于bvec边界的话,需要重新界定bvec的范围。bio的split函数在bcache里是bch_bio_split函数,后面再看这个函数。下面确定是否存在bio split的情况存在。
为了验证bio是否会split,我在bcache的入口函数和bcache的出口函数(也就是lvm的入口函数)里面分别对Bio的大小进行打印,用如下语句进行过滤:
if(bio_has_data(bio)&&!PageSwapBacked(page1=bio->bi_io_vec[0].bv_page)&&(page1->mapping)&&(!((unsigned long)page1->mapping&0x1))&&(page1->mapping->host->i_ino == 13))
这样只打印属于inode no为13的文件下发的bio,排除干扰。
bcache的入口函数为为bcache设备队列的make_request_fn,具体为cached_dev_make_request函数。lvm的入口函数为lvm设备的make_request_fn,具体为dm_request函数。打印结果如下:
[ 184.801757] func cached_dev_make_request bio sectors 256!
[ 184.805082] func dm_request bio sectors 255!
[ 184.809459] func dm_request bio sectors 1!
很明显,进入bcache的是256个sector的bio,出来之后split成255个sector和1个sector。
下面再过一下,bio在bcache的流程:
bio 进入bcache的make_request_fn函数,具体是cached_dev_make_request函数。因为是写bio,所以进入request_write函数。request_write函数调用bch_generic_make_request函数。bch_generic_make_request函数代码如下:
void bch_generic_make_request(struct bio *bio, struct bio_split_pool *p)
{
struct bio_split_hook *s;
if (!bio_has_data(bio) && !(bio->bi_rw & REQ_DISCARD))
goto submit;
if (bio_sectors(bio) <= bch_bio_max_sectors(bio))
goto submit;
s = mempool_alloc(p->bio_split_hook, GFP_NOIO);
s->bio = bio;
s->p = p;
s->bi_end_io = bio->bi_end_io;
s->bi_private = bio->bi_private;
bio_get(bio);
closure_call(&s->cl, __bch_bio_submit_split, NULL, NULL);
return;
submit:
bch_generic_make_request_hack(bio);
}
250行,如果bio的大小小于bch_bio_max_sectors(255个sector,每个sector为512字节),直接调用bch_generic_make_request_hack函数向下层的lvm设备提交bio。如果bio的超过255个sector,调用__bch_bio_submit_split函数。__bch_bio_submit_split函数调用bch_bio_split对bio以255个sector大小进行切分,切分后的bio,在调用bch_generic_make_request_hack函数向下层的lvm设备提交bio。
这几个bio的关系:原始bio(256个sector),clone的bio(256个sector),clone的bio split成大小为255个sector和大小1个sector两个bio 。看bch_bio_split函数,clone的bio split的时候,申请一个255sector的bio,然后自己调整大小为1个sector的bio。
最终三个bio:原始bio(大小256个sector),新split申请的大小为255个sector的新bio,还有原来调整为1个sector的clone bio。
原始的bio的结束回调为ext4_end_bio,新申请的255个sector的结束回调为bch_bio_submit_split_endio,大小为1sector的clone bio的回调函数为bch_bio_submit_split_endio。原始bio的bio->private为closure1,大小为255个sector的bio和clone bio的private为closure2。255的bio和1sector的clone bio每完成一个,bch_bio_submit_split_endio回调函数被调用。bch_bio_submit_split_endio函数调用closure_put(bio->private),将closure2的计数减1,当计数减为0的时候(此时255的bio和clone bio均已经完成),调用closure->fn,具体为bch_bio_submit_split_done。
看下bch_bio_submit_split_done函数:
static void bch_bio_submit_split_done(struct closure *cl)
{
struct bio_split_hook *s = container_of(cl, struct bio_split_hook, cl);
s->bio->bi_end_io = s->bi_end_io;
s->bio->bi_private = s->bi_private;
bio_endio(s->bio, 0);
closure_debug_destroy(&s->cl);
mempool_free(s, s->p->bio_split_hook);
}
s->bio指的是大小为1sector的clone bio,该函数将s->bi_end_io(request_endio函数)重新设置为clone bio的结束回调,clone bio的private设置为closure1(原来是cloure2)。然后调用bio_endio,也就是执行request_endio,调用closure_put(closure1)将closure1的计数减1,如果计数为0,那么调用closure1->fn,具体为cached_dev_write_complete函数。该函数最终回调用原始bio的结束回调函数ext4_end_bio,好绕。
一言以蔽之,对于下发的bio,bcache不会提交这个bio,而是提交他的clone副本,如果副本超过bch_bio_max_sectors,还会分裂为多个副本,只有在所有的副本bio都完成IO之后,bcache的回调机制能够确保,最后一个副本bio的io完成回调里执行原始bio的回调,在本例中是ext4_end_bio函数。
回到最开始的io hung的root cause上,为何最后一个page没有唤醒上面的进程,没有清除writeback。看一下,原始bio的clone过程,在search_alloc函数里,代码如下:
static struct search *search_alloc(struct bio *bio, struct bcache_device *d)
{
struct bio_vec *bv;
struct search *s = mempool_alloc(d->c->search, GFP_NOIO);
memset(s, 0, offsetof(struct search, op.keys));
__closure_init(&s->cl, NULL);
s->op.inode = d->id;
s->op.c = d->c;
s->d = d;
s->op.lock = -1;
s->task = current;
s->orig_bio = bio;
s->write = (bio->bi_rw & REQ_WRITE) != 0;
s->op.flush_journal = (bio->bi_rw & (REQ_FLUSH|REQ_FUA)) != 0;
s->op.skip = (bio->bi_rw & REQ_DISCARD) != 0;
s->recoverable = 1;
s->start_time = jiffies;
do_bio_hook(s);
if (bio->bi_size != bio_segments(bio) * PAGE_SIZE) {
bv = mempool_alloc(d->unaligned_bvec, GFP_NOIO);
memcpy(bv, bio_iovec(bio),
sizeof(struct bio_vec) * bio_segments(bio));
s->bio.bio.bi_io_vec = bv;
s->unaligned_bvec = 1;
}
return s;
}
其中参数bio原始bio,赋值给s->orig_bio,do_bio_hook里,clone bio是s->bio.bio,并完全拷贝原始bio,。do_bio_hook函数如下:
static void do_bio_hook(struct search *s)
{
struct bio *bio = &s->bio.bio;
memcpy(bio, s->orig_bio, sizeof(struct bio));
bio->bi_end_io = request_endio;
bio->bi_private = &s->cl;
atomic_set(&bio->bi_cnt, 3);
}
688行可见,clone bio完全拷贝原始bio结构体,此时原始bio和clone bio的bvec数组是公用的,这是问题根本所在。
731行,如果原始bio的每个bvec不全是完整的页,那么会为clone bio从新分配bvec数组,并拷贝原始bio的数组内容。显然本例中,原始bio为256个sector,且bvec都是满页,不会重新分配bvec数组,clone bio会和原始bio公用bvec数组。
问题在于,在clone bio 的split过程中,256个sector按照255个sector进行切分,分为255个sector bio和1sector的bio,其中clone bio为1sector的bio,split过程中,会修该bvec的范围,clone bio的最后一个bvec从{bv_len = 4096,bv_offset=0}变成{bv_len=512,bv_offset=3584}。由于clone bio和原始bio公用bvec数组,导致原始bio数组最后一个bvec被修改。最后,在所有的io都完成之后,调用原始bio的回调函数导致最后一个页不会执行end_page_writeback(),导致此页上的进程Hung死。
(四)修正与结论
知道了根本原因之后,是由于原始bio和clone bio公用bvec数组导致的,可以修改search_alloc函数如下:
static struct search *search_alloc(struct bio *bio, struct bcache_device *d)
{
struct bio_vec *bv;
struct search *s = mempool_alloc(d->c->search, GFP_NOIO);
memset(s, 0, offsetof(struct search, op.keys));
__closure_init(&s->cl, NULL);
s->op.inode = d->id;
s->op.c = d->c;
s->d = d;
s->op.lock = -1;
s->task = current;
s->orig_bio = bio;
s->write = (bio->bi_rw & REQ_WRITE) != 0;
s->op.flush_journal = (bio->bi_rw & (REQ_FLUSH|REQ_FUA)) != 0;
s->op.skip = (bio->bi_rw & REQ_DISCARD) != 0;
s->recoverable = 1;
s->start_time = jiffies;
do_bio_hook(s);
// if (bio->bi_size != bio_segments(bio) * PAGE_SIZE) {
bv = mempool_alloc(d->unaligned_bvec, GFP_NOIO);
memcpy(bv, bio_iovec(bio),
sizeof(struct bio_vec) * bio_segments(bio));
s->bio.bio.bi_io_vec = bv;
s->unaligned_bvec = 1;
//}
return s;
}
注释掉731行和738行,即无论对齐与否,都为clone bio 的bvec数组申请单独的内存,确保不和原始bio共用bvec数组,防止被修改。
在bcache或者lvm这种逻辑设备中,由于涉及到bio的重定向或者split处理,都不会提交原始的bio,以确保原始的bio的参数不被修改,在原始bio所覆盖的io 请求区间全部完成之后,会通过回调机制执行原始io的结束回调函数。在整个处理过程中,原始bio的不应该被修改。