bcache 写导致io hung问题的追踪

  最近线上bcache机器爆出io hung的问题,出问题的机器占到灰度bcache的10%左右,其中一个日志如下:

<3>[12041.639169@1] INFO: task qierouterproxy:22202 blocked for more than 1200 seconds.
<3>[12041.641004@1] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
<4>[ 61.976705@1] [aml_sdhc_data_thread] SDHC_ESTA=0x0
<4>[ 61.976705@1] [aml_sdhc_data_thread] SDHC_ESTA=0x0
<4>[ 61.976705@1] [aml_sdhc_data_thread] SDHC_ESTA=0x0
<4>[ 61.976705@1] [aml_sdhc_data_thread] SDHC_ESTA=0x0
<4>[ 61.976705@1] [aml_sdhc_data_thread] SDHC_ESTA=0x0
<4>[ 61.976705@1] [aml_sdhc_data_thread] SDHC_ESTA=0x0
<4>[ 61.976705@1] [aml_sdhc_data_thread] SDHC_ESTA=0x0
<4>[ 61.976705@1] [aml_sdhc_data_thread] SDHC_ESTA=0x0
<4>[ 61.976705@1] [aml_sdhc_data_thread] SDHC_ESTA=0x0
<4>[ 61.976705@1] [aml_sdhc_data_thread] SDHC_ESTA=0x0
<4>[ 61.976705@1] [aml_sdhc_data_thread] SDHC_ESTA=0x0
<4>[ 61.976705@1] [aml_sdhc_data_thread] SDHC_ESTA=0x0
<4>[ 61.976705@1] 
<6>[12041.649007@1] qierouterproxy D c09f286c 0 22202 22005 0x00000001
<4>[12041.652663@1] [] (__schedule+0x6a4/0x7e8) from [] (io_schedule+0xa0/0xf8)
<4>[12041.661190@1] [] (io_schedule+0xa0/0xf8) from [] (sleep_on_page+0x8/0x10)
<4>[12041.669606@1] [] (sleep_on_page+0x8/0x10) from [] (__wait_on_bit+0x54/0xbc)
<4>[12041.678200@1] [] (__wait_on_bit+0x54/0xbc) from [] (wait_on_page_bit+0xbc/0xc4)
<4>[12041.687208@1] [] (wait_on_page_bit+0xbc/0xc4) from [] (filemap_fdatawait_range+0x78/0x118)
<4>[12041.697608@1] [] (filemap_fdatawait_range+0x78/0x118) from [] (filemap_write_and_wait_range+0x54/0x78)
<4>[12041.711136@1] [] (filemap_write_and_wait_range+0x54/0x78) from [] (ext4_sync_file+0xb8/0x388)
<4>[12041.718222@1] [] (ext4_sync_file+0xb8/0x388) from [] (vfs_fsync+0x3c/0x4c)
<4>[12041.727098@1] [] (vfs_fsync+0x3c/0x4c) from [] (do_fsync+0x28/0x50)
<4>[12041.734928@1] [] (do_fsync+0x28/0x50) from [] (ret_fast_syscall+0x0/0x30)

  由日志看出,典型的d状态超时,因为我们把超时时间由内核默认的120s改成了1200s,依然爆栈,显然是是出了问题。由栈可以看出是qierouterproxy进程调用fsync系统调用引起的,进程d在了page的等待队列上等待唤醒,但却没有等到。

(一)复现方法

  上面的栈是线上爆出的现场,后来在bcache设备目录里写小文件没事,写大文件并sync或者fsync就会出样同样的问题,二分法很快发现一次写32个page大小也就是128k的数据就是会出现此问题,比128kk小没事,128k或者更大会出问题,系统hung死。

  复现代码如下:

#include 
#include 
#include 
#include 
#include 
#include 
#include 


#define LEN 4096*32
#define NUM 1

int main(int argc, char *argv[])
{
    int fd = -1;       
    char *buf = NULL;
    int ret = -1;
	int i = 0;
    
	buf = malloc(LEN);
	if(!buf)
	{
		printf("buf alloc failed!\n");
		return -1;
	}
	
    fd = open("test", O_RDWR|O_CREAT);
    if (-1 == fd)
    {
        printf("open err\n");
    }
    else
    {
        printf("open success!\n");
    }
   	
	for(i = 0;i 

(二)io 路径分析

       线上环境是比较复杂,线上用户磁盘hdd一般是NTFS文件系统,我们在NTFS上创建loop设备,再基于loop设备创建linear映射的lvm设备,对于使用ssd和hdd双盘的情况 ,还要使用bcache加速hdd,用lvm设备作为bcache的后端设备,ssd作为缓存设备,最后在bcache设备上部署ext4文件系统。总结起来,IO写的路径大概为:ext4文件系统,bcache设备,lvm设备,loop设备,ntfs,hdd,十分漫长。

  这里主要关注ext4、bcache 设备,写流程由上到下大概流程为:

  ext4写流程分为vfs写和bdi回写两部分:

  前半部,vfs写流程如下:

SyS_write
   vfs_write
      do_sync_write
         ext4_file_write       
            generic_file_aio_write
              __generic_file_aio_write
                   generic_file_buffered_write
                      generic_perform_write

  主要关注下generic_perform_write函数,源码如下:

static ssize_t generic_perform_write(struct file *file,
				struct iov_iter *i, loff_t pos)
{
	struct address_space *mapping = file->f_mapping;
	const struct address_space_operations *a_ops = mapping->a_ops;
	long status = 0;
	ssize_t written = 0;
	unsigned int flags = 0;

	/*
	 * Copies from kernel address space cannot fail (NFSD is a big user).
	 */
	if (segment_eq(get_fs(), KERNEL_DS))
		flags |= AOP_FLAG_UNINTERRUPTIBLE;

	do {
		struct page *page;
		unsigned long offset;	/* Offset into pagecache page */
		unsigned long bytes;	/* Bytes to write to page */
		size_t copied;		/* Bytes copied from user */
		void *fsdata;

		offset = (pos & (PAGE_CACHE_SIZE - 1));
		bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
						iov_iter_count(i));

again:
		/*
		 * Bring in the user page that we will copy from _first_.
		 * Otherwise there's a nasty deadlock on copying from the
		 * same page as we're writing to, without it being marked
		 * up-to-date.
		 *
		 * Not only is this an optimisation, but it is also required
		 * to check that the address is actually valid, when atomic
		 * usercopies are used, below.
		 */
		if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
			status = -EFAULT;
			break;
		}

		status = a_ops->write_begin(file, mapping, pos, bytes, flags,
						&page, &fsdata);
		if (unlikely(status))
			break;

		if (mapping_writably_mapped(mapping))
			flush_dcache_page(page);

		pagefault_disable();
		copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
		pagefault_enable();
		flush_dcache_page(page);
		mark_page_accessed(page);
		blk_associate_page(page);
		status = a_ops->write_end(file, mapping, pos, bytes, copied,
						page, fsdata);
		if (unlikely(status < 0))
			break;
		copied = status;

		cond_resched();

		iov_iter_advance(i, copied);
		if (unlikely(copied == 0)) {
			/*
			 * If we were unable to copy any data at all, we must
			 * fall back to a single segment length write.
			 *
			 * If we didn't fallback here, we could livelock
			 * because not all segments in the iov can be copied at
			 * once without a pagefault.
			 */
			bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
						iov_iter_single_seg_count(i));
			goto again;
		}
		pos += copied;
		written += copied;

		balance_dirty_pages_ratelimited(mapping);
		if (fatal_signal_pending(current)) {
			status = -EINTR;
			break;
		}
	} while (iov_iter_count(i));

	return written ? written : status;
}

            该函数是前半部写的核心函数,主要是是将数据从用户缓存区拷贝到文件的page cache,具体工作如下:

         (1)a_ops->write_begin。ext4上具体为ext4_da_write_begin函数,主要工作是查看文件的page cache里对应的page是否存在,如果不存在,分配并插入inode address 基树,插入系统inactive file lru链表。对于要写入的page依次调用ext4_da_get_block_prep函数判断文件page所在的逻辑块对应的磁盘的物理块是否分配,如果没有分配,由于ext4采用delay alloc 策略,并不会取分配物理块,只是为文件预留相应的quota,并将delay alloc 的range记录到inode的status tree里。delay alloc 的rang被视为mapped(参见ext4_da_get_block_prep函数里的map_bh函数)。

        (2)iov_iter_copy_from_user_atomic函数将用户缓存区数据拷贝到page cache对应的page。

        (3)a->ops->write_end。ext4上具体为ext4_da_write_end函数。对于追加写的情况,i_size_write函数更新文件i_size,mark_inode_dirty函数为回写ext4 磁盘inode元数据做准备。

  后半部-bdi回写:

kthread
   worker_thread
      process_one_work
          bdi_writeback_workfn
             wb_do_writeback
                __writeback_inodes_wb
                      writeback_sb_inodes
                          writeback_sb_inodes 
                              __writeback_single_inode
                                   ext4_da_writepages  

  这里重点关注ext4_da_writepages函数的实现,源码如下:

static int ext4_da_writepages(struct address_space *mapping,
			      struct writeback_control *wbc)
{
	pgoff_t	index;
	int range_whole = 0;
	handle_t *handle = NULL;
	struct mpage_da_data mpd;
	struct inode *inode = mapping->host;
	int pages_written = 0;
	unsigned int max_pages;
	int range_cyclic, cycled = 1, io_done = 0;
	int needed_blocks, ret = 0;
	long desired_nr_to_write, nr_to_writebump = 0;
	loff_t range_start = wbc->range_start;
	struct ext4_sb_info *sbi = EXT4_SB(mapping->host->i_sb);
	pgoff_t done_index = 0;
	pgoff_t end;
	struct blk_plug plug;

	trace_ext4_da_writepages(inode, wbc);

	/*
	 * No pages to write? This is mainly a kludge to avoid starting
	 * a transaction for special inodes like journal inode on last iput()
	 * because that could violate lock ordering on umount
	 */
	if (!mapping->nrpages || !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
		return 0;

	/*
	 * If the filesystem has aborted, it is read-only, so return
	 * right away instead of dumping stack traces later on that
	 * will obscure the real source of the problem.  We test
	 * EXT4_MF_FS_ABORTED instead of sb->s_flag's MS_RDONLY because
	 * the latter could be true if the filesystem is mounted
	 * read-only, and in that case, ext4_da_writepages should
	 * *never* be called, so if that ever happens, we would want
	 * the stack trace.
	 */
	if (unlikely(sbi->s_mount_flags & EXT4_MF_FS_ABORTED))
		return -EROFS;

	if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
		range_whole = 1;

	range_cyclic = wbc->range_cyclic;
	if (wbc->range_cyclic) {
		index = mapping->writeback_index;
		if (index)
			cycled = 0;
		wbc->range_start = index << PAGE_CACHE_SHIFT;
		wbc->range_end  = LLONG_MAX;
		wbc->range_cyclic = 0;
		end = -1;
	} else {
		index = wbc->range_start >> PAGE_CACHE_SHIFT;
		end = wbc->range_end >> PAGE_CACHE_SHIFT;
	}

	/*
	 * This works around two forms of stupidity.  The first is in
	 * the writeback code, which caps the maximum number of pages
	 * written to be 1024 pages.  This is wrong on multiple
	 * levels; different architectues have a different page size,
	 * which changes the maximum amount of data which gets
	 * written.  Secondly, 4 megabytes is way too small.  XFS
	 * forces this value to be 16 megabytes by multiplying
	 * nr_to_write parameter by four, and then relies on its
	 * allocator to allocate larger extents to make them
	 * contiguous.  Unfortunately this brings us to the second
	 * stupidity, which is that ext4's mballoc code only allocates
	 * at most 2048 blocks.  So we force contiguous writes up to
	 * the number of dirty blocks in the inode, or
	 * sbi->max_writeback_mb_bump whichever is smaller.
	 */
	max_pages = sbi->s_max_writeback_mb_bump << (20 - PAGE_CACHE_SHIFT);
	if (!range_cyclic && range_whole) {
		if (wbc->nr_to_write == LONG_MAX)
			desired_nr_to_write = wbc->nr_to_write;
		else
			desired_nr_to_write = wbc->nr_to_write * 8;
	} else
		desired_nr_to_write = ext4_num_dirty_pages(inode, index,
							   max_pages);
	if (desired_nr_to_write > max_pages)
		desired_nr_to_write = max_pages;

	if (wbc->nr_to_write < desired_nr_to_write) {
		nr_to_writebump = desired_nr_to_write - wbc->nr_to_write;
		wbc->nr_to_write = desired_nr_to_write;
	}

retry:
	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
		tag_pages_for_writeback(mapping, index, end);

	blk_start_plug(&plug);
	while (!ret && wbc->nr_to_write > 0) {

		/*
		 * we  insert one extent at a time. So we need
		 * credit needed for single extent allocation.
		 * journalled mode is currently not supported
		 * by delalloc
		 */
		BUG_ON(ext4_should_journal_data(inode));
		needed_blocks = ext4_da_writepages_trans_blocks(inode);

		/* start a new transaction*/
		handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE,
					    needed_blocks);
		if (IS_ERR(handle)) {
			ret = PTR_ERR(handle);
			ext4_msg(inode->i_sb, KERN_CRIT, "%s: jbd2_start: "
			       "%ld pages, ino %lu; err %d", __func__,
				wbc->nr_to_write, inode->i_ino, ret);
			blk_finish_plug(&plug);
			goto out_writepages;
		}

		/*
		 * Now call write_cache_pages_da() to find the next
		 * contiguous region of logical blocks that need
		 * blocks to be allocated by ext4 and submit them.
		 */
		ret = write_cache_pages_da(handle, mapping,
					   wbc, &mpd, &done_index);
		/*
		 * If we have a contiguous extent of pages and we
		 * haven't done the I/O yet, map the blocks and submit
		 * them for I/O.
		 */
		if (!mpd.io_done && mpd.next_page != mpd.first_page) {
			mpage_da_map_and_submit(&mpd);
			ret = MPAGE_DA_EXTENT_TAIL;
		}
		trace_ext4_da_write_pages(inode, &mpd);
		wbc->nr_to_write -= mpd.pages_written;

		ext4_journal_stop(handle);

		if ((mpd.retval == -ENOSPC) && sbi->s_journal) {
			/* commit the transaction which would
			 * free blocks released in the transaction
			 * and try again
			 */
			jbd2_journal_force_commit_nested(sbi->s_journal);
			ret = 0;
		} else if (ret == MPAGE_DA_EXTENT_TAIL) {
			/*
			 * Got one extent now try with rest of the pages.
			 * If mpd.retval is set -EIO, journal is aborted.
			 * So we don't need to write any more.
			 */
			pages_written += mpd.pages_written;
			ret = mpd.retval;
			io_done = 1;
		} else if (wbc->nr_to_write)
			/*
			 * There is no more writeout needed
			 * or we requested for a noblocking writeout
			 * and we found the device congested
			 */
			break;
	}
	blk_finish_plug(&plug);
	if (!io_done && !cycled) {
		cycled = 1;
		index = 0;
		wbc->range_start = index << PAGE_CACHE_SHIFT;
		wbc->range_end  = mapping->writeback_index - 1;
		goto retry;
	}

	/* Update index */
	wbc->range_cyclic = range_cyclic;
	if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
		/*
		 * set the writeback_index so that range_cyclic
		 * mode will write it back later
		 */
		mapping->writeback_index = done_index;

out_writepages:
	wbc->nr_to_write -= nr_to_writebump;
	wbc->range_start = range_start;
	trace_ext4_da_writepages_result(inode, wbc, ret, pages_written);
	return ret;
}

  在分析这个函数之前首先介绍这个关键结构体:

struct mpage_da_data {
	struct inode *inode;
	sector_t b_blocknr;		/* start block number of extent */
	size_t b_size;			/* size of extent */
	unsigned long b_state;		/* state of the extent */
	unsigned long first_page, next_page;	/* extent of pages */
	struct writeback_control *wbc;
	int io_done;
	int pages_written;
	int retval;
};

struct mpage_da_data主要记录待映射的extent,其中

first_page,next_page:分别记录本次要回写的range,单位是文件页偏移量。

b_blocknr:待映射的extent的起始逻辑块号。

b_size:待映射的extent的大小。

b_state:extent对应的bh的标志。

io_done:如果[first_page,next_page-1]的区域回写完毕,io_done=1。

  再看ext4_da_writepages函数主要处理流程:

       该函数主体是个while循环,循环回写wbc->start_range,wbc->end_range之间的区域,每次回写mpd->first_page,mpd->next_page的区域。mpd->first_page,mpd->next_page如何得来的?

       比如说,wbc->start_range,wbc->end_range覆盖文件的1,2,3,4,[5]],6,7,8,9页,其中1,2,3,4页是dirty的,5页是clean的,6,7,8,9页是dirty的,那么ext4_da_writepages函数需要循环两个。第一次[mpd->first_page = 1,mpd->next_page-1 = 4],第二次[mpd->first_page = 6,mpd->next_page-1 = 9]。mpd->first_page,mpd->next_page标示最大连续脏页区域。

       (1)调用write_cache_pages_da函数累计记录本次连续脏页区域(由mpd->first_page,mpd->next_page记录)。同时write_cache_pages_da函数调用mpage_add_bh_to_extent函数累集该区域内的delay或者unwritten的区域,并记录到mpd的b->blocknr和mpd->b_size里。

       (2)mpage_da_map_and_submit完成对mpd记录的extent的的分配工作。mpage_da_map_and_submit函数调用ext4_map_blocks为extent分配空间。对与delay的extent的直接分配,对于unwritten的extent,直接转换成written就行。

       (3)mpage_da_submit_io提交属于[mpd->first_page,mpd->next_page-1]区间的bio。经过mpage_da_map_and_submit函数的分配Block,此时所有bh都是真正mapped的。需要说明的是ext4_bio_write_page会调用io_submit_add_bh累集连续的bio到ext4_io_submit里。

       如果遇到不连续的bio,调用ext4_io_submit函数提交先前累集的bio,并重新开始累集。直到最后,如果io_done不为1,说明还有io没有提交,再次调用ext4_io_submit函数提交剩余的bio。

       vfs 写基本已经介绍完了,来看bcache写流程:

       上面说到的ext4_da_writepages函数最后调用ext4_io_submit函数来提交IO。ext4_io_submit函数调用submit_bio函数进入通用提交Bio流程。submit_bio函数调用generic_make_request函数,最终调用q->make_request_fn,也就是调用bcache设备的的cached_dev_make_request函数,自此进入bcache设备的处理流程。以下是具体流程:

     (1).进入bcache设备的q->make_request_fn函数,具体为cached_dev_make_request。在该函数中,申请struct search结构体,结构体中的struct closure通过计数跟踪bio在bcache里的完成情况,并将bio和struct search进行绑定,同时对bio进行clone,之所以要进行clone,是因为像bcache lvm这类逻辑设备在处理bio的时候需要修改bio某些域或者split bio,而在最后bio的所有请求在底层完成的时候,还需要完成一些列回调函数,这些回调的参数就是bio,所以需要对原始bio进行保留,clone一个新的bio进行提交。最后调用request_write函数。

     (2).request_write函数。线上的bcache的写模式为writearound,所以会直接写后端设备。调用bch_generic_make_request提交bio,这里提交的是clone后的bio。

     (3).bch_generic_make_reques函数主要处理bio 的split。如果bio大小超过bch_bio_max_sectors之后,需要对bio根据bch_bio_max_sectors大小进行切分。

     (4).bch_generic_make_request_hack调用generic_make_request向lvm设备队列提交bio,进入lvm设备的处理流程。

(三)bug分析定位

      从hung的调用栈来看,是qierouterproxy进程调用了fsync函数在sleep_on_page里切换了出去,再也没有调度回来。

      看下,fsync系统调用的过程。do_fsync调用vfs_fsync,vfs_sync函数调用vfs_fsync_range函数(),vfs_fsync_range函数调用ext4的ext4_sync_file函数。注意vfs_fsync_range(file, 0, LLONG_MAX, datasync)的参数,是等待所有数据。ext4_fsync_file函数调用filemap_write_and_wait_range函数,来看下这个函数:

int filemap_write_and_wait_range(struct address_space *mapping,
				 loff_t lstart, loff_t lend)
{
	int err = 0;

	if (mapping->nrpages) {
		err = __filemap_fdatawrite_range(mapping, lstart, lend,
						 WB_SYNC_ALL);
		/* See comment of filemap_write_and_wait() */
		if (err != -EIO) {
			int err2 = filemap_fdatawait_range(mapping,
						lstart, lend);
			if (!err)
				err = err2;
		}
	} else {
		err = filemap_check_errors(mapping);
	}
	return err;
}

      从函数的名字filemap_write_and_wait_range名字来看,它主要主要干了两件事write和wait。事实上,7行的__filemap_fdatawrite_range通过调用ext4_da_writepages函数来写address space里的所有脏页,11行filemap_fdatawait_range函数负责依次等待所有的page回写完成。

  看下filemap_fdatawait_range函数: 

int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
			    loff_t end_byte)
{
	pgoff_t index = start_byte >> PAGE_CACHE_SHIFT;
	pgoff_t end = end_byte >> PAGE_CACHE_SHIFT;
	struct pagevec pvec;
	int nr_pages;
	int ret2, ret = 0;

	if (end_byte < start_byte)
		goto out;

	pagevec_init(&pvec, 0);
	while ((index <= end) &&
			(nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
			PAGECACHE_TAG_WRITEBACK,
			min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1)) != 0) {
		unsigned i;

		for (i = 0; i < nr_pages; i++) {
			struct page *page = pvec.pages[i];

			/* until radix tree lookup accepts end_index */
			if (page->index > end)
				continue;
			wait_on_page_writeback(page);
			if (TestClearPageError(page))
				ret = -EIO;
		}
		pagevec_release(&pvec);
		cond_resched();
	}
out:
	ret2 = filemap_check_errors(mapping);
	if (!ret)
		ret = ret2;

	return ret;
}

      该函数遍历inode的adress space的writeback的page,对每个writeback的page调用wait_on_page_writeback函数等待page回写完成。如果page还处于writeback状态,那么进程切换出去。

      在wait_on_page_writeback前后加两句打印(26-27,29-30行),如下:

int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
			    loff_t end_byte)
{
	pgoff_t index = start_byte >> PAGE_CACHE_SHIFT;
	pgoff_t end = end_byte >> PAGE_CACHE_SHIFT;
	struct pagevec pvec;
	int nr_pages;
	int ret2, ret = 0;

	if (end_byte < start_byte)
		goto out;

	pagevec_init(&pvec, 0);
	while ((index <= end) &&
			(nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
			PAGECACHE_TAG_WRITEBACK,
			min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1)) != 0) {
		unsigned i;

		for (i = 0; i < nr_pages; i++) {
			struct page *page = pvec.pages[i];

			/* until radix tree lookup accepts end_index */
			if (page->index > end)
				continue;
			if((page->mapping)&&(!((unsigned long)page->mapping&0x1))&&(page->mapping->host->i_ino == 13))
				printk("Before wait page index %ld!\n",page->index);
			wait_on_page_writeback(page);
			if((page->mapping)&&(!((unsigned long)page->mapping&0x1))&&(page->mapping->host->i_ino == 13))
				printk("After  wait page index %ld!\n",page->index);
			if (TestClearPageError(page))
				ret = -EIO;
		}
		pagevec_release(&pvec);
		cond_resched();
	}
out:
	ret2 = filemap_check_errors(mapping);
	if (!ret)
		ret = ret2;

	return ret;
}

      其中13为要写的文件的inode no。

      重新编译烧写内核,在bcache目录里执行前面的复现的程序test_32k,系统hung住,串口打印如下:

[  191.456617] Before wait page index 0!
[  191.461711] After  wait page index 0!
[  191.461757] Before wait page index 1!
[  191.464117] After  wait page index 1!
[  191.467853] Before wait page index 2!
[  191.472090] After  wait page index 2!
[  191.475609] Before wait page index 3!
[  191.479335] After  wait page index 3!
[  191.483291] Before wait page index 4!
[  191.487035] After  wait page index 4!
[  191.490799] Before wait page index 5!
[  191.494462] After  wait page index 5!
[  191.498253] Before wait page index 6!
[  191.502042] After  wait page index 6!
[  191.505843] Before wait page index 7!
[  191.509621] After  wait page index 7!
[  191.513391] Before wait page index 8!
[  191.517184] After  wait page index 8!
[  191.521016] Before wait page index 9!
[  191.524775] After  wait page index 9!
[  191.528587] Before wait page index 10!
[  191.532452] After  wait page index 10!
[  191.536670] Before wait page index 11!
[  191.540250] After  wait page index 11!
[  191.544103] Before wait page index 12!
[  191.547983] After  wait page index 12!
[  191.551893] Before wait page index 13!
[  191.555748] After  wait page index 13!
[  191.559674] Before wait page index 31!

      发现index为31的page只有Before,没有After。由此断定fsync系统调用hung在第31个page,即写入的32个page的最后一个page。之所以没有把[0,31]个page全打出来,是因为,在调用fsync之前,有的page已经完成io 传输。

      index为31的page为什么没有完成writeback导致在此页上等待完成的进程d死。bio->bi_end_io负责唤醒在页上等待io完成的进程,并清除page的writeback。bio->bi_end_io在ext4上为具体为ext4_end_bio函数。代码如下:

static void ext4_end_bio(struct bio *bio, int error)
{
	ext4_io_end_t *io_end = bio->bi_private;
	struct inode *inode;
	int i;
	int blocksize;
	sector_t bi_sector = bio->bi_sector;
	struct page *page1 = NULL;
	int need_printk = 0;
	
	if(bio_has_data(bio)&&!PageSwapBacked(page1=bio->bi_io_vec[0].bv_page)&&(page1->mapping)&&(!((unsigned long)page1->mapping&0x1))&&(page1->mapping->host->i_ino == 13))
		need_printk = 1;
	BUG_ON(!io_end);
	inode = io_end->inode;
	blocksize = 1 << inode->i_blkbits;
	bio->bi_private = NULL;
	bio->bi_end_io = NULL;
	if (test_bit(BIO_UPTODATE, &bio->bi_flags))
		error = 0;
	for (i = 0; i < bio->bi_vcnt; i++) {
		struct bio_vec *bvec = &bio->bi_io_vec[i];
		struct page *page = bvec->bv_page;
		struct buffer_head *bh, *head;
		unsigned bio_start = bvec->bv_offset;
		unsigned bio_end = bio_start + bvec->bv_len;
		unsigned under_io = 0;
		unsigned long flags;

		if (!page)
			continue;
		if(need_printk)
			printk("page->index %ld start %d end %d!\n",page->index,bio_start,bio_end);
		if (error) {
			SetPageError(page);
			set_bit(AS_EIO, &page->mapping->flags);
		}
		bh = head = page_buffers(page);
		/*
		 * We check all buffers in the page under BH_Uptodate_Lock
		 * to avoid races with other end io clearing async_write flags
		 */
		local_irq_save(flags);
		bit_spin_lock(BH_Uptodate_Lock, &head->b_state);
		do {
			if (bh_offset(bh) < bio_start ||
			    bh_offset(bh) + blocksize > bio_end) {
				if (buffer_async_write(bh))
					under_io++;
				continue;
			}
			clear_buffer_async_write(bh);
			if (error)
				buffer_io_error(bh);
		} while ((bh = bh->b_this_page) != head);
		bit_spin_unlock(BH_Uptodate_Lock, &head->b_state);
		local_irq_restore(flags);
		if (!under_io){
			end_page_writeback(page);
			if(need_printk)
				printk("inode blocksize %d page index %ld!\n",1<<(inode->i_blkbits),page->index);
		}
	}
	bio_put(bio);

	if (error) {
		io_end->flag |= EXT4_IO_END_ERROR;
		ext4_warning(inode->i_sb, "I/O error writing to inode %lu "
			     "(offset %llu size %ld starting block %llu)",
			     inode->i_ino,
			     (unsigned long long) io_end->offset,
			     (long) io_end->size,
			     (unsigned long long)
			     bi_sector >> (inode->i_blkbits - 9));
	}

	if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
		ext4_free_io_end(io_end);
		return;
	}

	ext4_add_complete_io(io_end);
}

       该函数遍历bio的所有page,查看page所属的bh,根据bh的回写情况,来处理page。如果page的所有bh均完成回写,那么调用end_page_writeback函数来唤醒该page上等待的进程并清除writeback标志。其中11-12,59-60行是添加打印,过滤打印inode no为13的已经完成回写的page。

       重现编译烧写内核,在bcache设备上执行test_32复现程序,dmesg输出如下:

[  214.340375] inode blocksize 4096 page index 0!
[  214.340424] inode blocksize 4096 page index 1!
[  214.343944] inode blocksize 4096 page index 2!
[  214.348572] inode blocksize 4096 page index 3!
[  214.353120] inode blocksize 4096 page index 4!
[  214.357661] inode blocksize 4096 page index 5!
[  214.362261] inode blocksize 4096 page index 6!
[  214.366924] inode blocksize 4096 page index 7!
[  214.371436] inode blocksize 4096 page index 8!
[  214.375958] inode blocksize 4096 page index 9!
[  214.380575] inode blocksize 4096 page index 10!
[  214.385187] inode blocksize 4096 page index 11!
[  214.390170] inode blocksize 4096 page index 12!
[  214.394488] inode blocksize 4096 page index 13!
[  214.399165] inode blocksize 4096 page index 14!
[  214.403803] inode blocksize 4096 page index 15!
[  214.408478] inode blocksize 4096 page index 16!
[  214.413117] inode blocksize 4096 page index 17!
[  214.417773] inode blocksize 4096 page index 18!
[  214.422536] inode blocksize 4096 page index 19!
[  214.427092] inode blocksize 4096 page index 20!
[  214.431816] inode blocksize 4096 page index 21!
[  214.436418] inode blocksize 4096 page index 22!
[  214.441101] inode blocksize 4096 page index 23!
[  214.445736] inode blocksize 4096 page index 24!
[  214.450415] inode blocksize 4096 page index 25!
[  214.455042] inode blocksize 4096 page index 26!
[  214.459744] inode blocksize 4096 page index 27!
[  214.464369] inode blocksize 4096 page index 28!
[  214.469098] inode blocksize 4096 page index 29!
[  214.473705] inode blocksize 4096 page index 30!

       test_32函数写32个page,从打印来看只有[0,30]完成IO,index为31最后一个page没有完成io,没有唤醒等待其完成的进程。这个结果和上面filemap_fdatawait_range函数的打印相互印证。另外打印了ext4的blocksize是4096,表明一个page只有一个bh。

      显然,最后一个page没有调用end_page_writeback函数。函数265-276行,遍历page的所有bh(只有一个),如果落在bvec的范围内,那么调用clear_buffer_async_write函数清理bh的BH_Async_Write标志。如果还有在bh在bvec范围之外,且BH_Async_Write标志置位的话,那么under_io++,表明本page还有尚未回写完成的bh,那么整个page不能调用end_page_writeback。BH_Async_Write标志是在ext4_bio_write_page函数里在map完extent,累积提交的bio之前置位的。

      在ext4_end_bio函数继续添加打印,打印处每个bvec的覆盖范围,添加打印后的代码(11-12行,31-32行)如下:

static void ext4_end_bio(struct bio *bio, int error)
{
	ext4_io_end_t *io_end = bio->bi_private;
	struct inode *inode;
	int i;
	int blocksize;
	sector_t bi_sector = bio->bi_sector;
	struct page *page1 = NULL;
	int need_printk = 0;
	
	if(bio_has_data(bio)&&!PageSwapBacked(page1=bio->bi_io_vec[0].bv_page)&&(page1->mapping)&&(!((unsigned long)page1->mapping&0x1))&&(page1->mapping->host->i_ino == 13))
		need_printk = 1;
	BUG_ON(!io_end);
	inode = io_end->inode;
	blocksize = 1 << inode->i_blkbits;
	bio->bi_private = NULL;
	bio->bi_end_io = NULL;
	if (test_bit(BIO_UPTODATE, &bio->bi_flags))
		error = 0;
	for (i = 0; i < bio->bi_vcnt; i++) {
		struct bio_vec *bvec = &bio->bi_io_vec[i];
		struct page *page = bvec->bv_page;
		struct buffer_head *bh, *head;
		unsigned bio_start = bvec->bv_offset;
		unsigned bio_end = bio_start + bvec->bv_len;
		unsigned under_io = 0;
		unsigned long flags;

		if (!page)
			continue;
		if(need_printk)
			printk("page->index %ld start %d end %d!\n",page->index,bio_start,bio_end);
		if (error) {
			SetPageError(page);
			set_bit(AS_EIO, &page->mapping->flags);
		}
		bh = head = page_buffers(page);
		/*
		 * We check all buffers in the page under BH_Uptodate_Lock
		 * to avoid races with other end io clearing async_write flags
		 */
		local_irq_save(flags);
		bit_spin_lock(BH_Uptodate_Lock, &head->b_state);
		do {
			if (bh_offset(bh) < bio_start ||
			    bh_offset(bh) + blocksize > bio_end) {
				if (buffer_async_write(bh))
					under_io++;
				continue;
			}
			clear_buffer_async_write(bh);
			if (error)
				buffer_io_error(bh);
		} while ((bh = bh->b_this_page) != head);
		bit_spin_unlock(BH_Uptodate_Lock, &head->b_state);
		local_irq_restore(flags);
		if (!under_io){
			end_page_writeback(page);
			//if(need_printk)
			//	printk("inode blocksize %d page index %ld!\n",1<<(inode->i_blkbits),page->index);
		}
	}
	bio_put(bio);

	if (error) {
		io_end->flag |= EXT4_IO_END_ERROR;
		ext4_warning(inode->i_sb, "I/O error writing to inode %lu "
			     "(offset %llu size %ld starting block %llu)",
			     inode->i_ino,
			     (unsigned long long) io_end->offset,
			     (long) io_end->size,
			     (unsigned long long)
			     bi_sector >> (inode->i_blkbits - 9));
	}

	if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) {
		ext4_free_io_end(io_end);
		return;
	}

	ext4_add_complete_io(io_end);
}

      重新编译烧写内核,bcache目录执行./test_32,dmesg输出结果如下:

[  175.681627] page->index 0 start 0 end 4096!
[  175.681686] page->index 1 start 0 end 4096!
[  175.684686] page->index 2 start 0 end 4096!
[  175.689027] page->index 3 start 0 end 4096!
[  175.693330] page->index 4 start 0 end 4096!
[  175.697622] page->index 5 start 0 end 4096!
[  175.701956] page->index 6 start 0 end 4096!
[  175.706245] page->index 7 start 0 end 4096!
[  175.710627] page->index 8 start 0 end 4096!
[  175.714875] page->index 9 start 0 end 4096!
[  175.719203] page->index 10 start 0 end 4096!
[  175.723581] page->index 11 start 0 end 4096!
[  175.727991] page->index 12 start 0 end 4096!
[  175.732449] page->index 13 start 0 end 4096!
[  175.736791] page->index 14 start 0 end 4096!
[  175.741231] page->index 15 start 0 end 4096!
[  175.745588] page->index 16 start 0 end 4096!
[  175.750204] page->index 17 start 0 end 4096!
[  175.754370] page->index 18 start 0 end 4096!
[  175.758787] page->index 19 start 0 end 4096!
[  175.763168] page->index 20 start 0 end 4096!
[  175.767566] page->index 21 start 0 end 4096!
[  175.772026] page->index 22 start 0 end 4096!
[  175.776384] page->index 23 start 0 end 4096!
[  175.780831] page->index 24 start 0 end 4096!
[  175.785183] page->index 25 start 0 end 4096!
[  175.789629] page->index 26 start 0 end 4096!
[  175.793983] page->index 27 start 0 end 4096!
[  175.798418] page->index 28 start 0 end 4096!
[  175.802776] page->index 29 start 0 end 4096!
[  175.807161] page->index 30 start 0 end 4096!
[  175.811627] page->index 31 start 3584 end 4096!

       最后一个bvec为何是[3582,4096]、大小512字节,不也应该是[0,4096]么。根据前面对ext4写流程的分析,写32个page,应该会在磁盘分配连续的物理块,由于物理空间连续,一个bio就可以搞定。那应该是32个page全部应该是[0,4096]才符合实际。bio的bvec被修改,我想到了bio的split,bio的split如果split不是位于bvec边界的话,需要重新界定bvec的范围。bio的split函数在bcache里是bch_bio_split函数,后面再看这个函数。下面确定是否存在bio split的情况存在。

       为了验证bio是否会split,我在bcache的入口函数和bcache的出口函数(也就是lvm的入口函数)里面分别对Bio的大小进行打印,用如下语句进行过滤:

   if(bio_has_data(bio)&&!PageSwapBacked(page1=bio->bi_io_vec[0].bv_page)&&(page1->mapping)&&(!((unsigned long)page1->mapping&0x1))&&(page1->mapping->host->i_ino == 13))

       这样只打印属于inode no为13的文件下发的bio,排除干扰。

       bcache的入口函数为为bcache设备队列的make_request_fn,具体为cached_dev_make_request函数。lvm的入口函数为lvm设备的make_request_fn,具体为dm_request函数。打印结果如下:

[  184.801757] func cached_dev_make_request bio sectors 256!
[  184.805082] func dm_request bio sectors 255!
[  184.809459] func dm_request bio sectors 1!

       很明显,进入bcache的是256个sector的bio,出来之后split成255个sector和1个sector。

       下面再过一下,bio在bcache的流程:

       bio 进入bcache的make_request_fn函数,具体是cached_dev_make_request函数。因为是写bio,所以进入request_write函数。request_write函数调用bch_generic_make_request函数。bch_generic_make_request函数代码如下:

 void bch_generic_make_request(struct bio *bio, struct bio_split_pool *p)
 {
     struct bio_split_hook *s;

     if (!bio_has_data(bio) && !(bio->bi_rw & REQ_DISCARD))
         goto submit;

     if (bio_sectors(bio) <= bch_bio_max_sectors(bio))
         goto submit;

     s = mempool_alloc(p->bio_split_hook, GFP_NOIO);

     s->bio      = bio;
     s->p        = p;
     s->bi_end_io    = bio->bi_end_io;
     s->bi_private   = bio->bi_private;
     bio_get(bio);

     closure_call(&s->cl, __bch_bio_submit_split, NULL, NULL);
     return;
 submit:
     bch_generic_make_request_hack(bio);
 }

       250行,如果bio的大小小于bch_bio_max_sectors(255个sector,每个sector为512字节),直接调用bch_generic_make_request_hack函数向下层的lvm设备提交bio。如果bio的超过255个sector,调用__bch_bio_submit_split函数。__bch_bio_submit_split函数调用bch_bio_split对bio以255个sector大小进行切分,切分后的bio,在调用bch_generic_make_request_hack函数向下层的lvm设备提交bio。

       这几个bio的关系:原始bio(256个sector),clone的bio(256个sector),clone的bio split成大小为255个sector和大小1个sector两个bio 。看bch_bio_split函数,clone的bio split的时候,申请一个255sector的bio,然后自己调整大小为1个sector的bio。

       最终三个bio:原始bio(大小256个sector),新split申请的大小为255个sector的新bio,还有原来调整为1个sector的clone bio。

       原始的bio的结束回调为ext4_end_bio,新申请的255个sector的结束回调为bch_bio_submit_split_endio,大小为1sector的clone bio的回调函数为bch_bio_submit_split_endio。原始bio的bio->private为closure1,大小为255个sector的bio和clone bio的private为closure2。255的bio和1sector的clone bio每完成一个,bch_bio_submit_split_endio回调函数被调用。bch_bio_submit_split_endio函数调用closure_put(bio->private),将closure2的计数减1,当计数减为0的时候(此时255的bio和clone bio均已经完成),调用closure->fn,具体为bch_bio_submit_split_done。

       看下bch_bio_submit_split_done函数:

static void bch_bio_submit_split_done(struct closure *cl)
{
    struct bio_split_hook *s = container_of(cl, struct bio_split_hook, cl);

    s->bio->bi_end_io = s->bi_end_io;
    s->bio->bi_private = s->bi_private;
    bio_endio(s->bio, 0);

    closure_debug_destroy(&s->cl);
    mempool_free(s, s->p->bio_split_hook);
}

        s->bio指的是大小为1sector的clone bio,该函数将s->bi_end_io(request_endio函数)重新设置为clone bio的结束回调,clone bio的private设置为closure1(原来是cloure2)。然后调用bio_endio,也就是执行request_endio,调用closure_put(closure1)将closure1的计数减1,如果计数为0,那么调用closure1->fn,具体为cached_dev_write_complete函数。该函数最终回调用原始bio的结束回调函数ext4_end_bio,好绕。

       一言以蔽之,对于下发的bio,bcache不会提交这个bio,而是提交他的clone副本,如果副本超过bch_bio_max_sectors,还会分裂为多个副本,只有在所有的副本bio都完成IO之后,bcache的回调机制能够确保,最后一个副本bio的io完成回调里执行原始bio的回调,在本例中是ext4_end_bio函数。

       回到最开始的io hung的root cause上,为何最后一个page没有唤醒上面的进程,没有清除writeback。看一下,原始bio的clone过程,在search_alloc函数里,代码如下:

static struct search *search_alloc(struct bio *bio, struct bcache_device *d)
{
    struct bio_vec *bv;
    struct search *s = mempool_alloc(d->c->search, GFP_NOIO);
    memset(s, 0, offsetof(struct search, op.keys));

    __closure_init(&s->cl, NULL);

    s->op.inode     = d->id;
    s->op.c         = d->c;
    s->d            = d;
    s->op.lock      = -1;
    s->task         = current;
    s->orig_bio     = bio;
    s->write        = (bio->bi_rw & REQ_WRITE) != 0;
    s->op.flush_journal = (bio->bi_rw & (REQ_FLUSH|REQ_FUA)) != 0;
    s->op.skip      = (bio->bi_rw & REQ_DISCARD) != 0;
    s->recoverable      = 1;
    s->start_time       = jiffies;
    do_bio_hook(s);

    if (bio->bi_size != bio_segments(bio) * PAGE_SIZE) {
        bv = mempool_alloc(d->unaligned_bvec, GFP_NOIO);
        memcpy(bv, bio_iovec(bio),
               sizeof(struct bio_vec) * bio_segments(bio));

        s->bio.bio.bi_io_vec    = bv;
        s->unaligned_bvec   = 1;
    }

    return s;
}

      其中参数bio原始bio,赋值给s->orig_bio,do_bio_hook里,clone bio是s->bio.bio,并完全拷贝原始bio,。do_bio_hook函数如下:

static void do_bio_hook(struct search *s)
{
    struct bio *bio = &s->bio.bio;
    memcpy(bio, s->orig_bio, sizeof(struct bio));

    bio->bi_end_io      = request_endio;
    bio->bi_private     = &s->cl;
    atomic_set(&bio->bi_cnt, 3);
}

      688行可见,clone bio完全拷贝原始bio结构体,此时原始bio和clone bio的bvec数组是公用的,这是问题根本所在。

      731行,如果原始bio的每个bvec不全是完整的页,那么会为clone bio从新分配bvec数组,并拷贝原始bio的数组内容。显然本例中,原始bio为256个sector,且bvec都是满页,不会重新分配bvec数组,clone bio会和原始bio公用bvec数组。

      问题在于,在clone bio 的split过程中,256个sector按照255个sector进行切分,分为255个sector bio和1sector的bio,其中clone bio为1sector的bio,split过程中,会修该bvec的范围,clone bio的最后一个bvec从{bv_len = 4096,bv_offset=0}变成{bv_len=512,bv_offset=3584}。由于clone bio和原始bio公用bvec数组,导致原始bio数组最后一个bvec被修改。最后,在所有的io都完成之后,调用原始bio的回调函数导致最后一个页不会执行end_page_writeback(),导致此页上的进程Hung死。

(四)修正与结论

      知道了根本原因之后,是由于原始bio和clone bio公用bvec数组导致的,可以修改search_alloc函数如下:

static struct search *search_alloc(struct bio *bio, struct bcache_device *d)
{
    struct bio_vec *bv;
    struct search *s = mempool_alloc(d->c->search, GFP_NOIO);
    memset(s, 0, offsetof(struct search, op.keys));

    __closure_init(&s->cl, NULL);

    s->op.inode     = d->id;
    s->op.c         = d->c;
    s->d            = d;
    s->op.lock      = -1;
    s->task         = current;
    s->orig_bio     = bio;
    s->write        = (bio->bi_rw & REQ_WRITE) != 0;
    s->op.flush_journal = (bio->bi_rw & (REQ_FLUSH|REQ_FUA)) != 0;
    s->op.skip      = (bio->bi_rw & REQ_DISCARD) != 0;
    s->recoverable      = 1;
    s->start_time       = jiffies;
    do_bio_hook(s);

  //  if (bio->bi_size != bio_segments(bio) * PAGE_SIZE) {
        bv = mempool_alloc(d->unaligned_bvec, GFP_NOIO);
        memcpy(bv, bio_iovec(bio),
               sizeof(struct bio_vec) * bio_segments(bio));

        s->bio.bio.bi_io_vec    = bv;
        s->unaligned_bvec   = 1;
    //}

    return s;
}

      注释掉731行和738行,即无论对齐与否,都为clone bio 的bvec数组申请单独的内存,确保不和原始bio共用bvec数组,防止被修改。

      在bcache或者lvm这种逻辑设备中,由于涉及到bio的重定向或者split处理,都不会提交原始的bio,以确保原始的bio的参数不被修改,在原始bio所覆盖的io 请求区间全部完成之后,会通过回调机制执行原始io的结束回调函数。在整个处理过程中,原始bio的不应该被修改。

你可能感兴趣的:(Linux调试)