Linux缓存写回机制

Linux缓存写回机制

原址:  http://oenhan.com/linux-cache-writeback

在做进程安全监控的时候,拍脑袋决定的,如果发现一个进程在D状态时,即TASK_UNINTERRUPTIBLE(不可中断的睡眠状态),时间超过了8min,就将系统panic掉。恰好DB组做日志时,将整个log缓存到内存中,最后刷磁盘,结果系统就D状态了很长时间,自然panic了,中间涉及到Linux的缓存写回刷磁盘的一些机制和调优方法,写一下总结。

目前机制需要将脏页刷回到磁盘一般是以下情况:

  1. 脏页缓存占用的内存太多,内存空间不足;
  2. 脏页已经更改了很长时间,时间上已经到了临界值,需要及时刷新保持内存和磁盘上数据一致性;
  3. 外界命令强制刷新脏页到磁盘
  4. write写磁盘时检查状态刷新

内核使用pdflush线程刷新脏页到磁盘,pdflush线程个数在2和8之间,可以通过/proc/sys/vm/nr_pdflush_threads文件直接查看,具体策略机制参看源码函数__pdflush。

一、内核其他模块强制刷新

先说一下第一种和第三种情况:当内存空间不足或外界强制刷新的时候,脏页的刷新是通过调用wakeup_pdflush函数实现的,调用其函数的有do_sync、free_more_memory、try_to_free_pages。wakeup_pdflush的功能是通过background_writeout的函数实现的:

static void background_writeout(unsigned long _min_pages)
{
 long min_pages = _min_pages;
 struct writeback_control wbc = {
 .bdi = NULL,
 .sync_mode = WB_SYNC_NONE,
 .older_than_this = NULL,
 .nr_to_write = 0,
 .nonblocking = 1,
 };
 for ( ; ; ) {
 struct writeback_state wbs;
 long background_thresh;
 long dirty_thresh;
 get_dirty_limits(&wbs, &background_thresh, &dirty_thresh, NULL);
 if (wbs.nr_dirty + wbs.nr_unstable < background_thresh
 && min_pages <= 0)
 break;
 wbc.encountered_congestion = 0;
 wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 wbc.pages_skipped = 0;
 writeback_inodes(&wbc);
 min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
 if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
 /* Wrote less than expected */
 blk_congestion_wait(WRITE, HZ/10);
 if (!wbc.encountered_congestion)
 break;
 }
 }
}

background_writeout进到一个死循环里面,通过get_dirty_limits获取脏页开始刷新的临界值background_thresh,即为dirty_background_ratio的总内存页数百分比,可以通过proc接口/proc/sys/vm/dirty_background_ratio调整,一般默认为10。当脏页超过临界值时,调用writeback_inodes写MAX_WRITEBACK_PAGES(1024)个页,直到脏页比例低于临界值。

二、内核定时器启动刷新

内核在启动的时候在page_writeback_init初始化wb_timer定时器,超时时间是dirty_writeback_centisecs,单位是0.01秒,可以通过/proc/sys/vm/dirty_writeback_centisecs调节。wb_timer的触发函数是wb_timer_fn,最终是通过wb_kupdate实现。

static void wb_kupdate(unsigned long arg)
{
 sync_supers();
 get_writeback_state(&wbs);
 oldest_jif = jiffies - (dirty_expire_centisecs * HZ) / 100;
 start_jif = jiffies;
 next_jif = start_jif + (dirty_writeback_centisecs * HZ) / 100;
 nr_to_write = wbs.nr_dirty + wbs.nr_unstable +
 (inodes_stat.nr_inodes - inodes_stat.nr_unused);
 while (nr_to_write > 0) {
 wbc.encountered_congestion = 0;
 wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 writeback_inodes(&wbc);
 if (wbc.nr_to_write > 0) {
 if (wbc.encountered_congestion)
 blk_congestion_wait(WRITE, HZ/10);
 else
 break; /* All the old data is written */
 }
 nr_to_write -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
 }
 if (time_before(next_jif, jiffies + HZ))
 next_jif = jiffies + HZ;
 if (dirty_writeback_centisecs)
 mod_timer(&wb_timer, next_jif);
 }

上面的代码没有拷贝全。内核首先将超级块信息刷新到文件系统上,然后获取oldest_jif作为wbc的参数只刷新已修改时间大于dirty_expire_centisecs的脏页,dirty_expire_centisecs参数可以通过/proc/sys/vm/dirty_expire_centisecs调整。

三、WRITE写文件刷新缓存

用户态使用WRITE函数写文件时也有可能要刷新脏页,generic_file_buffered_write函数会在将写的内存页标记为脏之后,根据条件刷新磁盘以平衡当前脏页比率,参看balance_dirty_pages_ratelimited函数:

void balance_dirty_pages_ratelimited(struct address_space *mapping)
{
 static DEFINE_PER_CPU(int, ratelimits) = 0;
 long ratelimit;
 ratelimit = ratelimit_pages;
 if (dirty_exceeded)
 ratelimit = 8;
 /*
 * Check the rate limiting. Also, we do not want to throttle real-time
 * tasks in balance_dirty_pages(). Period.
 */
 if (get_cpu_var(ratelimits)++ >= ratelimit) {
 __get_cpu_var(ratelimits) = 0;
 put_cpu_var(ratelimits);
 balance_dirty_pages(mapping);
 return;
 }
 put_cpu_var(ratelimits);
}

balance_dirty_pages_ratelimited函数通过ratelimit_pages调节刷新(调用balance_dirty_pages函数)的次数,每ratelimit_pages次调用才会刷新一次,具体刷新过程看balance_dirty_pages函数:

static void balance_dirty_pages(struct address_space *mapping)
{
 struct writeback_state wbs;
 long nr_reclaimable;
 long background_thresh;
 long dirty_thresh;
 unsigned long pages_written = 0;
 unsigned long write_chunk = sync_writeback_pages();
 struct backing_dev_info *bdi = mapping->backing_dev_info;
 for (;;) {
 struct writeback_control wbc = {
 .bdi = bdi,
 .sync_mode = WB_SYNC_NONE,
 .older_than_this = NULL,
 .nr_to_write = write_chunk,
 };
 get_dirty_limits(&wbs, &background_thresh,
 &dirty_thresh, mapping);
 nr_reclaimable = wbs.nr_dirty + wbs.nr_unstable;
 if (nr_reclaimable + wbs.nr_writeback <= dirty_thresh)
 break;
 if (!dirty_exceeded)
 dirty_exceeded = 1;
 /* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 * Unstable writes are a feature of certain networked
 * filesystems (i.e. NFS) in which data may have been
 * written to the server's write cache, but has not yet
 * been flushed to permanent storage.
 */
 if (nr_reclaimable) {
 writeback_inodes(&wbc);
 get_dirty_limits(&wbs, &background_thresh,
 &dirty_thresh, mapping);
 nr_reclaimable = wbs.nr_dirty + wbs.nr_unstable;
 if (nr_reclaimable + wbs.nr_writeback <= dirty_thresh)
 break;
 pages_written += write_chunk - wbc.nr_to_write;
 if (pages_written >= write_chunk)
 break; /* We've done our duty */
 }
 blk_congestion_wait(WRITE, HZ/10);
 }
 if (nr_reclaimable + wbs.nr_writeback <= dirty_thresh && dirty_exceeded)
 dirty_exceeded = 0;
 if (writeback_in_progress(bdi))
 return; /* pdflush is already working this queue */
 /*
 * In laptop mode, we wait until hitting the higher threshold before
 * starting background writeout, and then write out all the way down
 * to the lower threshold. So slow writers cause minimal disk activity.
 *
 * In normal mode, we start background writeout at the lower
 * background_thresh, to keep the amount of dirty memory low.
 */
 if ((laptop_mode && pages_written) ||
 (!laptop_mode && (nr_reclaimable > background_thresh)))
 pdflush_operation(background_writeout, 0);
}

函数走进一个死循环,通过get_dirty_limits获取dirty_background_ratio和dirty_ratio对应的内存页数值,当24行做判断,如果脏页大于dirty_thresh,则调用writeback_inodes开始刷缓存到磁盘,如果一次没有将脏页比率刷到dirty_ratio之下,则用blk_congestion_wait阻塞写,然后反复循环,直到比率降低到dirty_ratio;当比率低于dirty_ratio之后,但脏页比率大于dirty_background_ratio,则用pdflush_operation启用background_writeout,pdflush_operation是非阻塞函数,唤醒pdflush后直接返回,background_writeout在有pdflush调用。

如此可知:WRITE写的时候,缓存超过dirty_ratio,则会阻塞写操作,回刷脏页,直到缓存低于dirty_ratio;如果缓存高于background_writeout,则会在写操作时,唤醒pdflush进程刷脏页,不阻塞写操作。

四,问题总结

导致进程D状态大部分是因为第3种和第4种情况:有大量写操作,缓存由Linux系统管理,一旦脏页累计到一定程度,无论是继续写还是fsync刷新,都会使进程D住。


你可能感兴趣的:(驱动程序)