posix_fadvise清除缓存的误解和改进措施

原创文章,转载请注明: 转载自系统技术非业余研究

本文链接地址: posix_fadvise清除缓存的误解和改进措施

在典型的IO密集型的数据库服务器如MYSQL中,会涉及到大量的文件读写,通常这些文件都是通过buffer io来使用的,以便充分利用到Linux操作系统的page cache。

Buffer IO的特点是读的时候,先检查页缓存里面是否有需要的数据,如果没有就从设备读取,返回给用户的同时,加到缓存一份;写的时候,直接写到缓存去,再由后台的进程定期涮到磁盘去。这样的机制看起来非常的好,在实践中也效果很好。

但是如果你的IO非常密集,就会出现问题。首先由于pagesize是4K,内存的利用效率比较低。其次缓存的淘汰算法很简单,由操作系统自主进行,用户不大好参与。当你的写很多,超过系统内存的某个上限的时候,后台的进程(swapd)要出来回收页面,而且一旦回收的速度小于写入的速度,就会出现不可预期的行为。

这里面最大的问题是:当你使用的内存包括缓存,没超过操作系统规定的上限的时候,操作系统选择不作为,让用户充分使用缓存,从它的角度来看这样效率最高。但是正是由于这种策略在实践中会导致问题。

比如说MYSQL服务器,我们可以把数据直接走direct IO,但是它的日志是走bufferio的。因为走directio需要对写入文件的偏移和大小都要扇区对全,这对日志系统来讲太麻烦了。由于MYSQL是基于事务的,会涉及到大量的日志动作,频繁的写入,然后fsync. 日志一旦写入磁盘,buffer page就没用了,但是一直会在内存呆着,直到达到内存上限,引起操作系统突然大量回收
页面,出现IO柱塞或者内存交换等负面问题。

那么我们知道了困境在哪里,我们可以主动避免这个现象的发生。有二种方法:
1. 日志也走direct io,需要规模的修改MYSQL代码,如percona就这么做了,提供相应的patch。
2. 日志还是走buffer io, 但是定期清除无用page cache.

第一张方法不是我们要讨论的,我们重点讨论第二种如何做:

我们在程序里知道文件的句柄,是不是就可以很轻松的用:

int posix_fadvise(int fd, off_t offset, off_t len, int advice);
POSIX_FADV_DONTNEED
The specified data will not be accessed in the near future.

来解决问题呢?
比如写类似 posix_fadvise(fd, 0, len_of_file, POSIX_FADV_DONTNEED);这样的代码来清掉文件所属的缓存。

前面介绍的vmtouch就有这样的功能,清某个文件的缓存。
vmtouch -ve logfile 就可以试验,但是你会发现内存根本就没下来,原因呢?

我们从代码来看posix_fadvise如何运作的:
参看 mm/fadvise.c:

/*
 * Posix_FADV_WILLNEED could set PG_Referenced, and POSIX_FADV_NOREUSE could
 * deactivate the pages and clear PG_Referenced.
 */
SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
{
...
    case POSIX_FADV_DONTNEED:
        if (!bdi_write_congested(mapping->backing_dev_info))
            filemap_flush(mapping);
 
        /* First and last FULL page! */
        start_index = (offset+(PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT;
        end_index = (endbyte >> PAGE_CACHE_SHIFT);
 
        if (end_index >= start_index)
            invalidate_mapping_pages(mapping, start_index,
                        end_index);
        break;
...
}

我们可以看到如果后备设备不忙的话,会先调用filemap_flush(mapping)把脏页面刷掉,然后再调invalidate_mapping_pages清除页面。先看下如何刷页面的:
mm/filemap.c

/**                                                                                                                                                        
 * filemap_flush - mostly a non-blocking flush                                                                                                             
 * @mapping:    target address_space                                                                                                                       
 *                                                                                                                                                         
 * This is a mostly non-blocking flush.  Not suitable for data-integrity                                                                                   
 * purposes - I/O may not be started against all dirty pages.                                                                                              
 */
int filemap_flush(struct address_space *mapping)
{
        return __filemap_fdatawrite(mapping, WB_SYNC_NONE);
}
/**                                                                                                                                                        
 * __filemap_fdatawrite_range - start writeback on mapping dirty pages in range                                                                            
 * @mapping:    address space structure to write                                                                                                           
 * @start:      offset in bytes where the range starts                                                                                                     
 * @end:        offset in bytes where the range ends (inclusive)                                                                                           
 * @sync_mode:  enable synchronous operation                                                                                                               
 *                                                                                                                                                         
 * Start writeback against all of a mapping's dirty pages that lie                                                                                         
 * within the byte offsets <start, end> inclusive.                                                                                                         
 *                                                                                                                                                         
 * If sync_mode is WB_SYNC_ALL then this is a "data integrity" operation, as                                                                               
 * opposed to a regular memory cleansing writeback.  The difference between                                                                                
 * these two operations is that if a dirty page/buffer is encountered, it must                                                                             
 * be waited upon, and not just skipped over.                                                                                                              
 */
int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
                                loff_t end, int sync_mode)
{
        int ret;
        struct writeback_control wbc = {
                .sync_mode = sync_mode,
                .nr_to_write = LONG_MAX,
        .range_start = start,
                .range_end = end,
    };
 
        if (!mapping_cap_writeback_dirty(mapping))
        return 0;
 
    ret = do_writepages(mapping, &wbc);
    return ret;
}
 
int filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
                                loff_t end)
{
        return __filemap_fdatawrite_range(mapping, start, end, WB_SYNC_ALL);
}

我们看到它刷页面用的参数是是 WB_SYNC_NONE, 也就是说不是同步等待页面刷新完成。
而fsync和fdatasync是最终会调用filemap_fdatawrite_range, 用WB_SYNC_ALL参数等到完成才返回的。
我们来看下代码mm/page-writeback.c确认下:

int generic_writepages(struct address_space *mapping,
                       struct writeback_control *wbc)
{
...
       return write_cache_pages(mapping, wbc, __writepage, mapping);
}
int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
{
...
        if (mapping->a_ops->writepages)
                ret = mapping->a_ops->writepages(mapping, wbc);
        else
                ret = generic_writepages(mapping, wbc);
        return ret;
}
 
int generic_writepages(struct address_space *mapping,
                       struct writeback_control *wbc)
{
        /* deal with chardevs and other special file */
        if (!mapping->a_ops->writepage)
                return 0;
 
        return write_cache_pages(mapping, wbc, __writepage, mapping);
}
 
int write_cache_pages(struct address_space *mapping,
                      struct writeback_control *wbc, writepage_t writepage,
                      void *data)
{
...
                        /*                                                                                                                                 
                         * We stop writing back only if we are not doing                                                                                   
                         * integrity sync. In case of integrity sync we have to                                                                            
                         * keep going until we have written all the pages                                                                                  
                         * we tagged for writeback prior to entering this loop.                                                                            
                         */
                        if (--wbc->nr_to_write <= 0 &&
                            wbc->sync_mode == WB_SYNC_NONE) {
                                done = 1;
                                break;
                        }
                }
                pagevec_release(&pvec);
                cond_resched();
 
...
}

从代码和注释可以看出,在WB_SYNC_NONE模式下,提交完写脏页,然后就返回了,确实不等到回写完成。
到这里为止如何刷脏页就很清楚了,再接着看第二步清除内存的操作:
看下mm/truncate.c的实现:

/**
 * invalidate_mapping_pages - Invalidate all the unlocked pages of one inode
 * @mapping: the address_space which holds the pages to invalidate
 * @start: the offset 'from' which to invalidate
 * @end: the offset 'to' which to invalidate (inclusive)
 *
 * This function only removes the unlocked pages, if you want to
 * remove all the pages of one inode, you must call truncate_inode_pages.
 *
 * invalidate_mapping_pages() will not block on IO activity. It will not
 * invalidate pages which are dirty, locked, under writeback or mapped into
 * pagetables.
 */
unsigned long invalidate_mapping_pages(struct address_space *mapping,
                       pgoff_t start, pgoff_t end)
{
    ...
    pagevec_init(&pvec, 0);
    while (next <= end &&
            pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
        mem_cgroup_uncharge_start();
        for (i = 0; i < pagevec_count(&pvec); i++) {
            struct page *page = pvec.pages[i];
            ...
            ret += invalidate_inode_page(page);
                       ...
        }
        pagevec_release(&pvec);
        mem_cgroup_uncharge_end();
        cond_resched();
    }
    return ret;
}
 
/*
 * Safely invalidate one page from its pagecache mapping.
 * It only drops clean, unused pages. The page must be locked.
 *
 * Returns 1 if the page is successfully invalidated, otherwise 0.
 */
int invalidate_inode_page(struct page *page)
{
    struct address_space *mapping = page_mapping(page);
    if (!mapping)
        return 0;
    if (PageDirty(page) || PageWriteback(page))
        return 0;
    if (page_mapped(page))
        return 0;
    return invalidate_complete_page(mapping, page);
}

从上面的注释我们可以看到清除相关的页面要满足二个条件: 1. 不脏。 2. 未被使用。
如果满足了这二个条件就调用invalidate_complete_page继续:

/*
 * This Is for invalidate_mapping_pages().  That function can be called at
 * any time, and is not supposed to throw away dirty pages.  But pages can
 * be marked dirty at any time too, so use remove_mapping which safely
 * discards clean, unused pages.
 *
 * Returns non-zero if the page was successfully invalidated.
 */
static int
invalidate_complete_page(struct address_space *mapping, struct page *page)
{
    int ret;
 
    if (page->mapping != mapping)
        return 0;
 
    if (page_has_private(page) && !try_to_release_page(page, 0))
        return 0;
 
    clear_page_mlock(page);
    ret = remove_mapping(mapping, page);
 
    return ret;
}

我们看到invalidate_complete_page在满足更多条件的话会继续调用remove_mapping:

/*
 * Attempt to detach a locked page from its ->mapping.  If it is dirty or if
 * someone else has a ref on the page, abort and return 0.  If it was
 * successfully detached, return 1.  Assumes the caller has a single ref on
 * this page.
 */
int remove_mapping(struct address_space *mapping, struct page *page)
{
    if (__remove_mapping(mapping, page)) {
        /*
         * Unfreezing the refcount with 1 rather than 2 effectively
         * drops the pagecache ref for us without requiring another
         * atomic operation.
         */
        page_unfreeze_refs(page, 1);
        return 1;
    }
    return 0;
}
/*
 * Same as remove_mapping, but if the page is removed from the mapping, it
 * gets returned with a refcount of 0.
 */
static int __remove_mapping(struct address_space *mapping, struct page *page);
{
    BUG_ON(!PageLocked(page));
    BUG_ON(mapping != page_mapping(page));
 
    spin_lock_irq(&mapping->tree_lock);
    /*
     * The non racy check for a busy page.
     *
     * Must be careful with the order of the tests. When someone has
     * a ref to the page, it may be possible that they dirty it then
     * drop the reference. So if PageDirty is tested before page_count
     * here, then the following race may occur:
     *
     * get_user_pages(&page);
     * [user mapping goes away]
     * write_to(page);
     *                !PageDirty(page)    [good]
     * SetPageDirty(page);
     * put_page(page);
     *                !page_count(page)   [good, discard it]
     *
     * [oops, our write_to data is lost]
     *
     * Reversing the order of the tests ensures such a situation cannot
     * escape unnoticed. The smp_rmb is needed to ensure the page->flags
     * load is not satisfied before that of page->_count.
     *

你可能感兴趣的:(posix_fadvise清除缓存的误解和改进措施)