目录
1 系统内存layout
2 内存管理器
2.3 SLAB
2.3.1 概述
2.3.2 Slab Freelist
2.3.3 Slab Cache
2.3.4 Slab Shrink
2.4 PER-CPU
2.4.1 pcpu_chunk概述
2.4.2 pcpu_chunk的映射
2.4.3 pcpu_chunk的balance
2.4.4 percpu变量的访问
3 地址空间
4 内存回收
4.1 Watermark
4.1.1 如何计算watermark
4.1.2 watermark如何工作
4.1.3 watermark是绝对的吗?
4.1.4 kswapd启停的条件
4.2 node reclaim
4.3 GFP flags
4.4 active和inactive
4.4.1 page的lru状态
4.4.2 active list
4.4.2 inactive list
4.4.3 active与inactive list的大小
4.6 vmscan中的几个数量
4.5.1 回收多少?
4.5.2 扫描多少?
4.6 OOM
4.6.1 何时OOM
4.6.2 任务选择
4.7 Shrinker
4.7.1 工作方式
4.7.2 Per-Cgroup
5 Writeback
5.1 writeback Basis
5.1.1 Dirty
5.1.2 bdi_writeback
5.1.3 writeback inode
5.1.4 bdev inode
5.1.5 dirty time
5.2 发起Writeback
5.2.1 Background
5.2.2 Periodic
5.2.3 Sync
5.3 Dirty Throttle
5.3.1 dirty thresh
5.3.2 pos_ratio
5.3.3 writeback bw
5.3.3 dirty rate
5.3.3 dirty ratelimit
5.3.4 balance_dirty_pages
6 Mem Cgroup
6.1 Mem Page Counter
6.2 Page and Mem Cgroup
7 专题
7.1 内存和Numa
7.1.1 硬件相关
7.1.2 申请内存
7.1.3 memory policy
7.1.4 memory numa balance
7.1.5 sched numa balance
(待更新)
Slabs manage frequently required data structures to ensure that memory managed page-by-page by the buddy system is used more efficiently and that instances of the data types can be allocated quickly and easily as a result of caching.
关于slab分配器,有以下几个关键字:
对于目前的Linux Kernel中的slab分配器,可以参考连接What to choose between Slab and Slub Allocator in Linux Kernel?
Slub is the next-generation replacement memory allocator, which has been the default in the Linux kernel since 2.6.23. It continues to employ the basic "slab" model, but fixes several deficiencies in Slab's design, particularly around systems with large numbers of processors. Slub is simpler than Slab.
以下几个小节,我们将主要基于SLUB的实现。
每个slab内的空闲object会组成一个freelist,类似如下结构,
slab
obj
+-----------------------------------------------+
| | | | | | | | | | | | |
+-----------------------------------------------+
\__^\__^\__^\__^\__^\__^\__^\__^\__^\__^\__^
freelist的创建可以参考以下函数:
allocate_slab()
---
for_each_object_idx(p, idx, s, start, page->objects) {
setup_object(s, page, p);
if (likely(idx < page->objects))
set_freepointer(s, p, p + s->size);
else
set_freepointer(s, p, NULL);
}
---
注:random freelist是将object的顺序随机话,这并不是处于对性能的考虑,而是为了安全,参考如下连接:https://mxatone.medium.com/randomizing-the-linux-kernel-heap-freelists-b899bb99c767
SLUB使用cmpxhg操作freelist,实现object的申请和释放,操作大致如下:
get_object()
{
while (1)
{
object= c->freelist;
next = *object;
if (cmpxchg(&c->freelist, object, next) == object)
break;
}
return object
}
put_object(object)
{
while (1) {
prev = c->freelsit;
*object = prev;
if (cmpxchg(&c->freelist, prev, object) == prev)
break;
}
}
然而,以上代码并不够,如下:
Round 1 - CPU0: Slab [ obj0 -> obj1 -> obj2 -> obj3 ]
cmpxchg freelist (old = obj0, new = obj1) START
Round 2 - CPU1: Slab [ obj0 -> obj1 -> obj2 -> obj3 ]
Get obj0
Round 3 - CPU1: Slab [ obj1 -> obj2 -> obj3 ]
Get obj1
Round 4 - CPU1: Slab [ obj2 -> obj3 ]
Free obj0
Round 5 - CPU0: Slab [ obj0 -> obj2 -> obj3 ]
cmpxchg freelist (old = obj0, new = obj1) SUCCESS
Right Now: Slab [ obj1 -> obj2 -> obj3 ]
已经被申请走的obj1又被放回了freelist
所以,为了避免这个问题,SLUB又引入了一个transaction id,并使用cmpxchg_double来完成这个操作,参考代码:
slab_alloc_node()
---
void *next_object = get_freepointer_safe(s, object);
if (unlikely(!this_cpu_cmpxchg_double(
s->cpu_slab->freelist, s->cpu_slab->tid,
object, tid,
next_object, next_tid(tid)))) {
goto redo;
}
---
这个cmpxchg_double的操作就是SLUB的核心技术。
slab分配器的核心数据结构叫kmem_cache,这也说明cache这个功能在slab中占的比重很大;slab是一个object的集合,每个slab可以是若干个连续的page,为避免频繁从page分配器中申请和释放的开销,这些slab在完全空闲之后,并不会立即释放回系统,而是暂存在slab中。
注:Buddy子系统在核数比较多的情况下,会在zone->lock这个自旋锁上产生很严重的竞争,使得系统出现高达40%+的sys cpu利用率;类似的问题可以通过修改/proc/sys/vm/percpu_pagelist_fraction来调整pcp list的high和batch缓解。
slab cache有以下几个部分:
那么,slab如何在这些位置之间转移呢?
shrink_slab()其实并不是shrink slab的cache,在Wolfgang Mauerer的《Professional Linux® Kernel Architecture》中有下面的描述:
想要清除kmem_cache中的slab cache,有两种方式,
在实践中,曾出现过slab cache 50G+导致系统级OOM的例子,问题的原因是,系统会为每个memory cgroup创建一个slab的子slab,如果这些memory cgroup不删除,这些子slab连同其中的slab cache会一直占据系统的内存,由于slab cache无法被回收,最终累积到了50G+;社区曾经尝试支持slab shrinker,但是由于牵扯性能,目前仍没有进展,参考连接:https://www.spinics.net/lists/linux-mm/msg242443.html
对于per-cpu变量,我们可以将其简单的理解为一个数组,比如:
int example[NR_CPUS];
cpu0 -> example[0] cpu1 -> example[1] cpu2 -> example[2] cpu3 -> example[3]
由于每个cpu都有自己的存储空间,因此,在更新的时候,不需要考虑同步;
per-cpu使用的一个经典场景是统计数据;比如,统计系统中某个事件发生的次数;最终的结果,需要遍历per-cpu变量的每个cpu的值,做sum;但是,这个遍历的过程并不是原子的,所以,这种方式缺乏准确性,这也就导致,per-cpu变量没法做计数,尤其是,需要考虑计数为某个特定的值时做某个特定的事,比如类似struct kref的功能;内核中,有一个percpu-refcount的变量,它有per-cpu和atomic两种模式,正常工作是per-cpu模式,当需要做最终清零的时候,会转换到atomic模式。
内核的per-cpu变量,是定义了一种特殊的数据类型,并提供了一套访问API。它有两种模式,即
pcpu_chunk是percpu变量的'slab';它的核心数据包括以下:
另外,有以下全局变量作为申请空间的基准:
这是从一个64核的intel机器上获取的,
pcpu_nr_units : 64 (64 cores)
pcpu_unit_pages : 64
pcpu_nr_groups : 2 (2 nodes)
pcpu_group_sizes: 8M/8M (16M, 64 * 64 * 4K)
pcpu_chunk按照剩余空间的多少被保存在pcpu-slot[]上,
pcpu_chunk_relocate()
-> pcpu_chunk_slot()
---
if (chunk->free_bytes < PCPU_MIN_ALLOC_SIZE || chunk->contig_bits == 0)
return 0;
return pcpu_size_to_slot(chunk->free_bytes);
---
static int pcpu_size_to_slot(int size)
{
if (size == pcpu_unit_size)
return pcpu_nr_slots - 1;
return max(fls(size) - PCPU_SLOT_BASE_SHIFT + 2, 1);
}
在pcpu-slot[pcpu_nr_slots - 1]里面连接的pcpu_chunk都是全空闲的。
pcpu_chunk在从pcpu_create_chunk()中创建出来时,只包含了pcpu_chunk结构本身和一段虚拟地址,并没有真实的内存对应,参考代码:
pcpu_create_chunk()
---
chunk = pcpu_alloc_chunk(gfp);
vms = pcpu_get_vm_areas(pcpu_group_offsets, pcpu_group_sizes,
pcpu_nr_groups, pcpu_atom_size);
chunk->data = vms;
chunk->base_addr = vms[0]->addr - pcpu_group_offsets[0];
---
这段虚拟地址来自vmalloc区,在pcpu_get_vm_areas()中构建vm_struct时,使用的caller直接赋值为pcpu_get_vm_areas,参考代码:
pcpu_get_vm_areas()
---
/* insert all vm's */
for (area = 0; area < nr_vms; area++)
setup_vmalloc_vm(vms[area], vas[area], VM_ALLOC,
pcpu_get_vm_areas);
---
所以,我们会在/proc/vmallocinfo里发现如下信息:
真实的物理内存的申请是在申请到这块区域的时候,
pcpu_alloc()
---
page_start = PFN_DOWN(off);
page_end = PFN_UP(off + size);
pcpu_for_each_unpop_region(chunk->populated, rs, re,
page_start, page_end) {
ret = pcpu_populate_chunk(chunk, rs, re, pcpu_gfp);
spin_lock_irqsave(&pcpu_lock, flags);
pcpu_chunk_populated(chunk, rs, re, true);
spin_unlock_irqrestore(&pcpu_lock, flags);
}
---
参考函数pcpu_balance_workfn(),balance主要做以下两件事:
在补充populate page时,如果没有存在unpopulated page的pcpu_chunk,还会申请新的pcpu_chunk。
注:这里的page是包括所有cpu的page
首先我们看pcpu_alloc()返回的地址是什么?
pcpu_alloc()
---
/* clear the areas and return address relative to base address */
for_each_possible_cpu(cpu)
memset((void *)pcpu_chunk_addr(chunk, cpu, 0) + off, 0, size);
ptr = __addr_to_pcpu_ptr(chunk->base_addr + off);
return ptr;
---
#define __addr_to_pcpu_ptr(addr) \
(void __percpu *)((unsigned long)(addr) - \
(unsigned long)pcpu_base_addr + \
(unsigned long)__per_cpu_start)
/* the address of the first chunk which starts with the kernel static area */
void *pcpu_base_addr __ro_after_init;
返回给用户的地址,并不是原有的基于pcpu_chunk.base_addr的地址,而是做了特殊的处理;
注:这应该是为了保证对percpu变量的方式只能使用percpu提供的API
对percpu变量的指针的访问,参考代码
#define this_cpu_ptr(ptr) \
({ \
__verify_pcpu_ptr(ptr); \
SHIFT_PERCPU_PTR(ptr, my_cpu_offset); \
})
#define my_cpu_offset per_cpu_offset(smp_processor_id())
#define per_cpu_offset(x) (__per_cpu_offset[x])
setup_per_cpu_areas()
---
delta = (unsigned long)pcpu_base_addr - (unsigned long)__per_cpu_start;
for_each_possible_cpu(cpu)
__per_cpu_offset[cpu] = delta + pcpu_unit_offsets[cpu];
---
pcpu_unit_offset[]对应的是各个cpu的unit整个chunk中的偏移,比如在一个8核的cpu上,
0 1 2 3 4 5 6 7
|----+----+----+----+----+----+----+----|
\___________________ ___________________/
v
chunk (8 * unit)
(待更新)
high和low水线决定的是kswapd的工作,即free pages低于low时开始,达到high时停止;min决定的是direct reclaim,即普通的申请操作在free pages低于min时,必须进行direct reclaim,只有特别的操作,比如内存回收过程中的内存申请才可以使用min水线以下的内存。
注:在64bit系统上,不需要highmem,所以本小节不考虑highmem相关的代码
查看当前系统各个zone的水线可以通过/proc/zoneinfo;watermark的计算,参考如下函数
__setup_per_zone_wmarks()
---
unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10)
...
for_each_zone(zone) {
u64 tmp;
tmp = (u64)pages_min * zone->managed_pages;
do_div(tmp, lowmem_pages);
zone->watermark[WMARK_MIN] = tmp;
tmp = max_t(u64, tmp >> 2,
mult_frac(zone->managed_pages,
watermark_scale_factor, 10000));
zone->watermark[WMARK_LOW] = min_wmark_pages(zone) + tmp;
zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
}
---
其中有两个关键的因子:
参考函数__alloc_pages_nodemask(),进入时,alloc_flags被赋值为ALLOC_WMARK_LOW,然后进入get_page_from_freelist(),
get_page_from_freelist()
---
mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
if (!zone_watermark_fast(zone, order, mark,
ac_classzone_idx(ac), alloc_flags)) {
...
}
---
此时,使用的watermark为low;如果申请内存失败的话,会进入__alloc_pages_slowpath(),alloc_flags会被gfp_to_alloc_flags()重新赋值,其中包括ALLOC_WMARK_MIN,同时,唤醒kswapd,
__alloc_pages_slowpath()
---
if (gfp_mask & __GFP_KSWAPD_RECLAIM)
wake_all_kswapds(order, gfp_mask, ac);
---
__GFP_KSWAPD_RECLAIM为比较普遍的flags,比较严格的GFP_NOWAIT、GFP_ATOMIC
都带有这个flags。
在min水线依然申请不到内存时,有些申请操作可以击穿min水线,参考函数
static inline int __gfp_pfmemalloc_flags(gfp_t gfp_mask)
{
if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
return 0;
if (gfp_mask & __GFP_MEMALLOC)
return ALLOC_NO_WATERMARKS;
if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
return ALLOC_NO_WATERMARKS;
if (!in_interrupt()) {
if (current->flags & PF_MEMALLOC)
return ALLOC_NO_WATERMARKS;
else if (oom_reserves_allowed(current))
return ALLOC_OOM;
}
return 0;
}
其中有两个关键的flag,
%__GFP_MEMALLOC allows access to all memory. This should only be used when
the caller guarantees the allocation will allow more memory to be freed
very shortly e.g. process exiting or swapping. Users either should
be the MM or co-ordinating closely with the VM (e.g. swap over NFS).
注:内存回收有关的路径都需要PF_MEMALLOC;比较特殊的场景是ceph-nbd,其在client端包括一个内核态的nbd设备驱动还有一个用户态的处理程序,用来和ceph后端通信;在系统需要回收脏页的时候,数据需要通过nbd用户态程序发给后端,但是此时nbd程序因为无法申请到内存而进入回收,于是造成了死锁,一种比较hack的解决的方式即是给这些nbd程序赋予PF_MEMALLOC
申请内存时会绝对按照watermark规定的数量操作吗?答案是否定的,有以下几个场景,
__zone_watermark_ok()
---
const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM));
if (alloc_flags & ALLOC_HIGH)
min -= min / 2;
if (likely(!alloc_harder)) {
...
} else {
if (alloc_flags & ALLOC_OOM)
min -= min / 2;
else
min -= min / 4;
}
---
ALLOC_HIGH来自__GFP_HIGH,ALLOC_HARDER则来自__GFP_ATOMIC,参考函数
gfp_to_alloc_flags()
我们已经知道了,wamtermark low和high决定了kswapd的启停,但事实上, 决定kswapd启停的不仅仅是watermark,还有其他条件;参考函数pgdat_balanced(),
pgdat_balanced()
---
for (i = 0; i <= classzone_idx; i++) {
zone = pgdat->node_zones + i;
mark = high_wmark_pages(zone);
if (zone_watermark_ok_safe(zone, order, mark, classzone_idx))
return true;
}
---
我们看到,除了high watermark,还有一个order参数,而__zone_water_mark_ok()除了检查watermark,还要检查是不是有对应order的page。order来自pgdat->kswapd_order,wakeup_kswapd()会设置这个值,不过kswapd每次执行完回收之后,会把kswapd_order清空,所以,也不至于在碎片化的系统上一直执行kswad回收。
另外,kswapd在high watermark满足之后, 会进入“浅睡眠”模式,参考当初提交patch的comment,
After kswapd balances all zones in a pgdat, it goes to sleep. In the
event of no IO congestion, kswapd can go to sleep very shortly after the
high watermark was reached. If there are a constant stream of allocations
from parallel processes, it can mean that kswapd went to sleep too quickly
and the high watermark is not being maintained for sufficient length time.
This patch makes kswapd go to sleep as a two-stage process. It first
tries to sleep for HZ/10. If it is woken up by another process or the
high watermark is no longer met, it's considered a premature sleep and
kswapd continues work. Otherwise it goes fully to sleep.
这种睡眠机制的目的是尽量维持内存水线在high,进而可以避免直收的发生。
参考连接
NUMA (Non-Uniform Memory Access): An Overview - ACM Queue
The impact of reclaim on the system can therefore vary. In a NUMA system multiple types of memory will be allocated on each node. The amount of free space on each node will vary. So if there is a request for memory and using memory on the local node would require reclaim but another node has enough memory to satisfy the request without reclaim, the kernel has two choices:
• Run a reclaim pass on the local node (causing kernel processing overhead) and then allocate node- local memory to the process.
• Just allocate from another node that does not need a reclaim pass. Memory will not be node local, but we avoid frequent reclaim passes. Reclaim will be performed when all zones are low on free memory. This approach reduces the frequency of reclaim and allows more of the reclaim work to be done in a single pass.
For small NUMA systems (such as the typical two-node servers) the kernel defaults to the second approach. For larger NUMA systems (four or more nodes) the kernel will perform a reclaim in order to get node-local memory whenever possible because the latencies have higher impacts on process performance.
There is a knob in the kernel that determines how the situation is to be treated in /proc/sys/vm/zone_reclaim. A value of 0 means that no local reclaim should take place. A value of 1 tells the kernel that a reclaim pass should be run in order to avoid allocations from the other node. On boot- up a mode is chosen based on the largest NUMA distance in the system.
本小节主要看下几个常见的gfp flags的作用及其如何在代码中发挥作用;
申请内存时基本都是以上flag的组合,需要特别说明的是:
#define GFP_ATOMIC (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
#define GFP_KERNEL (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
#define GFP_NOWAIT (__GFP_KSWAPD_RECLAIM)
#define GFP_USER (__GFP_RECLAIM | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
(2) (1)
| |
v (3) v (4)
|-----------| -> |-----------| -> evict
active inactive
|
v
(2)
(1) 进入inactive list
(2) page 升级进入 active list
(3) page 降级进入inactive list
(4) page 继续降级被回收
本小节提到主要针对file page cache,anon page和swap将单辟一节讲解。
lru中的page有三种状态,0、Referenced和Active,每次访问都会推进page状态,到Active时,则意味着可以进入Active list了,换句话说两次访问才能保证一个page进入Active list。
page cache主要有两种访问方式:
page_check_references()
---
referenced_ptes = page_referenced(page, 1, sc->target_mem_cgroup,
&vm_flags);
referenced_page = TestClearPageReferenced(page);
if (referenced_ptes) {
SetPageReferenced(page);
if (referenced_page || referenced_ptes > 1)
return PAGEREF_ACTIVATE;
/*
* Activate file-backed executable pages after first usage.
*/
if (vm_flags & VM_EXEC)
return PAGEREF_ACTIVATE;
}
return PAGEREF_KEEP;
---
升入active list的条件是苛刻的,主要由以下三种方式:
workingset的中的年龄,即lruec->inactive_age,其计数了inactive_list中
activate和evict两个事件,以此作为事件基准;同时,将相关信息保存进xarray
中原page保存的位置。参考mm/workingset.c中的comment
1. The sum of evictions and activations between any two points in
time indicate the minimum number of inactive pages accessed in
between.
2. Moving one inactive page N page slots towards the tail of the
list requires at least N inactive page accesses.
判断是否热的基准,参考函数
workingset_refault()
---
refault = atomic_long_read(&lruvec->inactive_age);
active_file = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE, MAX_NR_ZONES);
refault_distance = (refault - eviction) & EVICTION_MASK;
if (refault_distance <= active_file) {
return true;
}
return false;
---
从active list降级却相对简单,
shrink_page_list()
---
if (page_referenced(page, 0, sc->target_mem_cgroup,
&vm_flags)) {
nr_rotated += hpage_nr_pages(page);
if ((vm_flags & VM_EXEC) && page_is_file_cache(page)) {
list_add(&page->lru, &l_active);
continue;
}
}
ClearPageActive(page); /* we are de-activating */
list_add(&page->lru, &l_inactive);
---
除了依旧被优待的可执行文件的page cache之外,其他一律无条件降级;之后,在inactive list里或升级或者Evict,则需要各自的"奋斗"了。
在inactive list中page的去留,参考函数page_check_references(),总结起来就是:
对于可以被回收的page,主要分成以下情况:
Dirty,针对dirty page的回收策略,除了没有Referenced标记之外,还有以下策略,参考代码comment,
来自shrink_page_list()注释:
Only kswapd can writeback filesystem pages to avoid risk of stack overflow.
But avoid injecting inefficient single-page IO into flusher writeback as
much as possible: only write pages when we've encountered many dirty pages,
and when we've already scanned the rest of the LRU for clean pages and see
the same dirty pages again (PageReclaim).
总结起来就是:
if (page_is_file_cache(page) &&
(!current_is_kswapd() || !PageReclaim(page) ||
!test_bit(PGDAT_DIRTY, &pgdat->flags))) {
inc_node_page_state(page, NR_VMSCAN_IMMEDIATE);
SetPageReclaim(page);
goto activate_locked;
}
if (references == PAGEREF_RECLAIM_CLEAN)
goto keep_locked;
if (!may_enter_fs)
goto keep_locked;
if (!sc->may_writepage)
goto keep_locked;
kswapd执行pageout()的条件,总结起来就是:
Writeback,Writeback标记,代表的是这个page已经被回写,但是还没有完成;shrink_page_list()对writeback page的处理有一段长长的注释,但归根结底,shrink_page_list()处理page的首要目的还是回收,既然是被shrink_page_list()scan到,说明这个page已经是冷的;不论是kswapd还是直接回收,都会给page设置Reclaim标记,PageReclaim()标记在end_page_writeback()中被处理,如下
end_page_writeback()
---
if (PageReclaim(page)) {
ClearPageReclaim(page);
rotate_reclaimable_page(page);
}
if (!test_clear_page_writeback(page))
BUG();
---
被标记Reclaim的page会被移动到inactive_list的队尾,在下一次scan时会被回收掉;但是,这种方法是异步的并不能满足即时需求,因为page的writeback还取决于存储设备的处理速度;一旦跟不上前端dirty page的速度,有可能造成dirty page堆积最终OOM;shrink_page_list()的处理方法是:
Mapped,对于这种类型的page需要借助rmap解除page的mapping,这是个开销相对较大的操作,其通过scan_control.may_unmap控制,不过,除了node_reclaim()中做了开关处理,其他回收操作都没有特殊处理。
注:是否有必要像may_write那样用scan_control.priority控制may_unmap,尤其是,对于可能执行文件或者共享库的page cache的回收,进一步收紧;这有利于降低延时敏感型业务的长尾。
我们知道active list中的page会降级到inactive list中,但是,这个事情发生在什么时候?参考函数shrink_list(),
shrink_list()
---
if (is_active_lru(lru)) {
if (inactive_list_is_low(lruvec, is_file_lru(lru), sc, true))
shrink_active_list(nr_to_scan, lruvec, sc, lru);
return 0;
}
return shrink_inactive_list(nr_to_scan, lruvec, sc, lru);
---
inactive_list_is_low()负责控制什么时候从active_list向inactive_list输送page;方法是控制inactive list和active list的大小比例,其算法为:
inactive_list_is_low()
---
/*
* When refaults are being observed, it means a new workingset
* is being established. Disable active list protection to get
* rid of the stale workingset quickly.
*/
refaults = lruvec_page_state(lruvec, WORKINGSET_ACTIVATE);
if (file && lruvec->refaults != refaults) {
inactive_ratio = 0;
} else {
gb = (inactive + active) >> (30 - PAGE_SHIFT);
if (gb)
inactive_ratio = int_sqrt(10 * gb);
else
inactive_ratio = 1;
}
return inactive * inactive_ratio < active;
---
workingset,直译为工作集;LRU的基本设定是cache的时间局部性,我们可以把workingset理解为,为完成一个任务而具有时间局部性一组page cache,它们在LRU list中很可能是连续的;所以,我们可以把active和inactive list看做是一个个workingset,另外,refaults的部分虽然已经不在LRU list中,但是也应该被考虑在内。
active inactive refaults
| [wset 0] - [wset 1] | -> | [wset 2] [wset 3] | -> | [wset 4] [wset 5] |
\________________________ _______________________/ \_________ _________/
v v
In memory Out of memory
如上,wset 0 ~ 5会依次被访问,当wset5被重新访问的时候, inactive_list_is_low()会清空active list以重新装填,此时inactive_ratio为0;
在其他情况下,如果cache size小于1G,则inactive : active = 1,否则,inactive则至少是active的3倍大;不过,这种情况都是在发生kswapd或者direct relcaim发生的时候,如果没有内存压力,并不会维持这种关系。
scan_control.nr_to_reclaim设定的是每次执行回收的目标量,参数几个常见回收函数:
kswapd_shrink_node()
---
sc->nr_to_reclaim = 0;
for (z = 0; z <= sc->reclaim_idx; z++) {
zone = pgdat->node_zones + z;
if (!managed_zone(zone))
continue;
sc->nr_to_reclaim += max(high_wmark_pages(zone), SWAP_CLUSTER_MAX);
}
---
有两个问题需要解决,
对于第一个问题,
我们可以参考函数shrink_node(),其中包含一个使用mem_cgroup_iter()的循环,即
shrink_node()
---
memcg = mem_cgroup_iter(root, NULL, &reclaim);
do {
shrink_node_memcg(pgdat, memcg, sc, &lru_pages);
node_lru_pages += lru_pages;
shrink_slab(sc->gfp_mask, pgdat->node_id,
memcg, sc->priority);
if (!global_reclaim(sc) &&
sc->nr_reclaimed >= sc->nr_to_reclaim) {
mem_cgroup_iter_break(root, memcg);
break;
}
} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
---
其中mem_cgroup_iter(),会遍历memcg的层级结构,其使用了css_next_descendant_pre(),遍历路径如下:
css_next_descendant_pre()的基本原则是,先沿着树枝的左边,达到叶子节点,然后,遍历同级。
不过,在遍历memcg的时候,因为可能同时存在多个reclaimer,为了保证各个memcg的公平性,避免一个memcg被重复回收,mem_cgroup_iter()维护了per-root & per-priority的iterator,然后通过cmpxchg()使多个reclaimer可以共享一个iterator。
综上,在全局的范围内,多个reclaimer会遍历所有的memcg一遍。
对于第二问题,
执行回收时,我们需要从inactive list的默认降级一些page,当inactive list的数量不足时,我们也需要从active list降级,那么这个数量由scan_contro.nr_to_scan决定,参考函数get_scan_count(),在不考虑swap的情况下,
get_scan_count()
---
for_each_evictable_lru(lru) {
int file = is_file_lru(lru);
size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
scan = size >> sc->priority;
...
}
---
scan_control.priority的处理比较简单,参考代码:
do_try_to_free_pages()
---
do {
sc->nr_scanned = 0;
shrink_zones(zonelist, sc);
if (sc->nr_reclaimed >= sc->nr_to_reclaim)
break;
if (sc->priority < DEF_PRIORITY - 2)
sc->may_writepage = 1;
} while (--sc->priority >= 0);
---
当scan_control.priority的值为0时,vmscan会遍历Memcg LRU上所有的page。
OOM用来回收用户态进程的堆栈内存;OOM分为全局OOM和Memory cgroup OOM,本小节,我们主要关注全局OOM。
关于OOM的条件,我们可以关注should_reclaim_retry(),该函数主要用于决定,是否放弃回收,即:
那么,大块内存的申请会导致OOM吗?参考__alloc_page_may_oom()
_alloc_pages_may_oom()
---
/* The OOM killer will not help higher order allocs */
if (order > PAGE_ALLOC_COSTLY_ORDER) // 3
goto out;
---
也就是说,32K以上的page不会触发OOM。
OOM score的计算在函数oom_badness(),
oom_badness()
---
adj = (long)p->signal->oom_score_adj;
points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
mm_pgtables_bytes(p->mm) / PAGE_SIZE;
/* Normalize to oom_score_adj units */
adj *= totalpages / 1000;
points += adj;
return points > 0 ? points : 1;
---
其中oom_score_adj的范围是(-1000, 1000) ;可以通过/proc/
例如,上图中的stress占内存非常高,但是最终OOM选择了dfget,因为它的oom_score_adj是999。
注:oom dump打印的是oom_score_adj而不是计算出的结果。参考函数dump_tasks()
另外,从oom_badness()我们也可以看到,其统计内存所占内存的时候,主要考虑了rss、swap和pagetable,这些也是杀掉任务所能获得的内存数量。
注:任务的代码段等都属于page cache
系统中,除了page cache和用户态的堆栈,还有一些内核态的可回收的内存对象,比如inode cache和dentry cache;模块可以通过注册shrinker的方式,将自己占用的可回收的内存,在系统需要的时候,归还给系统。
shrinker在注册时,需要提供两个回调函数,即:
count_objects在返回freeable之后,do_shrink_slab()会根据当前的内存紧张程度决定回收多少,即priority:
do_shrink_slab()
---
if (shrinker->seeks) {
delta = freeable >> priority;
delta *= 4;
do_div(delta, shrinker->seeks);
} else {
/*
* These objects don't require any IO to create. Trim
* them aggressively under memory pressure to keep
* them from causing refetches in the IO caches.
*/
delta = freeable / 2;
}
---
其中,shrinker->seeks是shrinker的注册方提供的,表示的是,其中被回收对象对IO的依赖程度;值越大,标识对象被回收之后需要的IO越多;比如:raid5 stripe cache的seeks为DEFAULT_SEEKS * conf->raid_disks * 4。
shrinker中最著名的是文件系统的inode cache和dentry cache;具体参考代码alloc_super()
s->s_shrink.seeks = DEFAULT_SEEKS;
s->s_shrink.scan_objects = super_cache_scan;
s->s_shrink.count_objects = super_cache_count;
s->s_shrink.batch = 1024;
s->s_shrink.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE;
super_cache_count()
---
if (sb->s_op && sb->s_op->nr_cached_objects)
total_objects = sb->s_op->nr_cached_objects(sb, sc);
total_objects += list_lru_shrink_count(&sb->s_dentry_lru, sc);
total_objects += list_lru_shrink_count(&sb->s_inode_lru, sc);
---
super_cache_scan()
---
if (sb->s_op->nr_cached_objects)
fs_objects = sb->s_op->nr_cached_objects(sb, sc);
inodes = list_lru_shrink_count(&sb->s_inode_lru, sc);
dentries = list_lru_shrink_count(&sb->s_dentry_lru, sc);
total_objects = dentries + inodes + fs_objects + 1;
if (!total_objects)
total_objects = 1;
/* proportion the scan between the caches */
dentries = mult_frac(sc->nr_to_scan, dentries, total_objects);
inodes = mult_frac(sc->nr_to_scan, inodes, total_objects);
fs_objects = mult_frac(sc->nr_to_scan, fs_objects, total_objects);
/*
* prune the dcache first as the icache is pinned by it, then
* prune the icache, followed by the filesystem specific caches
*
* Ensure that we always scan at least one object - memcg kmem
* accounting uses this to fully empty the caches.
*/
sc->nr_to_scan = dentries + 1;
freed = prune_dcache_sb(sb, sc);
sc->nr_to_scan = inodes + 1;
freed += prune_icache_sb(sb, sc);
if (fs_objects) {
sc->nr_to_scan = fs_objects + 1;
freed += sb->s_op->free_cached_objects(sb, sc);
}
---
shrinker如何实现per-cgroup呢?有两个基础组件:
shrink_slab_memcg()
---
for_each_set_bit(i, info->map, shrinker_nr_max) {
struct shrink_control sc = {
.gfp_mask = gfp_mask,
.nid = nid,
.memcg = memcg,
};
shrinker = idr_find(&shrinker_idr, i);
...
ret = do_shrink_slab(&sc, shrinker, priority);
---
static inline struct list_lru_one *
list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
{
if (list_lru_memcg_aware(lru) && idx >= 0) {
struct list_lru_memcg *mlru = xa_load(&lru->xa, idx);
return mlru ? &mlru->node[nid] : NULL;
}
return &lru->node[nid].lru;
}
如果APP向一个page中写入数据,这个page之后会发生什么?我们可以参考代码
iomap_set_page_dirty()
---
lock_page_memcg(page);
newly_dirty = !TestSetPageDirty(page);
if (newly_dirty)
__set_page_dirty(page, mapping, 0);
unlock_page_memcg(page);
if (newly_dirty)
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
---
其中有两个核心函数:
注:
- __mark_inode_dirty()还会调用文件系统dirty_inode方法,ext4的dirty_inode会给inode的修改记日志;
- 对于mmap的page的写,需要触发page fault来感知其dirty状态,具体的操作,可以参考clear_page_dirty_for_io()中的page_mkclean()
inode的writeback由__writeback_single_inode()完成,其依次做三件事:
__writeback_single_inode()
---
/*
* Some filesystems may redirty the inode during the writeback
* due to delalloc, clear dirty metadata flags right before
* write_inode()
*/
spin_lock(&inode->i_lock);
dirty = inode->i_state & I_DIRTY; //I_DIRTY_INODE | I_DIRTY_PAGES
inode->i_state &= ~dirty;
if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
inode->i_state |= I_DIRTY_PAGES;
spin_unlock(&inode->i_lock);
---
在__writeback_single_inode()执行完之后,inode依然有可能处于dirty状态,参考requeue_inode(),除了有新的操作导致inode处于dirty状态之外,还有一种可能是
Filesystems can dirty the inode during writeback operations, such as delayed allocation during submission or metadata updates after data IO completion.
wbc->nr_to_write耗尽;nr_to_write来自writeback_chunk_size(),
writeback_chunk_size()
---
if (work->sync_mode == WB_SYNC_ALL || work->tagged_writepages)
pages = LONG_MAX;
else {
pages = min(wb->avg_write_bandwidth / 2,
global_wb_domain.dirty_limit / DIRTY_SCOPE);
pages = min(pages, work->nr_pages);
pages = round_down(pages + MIN_WRITEBACK_PAGES,
MIN_WRITEBACK_PAGES);
}
---
bdi_writeback是writeback的总控结构;
抛开cgroup writeback,bdi_writeback是per request queue的,而非per filesystem;这是因为通过分区,同一个块设备上可以同时有多个文件系统;一个bdi_writeback可以避免多个文件系统刷脏页时的HDD seek。
bdi_writeback中用来保存inode的list,有b_dirty、b_io、b_more_io和b_dirty_time四个,为什么需要这么多?
b_dirty、b_io和b_more_io中b_more_io的起源有明确的commit,参考
commit 0e0f4fc22ece8e593167eccbb1a4154565c11faa
Author: Ken Chen
Date: Tue Oct 16 23:30:38 2007 -0700writeback: fix periodic superblock dirty inode flushing
Current -mm tree has bucketful of bug fixes in periodic writeback path.
However, we still hit a glitch where dirty pages on a given inode aren't
completely flushed to the disk, and system will accumulate large amount of
dirty pages beyond what dirty_expire_interval is designed for.
The problem is __sync_single_inode() will move an inode to sb->s_dirty list
even when there are more pending dirty pages on that inode. If there is
another inode with a small number of dirty pages, we hit a case where the loop
iteration in wb_kupdate() terminates prematurely because wbc.nr_to_write > 0.
Thus leaving the inode that has large amount of dirty pages behind and it has
to wait for another dirty_writeback_interval before we flush it again. We
effectively only write out MAX_WRITEBACK_PAGES every dirty_writeback_interval.
If the rate of dirtying is sufficiently high, the system will start
accumulate a large number of dirty pages.
So fix it by having another sb->s_more_io list on which to park the inode
while we iterate through sb->s_io and to allow each dirty inode which resides
on that sb to have an equal chance of flushing some amount of dirty pages.
在2007年,b_dirty和b_io还在superblock里,名为s_dirty和s_io;
Before
newly dirtied s_dirty s_io
=============> gf edcBA
After
newly dirtied s_dirty s_io
=============> BAg fedc
|
+--> dequeue for IO
引入s_more_io之前,BA会被redirty,因为时间戳被更新,BA可能需要等到下一个kupdate周期;引入s_more_io的目的是为了避免BA从头排队;结合当前的代码(LTS v4.19.154),三个list的如下
Before
newly dirtied b_dirty b_io b_more_io
=============> gf edc BA
After
newly dirtied b_dirty b_io b_more_io
=============> g fBAedc
|
+--> dequeue for IO
将inode加入b_more_io的一个主要条件是wbc->nr_to_write <= 0,即分配的slice(参考writeback_chunk_size() slice ~= avg_write_bandwidth/2)耗尽;slice机制是处于公平性的考虑,让所有的dirty inode都有机会写出一定的脏数据。但是,从上面的结果看,其事实上主要保证了处于b_io上的inode的公平性。
那么为什么不直接将slice用完的inode放到b_io的队尾?还是处于公平性的考虑;
A b_io refill will setup a _fixed_ work set with all currently eligible
inodes and start a new round of walk through b_io. The "fixed" work set
means no new inodes will be added to the work set during the walk.
Only when a complete walk over b_io is done, new inodes that are
eligible at the time will be enqueued and the walk be started over.
This procedure provides fairness among the inodes because it guarantees
each inode to be synced once and only once at each round. So all inodes
will be free from starvations.
from commit 424b351fe1901fc909fd0ca4f21dab58f24c1aac
当b_io中的inode被sync一遍之后,就会退出,然后查看是否有已经符合标准的新的inode,即新的一轮的b_io是 fBAedc,这里引入了f,而不是来回处理edcBA;b_io这个独立的list的意义也应该就是为此。
此处的inode是指文件系统的inode元数据本身;在__mark_inode_dirty()中我们可以看到调用dirty_inode,在__writeback_single_inode()中有write_inode,他们分别是做什么的呢?
只有__mark_inode_dirty()带有I_DIRTY_INODE flag时,才会调用dirty_inode方法;这个场景调用很少,典型的例如:generic_file_direct_write()和__generic_write_end(),当文件size变化后,会调用mark_inode_dirty(I_DIRTY);
我们分别取ext2、ext4和xfs举例分析;
注:WB_SYNC_ALL的的注释为Wait on every mapping,在代码中也主要用于标识需要等待write IO完成
ext2与ext4&xfs的关键差别是,ext4和xfs是日志文件系统,它们有元数据更新事务性的保证;每一笔元数据更新,都及时送入了日志系统,日志commit之后,自动更新到相关on-disk inode;
注:我们之所以说"送入",是因为无论是jbd2还是xfslog,都是delayed log,日志会在系统中短暂停留,然后延时落盘;当然,元数据更新的事务性在这个过程中是得到保证的。之后,日志系统在log commit之后,会将元数据的更新落盘。
所以,在ext2_write_inode()需要自己调用mark_bh_dirty()将元数据交给writeback子系统,而ext4_write_inode()中,只是需要等待日志commit即可。
ext4的dirty_inode中,只将inode本身更新送入jbd2,主要目的是,方便其他位置,如sync、fsync、iput_final->write_inode_now()->write_inode等位置通过jbd2,确认inode更新落盘;xfs的dirty_inode方法作用也类似,主要是将inode放入日志系统
但是xfs为什么没有write_inode方法呢?
参考ext4_write_inode,其只有在WB_SYNC_ALL & !for_sync的时候才会发挥作用,主要调用方是iput_final()->write_inode_now();但是xfs并并不会在iput_final()中调用write_inode_now(),因为xfs自己维护了inode cache,参考xfs_iget(),generic_drop_inode()中的inode_unhashed()返回true,
综上,在writeback过程中,确实会写inode元数据,即
ext2和ext4的元数据的脏页,最终都又还给了writeback子系统,不过对应的inode变成了block device的。
上一小节提到,对于ext2和ext4,元数据的writeback最终落到了block device的inode身上,那么block device的inode的writeback是怎么进行的呢?
首先,我们需要关注一个函数,
inode_to_bdi()
---
sb = inode->i_sb;
#ifdef CONFIG_BLOCK
if (sb_is_blkdev_sb(sb))
return I_BDEV(inode)->bd_bdi;
#endif
return sb->s_bdi;
---
block_device的bd_bdi来自
__blkdev_get()
---
if (!bdev->bd_openers) {
...
if (bdev->bd_bdi == &noop_backing_dev_info)
bdev->bd_bdi = bdi_get(disk->queue->backing_dev_info);
...
}
---
也就说,block device inode也使用其request_queue的bdi,这是合理的,一个块设备上的所有的脏数据都应该由一个wb writeback。
在writeback中,数据、inode本身和time做了单独处理,其中time的特殊处理源自文件系统的lazytime选项,其目的是减少在访问文件过程中造成的IO开销,参考commit comment:
commit 0ae45f63d4ef8d8eeec49c7d8b44a1775fff13e8
Author: Theodore Ts'o
Date: Mon Feb 2 00:37:00 2015 -0500
vfs: add support for a lazytime mount option
Add a new mount option which enables a new "lazytime" mode. This mode
causes atime, mtime, and ctime updates to only be made to the
in-memory version of the inode. The on-disk times will only get
updated when (a) if the inode needs to be updated for some non-time
related change, (b) if userspace calls fsync(), syncfs() or sync(), or
(c) just before an undeleted inode is evicted from memory.
This is OK according to POSIX because there are no guarantees after a
crash unless userspace explicitly requests via a fsync(2) call.
For workloads which feature a large number of random write to a
preallocated file, the lazytime mount option significantly reduces
writes to the inode table. The repeated 4k writes to a single block
will result in undesirable stress on flash devices and SMR disk
drives. Even on conventional HDD's, the repeated writes to the inode
table block will trigger Adjacent Track Interference (ATI) remediation
latencies, which very negatively impact long tail latencies --- which
is a very big deal for web serving tiers (for example).
Google-Bug-Id: 18297052
Signed-off-by: Theodore Ts'o
Signed-off-by: Al Viro
generic_update_time()中,如果文件系统没有设置SB_LAZYTIME,就会给__mark_inode_dirty()传入I_DIRTY_SYNC | I_DIRTY_TIME,如果设置了,则只传入I_DIRTY_TIME;之后,__mark_inode_dirty()会调用dirty_inode,xfs_fs_dirty_inode()和ext4_drity_inode()中, 会检查如果只设置了I_DIRTY_TIME,则只直接退出,不做任何处理。
在in-core inode中被修改的time,通过以下函数调用mark_inode_dirty_sync(),update到on-disk inode中,
mark_inode_dirty_sync()在调用dirty_inode时,带着I_DIRTY_SYNC。
注:lazytime只是延迟更新,且atime、mtime和ctime都涉及到;noatime是直接放弃更新atime,
mount option: noatime -> MNT_NOATIME
touch_atime()
---
if (!atime_needs_update(path, inode))
return;
...
update_time(inode, &now, S_ATIME);
...
---
当前内核中(基于v4.19.154),主要有以下几种writeback,参考enum wb_reason;
本小节我们主要看它们分别何时发起,何时结束。
wb_check_background_flush()是wb_do_writeback()的常备项,只要writeback kworker运行,发生background writeback的必要条件是
wb_over_bg_thresh()
---
if (gdtc->dirty > gdtc->bg_thresh)
return true;
if (wb_stat(wb, WB_RECLAIMABLE) >
wb_calc_thresh(gdtc->wb, gdtc->bg_thresh))
return true;
---
除了全局的dirty > bg_thresh条件外,还有wb级别thresh的判定;其中WB_RECLAIMABLE统计数据的来源来自:
account_page_dirtied()
---
inc_wb_stat(wb, WB_RECLAIMABLE);
---
clear_page_dirty_for_io()
---
if (TestClearPageDirty(page)) {
dec_lruvec_page_state(page, NR_FILE_DIRTY);
dec_wb_stat(wb, WB_RECLAIMABLE);
ret = 1;
}
---
__wb_calc_thresh()会依据wb的writeback完成的数量占全局的比例,比例划分出其应该占得的全局background thresh的量。
wb_over_bg_thresh()不仅控制这background writeback的开始,还会控制其结束,如下:
wb_writeback()
---
/*
* For background writeout, stop when we are below the
* background dirty threshold
*/
if (work->for_background && !wb_over_bg_thresh(wb))
break;
---
和Background类似,Periodic也是wb_writeback()的常备项,参考函数wb_check_old_data_flush(),之前称为kupdate writeback,与其有关的两个变量为:
Periodic或者说Kupdate writeback的触发有两个时机:
在POSIX的官方说明文档中,对sync的描述是这样的,
https://pubs.opengroup.org/onlinepubs/009695299/functions/sync.html
The sync() function shall cause all information in memory that updates file systems to be scheduled for writing out to all file systems.
The writing, although scheduled, is not necessarily complete upon return from sync().
其中并没有规定,写操作必须完成。
但是,fsync的规定比较严格,
https://pubs.opengroup.org/onlinepubs/009695299/functions/fsync.html
The fsync() function shall request that all data for the open file descriptor named by fildes is to be transferred to the storage device associated with the file described by fildes. The nature of the transfer is implementation-defined. The fsync() function shall not return until the system has completed that action or until an error is detected.
我们先看下fsync的实现:
vfs_fsync_range()
---
if (!datasync && (inode->i_state & I_DIRTY_TIME))
mark_inode_dirty_sync(inode);
return file->f_op->fsync(file, start, end, datasync);
---
不同的文件系统,fsync实现有差别:
syncfs并不是POSIX标准的接口,在linux中,其语义与fsync类似,不过,它针对的是整个文件系统;在sync_filesystem()中,它调用了两次__sync_filesystem(),第一次wait = 0,第二次wait = 1,之所以分两次,是处于性能上的考虑,第一次__sync_filesystem()可以保证尽量多的将IO发出,进而提高并发量;sync_filesystem()有以下一个点要关注:
注:__writeback_single_inode()会调用ext2_write_inode(),__sync_filesystem()再被第二次调用时,会使用sync_inodes_sb(),其中for_sync = 1、sync_mode = WB_SYNC_ALL;ext2_write_inode()此时会等待bh的写完成,是不是应该再识别下for_sync ?即当这两个条件同时满足时,不必等待,等__sync_filesystem()调用__sync_blockdev()一次完成。
Dirty Throttle的目的是限制系统中脏页的数量到一定的比例,即/proc/sys/vm/dirty_ratio,参考下图
Dirty throttle通过让APP dirty page的速度(上图中红线)与flusher的速度(上图中蓝线)一致,使系统的dirty page总量(上图中黑线)不在上升;
限制APP dirty page的速度的方法是,让其睡眠一定的时间,睡眠时间的计算方法为:
page_dirtied
pause = ---------------------------------
pos_ratio * balanced_ratelimit
下面,我们分别介绍此公式的基础数据的来源和计算方法。
全局的dirty thresh的计算公式为:
avail = free + active_file + inactive_file
dirty = NR_FILE_DIRTY + NR_FILE_WRITEBACK + NR_UNSTABLE_FILE
dirty_thresh = avail * dirty_ratio
另外,还有bdi级别的dirty_thresh,其值会根据该wb的writeback完成的数据量,比例划分全局dirty thresh,参考函数__wb_calc_thresh(),数据来源参考
__wb_writeout_inc()
---
inc_wb_stat(wb, WB_WRITTEN);
wb_domain_writeout_inc(&global_wb_domain, &wb->completions,
wb->bdi->max_prop_frac);
---
除了带有BDI_CAP_STRICTLIMIT(目前只有fuse),wb_thresh并不会用在freerun的判断中。
pos_ratio使我们在计算APP pause time的时候可以引入额外的策略;目前其包括两个方面:
freerun = (limit + background) / 2
setpoint = (limit + freerun) / 2
setpoint - dirty 3
pos_ratio := 1.0 + (----------------)
limit - setpoint
假设limit = 200、background = 100,pos_ratio跟随dirty在freerun = 150到limit = 200变化的曲线如下;从中我们看到,其倾向于将系统的dirty page的数量控制在setpoint = 175附近。
上图左显示,当达到相同的dirty page X时,write_bw越大,则pos_ratio越大,也就是说pause time会越小;公式倾向于让write_bw小的wb睡眠的时间更长。
pos_ratio的计算在wb_pos_ratio()函数。
注:bdi control line还有一个pos_ratio最低1/4的限制,上图中没有体现出来
这个wb在过去这段时间内的writeback的真实带宽;
注:之所以是wb级别,是因为每个cgroup都有一个wb,如果cgorup writeback开启,每个cgroup的写速度不同;
其依赖统计数据WB_WRITTEN,来自
end_page_writeback()
-> test_clear_page_writeback()
---
ret = TestClearPageWriteback(page);
if (ret) {
if (bdi_cap_account_writeback(bdi)) {
struct bdi_writeback *wb = inode_to_wb(inode);
__wb_writeout_inc(wb);
-> inc_wb_stat(wb, WB_WRITTEN);
}
}
---
计算公式为:
period = roundup_pow_of_two(3 * HZ);
bw = (written - wb->written_stamp) * HZ / elapsed
bw * elapsed + write_bandwidth * (period - elapsed)
write_bandwidth = ---------------------------------------------------
period
其中,elapsed的计算依赖wb->bw_time_stamp,__wb_update_bandwidth()每200ms调用一次。
dirty rate是一个wb在过去这段时间里写脏页的速度,其统计依赖上面的wb->bw_time_stamp以及统计数据WB_DIRTIED,参考代码:
account_page_dirtied()
---
if (mapping_cap_account_dirty(mapping)) {
struct bdi_writeback *wb;
inode_attach_wb(inode, page);
wb = inode_to_wb(inode);
__inc_lruvec_page_state(page, NR_FILE_DIRTY);
inc_wb_stat(wb, WB_DIRTIED);
current->nr_dirtied++;
}
---
dirty rate的计算公式为:
dirty_rate = (dirtied - wb->dirtied_stamp) * HZ / elapsed;
注:current->nr_dirtied累积了过于一段时间内当前进程写脏页的数量,进而避免每次都调用balance_dirty_pages()
那么dirty ratelimit是如何计算出的呢?
首先,我们需要明确一个问题,write_bw是否会受到APP的影响?
答案是否定的,在正常的balance_dirty_pages()过程中,脏页都是由background writeback写出,这个过程会持续到wb_over_bg_thresh()返回false,即全局和wb级别的dirty < bg_thresh,而dirty rate受到限制需要在dirty > freerun ((thresh + bg_thresh ) / 2)点之后,这就避免了两者在bg_thresh附近拉锯;所以,如果APP持续输出脏页,wb_writeback()会全力写出到设备。
另外,writeback会控制每个inode writeback的chunk size,避免dirty跌出(freerun, limit)的范围,参考commit
commit 1a12d8bd7b2998be01ee55edb64e7473728abb9c
Author: Wu Fengguang
Date: Sun Aug 29 13:28:09 2010 -0600
writeback: scale IO chunk size up to half device bandwidth
....
XFS is observed to do IO completions in a batch, and the batch size is
equal to the write chunk size. To avoid dirty pages to suddenly drop
out of balance_dirty_pages()'s dirty control scope and create large
fluctuations, the chunk size is also limited to half the control scope.
The balance_dirty_pages() control scrope is
[(background_thresh + dirty_thresh) / 2, dirty_thresh]
which is by default [15%, 20%] of global dirty pages, whose range size
is dirty_thresh / DIRTY_FULL_SCOPE.
The adpative write chunk size will be rounded to the nearest 4MB
boundary.
rate_T : 上个周期的dirty_ratelimit
rate_T+1 : 依据rate_T重新计算的新的dirty_ratelimit
依据实际的设备的writeback bandwidth和APP的dirty page的数量的比值,来对rate_T做处理
rate_T+1 = rate_T * (write_bw / dirty_rate)
这个公式有什么问题呢?
已知,
dirty_rate = rate_T * pos_ratio * N
于是得到,
rate_T+1 = write_bw / (pos_ratio * N)
那么,这个计算周期的dirty rate就是
dirty_rate = N * pos_ratio * rate_T+1
= write_bw
pos_ratio的调节功能完全丧失了 !
于是计算公式修正为:
rate_T+1 = rate_T * (write_bw / dirty_rate) * pos_ratio
dirty ratelimit的计算由函数wb_update_dirty_ratelimit()完成,其计算频率为BANDWIDTH_INTERVAL = max(HZ/5, 1),即200ms。
page_counter结构及一些列接口,用来维护一个mem_cgroup的内存使用;同时page_counter本身还维护了mem_cgroup的层级关系。Max即mem_cgroup的memory.limit_in_bytes,Usage即memory.usage_in_bytes,下面,我们看他们两个是如何工作的;
这里比较特别的是max的resize时,并不会检查层级的max,也就是父mem_cgroup的max可以小于子mem_cgroup的max;但是,这并不会影响什么,因为usage charge的会检查父mem_cgroup的max和usage。
每个page结构体都携带其所在mem_cgroup的指针和一个引用计数。
当相关cgroup下线之后,其相关的memcg依然可能存在于内核中,直到所有的page都被释放
Non-Uniform Memory Access (NUMA) Performance is a MESI SituationUpdating to a new version of the Linux kernel is not too difficult, but it can have surprising performance impacts. This story is one of those times.https://qumulo.com/blog/non-uniform-memory-access-numa/这篇文档中介绍了一个有关Numa造成性能瓶颈的案例和原理解释,我们引用其中的图,
跨numa访问的性能瓶颈包括:
上面的文章中提到,home snoop协议对比early snoop,
Home Agent下,通过记录额外信息,可以避免广播,但是依然需要通信;所以,避免跨numa访问才是最优解。那么内核是怎么处理这些问题的呢?
我们会在哪个numa node上申请内存呢?
GFP_USER allocations are marked with the __GFP_HARDWALL bit,
and do not allow allocations outside the current tasks cpuset
unless the task has been OOM killed.
GFP_KERNEL allocations are not so marked, so can escape to the
nearest enclosing hardwalled ancestor cpuset.
cpuset的限制在所有手段是失效时,会被解除,参考函数__alloc_pages_cpuset_fallback();
注:cpuset有个spread mem的功能,可以让page cache或者slab page分配到多个node上
memory policy决定的是任务该从哪个numa node节点申请内存,最典型的函数,
#ifdef CONFIG_NUMA
static inline struct page *
alloc_pages(gfp_t gfp_mask, unsigned int order)
{
return alloc_pages_current(gfp_mask, order);
}
#else
#define alloc_pages(gfp_mask, order) \
alloc_pages_node(numa_node_id(), gfp_mask, order)
#endif
struct page *alloc_pages_current(gfp_t gfp, unsigned order)
{
struct mempolicy *pol = &default_policy;
struct page *page;
if (!in_interrupt() && !(gfp & __GFP_THISNODE))
pol = get_task_policy(current);
/*
* No reference counting needed for current->mempolicy
* nor system default_policy
*/
if (pol->mode == MPOL_INTERLEAVE)
page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
else
page = __alloc_pages_nodemask(gfp, order,
policy_node(gfp, pol, numa_node_id()),
policy_nodemask(gfp, pol));
return page;
}
memory policy包括
interleave_nodes()
---
struct task_struct *me = current;
next = next_node_in(me->il_prev, policy->v.nodes);
if (next < MAX_NUMNODES)
me->il_prev = next;
return next;
---
在系统启动过程中,系统的mempolicy发生了两次变化,
中间短暂的interleave policy,内核文档的解释是:
However, during boot up, the system default policy will be set to interleave allocations across all nodes with “sufficient” memory, so as not to overload the initial boot node with boot-time allocations.
不过,我觉得interleave的原因也包括,让内核核心模块的内存分布在多个节点上,可以让性能更加均衡。
之后,任务的的默认mempolicy为,preferred_node_policy,参考代码
struct mempolicy *get_task_policy(struct task_struct *p)
{
struct mempolicy *pol = p->mempolicy;
int node;
if (pol)
return pol;
node = numa_node_id();
if (node != NUMA_NO_NODE) {
pol = &preferred_node_policy[node];
/* preferred_node_policy is not initialised early in boot */
if (pol->mode)
return pol;
}
return &default_policy;
}
for_each_node(nid) {
preferred_node_policy[nid] = (struct mempolicy) {
.refcnt = ATOMIC_INIT(1),
.mode = MPOL_PREFERRED,
.flags = MPOL_F_MOF | MPOL_F_MORON,
.v = { .preferred_node = nid, },
};
}
此时的preferred策略相当于local;
然而,任务并不会老老实实的待在其分配内存的node节点上,而和可能被调度到其他node上,此时,就会产生跨numa访问;memory numa balance是基于内存迁移的方案,即把内存迁移到任务所在的node上。具体使用方法和参数,可以参考,
Automatic Non-Uniform Memory Access (NUMA) Balancinghttps://documentation.suse.com/sles/15-SP1/html/SLES-all/cha-tuning-numactl.htmlmemory numa balance工作流程分为三步:
Shared library pages mapped by multiple processes are not migrated as it is expected they are cache replicated. Avoid hinting faults in read-only file-backed mappings or the vdso as migrating the pages will be of marginal benefit.
handle_pte_fault()
---
if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
return do_numa_page(vmf);
---
在do_numa_page()中,会更正页表项的权限,同时检查,page的node id是否符合mempolicy的设定,通常就是local node,如果不符合,就进入下一步,做page迁移
执行memory numa balance的page,都是mmap到程序地址空间的,可以是匿名也可以是文件页,文件页的迁移比较苛刻,
在调度方面,numa balance会统计这个任务在对应各个numa node上发生numa faults的数量,然后,将任务迁移到那个numa faults数量最多的那个numa 节点上去。
/* Find the node with the highest number of faults */
for_each_online_node(nid) {
unsigned long faults = 0;
for (priv = 0; priv < NR_NUMA_HINT_FAULT_TYPES; priv++) {
long diff, f_diff, f_weight;
mem_idx = task_faults_idx(NUMA_MEM, nid, priv);
membuf_idx = task_faults_idx(NUMA_MEMBUF, nid, priv);
/* Decay existing window, copy faults since last scan */
diff = p->numa_faults[membuf_idx] - p->numa_faults[mem_idx] / 2;
p->numa_faults[membuf_idx] = 0;
p->numa_faults[mem_idx] += diff;
faults += p->numa_faults[mem_idx];
}
if (!ng) {
if (faults > max_faults) {
max_faults = faults;
max_nid = nid;
}
}
}
在决定出node id之后,会使用sched_setnuma()更新任务的numa_preferred_nid;
task_numa_migrate()会在numa_preferred_nid中选择一个合适的cpu,然后将该任务迁移过去;
注:do_numa_page()可能刚刚执行了page迁移,task_numa_fault()再把任务迁移走,这不会引起bounce吗?首先可以确定的是task_numa_fault()执行任务迁移是有一定间隔的,不会每次执行完page migrate都会发生迁移。