oom备忘


先来看一段oom

 md 265289728 sc: page allocation failure. order:7, mode:0xd0
Call Trace:
[<ffffffffc100e520>] dump_stack+0x8/0x34
[<ffffffffc10c95cc>] __alloc_pages_nodemask+0x5f4/0x700
[<ffffffffe037165c>] async_read_data_from_md+0x264/0x520 [vd]
[<ffffffffe0362d9c>] md_demo+0x14c/0x4e0 [vd]
[<ffffffffc1080648>] kthread+0x88/0x90
[<ffffffffc1028a50>] kernel_thread_helper+0x10/0x18

Mem-Info:
DMA per-cpu:
CPU    0: hi:  186, btch:  31 usd:   0
CPU    1: hi:  186, btch:  31 usd:   0
CPU    2: hi:  186, btch:  31 usd:   0
CPU    3: hi:  186, btch:  31 usd:   0
CPU    4: hi:  186, btch:  31 usd:   0
CPU    5: hi:  186, btch:  31 usd:   0
CPU    6: hi:  186, btch:  31 usd:   0
CPU    7: hi:  186, btch:  31 usd:   0
CBMEM per-cpu:
CPU    0: hi:  186, btch:  31 usd:   0
CPU    1: hi:  186, btch:  31 usd:   0
CPU    2: hi:  186, btch:  31 usd:   0
CPU    3: hi:  186, btch:  31 usd:   0
CPU    4: hi:  186, btch:  31 usd:   0
CPU    5: hi:  186, btch:  31 usd:   0
CPU    6: hi:  186, btch:  31 usd:   0
CPU    7: hi:  186, btch:  31 usd:   0
active_anon:4380 inactive_anon:1387 isolated_anon:0
 active_file:316 inactive_file:23814 isolated_file:32
 unevictable:65089 dirty:0 writeback:0 unstable:0
 free:103229 slab_reclaimable:1442 slab_unreclaimable:36971
 mapped:3854 shmem:916 pagetables:428 bounce:0
DMA free:134216kB min:1496kB low:1868kB high:2244kB active_anon:17520kB inactive_anon:5548kB active_file:1264kB inactive_file:95256kB unevictable:260356kB isolated(anon):0kB isolated(file):128kB present:770048kB mlocked:131028kB dirty:0kB writeback:0kB mapped:15416kB shmem:3664kB slab_reclaimable:5768kB slab_unreclaimable:76640kB kernel_stack:4624kB pagetables:1712kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 1008
CBMEM free:278700kB min:2008kB low:2508kB high:3012kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1032192kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:71244kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
DMA: 2135*4kB 2194*8kB 1420*16kB 743*32kB 390*64kB 139*128kB 52*256kB 9*512kB 1*1024kB 0*2048kB 0*4096kB = 134284kB
CBMEM: 265*4kB 123*8kB 43*16kB 22*32kB 5*64kB 2*128kB 1*256kB 2*512kB 1*1024kB 1*2048kB 66*4096kB = 278700kB
58626 total pagecache pages
524288 pages RAM
103684 pages reserved
10235 pages shared
305539 pages non-shared
[BLK][VD][md_demo] md 265289728 scan is released.
[BLK][VD][md_scan_complete_handle] vd_send_event_to_scs stop!!!!
[BLK][VD][md_scan_complete_handle] modify_ok 0,t


按照代码流程推算oom的原因。
大概场景:尝试分配 gfp_mask = 0xd0即GFP_KERNEL,, order = 7,连续的512K,
此时CBMEM区还剩2个512K的块,DMA还剩9个,总共剩余的页面比较多。


下面我们来看gfp_mask的作用和原理。
gfp_mask决定从哪个区开始分配,实现的关键是zone_type gfp_zone函数

static inline enum zone_type gfp_zone(gfp_t flags)
{
	enum zone_type z;
	int bit = flags & GFP_ZONEMASK; //对GFP_KERNEL,bit是0

	z = (GFP_ZONE_TABLE >> (bit * ZONES_SHIFT)) &
					 ((1 << ZONES_SHIFT) - 1);
	if (__builtin_constant_p(bit))
		MAYBE_BUILD_BUG_ON((GFP_ZONE_BAD >> bit) & 1);
	else {
#ifdef CONFIG_DEBUG_VM
		BUG_ON((GFP_ZONE_BAD >> bit) & 1);
#endif
	}
	return z;
}


#define GFP_ZONE_TABLE ( \
	(ZONE_NORMAL << 0 * ZONES_SHIFT)				\
	| (OPT_ZONE_DMA << __GFP_DMA * ZONES_SHIFT) 			\
	| (OPT_ZONE_HIGHMEM << __GFP_HIGHMEM * ZONES_SHIFT)		\
	| (OPT_ZONE_DMA32 << __GFP_DMA32 * ZONES_SHIFT)			\
	| (ZONE_NORMAL << __GFP_MOVABLE * ZONES_SHIFT)			\
	| (OPT_ZONE_DMA << (__GFP_MOVABLE | __GFP_DMA) * ZONES_SHIFT)	\
	| (ZONE_MOVABLE << (__GFP_MOVABLE | __GFP_HIGHMEM) * ZONES_SHIFT)\
	| (OPT_ZONE_DMA32 << (__GFP_MOVABLE | __GFP_DMA32) * ZONES_SHIFT)\
)

如果bit = 0, 则是ZONE_NORMAL区
如果bit = __GFP_DMA,则是OPT_ZONE_DMA区(可能是ZONE_DMA,也可能是ZONE_NORMAL,如果没DMA区的话)
如果bit = __GFP_HIGHMEM,则是OPT_ZONE_HIGHMEM
如果bit = __GFP_DMA32,则是OPT_ZONE_DMA32
...
本函数在代码中有注释:

/*
 * GFP_ZONE_TABLE is a word size bitstring that is used for looking up the
 * zone to use given the lowest 4 bits of gfp_t. Entries are ZONE_SHIFT long
 * and there are 16 of them to cover all possible combinations of
 * __GFP_DMA, __GFP_DMA32, __GFP_MOVABLE and __GFP_HIGHMEM
 *
 * The zone fallback order is MOVABLE=>HIGHMEM=>NORMAL=>DMA32=>DMA.
 * But GFP_MOVABLE is not only a zone specifier but also an allocation
 * policy. Therefore __GFP_MOVABLE plus another zone selector is valid.
 * Only 1bit of the lowest 3 bit (DMA,DMA32,HIGHMEM) can be set to "1".
 *
 *       bit       result
 *       =================
 *       0x0    => NORMAL
 *       0x1    => DMA or NORMAL
 *       0x2    => HIGHMEM or NORMAL
 *       0x3    => BAD (DMA+HIGHMEM)
 *       0x4    => DMA32 or DMA or NORMAL
 *       0x5    => BAD (DMA+DMA32)
 *       0x6    => BAD (HIGHMEM+DMA32)
 *       0x7    => BAD (HIGHMEM+DMA32+DMA)
 *       0x8    => NORMAL (MOVABLE+0)
 *       0x9    => DMA or NORMAL (MOVABLE+DMA)
 *       0xa    => MOVABLE (Movable is valid only if HIGHMEM is set too)
 *       0xb    => BAD (MOVABLE+HIGHMEM+DMA)
 *       0xc    => DMA32 (MOVABLE+HIGHMEM+DMA32)
 *       0xd    => BAD (MOVABLE+DMA32+DMA)
 *       0xe    => BAD (MOVABLE+DMA32+HIGHMEM)
 *       0xf    => BAD (MOVABLE+DMA32+HIGHMEM+DMA)
 *
 * ZONES_SHIFT must be <= 2 on 32 bit platforms.
 */


本例中,gfp_mask是GFP_KERNEL,则从DMA区开始分配。

知道了区间,下面来考察该区间是否能分配到页面。分配页面的函数是get_page_from_freelist。


get_page_from_freelist里判断区间内存是否充足的函数是zone_watermark_ok

从oom看到此时的DMA区信息:


DMA: 2135*4kB 2194*8kB 1420*16kB 743*32kB 390*64kB 139*128kB 52*256kB 9*512kB 1*1024kB 0*2048kB 0*4096kB = 134284kB
水线min:1496kB 


代入zone_watermark_ok函数,
long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1;
free_pages = 134284kB /4k - 1<<7 + 1 = 33444
min = 1496k / 4k = 374


	for (o = 0; o < order; o++) {
		free_pages -= z->free_area[o].nr_free << o;
		min >>= 1;
		if (free_pages <= min)
			return 0;
	}


代入  o = 0, 考察4k的块,free_pages  = 31309, min = 187
      o = 1, 考察8k的块, free_pages = 26921, min =  93
      o = 2, 16k的块,free_pages = 21241, min =  46
      ...
      o = 6,  256k的块, free_pages = 1281, min = 3
应该都是满足分配的。 这里为什么出现alloc_pages尚不清楚。


再来看另外一个oom 现场。

背景知识。在没有显式的设置用户的numa分配策略时,进程的分配策略是default,即本地节点分配失败后,依次选择邻居节点,远端节点。
如下面出问题的线程
TEST             22228  
cat /proc/22228/numa_map
7fff54af5000 default stack anon=21 dirty=21 active=0 N1=21 代表是default分配策略。

具体内核代码就是,初始化时,各个node会被做成一个连接的zonelist,zonelist[0]是本node的分配策略,其他zonelist依次

是邻居node,远端node。这样在alloc_pages的时候就会依次遍历下去。

下面的信息,推测是是TEST尝试在Node1中,分配内存alloc_pages(GFP_KERNEL: 0xd0), 然而发现Node1中内存不足,打印的信息。
照理说Node0里的DMA32和DMA区应该还是有很多空闲内存的,为啥TEST不去Node0分配内存?而显示分配失败了呢

[429108.012142] Page Alloc : the task sh(2824) order:1, mode:0xd0, oomadj:0 alloc kernel memory fail and will try again after a while!
[429108.017924] Page Alloc : the task TEST(17212) order:0, mode:0xd0, oomadj:0 alloc kernel memory fail and will try again after a while!
[429108.024193] sh: page allocation failure. order:1, mode:0xd0, oomadj:0
[429108.024220] Active:43 inactive:145 dirty:11 writeback:0 unstable:0
[429108.024221]  free:29423 slab:28887 mapped:4933014 pagetables:13977 bounce:0
[429108.024245] Node 0 <2>DMA free:15824kB min:32kB low:40kB high:48kB active:0kB inactive:0kB cachelimt:1552kB cachefailtimes:0 hi falldown times:0 present:15544kB pages_scanned:0 all_unreclaimable? yes
[429108.024251] lowmem_reserve[]:<2> 0<2> 3215<2> 16041<2>
[429108.024297] Node 0 <2>DMA32 free:57956kB min:6844kB low:8552kB high:10264kB active:12kB inactive:0kB cachelimt:329228kB cachefailtimes:0 hi falldown times:2336399 present:3292296kB pages_scanned:30 all_unreclaimable? yes
[429108.024303] lowmem_reserve[]:<2> 0<2> 0<2> 12826<2>
[429108.024341] Node 0 <2>Normal free:27112kB min:27312kB low:34140kB high:40968kB active:268kB inactive:360kB cachelimt:1313432kB cachefailtimes:0 hi falldown times:33285225 present:13134336kB pages_scanned:1101 all_unreclaimable? yes
[429108.024347] lowmem_reserve[]:<2> 0<2> 0<2> 0<2>
[429108.024386] Node 1 <2>Normal free:16800kB min:17004kB low:21252kB high:25504kB active:0kB inactive:220kB cachelimt:817676kB cachefailtimes:4 hi falldown times:28851188 present:8176768kB pages_scanned:539 all_unreclaimable? yes
[429108.024391] lowmem_reserve[]:<2> 0<2> 0<2> 0<2>
[429108.024690] Page Alloc : the task sh(2824) order:0, mode:0x200d0, oomadj:0 alloc kernel memory fail and will try again after a while!


原因是,每个zonelist的zone,都会预留一部分内存不允许分配。
/*
 * results with 256, 32 in the lowmem_reserve sysctl:
 *	1G machine -> (16M dma, 800M-16M normal, 1G-800M high)
 *	1G machine -> (16M dma, 784M normal, 224M high)
 *	NORMAL allocation will leave 784M/256 of ram reserved in the ZONE_DMA
 *	HIGHMEM allocation will leave 224M/32 of ram reserved in ZONE_NORMAL
 *	HIGHMEM allocation will (224M+784M)/256 of ram reserved in ZONE_DMA
 *
 * TBD: should special case ZONE_DMA32 machines here - in those we normally
 * don't need any ZONE_NORMAL reservation
 */
int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = {
#ifdef CONFIG_ZONE_DMA
	 256,
#endif
#ifdef CONFIG_ZONE_DMA32
	 256,
#endif
#ifdef CONFIG_HIGHMEM
	 32
#endif
};

NODE0: Normal->present:13134336kB, Dma32->present:3292296kB,then NORMAL allocation will leave 13134336kB/256=51306kB of ram reserved in the ZONE_DMA32.



你可能感兴趣的:(oom,内存不足,gfp_mask)