先来看一段oom
md 265289728 sc: page allocation failure. order:7, mode:0xd0 Call Trace: [<ffffffffc100e520>] dump_stack+0x8/0x34 [<ffffffffc10c95cc>] __alloc_pages_nodemask+0x5f4/0x700 [<ffffffffe037165c>] async_read_data_from_md+0x264/0x520 [vd] [<ffffffffe0362d9c>] md_demo+0x14c/0x4e0 [vd] [<ffffffffc1080648>] kthread+0x88/0x90 [<ffffffffc1028a50>] kernel_thread_helper+0x10/0x18 Mem-Info: DMA per-cpu: CPU 0: hi: 186, btch: 31 usd: 0 CPU 1: hi: 186, btch: 31 usd: 0 CPU 2: hi: 186, btch: 31 usd: 0 CPU 3: hi: 186, btch: 31 usd: 0 CPU 4: hi: 186, btch: 31 usd: 0 CPU 5: hi: 186, btch: 31 usd: 0 CPU 6: hi: 186, btch: 31 usd: 0 CPU 7: hi: 186, btch: 31 usd: 0 CBMEM per-cpu: CPU 0: hi: 186, btch: 31 usd: 0 CPU 1: hi: 186, btch: 31 usd: 0 CPU 2: hi: 186, btch: 31 usd: 0 CPU 3: hi: 186, btch: 31 usd: 0 CPU 4: hi: 186, btch: 31 usd: 0 CPU 5: hi: 186, btch: 31 usd: 0 CPU 6: hi: 186, btch: 31 usd: 0 CPU 7: hi: 186, btch: 31 usd: 0 active_anon:4380 inactive_anon:1387 isolated_anon:0 active_file:316 inactive_file:23814 isolated_file:32 unevictable:65089 dirty:0 writeback:0 unstable:0 free:103229 slab_reclaimable:1442 slab_unreclaimable:36971 mapped:3854 shmem:916 pagetables:428 bounce:0 DMA free:134216kB min:1496kB low:1868kB high:2244kB active_anon:17520kB inactive_anon:5548kB active_file:1264kB inactive_file:95256kB unevictable:260356kB isolated(anon):0kB isolated(file):128kB present:770048kB mlocked:131028kB dirty:0kB writeback:0kB mapped:15416kB shmem:3664kB slab_reclaimable:5768kB slab_unreclaimable:76640kB kernel_stack:4624kB pagetables:1712kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 1008 CBMEM free:278700kB min:2008kB low:2508kB high:3012kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1032192kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:71244kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 DMA: 2135*4kB 2194*8kB 1420*16kB 743*32kB 390*64kB 139*128kB 52*256kB 9*512kB 1*1024kB 0*2048kB 0*4096kB = 134284kB CBMEM: 265*4kB 123*8kB 43*16kB 22*32kB 5*64kB 2*128kB 1*256kB 2*512kB 1*1024kB 1*2048kB 66*4096kB = 278700kB 58626 total pagecache pages 524288 pages RAM 103684 pages reserved 10235 pages shared 305539 pages non-shared [BLK][VD][md_demo] md 265289728 scan is released. [BLK][VD][md_scan_complete_handle] vd_send_event_to_scs stop!!!! [BLK][VD][md_scan_complete_handle] modify_ok 0,t
按照代码流程推算oom的原因。
大概场景:尝试分配 gfp_mask = 0xd0即GFP_KERNEL,, order = 7,连续的512K,
此时CBMEM区还剩2个512K的块,DMA还剩9个,总共剩余的页面比较多。
static inline enum zone_type gfp_zone(gfp_t flags) { enum zone_type z; int bit = flags & GFP_ZONEMASK; //对GFP_KERNEL,bit是0 z = (GFP_ZONE_TABLE >> (bit * ZONES_SHIFT)) & ((1 << ZONES_SHIFT) - 1); if (__builtin_constant_p(bit)) MAYBE_BUILD_BUG_ON((GFP_ZONE_BAD >> bit) & 1); else { #ifdef CONFIG_DEBUG_VM BUG_ON((GFP_ZONE_BAD >> bit) & 1); #endif } return z; } #define GFP_ZONE_TABLE ( \ (ZONE_NORMAL << 0 * ZONES_SHIFT) \ | (OPT_ZONE_DMA << __GFP_DMA * ZONES_SHIFT) \ | (OPT_ZONE_HIGHMEM << __GFP_HIGHMEM * ZONES_SHIFT) \ | (OPT_ZONE_DMA32 << __GFP_DMA32 * ZONES_SHIFT) \ | (ZONE_NORMAL << __GFP_MOVABLE * ZONES_SHIFT) \ | (OPT_ZONE_DMA << (__GFP_MOVABLE | __GFP_DMA) * ZONES_SHIFT) \ | (ZONE_MOVABLE << (__GFP_MOVABLE | __GFP_HIGHMEM) * ZONES_SHIFT)\ | (OPT_ZONE_DMA32 << (__GFP_MOVABLE | __GFP_DMA32) * ZONES_SHIFT)\ )
/* * GFP_ZONE_TABLE is a word size bitstring that is used for looking up the * zone to use given the lowest 4 bits of gfp_t. Entries are ZONE_SHIFT long * and there are 16 of them to cover all possible combinations of * __GFP_DMA, __GFP_DMA32, __GFP_MOVABLE and __GFP_HIGHMEM * * The zone fallback order is MOVABLE=>HIGHMEM=>NORMAL=>DMA32=>DMA. * But GFP_MOVABLE is not only a zone specifier but also an allocation * policy. Therefore __GFP_MOVABLE plus another zone selector is valid. * Only 1bit of the lowest 3 bit (DMA,DMA32,HIGHMEM) can be set to "1". * * bit result * ================= * 0x0 => NORMAL * 0x1 => DMA or NORMAL * 0x2 => HIGHMEM or NORMAL * 0x3 => BAD (DMA+HIGHMEM) * 0x4 => DMA32 or DMA or NORMAL * 0x5 => BAD (DMA+DMA32) * 0x6 => BAD (HIGHMEM+DMA32) * 0x7 => BAD (HIGHMEM+DMA32+DMA) * 0x8 => NORMAL (MOVABLE+0) * 0x9 => DMA or NORMAL (MOVABLE+DMA) * 0xa => MOVABLE (Movable is valid only if HIGHMEM is set too) * 0xb => BAD (MOVABLE+HIGHMEM+DMA) * 0xc => DMA32 (MOVABLE+HIGHMEM+DMA32) * 0xd => BAD (MOVABLE+DMA32+DMA) * 0xe => BAD (MOVABLE+DMA32+HIGHMEM) * 0xf => BAD (MOVABLE+DMA32+HIGHMEM+DMA) * * ZONES_SHIFT must be <= 2 on 32 bit platforms. */
本例中,gfp_mask是GFP_KERNEL,则从DMA区开始分配。
知道了区间,下面来考察该区间是否能分配到页面。分配页面的函数是get_page_from_freelist。
get_page_from_freelist里判断区间内存是否充足的函数是zone_watermark_ok
从oom看到此时的DMA区信息:
DMA: 2135*4kB 2194*8kB 1420*16kB 743*32kB 390*64kB 139*128kB 52*256kB 9*512kB 1*1024kB 0*2048kB 0*4096kB = 134284kB 水线min:1496kB |
long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1; free_pages = 134284kB /4k - 1<<7 + 1 = 33444 min = 1496k / 4k = 374 for (o = 0; o < order; o++) { free_pages -= z->free_area[o].nr_free << o; min >>= 1; if (free_pages <= min) return 0; }
再来看另外一个oom 现场。
背景知识。在没有显式的设置用户的numa分配策略时,进程的分配策略是default,即本地节点分配失败后,依次选择邻居节点,远端节点。
如下面出问题的线程
TEST 22228
cat /proc/22228/numa_map
7fff54af5000 default stack anon=21 dirty=21 active=0 N1=21 代表是default分配策略。
具体内核代码就是,初始化时,各个node会被做成一个连接的zonelist,zonelist[0]是本node的分配策略,其他zonelist依次
是邻居node,远端node。这样在alloc_pages的时候就会依次遍历下去。
下面的信息,推测是是TEST尝试在Node1中,分配内存alloc_pages(GFP_KERNEL: 0xd0), 然而发现Node1中内存不足,打印的信息。
照理说Node0里的DMA32和DMA区应该还是有很多空闲内存的,为啥TEST不去Node0分配内存?而显示分配失败了呢
[429108.012142] Page Alloc : the task sh(2824) order:1, mode:0xd0, oomadj:0 alloc kernel memory fail and will try again after a while! [429108.017924] Page Alloc : the task TEST(17212) order:0, mode:0xd0, oomadj:0 alloc kernel memory fail and will try again after a while! [429108.024193] sh: page allocation failure. order:1, mode:0xd0, oomadj:0 [429108.024220] Active:43 inactive:145 dirty:11 writeback:0 unstable:0 [429108.024221] free:29423 slab:28887 mapped:4933014 pagetables:13977 bounce:0 [429108.024245] Node 0 <2>DMA free:15824kB min:32kB low:40kB high:48kB active:0kB inactive:0kB cachelimt:1552kB cachefailtimes:0 hi falldown times:0 present:15544kB pages_scanned:0 all_unreclaimable? yes [429108.024251] lowmem_reserve[]:<2> 0<2> 3215<2> 16041<2> [429108.024297] Node 0 <2>DMA32 free:57956kB min:6844kB low:8552kB high:10264kB active:12kB inactive:0kB cachelimt:329228kB cachefailtimes:0 hi falldown times:2336399 present:3292296kB pages_scanned:30 all_unreclaimable? yes [429108.024303] lowmem_reserve[]:<2> 0<2> 0<2> 12826<2> [429108.024341] Node 0 <2>Normal free:27112kB min:27312kB low:34140kB high:40968kB active:268kB inactive:360kB cachelimt:1313432kB cachefailtimes:0 hi falldown times:33285225 present:13134336kB pages_scanned:1101 all_unreclaimable? yes [429108.024347] lowmem_reserve[]:<2> 0<2> 0<2> 0<2> [429108.024386] Node 1 <2>Normal free:16800kB min:17004kB low:21252kB high:25504kB active:0kB inactive:220kB cachelimt:817676kB cachefailtimes:4 hi falldown times:28851188 present:8176768kB pages_scanned:539 all_unreclaimable? yes [429108.024391] lowmem_reserve[]:<2> 0<2> 0<2> 0<2> [429108.024690] Page Alloc : the task sh(2824) order:0, mode:0x200d0, oomadj:0 alloc kernel memory fail and will try again after a while!
/* * results with 256, 32 in the lowmem_reserve sysctl: * 1G machine -> (16M dma, 800M-16M normal, 1G-800M high) * 1G machine -> (16M dma, 784M normal, 224M high) * NORMAL allocation will leave 784M/256 of ram reserved in the ZONE_DMA * HIGHMEM allocation will leave 224M/32 of ram reserved in ZONE_NORMAL * HIGHMEM allocation will (224M+784M)/256 of ram reserved in ZONE_DMA * * TBD: should special case ZONE_DMA32 machines here - in those we normally * don't need any ZONE_NORMAL reservation */ int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = { #ifdef CONFIG_ZONE_DMA 256, #endif #ifdef CONFIG_ZONE_DMA32 256, #endif #ifdef CONFIG_HIGHMEM 32 #endif };