问题描述
2018年10月20日,宿主上的一台虚机触发oom,导致虚机被内核干掉,问题出现时宿主上内存还剩很多,message中日志如下:
说明
日志中的order=0说明申请了多少内存,order=0说明申请2的0次方页内存,也就是4k内存
Oct 20 00:43:07 kernel: qemu-kvm invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
Oct 20 00:43:07 kernel: qemu-kvm cpuset=emulator mems_allowed=1
Oct 20 00:43:07 kernel: CPU: 7 PID: 1194284 Comm: qemu-kvm Tainted: G OE ------------ 3.10.0-327.el7.x86_64 #1
Oct 20 00:43:07 kernel: Hardware name: Dell Inc. PowerEdge R730/0WCJNT, BIOS 2.5.5 08/16/2017
Oct 20 00:43:07 kernel: ffff882e328f0b80 000000008b0f4108 ffff882f6f367b00 ffffffff816351f1
Oct 20 00:43:07 kernel: ffff882f6f367b90 ffffffff81630191 ffff882e32a91980 0000000000000001
Oct 20 00:43:07 kernel: 000000000000420f 0000000000000010 ffffffff8197d740 00000000b922b922
Oct 20 00:43:07 kernel: Call Trace:
Oct 20 00:43:07 kernel: [] dump_stack+0x19/0x1b
Oct 20 00:43:07 kernel: [] dump_header+0x8e/0x214
Oct 20 00:43:07 kernel: [] oom_kill_process+0x24e/0x3b0
Oct 20 00:43:07 kernel: [] ? find_lock_task_mm+0x56/0xc0
Oct 20 00:43:07 kernel: [] out_of_memory+0x4b6/0x4f0
Oct 20 00:43:07 kernel: [] __alloc_pages_nodemask+0xa95/0xb90
Oct 20 00:43:07 kernel: [] alloc_pages_vma+0x9a/0x140
Oct 20 00:43:07 kernel: [] handle_mm_fault+0xb85/0xf50
Oct 20 00:43:07 kernel: [] ? eventfd_ctx_read+0x67/0x210
Oct 20 00:43:07 kernel: [] __do_page_fault+0x152/0x420
Oct 20 00:43:07 kernel: [] do_page_fault+0x23/0x80
Oct 20 00:43:07 kernel: [] page_fault+0x28/0x30
Oct 20 00:43:07 kernel: Mem-Info:
Oct 20 00:43:07 kernel: active_anon:87309259 inactive_anon:444334 isolated_anon:0#012 active_file:101827 inactive_file:1066463 isolated_file:0#012 unevictable:0 dirty:16777 writeback:0 unstable:0#012 free:8521193 slab_reclaimable:179558 slab_unreclaimable:138991#012 mapped:14804 shmem:1180357 pagetables:195678 bounce:0#012 free_cma:0
Oct 20 00:43:07 kernel: Node 1 Normal free:44244kB min:45096kB low:56368kB high:67644kB active_anon:194740280kB inactive_anon:795780kB active_file:80kB inactive_file:100kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:201326592kB managed:198168156kB mlocked:0kB dirty:4kB writeback:0kB mapped:2500kB shmem:2177236kB slab_reclaimable:158548kB slab_unreclaimable:199088kB kernel_stack:109552kB pagetables:478460kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:301 all_unreclaimable? yes
Oct 20 00:43:07 kernel: lowmem_reserve[]: 0 0 0 0
Oct 20 00:43:07 kernel: Node 1 Normal: 10147*4kB (UEM) 22*8kB (UE) 3*16kB (U) 11*32kB (UR) 8*64kB (R) 6*128kB (R) 2*256kB (R) 1*512kB (R) 1*1024kB (R) 0*2048kB 0*4096kB = 44492kB
Oct 20 00:43:07 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Oct 20 00:43:07 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Oct 20 00:43:07 kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Oct 20 00:43:07 kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Oct 20 00:43:07 kernel: 2349178 total pagecache pages
Oct 20 00:43:07 kernel: 0 pages in swap cache
Oct 20 00:43:07 kernel: Swap cache stats: add 0, delete 0, find 0/0
Oct 20 00:43:07 kernel: Free swap = 0kB
Oct 20 00:43:07 kernel: Total swap = 0kB
Oct 20 00:43:07 kernel: 100639322 pages RAM
Oct 20 00:43:07 kernel: 0 pages HighMem/MovableOnly
Oct 20 00:43:07 kernel: 1646159 pages reserved
Oct 20 00:43:07 kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
Oct 20 00:43:07 kernel: Out of memory: Kill process 1409878 (qemu-kvm) score 666 or sacrifice child
Oct 20 00:43:07 kernel: Killed process 1409878 (qemu-kvm) total-vm:136850144kB, anon-rss:133909332kB, file-rss:4724kB
Oct 20 00:43:30 libvirtd: 2018-10-19 16:43:30.303+0000: 81546: error : qemuMonitorIO:705 : internal error: End of file from qemu monitor
Oct 20 00:43:30 systemd-machined: Machine qemu-7-c2683281-6cbd-4100-ba91-e221ed06ee60 terminated.
Oct 20 00:43:30 kvm: 6 guests now active
复制代码
上述日志省略掉了meminfo的详细信息和每个进程占用内存的信息。
从日志中可以看到Node 1 Normal free内存只剩下44M左右,所以触发了oom,但当时其实node0上还有很多内存未被使用,触发oom的进程kvm,pid为1194284,通过查日志定位到引发问题的虚机为25913bd0-d869-4310-ab53-8df6855dd258,查看出本台虚机机xml文件配置发现,虚机内存的numa配置为:
'strict' placement='auto'/>
复制代码
通过virsh client获取到的信息如下:
virsh # numatune 25913bd0-d869-4310-ab53-8df6855dd258
numa_mode : strict
numa_nodeset : 1
复制代码
发现当mode是strict,placement为auto的时候,进程会算出一个合适的numa节点配置到这台虚机上。所以这台虚机内存就被限定到了node1上,当node1的内存被用尽就触发了oom
numatune mode选择redhat官网说法如下
参见官网链接
- strict
严格策略意思是,如果目标节点上的内存不能被分配,那么内存分配就会失败 指定了numa节点列表,但是没有定义内存模式默认为strict策略
- interleave
跨越指定节点集分配内存页,分配遵循round-robin(循环/轮替)方法
- preferred
内存从单个首选内存节点分配,如果没有足够的内存能满足,那么内存从其他节点分配。
重要提示
如果在strict模式内存被过量使用,并且guest没有足够的swap空间,那么内核将kill某些guest进程来获得足够的内存,所以红帽官方推荐用perferred,配置一个单节点(比如说,nodeset=‘0’)来避免这种情况
问题复现
我们拿了一台新的宿主创建一台虚拟机,修改虚拟机的numatune配置,测试了虚机的numatune配置在strict和prefreed两种mode在以下三种配置下的表现:
interleave这种跨节点内存分配方式性能表现肯定会比以上两种弱,且我们主要想测在单node节点内存占用满的情况下strict和prefreed两种模式会不会触发oom,所以interleave模式不在测试范围内。
定义三种配置
第一种配置
mode 为strict placement为auto
'strict' placement='auto'/>
复制代码
第二种配置
mode 为preferred placement为auto
'preferred' placement='auto'/>
复制代码
第三种配置
mode为strict nodeset配置为0-1
'strict' nodeset='0-1'/>
复制代码
具体操作
将宿主上单个node的节点内存用memholder(这个工具属于ssplatform2-tools这个rpm包)占用满(具体命令numactl -i 0 memholder 64000 &),然后在虚机上也跑memholder进程,看虚机占用内存也不断升高时,内存在numa节点上的分配情况。
测试结果
第一种配置
virsh client段获取到的信息如下,placement是auto,但是qemu-kvm进程还是选了个node确定了下来
virsh # numatune 638abba7-bba8-498b-88d6-ddc70f2cef18
numa_mode : strict
numa_nodeset : 1
复制代码
开始虚机内存占用如下
# numastat -c qemu-kvm
Per-node process memory usage (in MBs)
PID Node 0 Node 1 Total
--------------- ------ ------ -----
1332894 (qemu-kv 0 693 694
1764062 (qemu-kv 0 366 366
--------------- ------ ------ -----
Total 1 1060 1060
复制代码
用memholder把node1内存占用满之后宿主的内存占用
numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 64326 MB
node 0 free: 58476 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 64496 MB
node 1 free: 64 MB
node distances:
node 0 1
0: 10 21
1: 21 10
复制代码
虚机里运行完memholder开始占用内存之后,虚机的内存占用如下:
numastat -c qemu-kvm
Per-node process memory usage (in MBs)
PID Node 0 Node 1 Total
--------------- ------ ------ -----
1332894 (qemu-kv 6 685 692
1764062 (qemu-kv 7 4670 4677
--------------- ------ ------ -----
Total 13 5355 5368
复制代码
宿主的内存占用:
numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 64326 MB
node 0 free: 58650 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 64496 MB
node 1 free: 52181 MB
node distances:
node 0 1
0: 10 21
1: 21 10
复制代码
这个时候发现kvm进程已经出发了oom,宿主上占用内存的memholder进程已经被kernel kill调了,宿主内存空闲了出来
message里日志如下:
Nov 13 21:07:07 kernel: qemu-kvm invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
Nov 13 21:07:07 kernel: qemu-kvm cpuset=emulator mems_allowed=1
Nov 13 21:07:07 kernel: CPU: 28 PID: 1332894 Comm: qemu-kvm Not tainted 4.4.36-1.el7.elrepo.x86_64 #1
Nov 13 21:07:07 kernel: Mem-Info:
Nov 13 21:07:07 kernel: active_anon:1986423 inactive_anon:403229 isolated_anon:0#012 active_file:116773 inactive_file:577075 isolated_file:0#012 unevictable:14364416 dirty:142 writeback:0 unstable:0#012 slab_reclaimable:61182 slab_unreclaimable:296489#012 mapped:14400991 shmem:15542531 pagetables:35749 bounce:0#012 free:14983912 free_pcp:0 free_cma:0
Nov 13 21:07:07 kernel: Node 1 Normal free:44952kB min:45120kB low:56400kB high:67680kB active_anon:5485032kB inactive_anon:1571408kB active_file:308kB inactive_file:0kB unevictable:57286820kB isolated(anon):0kB isolated(file):0kB present:67108864kB managed:66044484kB mlocked:57286820kB dirty:48kB writeback:0kB mapped:57330444kB shmem:61948048kB slab_reclaimable:143752kB slab_unreclaimable:1107004kB kernel_stack:16592kB pagetables:129312kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2248 all_unreclaimable? yes
Nov 13 21:07:07 kernel: lowmem_reserve[]: 0 0 0 0
Nov 13 21:07:07 kernel: Node 1 Normal: 1018*4kB (UME) 312*8kB (UE) 155*16kB (UE) 34*32kB (UE) 293*64kB (UM) 53*128kB (U) 5*256kB (U) 1*512kB (U) 1*1024kB (E) 2*2048kB (UM) 2*4096kB (M) = 50776kB
Nov 13 21:07:07 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Nov 13 21:07:07 kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Nov 13 21:07:07 kernel: 16236582 total pagecache pages
Nov 13 21:07:07 kernel: 0 pages in swap cache
Nov 13 21:07:07 kernel: Swap cache stats: add 0, delete 0, find 0/0
Nov 13 21:07:07 kernel: Free swap = 0kB
Nov 13 21:07:07 kernel: Total swap = 0kB
Nov 13 21:07:07 kernel: 33530456 pages RAM
Nov 13 21:07:07 kernel: 0 pages HighMem/MovableOnly
Nov 13 21:07:07 kernel: 551723 pages reserved
Nov 13 21:07:07 kernel: 0 pages hwpoisoned
复制代码
我们测试的进程是1764062,但是出发oom的进程是1332894,该进程对应的虚机的numatune配置也为配置一,且运行virsh client获取到的nodeset也是1
virsh # numatune c11a155a-95b0-4593-9ce5-f2a42dc0ccca
numa_mode : strict
numa_nodeset : 1
复制代码
第二种配置
virsh client获取到的虚机numatune如下:
virsh # numatune 638abba7-bba8-498b-88d6-ddc70f2cef18
numa_mode : preferred
numa_nodeset : 1
复制代码
开始虚机的内存占用如下
[@ ~]# numastat -c qemu-kvm
Per-node process memory usage (in MBs)
PID Node 0 Node 1 Total
--------------- ------ ------ -----
1332894 (qemu-kv 6 691 698
1897916 (qemu-kv 17 677 694
--------------- ------ ------ -----
Total 24 1368 1392
复制代码
用memholder把node1内存占用满之后宿主的内存占用
[@ ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 64326 MB
node 0 free: 58403 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 64496 MB
node 1 free: 56 MB
node distances:
node 0 1
0: 10 21
1: 21 10
复制代码
虚机里运行完memholder开始占用内存之后,虚机的内存占用如下
[@ ~]# numastat -c qemu-kvm
Per-node process memory usage (in MBs)
PID Node 0 Node 1 Total
--------------- ------ ------ -----
1332894 (qemu-kv 7 690 697
1897916 (qemu-kv 4012 682 4695
--------------- ------ ------ -----
Total 4019 1372 5391
复制代码
宿主的内存占用:
[@ ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 64326 MB
node 0 free: 54395 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 64496 MB
node 1 free: 55 MB
node distances:
node 0 1
0: 10 21
1: 21 10
复制代码
从以上表现来看虽然preferred是node1但是当node1内存不足的时候,进程申请了node0的内存,并未引发oom
第三种配置
说明1308480这个进程是我们测试的虚机进程
开始虚机内存占用如下
[@ ~]# numastat -c qemu-kvm
Per-node process memory usage (in MBs)
PID Node 0 Node 1 Total
--------------- ------ ------ -----
1308480 (qemu-kv 141 584 725
1332894 (qemu-kv 0 707 708
--------------- ------ ------ -----
Total 141 1291 1432
复制代码
宿主上的内存占用如下
[@ ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 64326 MB
node 0 free: 58241 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 64496 MB
node 1 free: 131 MB
node distances:
node 0 1
0: 10 21
1: 21 10
复制代码
虚机里运行完memholder开始占用内存之后,虚机的内存占用如下:
[@ ~]# numastat -c qemu-kvm
Per-node process memory usage (in MBs)
PID Node 0 Node 1 Total
--------------- ------ ------ -----
1308480 (qemu-kv 4017 682 4699
1332894 (qemu-kv 7 681 688
--------------- ------ ------ -----
Total 4024 1363 5387
复制代码
宿主上的内存占用如下:
[@ ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 64326 MB
node 0 free: 54410 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 64496 MB
node 1 free: 55 MB
node distances:
node 0 1
0: 10 21
1: 21 10
复制代码
总结
从测试来看,第二种和第三种配置方式都不会导致由于两个node节点内存使用不均衡导致oom,但是哪种配置性能更好还需要后续的测试。
参考链接
www.jianshu.com/p/c2e7d3682…