虚机numa配置错误引发的问题

问题描述

2018年10月20日,宿主上的一台虚机触发oom,导致虚机被内核干掉,问题出现时宿主上内存还剩很多,message中日志如下:

说明

日志中的order=0说明申请了多少内存,order=0说明申请2的0次方页内存,也就是4k内存

Oct 20 00:43:07  kernel: qemu-kvm invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
Oct 20 00:43:07  kernel: qemu-kvm cpuset=emulator mems_allowed=1
Oct 20 00:43:07  kernel: CPU: 7 PID: 1194284 Comm: qemu-kvm Tainted: G           OE  ------------   3.10.0-327.el7.x86_64 #1
Oct 20 00:43:07  kernel: Hardware name: Dell Inc. PowerEdge R730/0WCJNT, BIOS 2.5.5 08/16/2017
Oct 20 00:43:07  kernel: ffff882e328f0b80 000000008b0f4108 ffff882f6f367b00 ffffffff816351f1
Oct 20 00:43:07  kernel: ffff882f6f367b90 ffffffff81630191 ffff882e32a91980 0000000000000001
Oct 20 00:43:07  kernel: 000000000000420f 0000000000000010 ffffffff8197d740 00000000b922b922
Oct 20 00:43:07  kernel: Call Trace:
Oct 20 00:43:07  kernel: [] dump_stack+0x19/0x1b
Oct 20 00:43:07  kernel: [] dump_header+0x8e/0x214
Oct 20 00:43:07  kernel: [] oom_kill_process+0x24e/0x3b0
Oct 20 00:43:07  kernel: [] ? find_lock_task_mm+0x56/0xc0
Oct 20 00:43:07  kernel: [] out_of_memory+0x4b6/0x4f0
Oct 20 00:43:07  kernel: [] __alloc_pages_nodemask+0xa95/0xb90
Oct 20 00:43:07  kernel: [] alloc_pages_vma+0x9a/0x140
Oct 20 00:43:07  kernel: [] handle_mm_fault+0xb85/0xf50
Oct 20 00:43:07  kernel: [] ? eventfd_ctx_read+0x67/0x210
Oct 20 00:43:07  kernel: [] __do_page_fault+0x152/0x420
Oct 20 00:43:07  kernel: [] do_page_fault+0x23/0x80
Oct 20 00:43:07  kernel: [] page_fault+0x28/0x30
Oct 20 00:43:07  kernel: Mem-Info:

Oct 20 00:43:07  kernel: active_anon:87309259 inactive_anon:444334 isolated_anon:0#012 active_file:101827 inactive_file:1066463 isolated_file:0#012 unevictable:0 dirty:16777 writeback:0 unstable:0#012 free:8521193 slab_reclaimable:179558 slab_unreclaimable:138991#012 mapped:14804 shmem:1180357 pagetables:195678 bounce:0#012 free_cma:0
Oct 20 00:43:07  kernel: Node 1 Normal free:44244kB min:45096kB low:56368kB high:67644kB active_anon:194740280kB inactive_anon:795780kB active_file:80kB inactive_file:100kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:201326592kB managed:198168156kB mlocked:0kB dirty:4kB writeback:0kB mapped:2500kB shmem:2177236kB slab_reclaimable:158548kB slab_unreclaimable:199088kB kernel_stack:109552kB pagetables:478460kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:301 all_unreclaimable? yes
Oct 20 00:43:07  kernel: lowmem_reserve[]: 0 0 0 0
Oct 20 00:43:07  kernel: Node 1 Normal: 10147*4kB (UEM) 22*8kB (UE) 3*16kB (U) 11*32kB (UR) 8*64kB (R) 6*128kB (R) 2*256kB (R) 1*512kB (R) 1*1024kB (R) 0*2048kB 0*4096kB = 44492kB
Oct 20 00:43:07  kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Oct 20 00:43:07  kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Oct 20 00:43:07  kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Oct 20 00:43:07  kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Oct 20 00:43:07  kernel: 2349178 total pagecache pages
Oct 20 00:43:07  kernel: 0 pages in swap cache
Oct 20 00:43:07  kernel: Swap cache stats: add 0, delete 0, find 0/0
Oct 20 00:43:07  kernel: Free swap  = 0kB
Oct 20 00:43:07  kernel: Total swap = 0kB
Oct 20 00:43:07  kernel: 100639322 pages RAM
Oct 20 00:43:07  kernel: 0 pages HighMem/MovableOnly
Oct 20 00:43:07  kernel: 1646159 pages reserved
Oct 20 00:43:07  kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name

Oct 20 00:43:07  kernel: Out of memory: Kill process 1409878 (qemu-kvm) score 666 or sacrifice child
Oct 20 00:43:07  kernel: Killed process 1409878 (qemu-kvm) total-vm:136850144kB, anon-rss:133909332kB, file-rss:4724kB
Oct 20 00:43:30  libvirtd: 2018-10-19 16:43:30.303+0000: 81546: error : qemuMonitorIO:705 : internal error: End of file from qemu monitor
Oct 20 00:43:30  systemd-machined: Machine qemu-7-c2683281-6cbd-4100-ba91-e221ed06ee60 terminated.
Oct 20 00:43:30  kvm: 6 guests now active
复制代码

上述日志省略掉了meminfo的详细信息和每个进程占用内存的信息。

从日志中可以看到Node 1 Normal free内存只剩下44M左右,所以触发了oom,但当时其实node0上还有很多内存未被使用,触发oom的进程kvm,pid为1194284,通过查日志定位到引发问题的虚机为25913bd0-d869-4310-ab53-8df6855dd258,查看出本台虚机机xml文件配置发现,虚机内存的numa配置为:

  
    'strict' placement='auto'/>
  
复制代码

通过virsh client获取到的信息如下:

virsh # numatune 25913bd0-d869-4310-ab53-8df6855dd258
numa_mode      : strict
numa_nodeset   : 1
复制代码

发现当mode是strict,placement为auto的时候,进程会算出一个合适的numa节点配置到这台虚机上。所以这台虚机内存就被限定到了node1上,当node1的内存被用尽就触发了oom

numatune mode选择redhat官网说法如下

参见官网链接

  • strict

严格策略意思是,如果目标节点上的内存不能被分配,那么内存分配就会失败 指定了numa节点列表,但是没有定义内存模式默认为strict策略

  • interleave

跨越指定节点集分配内存页,分配遵循round-robin(循环/轮替)方法

  • preferred

内存从单个首选内存节点分配,如果没有足够的内存能满足,那么内存从其他节点分配。

重要提示

如果在strict模式内存被过量使用,并且guest没有足够的swap空间,那么内核将kill某些guest进程来获得足够的内存,所以红帽官方推荐用perferred,配置一个单节点(比如说,nodeset=‘0’)来避免这种情况

问题复现

我们拿了一台新的宿主创建一台虚拟机,修改虚拟机的numatune配置,测试了虚机的numatune配置在strict和prefreed两种mode在以下三种配置下的表现:

interleave这种跨节点内存分配方式性能表现肯定会比以上两种弱,且我们主要想测在单node节点内存占用满的情况下strict和prefreed两种模式会不会触发oom,所以interleave模式不在测试范围内。

定义三种配置

第一种配置

mode 为strict placement为auto

  
    'strict' placement='auto'/>
  
复制代码
第二种配置

mode 为preferred placement为auto

  
    'preferred' placement='auto'/>
  
复制代码
第三种配置

mode为strict nodeset配置为0-1

  
    'strict' nodeset='0-1'/>
  
复制代码
具体操作

将宿主上单个node的节点内存用memholder(这个工具属于ssplatform2-tools这个rpm包)占用满(具体命令numactl -i 0 memholder 64000 &),然后在虚机上也跑memholder进程,看虚机占用内存也不断升高时,内存在numa节点上的分配情况。

测试结果

第一种配置

virsh client段获取到的信息如下,placement是auto,但是qemu-kvm进程还是选了个node确定了下来

virsh # numatune 638abba7-bba8-498b-88d6-ddc70f2cef18
numa_mode      : strict
numa_nodeset   : 1
复制代码

开始虚机内存占用如下

# numastat -c qemu-kvm

Per-node process memory usage (in MBs)
PID              Node 0 Node 1 Total
---------------  ------ ------ -----
1332894 (qemu-kv      0    693   694
1764062 (qemu-kv      0    366   366
---------------  ------ ------ -----
Total                 1   1060  1060
复制代码

用memholder把node1内存占用满之后宿主的内存占用

 numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 64326 MB
node 0 free: 58476 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 64496 MB
node 1 free: 64 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10 
复制代码

虚机里运行完memholder开始占用内存之后,虚机的内存占用如下:

 numastat -c qemu-kvm

Per-node process memory usage (in MBs)
PID              Node 0 Node 1 Total
---------------  ------ ------ -----
1332894 (qemu-kv      6    685   692
1764062 (qemu-kv      7   4670  4677
---------------  ------ ------ -----
Total                13   5355  5368
复制代码

宿主的内存占用:

 numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 64326 MB
node 0 free: 58650 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 64496 MB
node 1 free: 52181 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10 
复制代码

这个时候发现kvm进程已经出发了oom,宿主上占用内存的memholder进程已经被kernel kill调了,宿主内存空闲了出来

message里日志如下:

Nov 13 21:07:07  kernel: qemu-kvm invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
Nov 13 21:07:07  kernel: qemu-kvm cpuset=emulator mems_allowed=1
Nov 13 21:07:07  kernel: CPU: 28 PID: 1332894 Comm: qemu-kvm Not tainted 4.4.36-1.el7.elrepo.x86_64 #1

Nov 13 21:07:07  kernel: Mem-Info:
Nov 13 21:07:07  kernel: active_anon:1986423 inactive_anon:403229 isolated_anon:0#012 active_file:116773 inactive_file:577075 isolated_file:0#012 unevictable:14364416 dirty:142 writeback:0 unstable:0#012 slab_reclaimable:61182 slab_unreclaimable:296489#012 mapped:14400991 shmem:15542531 pagetables:35749 bounce:0#012 free:14983912 free_pcp:0 free_cma:0
Nov 13 21:07:07  kernel: Node 1 Normal free:44952kB min:45120kB low:56400kB high:67680kB active_anon:5485032kB inactive_anon:1571408kB active_file:308kB inactive_file:0kB unevictable:57286820kB isolated(anon):0kB isolated(file):0kB present:67108864kB managed:66044484kB mlocked:57286820kB dirty:48kB writeback:0kB mapped:57330444kB shmem:61948048kB slab_reclaimable:143752kB slab_unreclaimable:1107004kB kernel_stack:16592kB pagetables:129312kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2248 all_unreclaimable? yes
Nov 13 21:07:07  kernel: lowmem_reserve[]: 0 0 0 0
Nov 13 21:07:07  kernel: Node 1 Normal: 1018*4kB (UME) 312*8kB (UE) 155*16kB (UE) 34*32kB (UE) 293*64kB (UM) 53*128kB (U) 5*256kB (U) 1*512kB (U) 1*1024kB (E) 2*2048kB (UM) 2*4096kB (M) = 50776kB
Nov 13 21:07:07  kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Nov 13 21:07:07  kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Nov 13 21:07:07  kernel: 16236582 total pagecache pages
Nov 13 21:07:07  kernel: 0 pages in swap cache
Nov 13 21:07:07  kernel: Swap cache stats: add 0, delete 0, find 0/0
Nov 13 21:07:07  kernel: Free swap  = 0kB
Nov 13 21:07:07  kernel: Total swap = 0kB
Nov 13 21:07:07  kernel: 33530456 pages RAM
Nov 13 21:07:07  kernel: 0 pages HighMem/MovableOnly
Nov 13 21:07:07  kernel: 551723 pages reserved
Nov 13 21:07:07  kernel: 0 pages hwpoisoned
复制代码

我们测试的进程是1764062,但是出发oom的进程是1332894,该进程对应的虚机的numatune配置也为配置一,且运行virsh client获取到的nodeset也是1

virsh # numatune c11a155a-95b0-4593-9ce5-f2a42dc0ccca
numa_mode      : strict
numa_nodeset   : 1
复制代码
第二种配置

virsh client获取到的虚机numatune如下:

virsh # numatune 638abba7-bba8-498b-88d6-ddc70f2cef18
numa_mode      : preferred
numa_nodeset   : 1
复制代码

开始虚机的内存占用如下

[@ ~]# numastat -c qemu-kvm

Per-node process memory usage (in MBs)
PID              Node 0 Node 1 Total
---------------  ------ ------ -----
1332894 (qemu-kv      6    691   698
1897916 (qemu-kv     17    677   694
---------------  ------ ------ -----
Total                24   1368  1392
复制代码

用memholder把node1内存占用满之后宿主的内存占用

[@ ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 64326 MB
node 0 free: 58403 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 64496 MB
node 1 free: 56 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10 
复制代码

虚机里运行完memholder开始占用内存之后,虚机的内存占用如下

[@ ~]# numastat -c qemu-kvm

Per-node process memory usage (in MBs)
PID              Node 0 Node 1 Total
---------------  ------ ------ -----
1332894 (qemu-kv      7    690   697
1897916 (qemu-kv   4012    682  4695
---------------  ------ ------ -----
Total              4019   1372  5391
复制代码

宿主的内存占用:

[@ ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 64326 MB
node 0 free: 54395 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 64496 MB
node 1 free: 55 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10 
复制代码

从以上表现来看虽然preferred是node1但是当node1内存不足的时候,进程申请了node0的内存,并未引发oom

第三种配置

说明1308480这个进程是我们测试的虚机进程

开始虚机内存占用如下

[@ ~]# numastat -c qemu-kvm

Per-node process memory usage (in MBs)
PID              Node 0 Node 1 Total
---------------  ------ ------ -----
1308480 (qemu-kv    141    584   725
1332894 (qemu-kv      0    707   708
---------------  ------ ------ -----
Total               141   1291  1432
复制代码

宿主上的内存占用如下

[@ ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 64326 MB
node 0 free: 58241 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 64496 MB
node 1 free: 131 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10 

复制代码

虚机里运行完memholder开始占用内存之后,虚机的内存占用如下:

[@ ~]# numastat -c qemu-kvm

Per-node process memory usage (in MBs)
PID              Node 0 Node 1 Total
---------------  ------ ------ -----
1308480 (qemu-kv   4017    682  4699
1332894 (qemu-kv      7    681   688
---------------  ------ ------ -----
Total              4024   1363  5387
复制代码

宿主上的内存占用如下:

[@ ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 64326 MB
node 0 free: 54410 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 64496 MB
node 1 free: 55 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10 
复制代码

总结

从测试来看,第二种和第三种配置方式都不会导致由于两个node节点内存使用不均衡导致oom,但是哪种配置性能更好还需要后续的测试。

参考链接

www.jianshu.com/p/c2e7d3682…

你可能感兴趣的:(运维)