在过去的几天中,我的某些Pod持续panic,并且OS Syslog显示OOM Killer杀死了容器进程。我做了一些研究,以找出这些东西是如何工作的。
Pod内存限制和cgroup内存设置
创建一个将内存限制设置为128Mi的Pod:
kubectl run --restart=Never --rm -it --image=ubuntu --limits='memory=128Mi' -- sh
If you don't see a command prompt, try pressing enter.
root@sh:/#
打开另外一个终端,使用下面的方式获取到该Pod的uid:
kubectl get pods sh -o yaml | grep uid
uid: 98f587f8-8994-4eb4-a7b6-d62be890cc08
然后通过以下的命令找到Pod所运行的node节点:
kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
sh 1/1 Running 0 52s 10.107.1.136 10.67.62.22
在Pod运行的服务器(10.67.62.22)上,根据Pod的uid检查cgroup设置,
首先进入到对应Pod的cgroup下:
cd /sys/fs/cgroup/memory/kubepods/burstable/pod98f587f8-8994-4eb4-a7b6-d62be890cc08
执行ls 可以看到如下:
ls
bdc2f6d0b9791f9a8b86c1e877c830e387170c955cdc09866350344b19e08a6e memory.kmem.failcnt memory.kmem.tcp.usage_in_bytes memory.memsw.usage_in_bytes memory.swappiness
cgroup.clone_children memory.kmem.limit_in_bytes memory.kmem.usage_in_bytes memory.move_charge_at_immigrate memory.usage_in_bytes
cgroup.event_control memory.kmem.max_usage_in_bytes memory.limit_in_bytes memory.numa_stat memory.use_hierarchy
cgroup.procs memory.kmem.slabinfo memory.max_usage_in_bytes memory.oom_control notify_on_release
ffe85090722bdfbb94fab8a7b58ce714191c3456c2b405240afd58756664cc88 memory.kmem.tcp.failcnt memory.memsw.failcnt memory.pressure_level tasks
memory.failcnt memory.kmem.tcp.limit_in_bytes memory.memsw.limit_in_bytes memory.soft_limit_in_bytes
memory.force_empty memory.kmem.tcp.max_usage_in_bytes memory.memsw.max_usage_in_bytes memory.stat
查看限制值:
cat memory.limit_in_bytes
134217728
数字134217728是精确的128Mi(128 * 1024 * 1024)。因此,现在更加清楚的是,Kubernetes通过cgroup设置了内存限制。一旦pod消耗的内存超过了限制,cgroup将开始终止容器进程。
Stress test
让我们通过打开的shell会话在Pod上安装压力工具。
root@sh:/# apt update; apt install -y stress
同时,在node上通过运行dmesg -Tw
监视Syslog。
首先在内存限制在100M之内运行压力工具。
root@sh:/# stress --vm 1 --vm-bytes 100M &
[1] 253
root@sh:/# stress: info: [253] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
然后开始第二个压测:
root@sh:/# stress --vm 1 --vm-bytes =100M
stress: info: [256] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
stress: FAIL: [253] (415) <-- worker 254 got signal 9
stress: WARN: [253] (417) now reaping child worker processes
stress: FAIL: [253] (451) failed run completed in 66s
第一个压力进程(进程id 253)立即被信号9杀死。
此时,查看syslog显示:
[Thu May 21 08:48:41 2020] stress invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=999
[Thu May 21 08:48:41 2020] stress cpuset=ffe85090722bdfbb94fab8a7b58ce714191c3456c2b405240afd58756664cc88 mems_allowed=0-1
[Thu May 21 08:48:41 2020] CPU: 22 PID: 5222 Comm: stress Tainted: G O ---- ------- 3.10.0-862.14.1.5.h328.eulerosv2r7.x86_64 #1
[Thu May 21 08:48:41 2020] Hardware name: OpenStack Foundation OpenStack Nova, BIOS rel-1.10.2-0-g5f4c7b1-20181220_000000-szxrtosci10000 04/01/2014
[Thu May 21 08:48:41 2020] Call Trace:
[Thu May 21 08:48:41 2020] [] dump_stack+0x19/0x1b
[Thu May 21 08:48:41 2020] [] dump_header+0x90/0x229
[Thu May 21 08:48:41 2020] [] ? find_lock_task_mm+0x56/0xc0
[Thu May 21 08:48:41 2020] [] oom_kill_process+0x254/0x3d0
[Thu May 21 08:48:41 2020] [] mem_cgroup_oom_synchronize+0x553/0x580
[Thu May 21 08:48:41 2020] [] ? mem_cgroup_charge_common+0xc0/0xc0
[Thu May 21 08:48:41 2020] [] pagefault_out_of_memory+0x14/0x90
[Thu May 21 08:48:41 2020] [] mm_fault_error+0x6a/0x157
[Thu May 21 08:48:41 2020] [] __do_page_fault+0x4a6/0x4f0
[Thu May 21 08:48:41 2020] [] trace_do_page_fault+0x56/0x150
[Thu May 21 08:48:41 2020] [] do_async_page_fault+0x22/0xf0
[Thu May 21 08:48:41 2020] [] async_page_fault+0x28/0x30
[Thu May 21 08:48:41 2020] Task in /kubepods/burstable/pod98f587f8-8994-4eb4-a7b6-d62be890cc08/ffe85090722bdfbb94fab8a7b58ce714191c3456c2b405240afd58756664cc88 killed as a result of limit of /kubepods/burstable/pod98f587f8-8994-4eb4-a7b6-d62be890cc08
[Thu May 21 08:48:41 2020] memory: usage 131072kB, limit 131072kB, failcnt 6711
[Thu May 21 08:48:41 2020] memory+swap: usage 131072kB, limit 9007199254740988kB, failcnt 0
[Thu May 21 08:48:41 2020] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
[Thu May 21 08:48:41 2020] Memory cgroup stats for /kubepods/burstable/pod98f587f8-8994-4eb4-a7b6-d62be890cc08: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
[Thu May 21 08:48:41 2020] Memory cgroup stats for /kubepods/burstable/pod98f587f8-8994-4eb4-a7b6-d62be890cc08/bdc2f6d0b9791f9a8b86c1e877c830e387170c955cdc09866350344b19e08a6e: cache:0KB rss:1656KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:1656KB inactive_file:0KB active_file:0KB unevictable:0KB
[Thu May 21 08:48:41 2020] Memory cgroup stats for /kubepods/burstable/pod98f587f8-8994-4eb4-a7b6-d62be890cc08/ffe85090722bdfbb94fab8a7b58ce714191c3456c2b405240afd58756664cc88: cache:100KB rss:129316KB rss_huge:98304KB mapped_file:4KB swap:0KB inactive_anon:0KB active_anon:129268KB inactive_file:0KB active_file:0KB unevictable:0KB
[Thu May 21 08:48:41 2020] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[Thu May 21 08:48:41 2020] [25078] 0 25078 1104 397 7 0 -998 pause
[Thu May 21 08:48:41 2020] [25519] 0 25519 4624 104 14 0 999 bash
[Thu May 21 08:48:41 2020] [ 5221] 0 5221 2057 19 8 0 999 stress
[Thu May 21 08:48:41 2020] [ 5222] 0 5222 27658 17581 42 0 999 stress
[Thu May 21 08:48:41 2020] [ 6772] 0 6772 2057 19 8 0 999 stress
[Thu May 21 08:48:41 2020] [ 6774] 0 6774 27658 14566 36 0 999 stress
[Thu May 21 08:48:41 2020] Memory cgroup out of memory: Kill process 5222 (stress) score 1513 or sacrifice child
[Thu May 21 08:48:41 2020] Killed process 5222 (stress) total-vm:110632kB, anon-rss:70320kB, file-rss:4kB, shmem-rss:0kB
主机上的进程ID 5222被OOM杀死。我们需要详细看看syslog日志的最后部分:
[Thu May 21 08:48:41 2020] Memory cgroup out of memory: Kill process 5222 (stress) score 1513 or sacrifice child
[Thu May 21 08:48:41 2020] Killed process 5222 (stress) total-vm:110632kB, anon-rss:70320kB, file-rss:4kB, shmem-rss:0kB
对于此Pod,有一些进程是OOM Killer 选择杀死的候选对象。保证网络进程名称空间的pause
容器的oom_score_adj
值为-998,保证不会被杀死。容器中的其余所有进程的oom_score_adj
值均为999。我们可以根据来自Kubernetes文档的公式如下验证该值,
min(max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999)
找出节点可分配的内存:
kubectl describe nodes 10.67.62.22 | grep Allocatable -A 7
Allocatable:
attachable-volumes-csi-disk.csi.everest.io: 58
cce/eni: 15
cpu: 31850m
ephemeral-storage: 28411501317
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 60280908Ki
如果未设置,则请求内存默认情况下与限制值相同。根据公式,我们计算应该是 oom_score_adj
为999。
具体:
min(max(2, 1000-128*1024/60280908), 999)
请注意,容器中的所有进程都具有相同的oom_score_adj值。 OOM杀手将根据内存使用情况计算OOM值,并使用oom_score_adj
值进行微调。最后,它终止了使用内存最多的第一个压力进程,即进程id为5222的stress。
结论
我们只是详细介绍了Qos为Burstable类型的Pod内存限制,关于其他类型大家可以自己去做测试。
不同类型的Pod的oom_score_adj
不同。