k8s集群中有一个节点创建的pod总是起不来,状态一直是ContainerCreating,describe pod发现sandbox一直创建不起来
kubectl describe pod xxxxx -n xxx
如下:
Normal SandboxChanged 22m (x90 over 32m) kubelet, node4 Pod sandbox changed, it will be killed and re-created.
Warning FailedCreatePodSandBox 32m (x7 over 32m) kubelet, node4 Failed create pod sandbox.
Warning FailedSync 27m (x48 over 32m) kubelet, node4 Error syncing pod
登录节点查看节点日志:
tail -f /var/log/messages
日志输出如下,从日志中可以看出是节点内存的buffer/cache满了导致sandbox无法创建。
Jan 25 15:12:52 node4 kubelet: W0125 15:12:52.446651 20224 cni.go:265] CNI failed to retrieve network namespace path: Cannot find network namespace for the terminated container "a722b111fe3a8e78d1d7ee49280ab743d80c6e6ba955195b65bcfe60d5cf3264"
Jan 25 15:12:53 node4 docker: time="2019-01-25T15:12:53.102069754+08:00" level=error msg="Handler for POST /v1.26/containers/a722b111fe3a8e78d1d7ee49280ab743d80c6e6ba955195b65bcfe60d5cf3264/stop returned error: Container a722b111fe3a8e78d1d7ee49280ab743d80c6e6ba955195b65bcfe60d5cf3264 is already stopped"
Jan 25 15:12:53 node4 kernel: runc:[1:CHILD]: page allocation failure: order:6, mode:0x10c0d0
Jan 25 15:12:53 node4 kernel: CPU: 2 PID: 26598 Comm: runc:[1:CHILD] Tainted: G ------------ T 3.10.0-693.5.2.el7.x86_64 #1
Jan 25 15:12:53 node4 kernel: Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
Jan 25 15:12:53 node4 kernel: 000000000010c0d0 000000008bf84052 ffff88003d21fa20 ffffffff816a3e51
Jan 25 15:12:53 node4 kernel: ffff88003d21fab0 ffffffff81188820 0000000000000000 ffff88023ffd9000
Jan 25 15:12:53 node4 kernel: 0000000000000006 000000000010c0d0 ffff88003d21fab0 000000008bf84052
Jan 25 15:12:53 node4 kernel: Call Trace:
Jan 25 15:12:53 node4 kernel: [] dump_stack+0x19/0x1b
Jan 25 15:12:53 node4 kernel: [] warn_alloc_failed+0x110/0x180
Jan 25 15:12:53 node4 kernel: [] __alloc_pages_slowpath+0x6b6/0x724
Jan 25 15:12:53 node4 kernel: [] __alloc_pages_nodemask+0x405/0x420
Jan 25 15:12:53 node4 kernel: [] alloc_pages_current+0x98/0x110
Jan 25 15:12:53 node4 kernel: [] __get_free_pages+0xe/0x40
Jan 25 15:12:53 node4 kernel: [] kmalloc_order_trace+0x2e/0xa0
Jan 25 15:12:53 node4 kernel: [] __kmalloc+0x211/0x230
Jan 25 15:12:53 node4 kernel: [] memcg_register_cache+0xb9/0xe0
Jan 25 15:12:53 node4 kernel: [] kmem_cache_create_memcg+0x110/0x230
Jan 25 15:12:53 node4 kernel: [] kmem_cache_create+0x2b/0x30
Jan 25 15:12:53 node4 kernel: [] nf_conntrack_init_net+0x101/0x250 [nf_conntrack]
Jan 25 15:12:53 node4 kernel: [] nf_conntrack_pernet_init+0x14/0x150 [nf_conntrack]
Jan 25 15:12:53 node4 kernel: [] ops_init+0x41/0x150
Jan 25 15:12:53 node4 kernel: [] setup_net+0xa3/0x160
Jan 25 15:12:53 node4 kernel: [] copy_net_ns+0xb5/0x180
Jan 25 15:12:53 node4 kernel: [] create_new_namespaces+0xf9/0x180
Jan 25 15:12:53 node4 kernel: [] unshare_nsproxy_namespaces+0x5a/0xc0
Jan 25 15:12:53 node4 kernel: [] SyS_unshare+0x193/0x300
Jan 25 15:12:53 node4 kernel: [] system_call_fastpath+0x16/0x1b
Jan 25 15:12:53 node4 kernel: Mem-Info:
Jan 25 15:12:53 node4 kernel: active_anon:422434 inactive_anon:282 isolated_anon:0#012 active_file:389880 inactive_file:422271 isolated_file:0#012 unevictable:0 dirty:746 writeback:0 unstable:0#012 slab_reclaimable:404737 slab_unreclaimable:255217#012 mapped:62492 shmem:1127 pagetables:6817 bounce:0#012 free:51212 free_pcp:6 free_cma:0
Jan 25 15:12:53 node4 kernel: Node 0 DMA free:15900kB min:132kB low:164kB high:196kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:8kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Jan 25 15:12:53 node4 kernel: lowmem_reserve[]: 0 2814 7805 7805
Jan 25 15:12:53 node4 kernel: Node 0 DMA32 free:73680kB min:24324kB low:30404kB high:36484kB active_anon:547328kB inactive_anon:340kB active_file:538904kB inactive_file:565980kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3129216kB managed:2884232kB mlocked:0kB dirty:520kB writeback:0kB mapped:88968kB shmem:1488kB slab_reclaimable:754588kB slab_unreclaimable:332604kB kernel_stack:12608kB pagetables:7456kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Jan 25 15:12:53 node4 kernel: lowmem_reserve[]: 0 0 4990 4990
Jan 25 15:12:53 node4 kernel: Node 0 Normal free:115268kB min:43124kB low:53904kB high:64684kB active_anon:1142408kB inactive_anon:788kB active_file:1020616kB inactive_file:1123104kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:5242880kB managed:5110056kB mlocked:0kB dirty:2464kB writeback:0kB mapped:161000kB shmem:3020kB slab_reclaimable:864360kB slab_unreclaimable:688256kB kernel_stack:16928kB pagetables:19812kB unstable:0kB bounce:0kB free_pcp:24kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Jan 25 15:12:53 node4 kernel: lowmem_reserve[]: 0 0 0 0
Jan 25 15:12:53 node4 kernel: Node 0 DMA: 1*4kB (U) 1*8kB (U) 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15900kB
Jan 25 15:12:53 node4 kernel: Node 0 DMA32: 2019*4kB (UEM) 781*8kB (UEM) 1373*16kB (UEM) 897*32kB (UEM) 139*64kB (UEM) 1*128kB (U) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 74020kB
Jan 25 15:12:53 node4 kernel: Node 0 Normal: 12464*4kB (UEM) 2962*8kB (UEM) 1229*16kB (UEM) 356*32kB (UEM) 129*64kB (UEM) 16*128kB (UEM) 2*256kB (E) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 115424kB
Jan 25 15:12:53 node4 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Jan 25 15:12:53 node4 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jan 25 15:12:53 node4 kernel: 813284 total pagecache pages
Jan 25 15:12:53 node4 kernel: 0 pages in swap cache
Jan 25 15:12:53 node4 kernel: Swap cache stats: add 0, delete 0, find 0/0
Jan 25 15:12:53 node4 kernel: Free swap = 0kB
Jan 25 15:12:53 node4 kernel: Total swap = 0kB
Jan 25 15:12:53 node4 kernel: 2097022 pages RAM
Jan 25 15:12:53 node4 kernel: 0 pages HighMem/MovableOnly
Jan 25 15:12:53 node4 kernel: 94473 pages reserved
Jan 25 15:12:53 node4 kernel: kmem_cache_create(nf_conntrack_ffff88018f2fbcc0) failed with error -12
Jan 25 15:12:53 node4 kernel: CPU: 2 PID: 26598 Comm: runc:[1:CHILD] Tainted: G ------------ T 3.10.0-693.5.2.el7.x86_64 #1
Jan 25 15:12:53 node4 kernel: Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
可以通过如下方法解决:
在/etc/sysctl.conf中加入:
vm.zone_reclaim_mode = 1
然后执行sysctl -p
这个参数的作用时告诉内核当内存不够用时就直接回收buffer/cache。
之后问题解决了,在get pod发现pod状态已经变成running了。