前几天公司与神州租车合作项目总出现推送进程自动停止的现象。
于是检查服务器查找原因
检查内存,发现内存还有有空闲的
free
检查IO使用情况,发现io竟然使用是100%,空闲一直是0%
top
我去,这才发现以前监控没弄好,竟然没报警
马上找到相应的IO占用的进程,查看相关日志解决IO问题
但是为什么会自动停止呢?
查看系统日志
less /var/log/messages Sep 18 18:54:18 SZZC-DC-01 kernel: hv_utils: Shutdown request received - graceful shutdown initiated Sep 18 21:39:36 SZZC-DC-01 kernel: node invoked oom-killer: gfp_mask=0x800d0, order=0, oom_adj=0, oom_score_adj=0 Sep 18 21:39:36 SZZC-DC-01 kernel: node cpuset=/ mems_allowed=0 Sep 18 21:39:36 SZZC-DC-01 kernel: Free swap = 0kB Sep 18 21:39:36 SZZC-DC-01 kernel: [ 1575] 501 1575 600373 326996 5 0 0 node Sep 18 21:39:36 SZZC-DC-01 kernel: Killed process 1575, UID 501, (node) total-vm:2401492kB, anon-rss:1306144kB, file-rss:1840kB Sep 18 22:48:21 SZZC-DC-01 kernel: node invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0 Sep 18 22:48:21 SZZC-DC-01 kernel: node cpuset=/ mems_allowed=0 Sep 18 22:48:21 SZZC-DC-01 kernel: Free swap = 0kB Sep 18 22:48:21 SZZC-DC-01 kernel: [ 1572] 501 1572 615537 386009 4 0 0 node Sep 18 22:48:21 SZZC-DC-01 kernel: Out of memory: Kill process 1502 (node) score 173 or sacrifice child Sep 18 22:48:21 SZZC-DC-01 kernel: Killed process 1572, UID 501, (node) total-vm:2462148kB, anon-rss:1541224kB, file-rss:2812kB Sep 18 22:48:22 SZZC-DC-01 kernel: [ 1701] 501 1572 615537 386010 1 0 0 node Sep 18 22:48:22 SZZC-DC-01 kernel: Out of memory: Kill process 1502 (node) score 173 or sacrifice child Sep 18 22:48:22 SZZC-DC-01 kernel: Killed process 1701, UID 501, (node) total-vm:2462148kB, anon-rss:1541228kB, file-rss:2812kB Sep 18 23:55:42 SZZC-DC-01 root: [euid=user01]:root pts/0 2015-09-18 22:43 (salt):[/data/shenzhouToBeiHang]sh start_push.sh Sep 19 02:52:33 SZZC-DC-01 kernel: node invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0 Sep 19 02:52:33 SZZC-DC-01 kernel: node cpuset=/ mems_allowed=0 Sep 19 02:52:33 SZZC-DC-01 kernel: Free swap = 0kB Sep 19 02:52:33 SZZC-DC-01 kernel: [ 1578] 501 1578 630976 397473 3 0 0 node Sep 19 02:52:33 SZZC-DC-01 kernel: Out of memory: Kill process 1502 (node) score 166 or sacrifice child Sep 19 02:52:33 SZZC-DC-01 kernel: Killed process 1578, UID 501, (node) total-vm:2523904kB, anon-rss:1589740kB, file-rss:148kB Sep 19 06:51:44 SZZC-DC-01 kernel: node invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0 Sep 19 06:51:44 SZZC-DC-01 kernel: Killed process 1576, UID 501, (node) total-vm:2588936kB, anon-rss:1631824kB, file-rss:120kB Sep 19 08:55:26 SZZC-DC-01 kernel: node invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0 Sep 19 08:55:26 SZZC-DC-01 kernel: Killed process 1581, UID 501, (node) total-vm:2749640kB, anon-rss:1822012kB, file-rss:72k Sep 19 10:09:54 SZZC-DC-01 kernel: hv_utils: Shutdown request received - graceful shutdown initiated
找到原因了,最开始服务器被人手动重启了一次,后来询问是正常操作
hv_utils: Shutdown request received - graceful shutdown initiated
但是后续服务器经常处于内存不足的状态
node cpuset=/ mems_allowed=0 Free swap = 0kB
于是乎启动了oom-killer机制(系统内存耗尽的情况下,启用自己算法有选择性的kill 掉一些进程)
/proc/[pid]/oom_adj ## 该pid进程被oom killer杀掉的权重,介于 [-17,15]之间,越高的权重,意味着更可能被oom killer选中,-17表示禁止被kill掉。 /proc/[pid]/oom_score ## 当前该pid进程的被kill的分数,越高的分数意味着越可能被kill,这个数值是根据oom_adj运算后的结果,是oom_killer的主要参考。
注:sysctl 下有2个可配置选项:
vim /etc/sysctl.conf vm.panic_on_oom = 0 #内存不够时内核是否直接panic vm.oom_kill_allocating_task = 1 #oom-killer是否选择当前正在申请内存的进程进行kill
杀死了占用内存最高的一个进程,就这样推送进程就被杀死了
node invoked oom-killer: gfp_mask=0x800d0, order=0, oom_adj=0, oom_score_adj=0 Sep 18 21:39:36 SZZC-DC-01 kernel: [ 1575] 501 1575 600373 326996 5 0 0 node Sep 18 21:39:36 SZZC-DC-01 kernel: Killed process 1575, UID 501, (node) total-vm:2401492kB, anon-rss:1306144kB, file-rss:1840kB
后续解决:
进行了配置扩容;
解决了进程占用内存高的原因(当时是连接数据库有问题);
将推送进程的/proc/[pid]/oom_adj调到尽量低,避免先被干掉;
修改了监控。
参考文献:
http://blog.csdn.net/gugemichael/article/details/24017515
http://www.cnblogs.com/kerrycode/p/3889912.html