服务器突然Out of memory的问题排查

最近某机房的服务器经常有tomcat突然毫无征兆的挂了.检查catalina.out日志和业务日志并没有发现什么问题.

但是检查/var/log/messages日志(或者使用dmesg命令)时可以看到如下信息:

[1884622.659293] salt invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[1884622.660034] salt cpuset=/ mems_allowed=0
[1884622.660927] CPU: 3 PID: 30519 Comm: salt Not tainted 3.10.0-327.36.3.el7.x86_64 #1
[1884622.661989] Hardware name: RDO OpenStack Compute, BIOS 1.9.1-5.el7_3.3 04/01/2014
[1884622.662822]  ffff880328f25080 000000000c51a1e0 ffff88032cd6f790 ffffffff81636431
[1884622.663701]  ffff88032cd6f820 ffffffff816313cc ffff8800357ff4b0 ffff8800357ff4c8
[1884622.664513]  0000000000000202 ffff880328f25080 ffff88032cd6f808 ffffffff81128cef
[1884622.665390] Call Trace:
[1884622.665721]  [<ffffffff81636431>] dump_stack+0x19/0x1b
[1884622.666431]  [<ffffffff816313cc>] dump_header+0x8e/0x214
[1884622.667001]  [<ffffffff81128cef>] ? delayacct_end+0x8f/0xb0
[1884622.667574]  [<ffffffff8116d21e>] oom_kill_process+0x24e/0x3b0
[1884622.668208]  [<ffffffff8116cd86>] ? find_lock_task_mm+0x56/0xc0
[1884622.668858]  [<ffffffff81088e4e>] ? has_capability_noaudit+0x1e/0x30
[1884622.669540]  [<ffffffff8116da46>] out_of_memory+0x4b6/0x4f0
[1884622.670108]  [<ffffffff81173c36>] __alloc_pages_nodemask+0xaa6/0xba0
[1884622.670735]  [<ffffffff811b7fca>] alloc_pages_vma+0x9a/0x150
[1884622.671304]  [<ffffffff81197b75>] handle_mm_fault+0xba5/0xf80
[1884622.671928]  [<ffffffffa015ba4c>] ? xfs_ilock+0xdc/0x120 [xfs]
[1884622.672515]  [<ffffffffa015bd64>] ? xfs_iunlock+0xa4/0x130 [xfs]
[1884622.673135]  [<ffffffffa01111fa>] ? xfs_attr_get+0x11a/0x1b0 [xfs]
[1884622.673776]  [<ffffffff81642040>] __do_page_fault+0x150/0x450
[1884622.674354]  [<ffffffff81642403>] trace_do_page_fault+0x43/0x110
[1884622.674987]  [<ffffffff81641ae9>] do_async_page_fault+0x29/0xe0
[1884622.675620]  [<ffffffff8163e678>] async_page_fault+0x28/0x30
[1884622.676267]  [<ffffffff811688dc>] ? file_read_actor+0x3c/0x180
[1884622.676926]  [<ffffffff812f9092>] ? radix_tree_lookup_slot+0x22/0x50
[1884622.677617]  [<ffffffff8116b0a8>] generic_file_aio_read+0x478/0x750
[1884622.678311]  [<ffffffffa014fed1>] xfs_file_aio_read+0x151/0x2f0 [xfs]
[1884622.679010]  [<ffffffff811de47d>] do_sync_read+0x8d/0xd0
[1884622.679553]  [<ffffffff811debdc>] vfs_read+0x9c/0x170
[1884622.680101]  [<ffffffff811df72f>] SyS_read+0x7f/0xe0
[1884622.680647]  [<ffffffff81646b49>] system_call_fastpath+0x16/0x1b
[1884622.681272] Mem-Info:
[1884622.681534] Node 0 DMA per-cpu:
[1884622.681905] CPU    0: hi:    0, btch:   1 usd:   0
[1884622.682406] CPU    1: hi:    0, btch:   1 usd:   0
[1884622.682940] CPU    2: hi:    0, btch:   1 usd:   0
[1884622.683481] CPU    3: hi:    0, btch:   1 usd:   0
[1884622.683968] Node 0 DMA32 per-cpu:
[1884622.684329] CPU    0: hi:  186, btch:  31 usd:  20
[1884622.684822] CPU    1: hi:  186, btch:  31 usd:  34
[1884622.685318] CPU    2: hi:  186, btch:  31 usd: 144
[1884622.686644] CPU    3: hi:  186, btch:  31 usd:  21
[1884622.687978] Node 0 Normal per-cpu:
[1884622.689176] CPU    0: hi:  186, btch:  31 usd:  24
[1884622.690500] CPU    1: hi:  186, btch:  31 usd:  59
[1884622.691876] CPU    2: hi:  186, btch:  31 usd: 185
[1884622.693154] CPU    3: hi:  186, btch:  31 usd:  34
[1884622.694406] active_anon:2545397 inactive_anon:400896 isolated_anon:0
 active_file:0 inactive_file:6335 isolated_file:0
 unevictable:0 dirty:4 writeback:0 unstable:0
 free:34321 slab_reclaimable:6399 slab_unreclaimable:6568
 mapped:2071 shmem:16494 pagetables:7474 bounce:0
 free_cma:0
[1884622.701955] Node 0 DMA free:15908kB min:88kB low:108kB high:132kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[1884622.709010] lowmem_reserve[]: 0 2815 11837 11837
[1884622.710489] Node 0 DMA32 free:57168kB min:16052kB low:20064kB high:24076kB active_anon:2271580kB inactive_anon:512908kB active_file:0kB inactive_file:12736kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3129200kB managed:2884828kB mlocked:0kB dirty:8kB writeback:0kB mapped:7452kB shmem:25220kB slab_reclaimable:8136kB slab_unreclaimable:6100kB kernel_stack:3632kB pagetables:7964kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:32 all_unreclaimable? yes
[1884622.718624] lowmem_reserve[]: 0 0 9022 9022
[1884622.720224] Node 0 Normal free:64208kB min:51440kB low:64300kB high:77160kB active_anon:7910008kB inactive_anon:1090676kB active_file:40kB inactive_file:12604kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:9437184kB managed:9238828kB mlocked:0kB dirty:8kB writeback:0kB mapped:832kB shmem:40756kB slab_reclaimable:17460kB slab_unreclaimable:20172kB kernel_stack:12624kB pagetables:21932kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:61 all_unreclaimable? yes
[1884622.729016] lowmem_reserve[]: 0 0 0 0
[1884622.730648] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (R) 3*4096kB (M) = 15908kB
[1884622.734773] Node 0 DMA32: 412*4kB (UEM) 527*8kB (UEM) 369*16kB (UEM) 151*32kB (UEM) 109*64kB (UEM) 52*128kB (UEM) 17*256kB (UEM) 8*512kB (UE) 18*1024kB (UM) 0*2048kB 0*4096kB = 57112kB
[1884622.739395] Node 0 Normal: 909*4kB (UE) 988*8kB (UEM) 470*16kB (UEM) 292*32kB (UEM) 212*64kB (UEM) 95*128kB (UEM) 30*256kB (UEM) 5*512kB (UM) 0*1024kB 0*2048kB 0*4096kB = 64372kB
[1884622.744255] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[1884622.746622] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[1884622.748888] 23034 total pagecache pages
[1884622.750701] 0 pages in swap cache
[1884622.752499] Swap cache stats: add 698056, delete 698056, find 6080023/6114645
[1884622.754633] Free swap  = 0kB
[1884622.756467] Total swap = 0kB
[1884622.758223] 3145594 pages RAM
[1884622.760036] 0 pages HighMem/MovableOnly
[1884622.761983] 110703 pages reserved
[1884622.763946] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[1884622.766399] [  384]     0   384    11354     2692      25        0             0 systemd-journal
[1884622.768951] [  418]     0   418    10779      135      21        0         -1000 systemd-udevd
[1884622.771380] [  524]     0   524     1094       39       8        0             0 rngd
[1884622.773629] [  526]     0   526     6633      110      17        0             0 systemd-logind
[1884622.776021] [  527]     0   527     4795       59      14        0             0 irqbalance
[1884622.778170] [  535]    81   535     6648       95      18        0          -900 dbus-daemon
[1884622.780404] [  542]    38   542     6296      144      16        0             0 ntpd
[1884622.782580] [  546]     0   546    50842      125      40        0             0 gssproxy
[1884622.784949] [  568]     0   568     6491       52      16        0             0 atd
[1884622.787030] [  569]     0   569    31583      156      21        0             0 crond
[1884622.789414] [  749]     0   749   138258     2656      90        0             0 tuned
[1884622.791470] [  753]     0   753   196627      453      49        0             0 collectd
[1884622.793512] [  762]     0   762    27509       32      13        0             0 agetty
[1884622.795484] [  765]    99   765     1665       34       8        0             0 dnsmasq
[1884622.797574] [  766]     0   766    27509       31      10        0             0 agetty
[1884622.799767] [  949]     0   949   155412     1066      34        0             0 qagent
[1884622.802013] [  951]   995   951   107531     1825      64        0             0 flume-ng-manage
[1884622.804271] [ 1315]     0  1315    22785      257      43        0             0 master
[1884622.806507] [ 1341]    89  1341    22828      251      46        0             0 qmgr
[1884622.808705] [ 8377]     0  8377    20640      212      44        0         -1000 sshd
[1884622.810884] [31919]     0 31919    84696     1294      55        0             0 rsyslogd
[1884622.813053] [16988]     0 16988     1084       24       8        0             0 runsvdir
[1884622.815210] [17283]     0 17283     1046       19       7        0             0 runsv
[1884622.817217] [17284]    99 17284     1082       21       7        0             0 svlogd
[1884622.819180] [17285] 40001 17285   595650    46186     179        0             0 java
[1884622.821245] [19650] 40001 19650  3697359  2848592    5915      375             0 java
[1884622.823242] [ 1566]     0  1566    80903     4492     113        0             0 salt-minion
[1884622.825186] [ 1579]     0  1579   171680     8500     146        0             0 salt-minion
[1884622.827061] [ 1581]     0  1581    99739     4612     109        0             0 salt-minion
[1884622.829025] [ 7454] 30301  7454    11063      165      25        0             0 nrpe
[1884622.831261] [ 7068]     0  7068    12803      117      23        0         -1000 auditd
[1884622.833769] [ 7077]     0  7077    20056       66       8        0             0 audispd
[1884622.836218] [19420]    89 19420    22811      249      44        0             0 pickup
[1884622.838464] [30516] 30301 30516    11099      176      25        0             0 nrpe
[1884622.840768] [30517] 30301 30517    11100      176      24        0             0 nrpe
[1884622.842880] [30518] 30301 30518    28793       57      14        0             0 check_proc_cpu
[1884622.845133] [30519] 30301 30519    88168     5633     128        0             0 salt
[1884622.847239] Out of memory: Kill process 19650 (java) score 940 or sacrifice child
[1884622.849499] Killed process 19650 (java) total-vm:14789436kB, anon-rss:11394368kB, file-rss:0kB

这是系统内存耗尽情况下Linux的OOM Killer 选择性的干掉一些进程并释放一些内存,Linux下允许程序申请比系统可用内存更多的内存,这个特性叫Overcommit。这样做是出于优化系统考虑,因为不是所有的程序申请了内存就立刻使用的,当你使用的时候说不定系统已经回收了一些资源了。不幸的是,当你用到这个Overcommit给你的内存的时候,系统还没有资源的话,OOM killer就跳出来了.Linux下每个进程都有个OOM权重,在/proc//oom_adj里面,取值是-17到+15,取值越高,越容易被干掉。最终OOM killer是通过/proc//oom_score这个值来决定哪个进程被干掉的。这个值是系统综合进程的内存消耗量、CPU时间(utime + stime)、存活时间(uptime - start time)和oom_adj计算出的,消耗内存越多分越高,存活时间越长分越低。总之,总的策略是:损失最少的工作,释放最大的内存同时不伤及无辜的用了很大内存的进程,并且杀掉的进程数尽量少。

排查完tomcat的挂掉的直接原因后,内存耗尽时,如果有swap分区,会使用swap分区,不至于会挂掉,查看这几台服务器上的swap分区使用情况,发现swap分区并没有开启:

              total        used        free      shared  buff/cache   available
Mem:             11           9           0           0           0           1
Swap:             0           0           0

这是导致OOM的最终原因,关闭swap分区的原因是因为内存的速度比磁盘快,使用swap分区,会加大系统io,同时造的成大量页的换进换出,严重影响系统的性能.

但我们可以不关闭swap分区,而配置使用swap分区的策略.

/proc/sys/vm/swappiness文件中的值表示如何使用swap分区,0表示最大限度使用物理内存,之后才使用swap分区,100表示积极使用swap分区,并把内存上的数据及时搬到swap空间里边,linux默认设置为60,表示在内存使用40%时就开始出现交换分区的使用.

调整命令如下:

sysctl vm.swappiness=0

但这只是临时调整的方法,重启失效,永久调整需编辑/etc/sysctl.conf文件,添加

vm.swappiness=0

清空swap并使用sysctl -p重新加载/etc/sysctl.conf文件

参考资料

http://blog.csdn.net/hunanchenxingyu/article/details/26271293

http://blog.csdn.net/lufeisan/article/details/53339991

你可能感兴趣的:(Linux)