最近某机房的服务器经常有tomcat突然毫无征兆的挂了.检查catalina.out日志和业务日志并没有发现什么问题.
但是检查/var/log/messages日志(或者使用dmesg命令)时可以看到如下信息:
[1884622.659293] salt invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[1884622.660034] salt cpuset=/ mems_allowed=0
[1884622.660927] CPU: 3 PID: 30519 Comm: salt Not tainted 3.10.0-327.36.3.el7.x86_64 #1
[1884622.661989] Hardware name: RDO OpenStack Compute, BIOS 1.9.1-5.el7_3.3 04/01/2014
[1884622.662822] ffff880328f25080 000000000c51a1e0 ffff88032cd6f790 ffffffff81636431
[1884622.663701] ffff88032cd6f820 ffffffff816313cc ffff8800357ff4b0 ffff8800357ff4c8
[1884622.664513] 0000000000000202 ffff880328f25080 ffff88032cd6f808 ffffffff81128cef
[1884622.665390] Call Trace:
[1884622.665721] [<ffffffff81636431>] dump_stack+0x19/0x1b
[1884622.666431] [<ffffffff816313cc>] dump_header+0x8e/0x214
[1884622.667001] [<ffffffff81128cef>] ? delayacct_end+0x8f/0xb0
[1884622.667574] [<ffffffff8116d21e>] oom_kill_process+0x24e/0x3b0
[1884622.668208] [<ffffffff8116cd86>] ? find_lock_task_mm+0x56/0xc0
[1884622.668858] [<ffffffff81088e4e>] ? has_capability_noaudit+0x1e/0x30
[1884622.669540] [<ffffffff8116da46>] out_of_memory+0x4b6/0x4f0
[1884622.670108] [<ffffffff81173c36>] __alloc_pages_nodemask+0xaa6/0xba0
[1884622.670735] [<ffffffff811b7fca>] alloc_pages_vma+0x9a/0x150
[1884622.671304] [<ffffffff81197b75>] handle_mm_fault+0xba5/0xf80
[1884622.671928] [<ffffffffa015ba4c>] ? xfs_ilock+0xdc/0x120 [xfs]
[1884622.672515] [<ffffffffa015bd64>] ? xfs_iunlock+0xa4/0x130 [xfs]
[1884622.673135] [<ffffffffa01111fa>] ? xfs_attr_get+0x11a/0x1b0 [xfs]
[1884622.673776] [<ffffffff81642040>] __do_page_fault+0x150/0x450
[1884622.674354] [<ffffffff81642403>] trace_do_page_fault+0x43/0x110
[1884622.674987] [<ffffffff81641ae9>] do_async_page_fault+0x29/0xe0
[1884622.675620] [<ffffffff8163e678>] async_page_fault+0x28/0x30
[1884622.676267] [<ffffffff811688dc>] ? file_read_actor+0x3c/0x180
[1884622.676926] [<ffffffff812f9092>] ? radix_tree_lookup_slot+0x22/0x50
[1884622.677617] [<ffffffff8116b0a8>] generic_file_aio_read+0x478/0x750
[1884622.678311] [<ffffffffa014fed1>] xfs_file_aio_read+0x151/0x2f0 [xfs]
[1884622.679010] [<ffffffff811de47d>] do_sync_read+0x8d/0xd0
[1884622.679553] [<ffffffff811debdc>] vfs_read+0x9c/0x170
[1884622.680101] [<ffffffff811df72f>] SyS_read+0x7f/0xe0
[1884622.680647] [<ffffffff81646b49>] system_call_fastpath+0x16/0x1b
[1884622.681272] Mem-Info:
[1884622.681534] Node 0 DMA per-cpu:
[1884622.681905] CPU 0: hi: 0, btch: 1 usd: 0
[1884622.682406] CPU 1: hi: 0, btch: 1 usd: 0
[1884622.682940] CPU 2: hi: 0, btch: 1 usd: 0
[1884622.683481] CPU 3: hi: 0, btch: 1 usd: 0
[1884622.683968] Node 0 DMA32 per-cpu:
[1884622.684329] CPU 0: hi: 186, btch: 31 usd: 20
[1884622.684822] CPU 1: hi: 186, btch: 31 usd: 34
[1884622.685318] CPU 2: hi: 186, btch: 31 usd: 144
[1884622.686644] CPU 3: hi: 186, btch: 31 usd: 21
[1884622.687978] Node 0 Normal per-cpu:
[1884622.689176] CPU 0: hi: 186, btch: 31 usd: 24
[1884622.690500] CPU 1: hi: 186, btch: 31 usd: 59
[1884622.691876] CPU 2: hi: 186, btch: 31 usd: 185
[1884622.693154] CPU 3: hi: 186, btch: 31 usd: 34
[1884622.694406] active_anon:2545397 inactive_anon:400896 isolated_anon:0
active_file:0 inactive_file:6335 isolated_file:0
unevictable:0 dirty:4 writeback:0 unstable:0
free:34321 slab_reclaimable:6399 slab_unreclaimable:6568
mapped:2071 shmem:16494 pagetables:7474 bounce:0
free_cma:0
[1884622.701955] Node 0 DMA free:15908kB min:88kB low:108kB high:132kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[1884622.709010] lowmem_reserve[]: 0 2815 11837 11837
[1884622.710489] Node 0 DMA32 free:57168kB min:16052kB low:20064kB high:24076kB active_anon:2271580kB inactive_anon:512908kB active_file:0kB inactive_file:12736kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3129200kB managed:2884828kB mlocked:0kB dirty:8kB writeback:0kB mapped:7452kB shmem:25220kB slab_reclaimable:8136kB slab_unreclaimable:6100kB kernel_stack:3632kB pagetables:7964kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:32 all_unreclaimable? yes
[1884622.718624] lowmem_reserve[]: 0 0 9022 9022
[1884622.720224] Node 0 Normal free:64208kB min:51440kB low:64300kB high:77160kB active_anon:7910008kB inactive_anon:1090676kB active_file:40kB inactive_file:12604kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:9437184kB managed:9238828kB mlocked:0kB dirty:8kB writeback:0kB mapped:832kB shmem:40756kB slab_reclaimable:17460kB slab_unreclaimable:20172kB kernel_stack:12624kB pagetables:21932kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:61 all_unreclaimable? yes
[1884622.729016] lowmem_reserve[]: 0 0 0 0
[1884622.730648] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (R) 3*4096kB (M) = 15908kB
[1884622.734773] Node 0 DMA32: 412*4kB (UEM) 527*8kB (UEM) 369*16kB (UEM) 151*32kB (UEM) 109*64kB (UEM) 52*128kB (UEM) 17*256kB (UEM) 8*512kB (UE) 18*1024kB (UM) 0*2048kB 0*4096kB = 57112kB
[1884622.739395] Node 0 Normal: 909*4kB (UE) 988*8kB (UEM) 470*16kB (UEM) 292*32kB (UEM) 212*64kB (UEM) 95*128kB (UEM) 30*256kB (UEM) 5*512kB (UM) 0*1024kB 0*2048kB 0*4096kB = 64372kB
[1884622.744255] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[1884622.746622] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[1884622.748888] 23034 total pagecache pages
[1884622.750701] 0 pages in swap cache
[1884622.752499] Swap cache stats: add 698056, delete 698056, find 6080023/6114645
[1884622.754633] Free swap = 0kB
[1884622.756467] Total swap = 0kB
[1884622.758223] 3145594 pages RAM
[1884622.760036] 0 pages HighMem/MovableOnly
[1884622.761983] 110703 pages reserved
[1884622.763946] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[1884622.766399] [ 384] 0 384 11354 2692 25 0 0 systemd-journal
[1884622.768951] [ 418] 0 418 10779 135 21 0 -1000 systemd-udevd
[1884622.771380] [ 524] 0 524 1094 39 8 0 0 rngd
[1884622.773629] [ 526] 0 526 6633 110 17 0 0 systemd-logind
[1884622.776021] [ 527] 0 527 4795 59 14 0 0 irqbalance
[1884622.778170] [ 535] 81 535 6648 95 18 0 -900 dbus-daemon
[1884622.780404] [ 542] 38 542 6296 144 16 0 0 ntpd
[1884622.782580] [ 546] 0 546 50842 125 40 0 0 gssproxy
[1884622.784949] [ 568] 0 568 6491 52 16 0 0 atd
[1884622.787030] [ 569] 0 569 31583 156 21 0 0 crond
[1884622.789414] [ 749] 0 749 138258 2656 90 0 0 tuned
[1884622.791470] [ 753] 0 753 196627 453 49 0 0 collectd
[1884622.793512] [ 762] 0 762 27509 32 13 0 0 agetty
[1884622.795484] [ 765] 99 765 1665 34 8 0 0 dnsmasq
[1884622.797574] [ 766] 0 766 27509 31 10 0 0 agetty
[1884622.799767] [ 949] 0 949 155412 1066 34 0 0 qagent
[1884622.802013] [ 951] 995 951 107531 1825 64 0 0 flume-ng-manage
[1884622.804271] [ 1315] 0 1315 22785 257 43 0 0 master
[1884622.806507] [ 1341] 89 1341 22828 251 46 0 0 qmgr
[1884622.808705] [ 8377] 0 8377 20640 212 44 0 -1000 sshd
[1884622.810884] [31919] 0 31919 84696 1294 55 0 0 rsyslogd
[1884622.813053] [16988] 0 16988 1084 24 8 0 0 runsvdir
[1884622.815210] [17283] 0 17283 1046 19 7 0 0 runsv
[1884622.817217] [17284] 99 17284 1082 21 7 0 0 svlogd
[1884622.819180] [17285] 40001 17285 595650 46186 179 0 0 java
[1884622.821245] [19650] 40001 19650 3697359 2848592 5915 375 0 java
[1884622.823242] [ 1566] 0 1566 80903 4492 113 0 0 salt-minion
[1884622.825186] [ 1579] 0 1579 171680 8500 146 0 0 salt-minion
[1884622.827061] [ 1581] 0 1581 99739 4612 109 0 0 salt-minion
[1884622.829025] [ 7454] 30301 7454 11063 165 25 0 0 nrpe
[1884622.831261] [ 7068] 0 7068 12803 117 23 0 -1000 auditd
[1884622.833769] [ 7077] 0 7077 20056 66 8 0 0 audispd
[1884622.836218] [19420] 89 19420 22811 249 44 0 0 pickup
[1884622.838464] [30516] 30301 30516 11099 176 25 0 0 nrpe
[1884622.840768] [30517] 30301 30517 11100 176 24 0 0 nrpe
[1884622.842880] [30518] 30301 30518 28793 57 14 0 0 check_proc_cpu
[1884622.845133] [30519] 30301 30519 88168 5633 128 0 0 salt
[1884622.847239] Out of memory: Kill process 19650 (java) score 940 or sacrifice child
[1884622.849499] Killed process 19650 (java) total-vm:14789436kB, anon-rss:11394368kB, file-rss:0kB
这是系统内存耗尽情况下Linux的OOM Killer 选择性的干掉一些进程并释放一些内存,Linux下允许程序申请比系统可用内存更多的内存,这个特性叫Overcommit。这样做是出于优化系统考虑,因为不是所有的程序申请了内存就立刻使用的,当你使用的时候说不定系统已经回收了一些资源了。不幸的是,当你用到这个Overcommit给你的内存的时候,系统还没有资源的话,OOM killer就跳出来了.Linux下每个进程都有个OOM权重,在/proc//oom_adj里面,取值是-17到+15,取值越高,越容易被干掉。最终OOM killer是通过/proc//oom_score这个值来决定哪个进程被干掉的。这个值是系统综合进程的内存消耗量、CPU时间(utime + stime)、存活时间(uptime - start time)和oom_adj计算出的,消耗内存越多分越高,存活时间越长分越低。总之,总的策略是:损失最少的工作,释放最大的内存同时不伤及无辜的用了很大内存的进程,并且杀掉的进程数尽量少。
排查完tomcat的挂掉的直接原因后,内存耗尽时,如果有swap分区,会使用swap分区,不至于会挂掉,查看这几台服务器上的swap分区使用情况,发现swap分区并没有开启:
total used free shared buff/cache available
Mem: 11 9 0 0 0 1
Swap: 0 0 0
这是导致OOM的最终原因,关闭swap分区的原因是因为内存的速度比磁盘快,使用swap分区,会加大系统io,同时造的成大量页的换进换出,严重影响系统的性能.
但我们可以不关闭swap分区,而配置使用swap分区的策略.
/proc/sys/vm/swappiness文件中的值表示如何使用swap分区,0表示最大限度使用物理内存,之后才使用swap分区,100表示积极使用swap分区,并把内存上的数据及时搬到swap空间里边,linux默认设置为60,表示在内存使用40%时就开始出现交换分区的使用.
调整命令如下:
sysctl vm.swappiness=0
但这只是临时调整的方法,重启失效,永久调整需编辑/etc/sysctl.conf文件,添加
vm.swappiness=0
清空swap并使用sysctl -p重新加载/etc/sysctl.conf文件
参考资料
http://blog.csdn.net/hunanchenxingyu/article/details/26271293
http://blog.csdn.net/lufeisan/article/details/53339991