centos7( 3.10.0-123.el7.x86_64) 重启问题

centos7( 3.10.0-123.el7.x86_64) 重启问题  http://aperise.iteye.com/blog/2326082
centos7( 3.10.0-327.el7.x86_64) 重启问题 http://aperise.iteye.com/blog/2425717

centos7( 3.10.0-123.el7.x86_64) 重启问题


       新买来服务器(2U 2cpu 6cores/cpu 16G*8 5 * 2TB)安装centos 7操作系统,搭建好hadoop集群和spark集群后,最近跑spark任务,发现任务执行到一定次数后,服务器中会随机的无端的会有一台自动重启







       因为自己专长的领域不在系统运维,所以对于系统运维工程师给的回复,还是比较相信的,起码首先没有去怀疑(虽然这在最后被证明是致命的错误判断),于是花了一大堆时间在盘查hadoop集群和spark集群资源消耗(CPU  MEM IO),在这期间主要是通过工具nmon来抓取所有服务器详细参数,疯狂的跑spark任务使问题重现,然后分析nmon日志信息。



      nmon是一款分析 AIX 和 Linux 性能的免费工具,这里也顺便介绍下该工具使用,我下载的版本主要有一下两个文件:

  • nmon_x86_64_centos6.centos6          nmon工具,主要抓取服务器资源日志,日志存为机器hostname_年月日_时分.nmon
  • nmon analyser v40.xlsm                       主要讲上面的“机器hostname_年月日_时分.nmon”转换为可读性的excel图表

    3.1 nmon命令参数介绍

cd /home/hadoop/nmon
./nmon_x86_64_centos6 -h
Hint: nmon_x86_64_centos6 [-h] [-s ] [-c ] [-f -d -t -r ] [-x]

-h FULL help information
read startup banner and type: "h" once it is running
For Data-Collect-Mode (-f)
-f spreadsheet output format [note: default -s300 -c288]
-s between refreshing the screen [default 2]
-c of refreshes [default millions]
-d to increase the number of disks [default 256]
-t spreadsheet includes top processes
-x capacity planning (15 min for 1 day = -fdt -s 900 -c 96)

Version - nmon 14i

For Interactive-Mode
-s time between refreshing the screen [default 2]
-c of refreshes [default millions]
-g User Defined Disk Groups [hit g to show them]
- file = on each line: group_name space separated
- like: database sdb sdc sdd sde
- upto 64 disk groups, 512 disks per line
- disks can appear more than once and in many groups
-b black and white [default is colour]
example: nmon_x86_64_centos6 -s 1 -c 100

For Data-Collect-Mode = spreadsheet format (comma separated values)
Note: use only one of f,F,z,x or X and make it the first argument
-f spreadsheet output format [note: default -s300 -c288]
output file is _YYYYMMDD_HHMM.nmon
-F same as -f but user supplied filename
-r used in the spreadsheet file [default hostname]
-t include top processes in the output
-T as -t plus saves command line arguments in UARG section
-s between snap shots
-c of snapshots before nmon stops
-d to increase the number of disks [default 256]
-l disks/line default 150 to avoid spreadsheet issues. EMC=64.
-g User Defined Disk Groups (see above) - see BBBG & DG lines
-N include NFS Network File System
-I Include process & disks busy threshold (default 0.1)
don't save or show proc/disk using less than this percent
-m nmon changes to this directory before saving to file
example: collect for 1 hour at 30 second intervals with top procs
nmon_x86_64_centos6 -f -t -r Test1 -s30 -c120

To load into a spreadsheet:
sort -A *nmon >stats.csv
transfer the stats.csv file to your PC
Start spreadsheet & then Open type=comma-separated-value ASCII file
The nmon analyser or consolidator does not need the file sorted.

Capacity planning mode - use cron to run each day
-x sensible spreadsheet output for CP = one day
every 15 mins for 1 day ( i.e. -ft -s 900 -c 96)
-X sensible spreadsheet output for CP = busy hour
every 30 secs for 1 hour ( i.e. -ft -s 30 -c 120)

Interactive Mode Commands
key --- Toggles to control what is displayed ---
h = Online help information
r = Machine type, machine name, cache details and OS version + LPAR
c = CPU by processor stats with bar graphs
l = long term CPU (over 75 snapshots) with bar graphs
m = Memory stats
L = Huge memory page stats
V = Virtual Memory and Swap stats
k = Kernel Internal stats
n = Network stats and errors
N = NFS Network File System
d = Disk I/O Graphs
D = Disk I/O Stats
o = Disk I/O Map (one character per disk showing how busy it is)
o = User Defined Disk Groups
j = File Systems
t = Top Process stats use 1,3,4,5 to select the data & order
u = Top Process full command details
v = Verbose mode - tries to make recommendations
b = black and white mode (or use -b option)
. = minimum mode i.e. only busy disks and processes

key --- Other Controls ---
+ = double the screen refresh time
- = halves the screen refresh time
q = quit (also x, e or control-C)
0 = reset peak counts to zero (peak = ">")
space = refresh screen now

Startup Control
If you find you always type the same toggles every time you start
then place them in the NMON shell variable. For example:
export NMON=cmdrvtan

a) To you want to stop nmon - kill -USR2
b) Use -p and nmon outputs the background process pid
c) To limit the processes nmon lists (online and to a file)
Either set NMONCMD0 to NMONCMD63 to the program names
or use -C cmd:cmd:cmd etc. example: -C ksh:vi:syncd
d) If you want to pipe nmon output to other commands use a FIFO:
mkfifo /tmp/mypipe
nmon -F /tmp/mypipe &
grep /tmp/mypipe
e) If nmon fails please report it with:
1) nmon version like: 14i
2) the output of cat /proc/cpuinfo
3) some clue of what you were doing
4) I may ask you to run the debug version

Developer Nigel Griffiths
Feedback welcome - on the current release only and state exactly the problem
No warranty given or implied.



    3.2 服务器上安装nmon


centos7( 3.10.0-123.el7.x86_64) 重启问题_第1张图片

    3.3 服务器上抓取日志

执行如下命令抓取服务器参数 写道
cd /home/hadoop/nmon
./nmon_x86_64_centos6 -f -t -r name_view_in_excel_sheet -s 15 -c 960

centos7( 3.10.0-123.el7.x86_64) 重启问题_第2张图片


     第一步:打开文件“nmon analyser v40.xlsm”,点击按钮“Analyze nmon data”,选中上面获取的性能日志文件“hadoop31_160921_2357.nmon”,如下:

centos7( 3.10.0-123.el7.x86_64) 重启问题_第3张图片

centos7( 3.10.0-123.el7.x86_64) 重启问题_第4张图片


手动释放cache :
free -m
echo 1 > /proc/sys/vm/drop_caches
free -m







    2)既然找不到日志,那么就要想办法让系统运维工程师去主动找到日志,一是操作系统层面crash日志,二是抓取java程序的core dump日志;




centos7( 3.10.0-123.el7.x86_64) 重启问题_第5张图片
    查看文件vmcore-dmesg.txt,发现如下错误(吓人的 kenel BUG):

centos7( 3.10.0-123.el7.x86_64) 重启问题_第6张图片
     是一个centos 7的内核级BUG,我的linux内核版本如下:


     由于内存中的page table entry 产生争用,触发了kernel crash。


     6)在网上找了一下,也有人碰到使用centos7(Linux version 3.10.0-123.el7.x86_64 )出现类似的问题,RH5885 V3 CentOS7.0(Redhat7.0)内核问题导致系统自动重启

centos7( 3.10.0-123.el7.x86_64) 重启问题_第7张图片

centos7( 3.10.0-123.el7.x86_64) 重启问题_第8张图片

centos7( 3.10.0-123.el7.x86_64) 重启问题_第9张图片


      至此,困扰多时的问题终于解决了,毫无疑问的升级centos7 内核版本至



