2019独角兽企业重金招聘Python工程师标准>>>
问题
测试服务器频繁死机,刚开始一周一次,后面应用服务启动就死机。
服务器系统: CentOS 6.5
内核版本:2.6.32-431.el6.x86_64
服务器系统日志分析
查看日志:/var/log/message ,下面是出错比较多的
Dec 4 14:11:46 localhost abrtd: Init complete, entering main loop
Dec 4 14:11:53 localhost modem-manager: (ttyS1) closing serial device...
Dec 4 14:11:53 localhost modem-manager: (ttyS1) opening serial device...
Dec 4 14:11:59 localhost modem-manager: (ttyS1) closing serial device...
Dec 4 14:12:16 localhost kernel: {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
Dec 4 14:12:16 localhost kernel: {1}[Hardware Error]: APEI generic hardware error status
Dec 4 14:12:16 localhost kernel: {1}[Hardware Error]: severity: 2, corrected
Dec 4 14:12:16 localhost kernel: {1}[Hardware Error]: section: 0, severity: 2, corrected
Dec 4 14:12:16 localhost kernel: {1}[Hardware Error]: flags: 0x01
Dec 4 14:12:16 localhost kernel: {1}[Hardware Error]: primary
Dec 4 14:12:16 localhost kernel: {1}[Hardware Error]: fru_text: CorrectedErr
Dec 4 14:12:16 localhost kernel: {1}[Hardware Error]: section_type: memory error
Dec 4 14:12:16 localhost kernel: {1}[Hardware Error]: node: 15424
Dec 4 14:12:16 localhost kernel: {1}[Hardware Error]: device: 12343
Dec 4 14:12:16 localhost kernel: {1}[Hardware Error]: error_type: 2, single-bit ECC
Dec 4 14:12:16 localhost kernel: [Hardware Error]: Machine check events logged 【死机】
Dec 9 04:05:06 localhost kernel: imklog 5.8.10, log source = /proc/kmsg started. 【重启】
Dec 9 04:05:06 localhost rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1601" x-info="http://www.rsyslog.com"] start
Dec 9 04:05:06 localhost kernel: Initializing cgroup subsys cpuset
Dec 9 04:05:11 localhost abrtd: Init complete, entering main loop
Dec 9 04:05:19 localhost modem-manager: (ttyS1) closing serial device...
Dec 9 04:05:19 localhost modem-manager: (ttyS1) opening serial device...
Dec 9 04:05:25 localhost modem-manager: (ttyS1) closing serial device...
Dec 9 04:05:52 localhost kernel: {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
Dec 9 04:05:52 localhost kernel: {1}[Hardware Error]: APEI generic hardware error status
Dec 9 04:05:52 localhost kernel: {1}[Hardware Error]: severity: 2, corrected
Dec 9 04:05:52 localhost kernel: {1}[Hardware Error]: section: 0, severity: 2, corrected
Dec 9 04:05:52 localhost kernel: {1}[Hardware Error]: flags: 0x01
Dec 9 04:05:52 localhost kernel: {1}[Hardware Error]: primary
Dec 9 04:05:52 localhost kernel: {1}[Hardware Error]: fru_text: CorrectedErr
Dec 9 04:05:52 localhost kernel: {1}[Hardware Error]: section_type: memory error
Dec 9 04:05:52 localhost kernel: {1}[Hardware Error]: node: 24208
Dec 9 04:05:52 localhost kernel: {1}[Hardware Error]: device: 12343
Dec 9 04:05:52 localhost kernel: {1}[Hardware Error]: error_type: 2, single-bit ECC
Dec 9 04:05:52 localhost kernel: [Hardware Error]: Machine check events logged 【死机】
Dec 11 10:40:00 localhost kernel: imklog 5.8.10, log source = /proc/kmsg started. 【重启】
Dec 11 10:40:00 localhost rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="1603" x-info="http://www.rsyslog.com"] start
Dec 11 10:40:00 localhost kernel: Initializing cgroup subsys cpuset
Dec 11 10:40:00 localhost kernel: Initializing cgroup subsys cpu
当时看到这些错误还是比较懵,Hardware Error硬件错误,以为无法挽救。
解决办法
在bing搜索关键“Hardware error from APEI Generic Hardware Error Source: 1”找到一篇匹配度还算比较高的: APEI Generic Hardware Error 大致是系统与ECC 内存相关的问题导致
后面我进行了2个操作:
- 1.内存条拔出来清理灰尘换个插槽重新插入【重启后问题没解决】
- 2.升级内核 (内核从 2.6.32-431.el6.x86_64 升级到 3.17.1)
目前服务器已经运行一周多,暂没出现死机现象,/var/log/message 无任何报错出现。
事后思考
服务器出现这个问题,可能与前几次突然停电有关。
资料参考
Linux日志查看
CentOS 内核升级
Linux最新内核列表