一次诡异的linux系统重启故障

情况描述:
同事反应说oracle数据库在周末的时候宕了,排查下问题.
登到服务器上发现,oracle进程已经不存在,然后ps看了下监听进程,发现也不存在,这时候就怀疑是操作系统重启了.

操作系统版本信息:
[root@card-paopi log]# uname -a
Linux card-paopi 2.6.32-220.el6.x86_64 #1 SMP Wed Nov 9 08:03:13 EST 2011 x86_64 x86_64 x86_64 GNU/Linux
[oracle@card-paopi bdump]$ cat /proc/version
Linux version 2.6.32-220.el6.x86_64 ([email protected]) (gcc version 4.4.5 20110214 (Red Hat 4.4.5-6) (GCC) ) #1 SMP Wed Nov 9 08:03:13 EST 2011
[oracle@card-paopi bdump]$ cat /etc/issue
Red Hat Enterprise Linux Server release 6.2 (Santiago)

oracle版本:
Release 10.2.0.4.0 - Production

下面就出现让我无法理解的情况了:
使用uptime命令,输出如下:
[oracle@card-paopi bdump]$ uptime
 10:12:15 up 1 day, 49 min,  4 users,  load average: 0.13, 0.19, 0.18

可见,服务器运行了1天49分钟

然后使用 last reboot命令,输出如下:
[root@card-paopi log]# last reboot
reboot   system boot  2.6.32-220.el6.x Sun Jan 15 01:03 - 10:13 (1+09:09)   
reboot   system boot  2.6.32-220.el6.x Tue Feb  2 01:47 - 10:13 (349+08:25) 
reboot   system boot  2.6.32-220.el6.x Thu Sep 25 06:38 - 10:13 (844+03:34) 
reboot   system boot  2.6.32-220.el6.x Thu Sep 11 16:16 - 10:13 (857+17:57) 
reboot   system boot  2.6.32-220.el6.x Thu Sep 11 16:08 - 10:13 (857+18:05) 
reboot   system boot  2.6.32-220.el6.x Thu Sep 11 13:42 - 10:13 (857+20:31) 
reboot   system boot  2.6.32-220.el6.x Wed Sep 10 16:45 - 13:41  (20:56)    
reboot   system boot  2.6.32-220.el6.x Wed Sep 10 16:16 - 16:45  (00:28)    
reboot   system boot  2.6.32-220.el6.x Wed Sep 10 16:05 - 16:45  (00:39)    
reboot   system boot  2.6.32-220.el6.x Wed Sep 10 15:52 - 16:45  (00:52)    
reboot   system boot  2.6.32-220.el6.x Wed Sep 10 15:45 - 15:50  (00:05)    

wtmp begins Wed Sep 10 15:45:41 2014

可以看到,最近的reboot是15号晚上1点多重启的,重启至今是1天9小时9分钟.....

然后我看了下oracle 监听的日志,listener.log:
15-JAN-2017 09:00:36 * service_update * orcl * 0
15-JAN-2017 09:10:37 * service_update * orcl * 0
15-JAN-2017 09:12:08 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=orcl)(CID=(PROGRAM=batchMain)(HOST=card-paopi)(USER=cardbt))) * (ADDRESS=(PROTOCOL=tcp)(HOST=32.113.55.100)(
PORT=32467)) * establish * orcl * 0
15-JAN-2017 09:12:10 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=orcl)(CID=(PROGRAM=batchMain)(HOST=card-paopi)(USER=cardbt))) * (ADDRESS=(PROTOCOL=tcp)(HOST=32.113.55.100)(
PORT=32468)) * establish * orcl * 0
15-JAN-2017 09:14:49 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=orcl)(CID=(PROGRAM=batchMain)(HOST=card-paopi)(USER=cardbt))) * (ADDRESS=(PROTOCOL=tcp)(HOST=32.113.55.100)(
PORT=32485)) * establish * orcl * 0
TNSLSNR for Linux: Version 10.2.0.4.0 - Production on 16-JAN-2017 08:30:59
Copyright (c) 1991, 2007, Oracle.  All rights reserved.
System parameter file is /u01/oracle/product/10.2.0/db_1/network/admin/listener.ora
Log messages written to /u01/oracle/product/10.2.0/db_1/network/log/listener.log
Trace information written to /u01/oracle/product/10.2.0/db_1/network/trace/listener.trc
Trace level is currently 0
Started with pid=6319
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=card-paopi)(PORT=1521)))
Listener completed notification to CRS on start

TIMESTAMP * CONNECT DATA [* PROTOCOL INFO] * EVENT [* SID] * RETURN CODE
WARNING: Subscription for node down event still pending
16-JAN-2017 08:30:59 * (CONNECT_DATA=(CID=(PROGRAM=)(HOST=card-paopi)(USER=oracle))(COMMAND=status)(ARGUMENTS=64)(SERVICE=LISTENER)(VERSION=169870336)) * status * 0

可以看到,监听日志是在15号9点14分以后就没有记录了,符合uptime看到的时间.

看了下awr的时间:
Instance     DB Name        Snap Id    Snap Started    Level
------------ ------------ --------- ------------------ -----
orcl         ORCL             19545 15 Jan 2017 00:00      1
                              19546 15 Jan 2017 01:00      1
                              19547 15 Jan 2017 02:00      1
                              19548 15 Jan 2017 03:00      1
                              19549 15 Jan 2017 04:00      1
                              19550 15 Jan 2017 05:00      1
                              19551 15 Jan 2017 06:00      1
                              19552 15 Jan 2017 07:00      1
                              19553 15 Jan 2017 08:00      1
                              19554 15 Jan 2017 09:00      1


                              19555 16 Jan 2017 08:42      1
                              19556 16 Jan 2017 10:00      1
可以看到,awr也是在15号9点以后出现了中断.

再看oracle alert日志里面的信息:
Sun Jan 15 05:03:14 2017
Thread 1 cannot allocate new log, sequence 34007
Checkpoint not complete
  Current log# 1 seq# 34006 mem# 0: /u01/oracle/oradata/orcl/redo01.log
Sun Jan 15 05:03:16 2017
Thread 1 advanced to log sequence 34007 (LGWR switch)
  Current log# 2 seq# 34007 mem# 0: /u01/oracle/oradata/orcl/redo02.log
Mon Jan 16 08:31:08 2017

可以看到,在15号五点以后就没有任何日志输出了.由于该库正常情况下比较空闲,所以5点以后没用redo切换是正常情况.
但没有抛出任何错误,oracle就停掉了.说明oracle是意外中止的.

通过上述listener.log,awr情况,alert日志情况,可以推断出,oracle数据库是在1月15号9点14分之后被意外中止了.,这个时间点也符合uptime 所看到的服务器运行时间.
但是,却与last reboot所看到的时间冲突.
我只好认为是last reboot命令显示的信息有误.
于是我希望在操作系统日志中寻找到蛛丝马迹:
查看/var/log/messages
Jan 15 01:03:41 card-paopi kernel: Initializing cgroup subsys cpuset
Jan 15 01:03:41 card-paopi kernel: Initializing cgroup subsys cpu
Jan 15 01:03:41 card-paopi kernel: Linux version 2.6.32-220.el6.x86_64 ([email protected]) (gcc version 4.4.5 20110214 (Red Hat 4.4.5-6) (GCC) ) #1 SMP Wed 
Nov 9 08:03:13 EST 2011
Jan 15 01:03:41 card-paopi kernel: Command line: ro root=/dev/mapper/vg_cardpaopi-LogVol00 rd_LVM_LV=vg_cardpaopi/LogVol00 LANG=en_US.UTF-8 rd_NO_MD quiet SYSFONT=latarcyrheb-su
n16 rhgb crashkernel=128M rd_NO_LUKS  KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM
Jan 15 01:03:41 card-paopi kernel: KERNEL supported cpus:
Jan 15 01:03:41 card-paopi kernel:  Intel GenuineIntel
Jan 15 01:03:41 card-paopi kernel:  AMD AuthenticAMD
Jan 15 01:03:41 card-paopi kernel:  Centaur CentaurHauls
Jan 15 01:03:41 card-paopi kernel: Disabled fast string operations
Jan 15 01:03:41 card-paopi kernel: BIOS-provided physical RAM map:
Jan 15 01:03:41 card-paopi kernel: BIOS-e820: 0000000000000000 - 000000000009f800 (usable)
Jan 15 01:03:41 card-paopi kernel: BIOS-e820: 000000000009f800 - 00000000000a0000 (reserved)
Jan 15 01:03:41 card-paopi kernel: BIOS-e820: 00000000000dc000 - 0000000000100000 (reserved)
Jan 15 01:03:41 card-paopi kernel: BIOS-e820: 0000000000100000 - 00000000bfef0000 (usable)
Jan 15 01:03:41 card-paopi kernel: BIOS-e820: 00000000bfef0000 - 00000000bfeff000 (ACPI data)
Jan 15 01:03:41 card-paopi kernel: BIOS-e820: 00000000bfeff000 - 00000000bff00000 (ACPI NVS)
Jan 15 01:03:41 card-paopi kernel: BIOS-e820: 00000000bff00000 - 00000000c0000000 (usable)
Jan 15 01:03:41 card-paopi kernel: BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
Jan 15 01:03:41 card-paopi kernel: BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
Jan 15 01:03:41 card-paopi kernel: BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
Jan 15 01:03:41 card-paopi kernel: BIOS-e820: 00000000fffe0000 - 0000000100000000 (reserved)
Jan 15 01:03:41 card-paopi kernel: BIOS-e820: 0000000100000000 - 0000000140000000 (usable)
Jan 15 01:03:41 card-paopi kernel: DMI present.
Jan 15 01:03:41 card-paopi kernel: SMBIOS version 2.4 @ 0xF68F0
Jan 15 01:03:41 card-paopi kernel: Phoenix BIOS detected: BIOS may corrupt low RAM, working around it.

在15号1点3分左右,确实出现了服务器重启的情况,符合last reboot看到的时间点.
紧接着,我查看15号9点14分左右是否有日志信息:
Jan 15 01:04:17 card-paopi abrtd: Email was sent to: root@localhost
Jan 15 01:04:18 card-paopi abrtd: DUP_OF_DIR: /var/spool/abrt/ccpp-2014-09-11-08:38:41-16883
Jan 15 01:04:18 card-paopi abrtd: Dump directory is a duplicate of /var/spool/abrt/ccpp-2014-09-11-08:38:41-16883
Jan 15 01:04:18 card-paopi abrtd: Deleting dump directory ccpp-2017-01-15-01:03:57-2270 (dup of ccpp-2014-09-11-08:38:41-16883), sending dbus signal
Jan 15 05:00:02 card-paopi kernel: dataInsert[2964]: segfault at 88 ip 00007f3fe6764777 sp 00007fff8425c290 error 4 in libclntsh.so.10.1[7f3fe6435000+1230000]
Jan 15 05:00:02 card-paopi abrt[2965]: saved core dump of pid 2964 (/home/cardbt/bin/dataInsert) to /var/spool/abrt/ccpp-2017-01-15-05:00:02-2964.new/coredump (24805376 bytes)
Jan 15 05:00:02 card-paopi abrtd: Directory 'ccpp-2017-01-15-05:00:02-2964' creation detected
Jan 15 05:00:02 card-paopi abrtd: Executable '/home/cardbt/bin/dataInsert' doesn't belong to any package
Jan 15 05:00:02 card-paopi abrtd: Corrupted or bad dump /var/spool/abrt/ccpp-2017-01-15-05:00:02-2964 (res:2), deleting
Jan 15 05:00:03 card-paopi kernel: dataProcess[2971]: segfault at 88 ip 00007f8cd739f777 sp 00007fffae483a60 error 4 in libclntsh.so.10.1[7f8cd7070000+1230000]
Jan 15 05:00:03 card-paopi abrt[2972]: saved core dump of pid 2971 (/home/cardbt/bin/dataProcess) to /var/spool/abrt/ccpp-2017-01-15-05:00:03-2971.new/coredump (24805376 bytes)
Jan 15 05:00:03 card-paopi abrtd: Directory 'ccpp-2017-01-15-05:00:03-2971' creation detected
Jan 15 05:00:03 card-paopi abrtd: Executable '/home/cardbt/bin/dataProcess' doesn't belong to any package
Jan 15 05:00:03 card-paopi abrtd: Corrupted or bad dump /var/spool/abrt/ccpp-2017-01-15-05:00:03-2971 (res:2), deleting
Jan 15 23:59:01 card-paopi kernel: daySwitch[4414]: segfault at 88 ip 00007fc3947ab777 sp 00007fff49046430 error 4 in libclntsh.so.10.1[7fc39447c000+1230000]
Jan 15 23:59:01 card-paopi abrt[4415]: saved core dump of pid 4414 (/home/cardbt/bin/daySwitch) to /var/spool/abrt/ccpp-2017-01-15-23:59:01-4414.new/coredump (24764416 bytes)
Jan 15 23:59:01 card-paopi abrtd: Directory 'ccpp-2017-01-15-23:59:01-4414' creation detected
Jan 15 23:59:01 card-paopi abrtd: Executable '/home/cardbt/bin/daySwitch' doesn't belong to any package
Jan 15 23:59:01 card-paopi abrtd: Corrupted or bad dump /var/spool/abrt/ccpp-2017-01-15-23:59:01-4414 (res:2), deleting
Jan 16 00:05:01 card-paopi kernel: annualFee[4460]: segfault at 88 ip 00007f7c0cdd6777 sp 00007fff5fcb2eb0 error 4 in libclntsh.so.10.1[7f7c0caa7000+1230000]
Jan 16 00:05:02 card-paopi abrt[4461]: saved core dump of pid 4460 (/home/cardbt/bin/annualFee) to /var/spool/abrt/ccpp-2017-01-16-00:05:01-4460.new/coredump (24801280 bytes)
Jan 16 00:05:02 card-paopi abrtd: Directory 'ccpp-2017-01-16-00:05:01-4460' creation detected
Jan 16 00:05:02 card-paopi abrtd: Executable '/home/cardbt/bin/annualFee' doesn't belong to any package
Jan 16 00:05:02 card-paopi abrtd: Corrupted or bad dump /var/spool/abrt/ccpp-2017-01-16-00:05:01-4460 (res:2), deleting
Jan 16 03:24:02 card-paopi pcscd: winscard.c:309:SCardConnect() Reader E-Gate 0 0 Not Found

然后发现,15号那天,根本没有任何9点时间段的日志信息.

又看了下boot.log的生成时间信息:
[root@card-paopi log]# ls -l boot.log 
-rw-r--r-- 2 root root 2822  1? 15 01:03 boot.log

依然是15号1点零3分生成的,也就是说,根本没有9点中服务器重启所产生的信息!
----------------------------------------------------------------------------------------------------------------------------------
总述:
uptime命令,和oracle中awr,alert日志,listener.ora日志都可以看出,操作系统应该是在15号9点14分左右重启的
但last reboot命令,和messages日志中的信息,boot.log产生日期,却看不到9点时间段出现系统重启的情况.

我只好怀疑是redhat 6.2上面的bug了.
----------------------------------------------------------------------------------------------------------------------------------
接着:
我用top命令看了下服务器负载情况:
Tasks: 228 total,   4 running, 224 sleeping,   0 stopped,   0 zombie
Cpu(s): 23.8%us,  7.3%sy,  0.0%ni, 68.7%id,  0.1%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:   3925936k total,  2196992k used,  1728944k free,    20468k buffers
Swap:  4095992k total,        0k used,  4095992k free,  1518040k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                             
 1577 dbus      20   0 22500 2068  824 S 29.7  0.1  31:32.00 dbus-daemon                                                                                                         
 7050 root      20   0 40680 2724 2140 S 21.5  0.1  21:47.95 udisks-daemon                                                                                                       
 6908 query     20   0 21964 1424  628 R  8.3  0.0   9:00.05 dbus-daemon                                                                                                         
 6899 query     20   0  244m 6864 5388 S  2.3  0.2   2:02.74 gnome-session                                                                                                       
 7059 query     20   0  288m  10m 7700 S  1.7  0.3   1:49.54 gpk-update-icon                                                                                                     
13933 query     20   0  142m 3244 2672 R  1.7  0.1   0:00.05 gvfs-gdu-volume                                                                                                     
 6989 query     20   0  288m 8232 6208 S  1.3  0.2   1:15.42 seahorse-daemon                                                                                                     
 7039 query     20   0  296m  10m 8048 S  1.3  0.3   1:25.16 nm-applet                                                                                                           
 7064 query     20   0  289m 9908 7492 S  1.3  0.3   1:19.41 gnome-power-man                                                                                                     
 2212 root      20   0 45140 2568 2044 S  1.0  0.1   0:21.85 devkit-power-da                                                                                                     
 7018 query     20   0  458m 133m  10m S  1.0  3.5   1:04.64 gnome-panel                                                                                                         
 7022 query     20   0  440m  35m  10m S  1.0  0.9   0:52.65 nautilus                                                                                                            
 7071 query     20   0  257m 8880 6648 S  1.0  0.2   0:51.28 rhsm-icon                                                                                                           
12359 oracle    20   0 15224 1356  964 R  1.0  0.0   0:00.81 top                  

发现dbus-daemon  和 udisks-daemon  一直占据着较高的cpu,在生产环境中找了一台同是redhat 6.2的服务器,发现没有该问题.
百度下,发现网上有类似问题,是由于为了更好的支持中文,将LANG="zh_CN.UTF-8"改成了 LANG="zh_CN.GBK"                                                                                                                                                                                           
我看下本机的LANG情况:
[root@card-paopi log]# echo $LANG
zh_CN.GB18030
应该是这个问题.
但在网上并未看到该问题会导致服务器异常重启的案例.
-------------------------------------------------------------------------------------------------------------------------------------

你可能感兴趣的:(linux类)