内核崩溃转储服务的使用
在linux生涯中虽说内核崩溃出现的几率不大,但还是有的
简单的叙述一下kdump这个服务的工作原理和作用(摘自网络)
dump 是一种的新的crash dump捕获机制,用来捕获kernel crash时候产生的crash dump。Kdump需要配置两个不同目的的kernel,其中一个我们在这里称作standard(production) kernel;另外一个称之为Crash(capture)kernel。
standard(production)kernel,是指我正在使用的kernel,当standard kernel在使用的过程中出现crash的时候, kdump会切换到crash kernel, 简单来说,standard kernel会正运行时发生crash,而crash(capture) Kernel 会被用来捕获production kernel crash时候产生的crash dump。
捕获crash dump是在新的crash(capture) kernel 的上下文中来捕获的,而不是在standard kernel上下文进行。
具体是当standard kernel方式crash的时候,kdump通过kexec(后面介绍)自动启动进入到crash kernel当中。如果启动了kdump服务,standard kernel会预留一部分内存, 这部分内存用来启动crash kernel。
kdump机制主要包括两个组件:kdump和kexec
什么是Kexec?
kexec 是一个快速启动kernel的机制,它运行在某一正在运行的kernel中,启动一个新的kernel(这里是crash kernel),而且不用重新经过BIOS 就可以完成启动。因为一般BIOS都会花费很长的时间,尤其是在大型并且同时连接许多外部设备的Server上的环境下,BIOS会花费更多的时间。
今天我们家一台RHEL6.4_64的服务器突然死机了,研发的同学想知道具体内容
查看了系统日志内容如下:
Aug 30 16:53:33 localhost kernel: [Hardware Error]: Machine check events logged
Aug 30 21:53:33 localhost kernel: [Hardware Error]: Machine check events logged
Aug 30 22:53:33 localhost kernel: [Hardware Error]: Machine check events logged
Aug 30 23:53:33 localhost kernel: [Hardware Error]: Machine check events logged
Aug 31 00:53:33 localhost kernel: [Hardware Error]: Machine check events logged
Aug 31 01:53:33 localhost kernel: [Hardware Error]: Machine check events logged
Aug 31 02:53:33 localhost kernel: [Hardware Error]: Machine check events logged
Aug 31 03:53:33 localhost kernel: [Hardware Error]: Machine check events logged
Sep 2 10:07:46 localhost kernel: imklog 5.8.10, log source = /proc/kmsg started.
很明显没有什么可用信息
于是就想到了kdump服务
查看了一下服务情况
# chkconfig --list|grep kdump
kdump 0:关闭 1:关闭 2:启用 3:启用 4:启用 5:启用 6:关闭
看来是开启的然后
[cai@localhost ~]$ ll /var/crash/
总用量 0
很明显这个服务没有起到作用
由于碰到了,所以就深入的化验了一下这个服务的作用使用方法,一下就是笔者测试的整个过程
[root@localhost ~]# rpm -qa|grep crash
crash-trace-command-1.0-4.el6.x86_64
crash-gcore-command-1.0-3.el6.x86_64
crash-6.1.0-1.el6.x86_64
也支持源码包的安装,这儿不做太多介绍
[root@localhost ~]# rpm -qa|grep kexec
kexec-tools-2.0.0-258.el6.x86_64
[root@localhost ~]# uname -r //查看内核版本
2.6.32-358.el6.x86_64
[root@localhost ~]# rpm -qa|grep $(uname -r)
kernel-debuginfo-common-x86_64-2.6.32-358.el6.x86_64 //这些内核的版本必须一致
kernel-debuginfo-2.6.32-358.el6.x86_64
kernel-2.6.32-358.el6.x86_64
***************************************************************
注意:如果debug内核和真正运行的内核版本不一致会出现如
下错误:
[root@localhost ~]# cd - && crash vmcore vmlinux //此处暂不解是这儿的参数的要用的东西,只是演示一下错误而以,后续会有解释
/var/crash/127.0.0.1-2013-09-02-01:22:37
crash 6.1.0-1.el6
Copyright (C) 2002-2012 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.
GNU gdb (GDB) 7.3.1
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...
crash: page excluded: kernel virtual address: ffffffff81c42d70 type: "pglist node_id" //这个千奇百怪的错误
***************************************************************
下边进行一次内核崩溃的错误:
#echo "c" > /proc/sysrq-trigger
然后进行kdump的一系列操作之后重启,笔者不方便截图,读者自测吧!
然后登录服务器查看
[root@localhost ~]# ll /var/crash/
总用量 4
drwxr-xr-x. 2 root root 4096 9月 2 05:51 127.0.0.1-2013-09-02-05:50:47
已经有一次转储了
[root@localhost ~]# cd /var/crash/127.0.0.1-2013-09-02-05\:50\:47/
[root@localhost 127.0.0.1-2013-09-02-05:50:47]# ll
总用量 43380
-rw-------. 1 root root 44333564 9月 2 05:51 vmcore //转储文件
-rw-r--r--. 1 root root 78407 9月 2 05:50 vmcore-dmesg.txt //转储内核起动时的日志
然后用crash工具来分析crash的dump文件
[root@localhost 127.0.0.1-2013-09-02-05:50:47]# crash /usr/lib/debug/lib/modules/2.6.32-358.el6.x86_64/vmlinux vmcore
*********************************************************
注意:上条命令中/usr/lib/debug/lib/modules/2.6.32-358.el6.x86_64/vmlinux
是debug内核的路径
vmcore这个是carsh dump文件的路径
命令语法是: crash debug内核 crashdump文件
debug内核是安装kernel-debuginfo-2.6.32-358.el6.x86_64包所得的
千万记住,debug内核和现用的内核版本必须一样
如我的:
[root@localhost 127.0.0.1-2013-09-02-05:50:47]# rpm -qa|grep $(uname -r)
kernel-debuginfo-common-x86_64-2.6.32-358.el6.x86_64 //x86_64-2.6.32-358两个必须一样
kernel-debuginfo-2.6.32-358.el6.x86_64
kernel-2.6.32-358.el6.x86_64
[root@localhost 127.0.0.1-2013-09-02-05:50:47]# uname -r
2.6.32-358.el6.x86_64 //2.6.32-358.el6.x86_64 和上边RPM包的必须一样
*********************************************************
crash 6.1.0-1.el6
Copyright (C) 2002-2012 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.
GNU gdb (GDB) 7.3.1
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...
crash: page excluded: kernel virtual address: ffffffffffffffff type: "possible"
WARNING: cannot read cpu_possible_map
crash: page excluded: kernel virtual address: ffffffffffffffff type: "present"
WARNING: cannot read cpu_present_map
crash: page excluded: kernel virtual address: ffffffffffffffff type: "online"
WARNING: cannot read cpu_online_map
WARNING: kernel version inconsistency between vmlinux and dumpfile
crash: page excluded: kernel virtual address: ffffffffffffffff type: "cpu_present_map"
crash: page excluded: kernel virtual address: ffffffffffffffff type: "cpu_present_map"
crash: page excluded: kernel virtual address: ffffffffffffffff type: "cpu_online_map"
KERNEL: /usr/lib/debug/lib/modules/2.6.32-358.el6.x86_64/vmlinux
DUMPFILE: vmcore [PARTIAL DUMP]
CPUS: 1
DATE: Mon Sep 2 05:50:42 2013
UPTIME: 03:42:31
LOAD AVERAGE: 0.00, 0.02, 0.00
TASKS: 127
NODENAME: localhost.localdomain
RELEASE: 2.6.32-358.el6.x86_64
VERSION: #1 SMP Fri Feb 22 00:31:26 UTC 2013
MACHINE: x86_64 (2128 Mhz)
MEMORY: 2 GB
PANIC: "Oops: 0002 [#1] SMP " (check log for details)
PID: 2390
COMMAND: "bash"
TASK: ffff88007a9b0080 [THREAD_INFO: ffff88007b824000]
CPU: 0
STATE: TASK_RUNNING (PANIC)
crash> //如果正常就回停留在这个模式下
举个简单的例子查看文件的访问进程:
crash> files //显示最后一次的操作文件的命令 是bash pid是2390
PID: 2390 TASK: ffff88007a9b0080 CPU: 0 COMMAND: "bash"
ROOT: / CWD: /root
FD FILE DENTRY INODE TYPE PATH
0 ffff88007d96d600 ffff88007cb65440 ffff880077f34878 CHR /dev/pts/0
1 ffff88007d7bc800 ffff88007ca43b40 ffff88007bd32578 REG /proc/sysrq-trigger
2 ffff88007d96d600 ffff88007cb65440 ffff880077f34878 CHR /dev/pts/0
10 ffff88007d96d600 ffff88007cb65440 ffff880077f34878 CHR /dev/pts/0
255 ffff88007d96d600 ffff88007cb65440 ffff880077f34878 CHR /dev/pts/0
或者指定pid进行查看打开了那些文件
crash> files 2390
PID: 2390 TASK: ffff88007a9b0080 CPU: 0 COMMAND: "bash"
ROOT: / CWD: /root
FD FILE DENTRY INODE TYPE PATH
0 ffff88007d96d600 ffff88007cb65440 ffff880077f34878 CHR /dev/pts/0
1 ffff88007d7bc800 ffff88007ca43b40 ffff88007bd32578 REG /proc/sysrq-trigger
2 ffff88007d96d600 ffff88007cb65440 ffff880077f34878 CHR /dev/pts/0
10 ffff88007d96d600 ffff88007cb65440 ffff880077f34878 CHR /dev/pts/0
255 ffff88007d96d600 ffff88007cb65440 ffff880077f34878 CHR /dev/pts/0
crash> files 1 //pid1 是init进程的进程号,此处显示的是init进程打开的文件
PID: 1 TASK: ffff88007e4c3500 CPU: 0 COMMAND: "init"
ROOT: / CWD: /
FD FILE DENTRY INODE TYPE PATH
0 ffff88007afe0c00 ffff88007eb18680 ffff88007e4f1108 CHR /null
1 ffff88007afe0c00 ffff88007eb18680 ffff88007e4f1108 CHR /null
2 ffff88007afe0c00 ffff88007eb18680 ffff88007e4f1108 CHR /null
3 ffff880037e0cc00 ffff88007eb1b300 ffff88007b0c6148 FIFO
4 ffff880037e0cb40 ffff88007eb1b300 ffff88007b0c6148 FIFO
5 ffff88007afe0d80 ffff88007b0a3b40 ffff88007b0db7f8 DIR inotify
6 ffff88007afe0840 ffff88007b0a3b40 ffff88007b0db7f8 DIR inotify
7 ffff88007afe0240 ffff88007eb28780 ffff88007b0f4c88 SOCK
9 ffff880037e0c6c0 ffff880077d33140 ffff88007b219d08 SOCK
crash> ps //可以用ps命令来查看kernel crash 时候的所有进程
PID PPID CPU TASK ST %MEM VSZ RSS COMM
0 0 0 ffffffff81a8d020 RU 0.0 0 0 [swapper]
1 0 0 ffff88007e4c3500 IN 0.1 19356 1448 init
2 0 0 ffff88007e4c2aa0 IN 0.0 0 0 [kthreadd]
3 2 0 ffff88007e4c2040 IN 0.0 0 0 [migration/0]
4 2 0 ffff88007e4d1540 IN 0.0 0 0 [ksoftirqd/0]
5 2 0 ffff88007e4d0ae0 IN 0.0 0 0 [migration/0]
6 2 0 ffff88007e4d0080 IN 0.0 0 0 [watchdog/0]
7 2 0 ffff88007e503500 IN 0.0 0 0 [events/0]
至此也就是说这个kdump服务可以在内核崩溃的瞬间将现有的内存做个镜像
然后将这个镜像文件存储到dump文件中,然后下次启动可以用dump文件在其中找到一些蛛丝马迹。
下边班了,不说太多了,其他感兴趣的同血可以请医歩至:
http://people.redhat.com/anderson/crash_whitepaper/
####################################
本文由笔者原创
作者:john
转载请注明出处