如何判断是否发生Kernel Panic,以下以 CentOS 7.9系统为例
#查看 /var/crash 路径下是否有生成文件夹,Kernel Panic后会生成文件夹在此路径表示产生了Kernel Panic
ls /var/crash
#/var/crash/127.0.0.1-2023-12-04-08\:57\:47/vmcore
#Kernel Panic文件有了,分析需要对应的工具才能进行,步骤如下
# 安装 crash
yum install crash
# 查看内核版本
uname -r
#下载 内核debug info,3.10.0-693.el7.x86_64 是uname -r 查出来的版本
wget http://debuginfo.centos.org/7/x86_64/kernel-debuginfo-common-x86_64-3.10.0-693.el7.x86_64.rpm
wget http://linuxsoft.cern.ch/centos-debuginfo/7/x86_64/kernel-debuginfo-3.10.0-693.el7.x86_64.rpm
#假设下载很慢,建议直接浏览器上这个网站下载
#下载好以后使用 rpm -ivh xxx.rpm 安装以上两个rpm包
#安装好以后,运行crash应该能看到以下信息:
[root@localhost vmcore]# crash
crash 7.2.3-11.el7_9.1
Copyright (C) 2002-2017 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...
WARNING: kernel relocated [184MB]: patching 87476 gdb minimal_symbol values
KERNEL: /usr/lib/debug/lib/modules/3.10.0-1160.88.1.el7.x86_64/vmlinux
DUMPFILE: /dev/crash
CPUS: 12
DATE: Mon Dec 4 10:10:19 2023
UPTIME: 00:13:14
LOAD AVERAGE: 0.29, 0.32, 0.29
TASKS: 987
NODENAME: localhost.localdomain
RELEASE: 3.10.0-1160.88.1.el7.x86_64
VERSION: #1 SMP Tue Mar 7 15:41:52 UTC 2023
MACHINE: x86_64 (2096 Mhz)
MEMORY: 15.4 GB
PID: 4240
COMMAND: "crash"
TASK: ffff9e0d1eefc200 [THREAD_INFO: ffff9e0d083f4000]
CPU: 0
STATE: TASK_RUNNING (ACTIVE)
crash>
这是正常的,可以开始接下来的步骤:
crash /lib/debug/lib/modules/3.10.0-1160.88.1.el7.x86_64/vmlinux /var/crash/127.0.0.1-2023-12-04-08\:57\:47/vmcore
以上/var/crash/127.0.0.1-2023-12-04-08\:57\:47/vmcore
是kernel panic后生成的文件夹内的信息
解析他可以看到kernel panic的原因
范例一:
创造一个kernel panic的场景
可以使用 以下命令直接触发,触发后系统会在几秒内重启
echo c > /proc/sysrq-trigger
范例二:
使用oom 触发:
之前有提到我之前 fio 命令导致 触发 out of memory 触发 oom-killer,内核有办法设定,让OOM触发的时候直接Panic重启,以下是命令:
sysctl -w vm.panic_on_oom=1
sysctl -w kernel.panic=10
echo "vm.panic_on_oom=1" >> /etc/sysctl.conf
echo "kernel.panic=10" >> /etc/sysctl.conf
在此设定下,即可使系统在触发OOM后10s重启,同时 /var/crash 内会生成文件夹
以下是我触发OOM的脚本:
首先是fio配置,至于OOM原因,参考我之前的文章:
https://blog.csdn.net/weixin_44517278/article/details/131661105
以下配置写到 fio.conf
[JEDEC-219]
ioengine=libaio
direct=1
rw=randrw
norandommap
randrepeat=0
rwmixread=40
iodepth=128
numjobs=4
bssplit=512/4:1024/1:1536/1:2048/1:2560/1:3072/1:3584/1:4k/67:8k/10:16k/7:32k/3:64k/3
blockalign=4k
random_distribution=zoned:50/5:30/15:20/80
loops=10000
filename=/dev/nvme0n1
group_reporting
write_iops_log=iops.log
write_bw_log=bw.log
write_lat_log=lat.log
然后为了快速触发,我使用for循环去快速触发:
for i in {0..100};do nohup fio fio.conf &;sleep 1;done
这样很快就能触发oom panic,系统重启,重启后能在 /var/crash
中查到一个带刚刚日期时间的文件夹,如我试验的时候生成的/var/crash/127.0.0.1-2023-12-04-09\:56\:53/vmcore
,然后可以用上文说的命令进行分析,如下:
[root@localhost vmcore]# crash /lib/debug/lib/modules/3.10.0-1160.88.1.el7.x86_64/vmlinux /var/crash/127.0.0.1-2023-12-04-09\:56\:53/vmcore
crash 7.2.3-11.el7_9.1
Copyright (C) 2002-2017 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...
WARNING: kernel relocated [932MB]: patching 87476 gdb minimal_symbol values
KERNEL: /lib/debug/lib/modules/3.10.0-1160.88.1.el7.x86_64/vmlinux
DUMPFILE: /var/crash/127.0.0.1-2023-12-04-09:56:53/vmcore [PARTIAL DUMP]
CPUS: 12
DATE: Mon Dec 4 09:56:51 2023
UPTIME: 00:11:41
LOAD AVERAGE: 60.96, 26.74, 13.81
TASKS: 915
NODENAME: localhost.localdomain
RELEASE: 3.10.0-1160.88.1.el7.x86_64
VERSION: #1 SMP Tue Mar 7 15:41:52 UTC 2023
MACHINE: x86_64 (2095 Mhz)
MEMORY: 15.4 GB
PANIC: "Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled"
PID: 7020
COMMAND: "fio"
TASK: ffff9b237633c200 [THREAD_INFO: ffff9b24f013c000]
CPU: 8
STATE: TASK_RUNNING (PANIC)
如上,可以看到PANIC的点是由于 PANIC: "Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled"
范例三:
WARNING: kernel relocated [54MB]: patching 87292 gdb minimal_symbol values
KERNEL: /lib/debug/lib/modules/3.10.0-1160.el7.x86_64/vmlinux
DUMPFILE: ./127.0.0.1-2023-10-15-12.14.31/vmcore [PARTIAL DUMP]
CPUS: 112
DATE: Mon Oct 16 00:13:16 2023
UPTIME: 2 days, 04:29:35
LOAD AVERAGE: 9.20, 8.26, 8.13
TASKS: 990
NODENAME: sh-dell01
RELEASE: 3.10.0-1160.el7.x86_64
VERSION: #1 SMP Mon Oct 19 16:18:59 UTC 2020
MACHINE: x86_64 (2000 Mhz)
MEMORY: 63.3 GB
PANIC: "BUG: unable to handle kernel NULL pointer dereference at (null)"
PID: 44133
COMMAND: "umount"
TASK: ffff8b476bc9b180 [THREAD_INFO: ffff8b40973d8000]
CPU: 61
STATE: TASK_RUNNING (PANIC)
分析命令:ps 查看系统崩溃前的进程 带 > 的是活跃进程,也是有可能导致系统崩溃的进程
crash> ps
44118 2 15 ffff8b4f4851e300 IN 0.0 0 0 [kworker/15:2]
44125 2 68 ffff8b479fb29080 IN 0.0 0 0 [kworker/68:2]
> 44133 1 61 ffff8b476bc9b180 RU 0.0 123620 1220 umount
44136 2 58 ffff8b4789091080 IN 0.0 0 0 [kworker/58:2]
44139 1 58 ffff8b476bc9c200 UN 0.0 123620 1224 umount
44141 1 59 ffff8b476bc9d280 UN 0.0 123608 996 swapoff
分析命令:log 查看系统崩溃时所有的dmesg(崩溃导致系统重启,重启前的dmesg可以在这里查看)
crash> log
[ 0.000000] Initializing cgroup subsys cpuset
[ 0.000000] Initializing cgroup subsys cpu
[ 0.000000] Initializing cgroup subsys cpuacct
..............................
分析命令: bt 查看系统崩溃前的堆栈信息
crash> bt
PID: 44133 TASK: ffff8b476bc9b180 CPU: 61 COMMAND: "umount"
#0 [ffff8b40973db980] machine_kexec at ffffffff84666294
#1 [ffff8b40973db9e0] __crash_kexec at ffffffff84722562
#2 [ffff8b40973dbab0] crash_kexec at ffffffff84722650
#3 [ffff8b40973dbac8] oops_end at ffffffff84d8b798
#4 [ffff8b40973dbaf0] no_context at ffffffff84675d14
#5 [ffff8b40973dbb40] __bad_area_nosemaphore at ffffffff84675fe2
#6 [ffff8b40973dbb90] bad_area_nosemaphore at ffffffff84676104
#7 [ffff8b40973dbba0] __do_page_fault at ffffffff84d8e750
#8 [ffff8b40973dbc10] do_page_fault at ffffffff84d8e975
#9 [ffff8b40973dbc40] page_fault at ffffffff84d8a778
[exception RIP: jbd2_superblock_csum+58]
RIP: ffffffffc06f969a RSP: ffff8b40973dbcf8 RFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8b4778d39000 RCX: ffff8b40973dbfd8
RDX: 0000000000000000 RSI: ffff8b4778d39000 RDI: ffff8b4f72e9d800
RBP: ffff8b40973dbd28 R8: ffff8b40973dbdc8 R9: 0000000000000001
R10: 0000000000000001 R11: ffff8b47895d3200 R12: 000000000e33f513
R13: ffff8b4f72e9d800 R14: 0000000000001c11 R15: ffff8b4778d39000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#10 [ffff8b40973dbd30] jbd2_write_superblock at ffffffffc06fa61c [jbd2]
#11 [ffff8b40973dbd70] jbd2_mark_journal_empty at ffffffffc06facbd [jbd2]
#12 [ffff8b40973dbda0] jbd2_journal_destroy at ffffffffc06faf6e [jbd2]
#13 [ffff8b40973dbe10] ext4_put_super at ffffffffc0913680 [ext4]
#14 [ffff8b40973dbe50] generic_shutdown_super at ffffffff8485051d
#15 [ffff8b40973dbe70] kill_block_super at ffffffff84850997
#16 [ffff8b40973dbe90] deactivate_locked_super at ffffffff84850cfe
#17 [ffff8b40973dbeb0] deactivate_super at ffffffff84851486
#18 [ffff8b40973dbec8] cleanup_mnt at ffffffff84870b0f
#19 [ffff8b40973dbee0] __cleanup_mnt at ffffffff84870ba2
#20 [ffff8b40973dbef0] task_work_run at ffffffff846c275b
#21 [ffff8b40973dbf30] do_notify_resume at ffffffff8462cc65
#22 [ffff8b40973dbf50] int_signal at ffffffff84d942ef
RIP: 00007f785a783a07 RSP: 00007ffc82e094e8 RFLAGS: 00000246
RAX: 0000000000000000 RBX: 00005597e34bc040 RCX: ffffffffffffffff
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00005597e34c2280
RBP: 00005597e34c2280 R8: 00005597e34c21f0 R9: 0000000000000000
R10: 00007ffc82e08920 R11: 0000000000000246 R12: 00007f785b301d78
R13: 0000000000000000 R14: 00005597e34bc140 R15: 00005597e34bc040
ORIG_RAX: 00000000000000a6 CS: 0033 SS: 002b
这里可以看到 最后在 #22 [ffff8b40973dbf50] int_signal at ffffffff84d942ef 调用发生问题,
可以进一步查看,我这里指向的地址是ffffffff84d942ef
分析命令:dis 反汇编该地址,查看源码Fail位置
crash> dis -l ffffffff84d942ef
/usr/src/debug/kernel-3.10.0-1160.el7/linux-3.10.0-1160.el7.x86_64/arch/x86/kernel/entry_64.S: 701
0xffffffff84d942ef <int_signal+18>: mov $0xfe0e,%edi
上面列出了源码指向/usr/src/debug/kernel-3.10.0-1160.el7/linux-3.10.0-1160.el7.x86_64/arch/x86/kernel/entry_64.S: 701
可以直接查看源码相应位置:
crash> cat -n /usr/src/debug/kernel-3.10.0-1160.el7/linux-3.10.0-1160.el7.x86_64/arch/x86/kernel/entry_64.S
#筛选了一下结果.....
695 int_signal:
696 testl $_TIF_DO_NOTIFY_MASK,%edx
697 jz 1f
698 movq %rsp,%rdi # &ptregs -> arg1
699 xorl %esi,%esi # oldset -> arg2
700 call do_notify_resume
701 1: movl $_TIF_WORK_MASK,%edi
702 int_restore_rest:
703 RESTORE_REST
704 DISABLE_INTERRUPTS(CLBR_NONE)
705 TRACE_IRQS_OFF
706 jmp int_with_check
707 CFI_ENDPROC
708 END(system_call)
尴尬的是找到这里对我来说也没啥用,看不懂源码…
以上,暂时记录这些…遇到更多Kernel Panic的案例会再总结记录上来
参考文章:
https://blog.csdn.net/linuxvfast/article/details/116591523
https://blog.csdn.net/weixin_45030965/article/details/124960224