关于Linux Kernel Panic导致重启的简单分析步骤

Linux系统Kernel Panic的检索

如何判断是否发生Kernel Panic,以下以 CentOS 7.9系统为例

#查看 /var/crash 路径下是否有生成文件夹,Kernel Panic后会生成文件夹在此路径表示产生了Kernel Panic
ls /var/crash
#/var/crash/127.0.0.1-2023-12-04-08\:57\:47/vmcore

如何建造debug环境

#Kernel Panic文件有了,分析需要对应的工具才能进行,步骤如下
#  安装 crash
yum install crash
#  查看内核版本
uname -r
#下载 内核debug info,3.10.0-693.el7.x86_64 是uname -r 查出来的版本
wget http://debuginfo.centos.org/7/x86_64/kernel-debuginfo-common-x86_64-3.10.0-693.el7.x86_64.rpm
wget http://linuxsoft.cern.ch/centos-debuginfo/7/x86_64/kernel-debuginfo-3.10.0-693.el7.x86_64.rpm
#假设下载很慢,建议直接浏览器上这个网站下载
#下载好以后使用 rpm -ivh xxx.rpm 安装以上两个rpm包

#安装好以后,运行crash应该能看到以下信息:
[root@localhost vmcore]# crash

crash 7.2.3-11.el7_9.1
Copyright (C) 2002-2017  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.
 
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

WARNING: kernel relocated [184MB]: patching 87476 gdb minimal_symbol values

      KERNEL: /usr/lib/debug/lib/modules/3.10.0-1160.88.1.el7.x86_64/vmlinux
    DUMPFILE: /dev/crash
        CPUS: 12
        DATE: Mon Dec  4 10:10:19 2023
      UPTIME: 00:13:14
LOAD AVERAGE: 0.29, 0.32, 0.29
       TASKS: 987
    NODENAME: localhost.localdomain
     RELEASE: 3.10.0-1160.88.1.el7.x86_64
     VERSION: #1 SMP Tue Mar 7 15:41:52 UTC 2023
     MACHINE: x86_64  (2096 Mhz)
      MEMORY: 15.4 GB
         PID: 4240
     COMMAND: "crash"
        TASK: ffff9e0d1eefc200  [THREAD_INFO: ffff9e0d083f4000]
         CPU: 0
       STATE: TASK_RUNNING (ACTIVE)

crash> 

这是正常的,可以开始接下来的步骤:

crash /lib/debug/lib/modules/3.10.0-1160.88.1.el7.x86_64/vmlinux /var/crash/127.0.0.1-2023-12-04-08\:57\:47/vmcore

以上/var/crash/127.0.0.1-2023-12-04-08\:57\:47/vmcore是kernel panic后生成的文件夹内的信息
解析他可以看到kernel panic的原因

范例一:
创造一个kernel panic的场景
可以使用 以下命令直接触发,触发后系统会在几秒内重启

echo c > /proc/sysrq-trigger 

范例二:
使用oom 触发:
之前有提到我之前 fio 命令导致 触发 out of memory 触发 oom-killer,内核有办法设定,让OOM触发的时候直接Panic重启,以下是命令:

sysctl -w vm.panic_on_oom=1
sysctl -w kernel.panic=10
echo "vm.panic_on_oom=1" >> /etc/sysctl.conf
echo "kernel.panic=10" >> /etc/sysctl.conf

在此设定下,即可使系统在触发OOM后10s重启,同时 /var/crash 内会生成文件夹
以下是我触发OOM的脚本:
首先是fio配置,至于OOM原因,参考我之前的文章:
https://blog.csdn.net/weixin_44517278/article/details/131661105
以下配置写到 fio.conf

[JEDEC-219]
ioengine=libaio
direct=1
rw=randrw
norandommap
randrepeat=0
rwmixread=40
iodepth=128
numjobs=4
bssplit=512/4:1024/1:1536/1:2048/1:2560/1:3072/1:3584/1:4k/67:8k/10:16k/7:32k/3:64k/3
blockalign=4k
random_distribution=zoned:50/5:30/15:20/80
loops=10000

filename=/dev/nvme0n1
group_reporting
write_iops_log=iops.log
write_bw_log=bw.log
write_lat_log=lat.log

然后为了快速触发,我使用for循环去快速触发:

for i in {0..100};do nohup fio fio.conf &;sleep 1;done

这样很快就能触发oom panic,系统重启,重启后能在 /var/crash中查到一个带刚刚日期时间的文件夹,如我试验的时候生成的/var/crash/127.0.0.1-2023-12-04-09\:56\:53/vmcore ,然后可以用上文说的命令进行分析,如下:

[root@localhost vmcore]# crash /lib/debug/lib/modules/3.10.0-1160.88.1.el7.x86_64/vmlinux /var/crash/127.0.0.1-2023-12-04-09\:56\:53/vmcore

crash 7.2.3-11.el7_9.1
Copyright (C) 2002-2017  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.
 
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

WARNING: kernel relocated [932MB]: patching 87476 gdb minimal_symbol values

      KERNEL: /lib/debug/lib/modules/3.10.0-1160.88.1.el7.x86_64/vmlinux
    DUMPFILE: /var/crash/127.0.0.1-2023-12-04-09:56:53/vmcore  [PARTIAL DUMP]
        CPUS: 12
        DATE: Mon Dec  4 09:56:51 2023
      UPTIME: 00:11:41
LOAD AVERAGE: 60.96, 26.74, 13.81
       TASKS: 915
    NODENAME: localhost.localdomain
     RELEASE: 3.10.0-1160.88.1.el7.x86_64
     VERSION: #1 SMP Tue Mar 7 15:41:52 UTC 2023
     MACHINE: x86_64  (2095 Mhz)
      MEMORY: 15.4 GB
       PANIC: "Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled"
         PID: 7020
     COMMAND: "fio"
        TASK: ffff9b237633c200  [THREAD_INFO: ffff9b24f013c000]
         CPU: 8
       STATE: TASK_RUNNING (PANIC)

如上,可以看到PANIC的点是由于 PANIC: "Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled"

范例三:

WARNING: kernel relocated [54MB]: patching 87292 gdb minimal_symbol values

      KERNEL: /lib/debug/lib/modules/3.10.0-1160.el7.x86_64/vmlinux
    DUMPFILE: ./127.0.0.1-2023-10-15-12.14.31/vmcore  [PARTIAL DUMP]
        CPUS: 112
        DATE: Mon Oct 16 00:13:16 2023
      UPTIME: 2 days, 04:29:35
LOAD AVERAGE: 9.20, 8.26, 8.13
       TASKS: 990
    NODENAME: sh-dell01
     RELEASE: 3.10.0-1160.el7.x86_64
     VERSION: #1 SMP Mon Oct 19 16:18:59 UTC 2020
     MACHINE: x86_64  (2000 Mhz)
      MEMORY: 63.3 GB
       PANIC: "BUG: unable to handle kernel NULL pointer dereference at           (null)"
         PID: 44133
     COMMAND: "umount"
        TASK: ffff8b476bc9b180  [THREAD_INFO: ffff8b40973d8000]
         CPU: 61
       STATE: TASK_RUNNING (PANIC)

分析命令:ps 查看系统崩溃前的进程 带 > 的是活跃进程,也是有可能导致系统崩溃的进程

crash> ps
  44118      2  15  ffff8b4f4851e300  IN   0.0       0      0  [kworker/15:2]
  44125      2  68  ffff8b479fb29080  IN   0.0       0      0  [kworker/68:2]
> 44133      1  61  ffff8b476bc9b180  RU   0.0  123620   1220  umount
  44136      2  58  ffff8b4789091080  IN   0.0       0      0  [kworker/58:2]
  44139      1  58  ffff8b476bc9c200  UN   0.0  123620   1224  umount
  44141      1  59  ffff8b476bc9d280  UN   0.0  123608    996  swapoff

分析命令:log 查看系统崩溃时所有的dmesg(崩溃导致系统重启,重启前的dmesg可以在这里查看)

crash> log
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Initializing cgroup subsys cpuacct
..............................

分析命令: bt 查看系统崩溃前的堆栈信息

crash> bt
PID: 44133  TASK: ffff8b476bc9b180  CPU: 61  COMMAND: "umount"
 #0 [ffff8b40973db980] machine_kexec at ffffffff84666294
 #1 [ffff8b40973db9e0] __crash_kexec at ffffffff84722562
 #2 [ffff8b40973dbab0] crash_kexec at ffffffff84722650
 #3 [ffff8b40973dbac8] oops_end at ffffffff84d8b798
 #4 [ffff8b40973dbaf0] no_context at ffffffff84675d14
 #5 [ffff8b40973dbb40] __bad_area_nosemaphore at ffffffff84675fe2
 #6 [ffff8b40973dbb90] bad_area_nosemaphore at ffffffff84676104
 #7 [ffff8b40973dbba0] __do_page_fault at ffffffff84d8e750
 #8 [ffff8b40973dbc10] do_page_fault at ffffffff84d8e975
 #9 [ffff8b40973dbc40] page_fault at ffffffff84d8a778
    [exception RIP: jbd2_superblock_csum+58]
    RIP: ffffffffc06f969a  RSP: ffff8b40973dbcf8  RFLAGS: 00010246
    RAX: 0000000000000000  RBX: ffff8b4778d39000  RCX: ffff8b40973dbfd8
    RDX: 0000000000000000  RSI: ffff8b4778d39000  RDI: ffff8b4f72e9d800
    RBP: ffff8b40973dbd28   R8: ffff8b40973dbdc8   R9: 0000000000000001
    R10: 0000000000000001  R11: ffff8b47895d3200  R12: 000000000e33f513
    R13: ffff8b4f72e9d800  R14: 0000000000001c11  R15: ffff8b4778d39000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#10 [ffff8b40973dbd30] jbd2_write_superblock at ffffffffc06fa61c [jbd2]
#11 [ffff8b40973dbd70] jbd2_mark_journal_empty at ffffffffc06facbd [jbd2]
#12 [ffff8b40973dbda0] jbd2_journal_destroy at ffffffffc06faf6e [jbd2]
#13 [ffff8b40973dbe10] ext4_put_super at ffffffffc0913680 [ext4]
#14 [ffff8b40973dbe50] generic_shutdown_super at ffffffff8485051d
#15 [ffff8b40973dbe70] kill_block_super at ffffffff84850997
#16 [ffff8b40973dbe90] deactivate_locked_super at ffffffff84850cfe
#17 [ffff8b40973dbeb0] deactivate_super at ffffffff84851486
#18 [ffff8b40973dbec8] cleanup_mnt at ffffffff84870b0f
#19 [ffff8b40973dbee0] __cleanup_mnt at ffffffff84870ba2
#20 [ffff8b40973dbef0] task_work_run at ffffffff846c275b
#21 [ffff8b40973dbf30] do_notify_resume at ffffffff8462cc65
#22 [ffff8b40973dbf50] int_signal at ffffffff84d942ef
    RIP: 00007f785a783a07  RSP: 00007ffc82e094e8  RFLAGS: 00000246
    RAX: 0000000000000000  RBX: 00005597e34bc040  RCX: ffffffffffffffff
    RDX: 0000000000000001  RSI: 0000000000000000  RDI: 00005597e34c2280
    RBP: 00005597e34c2280   R8: 00005597e34c21f0   R9: 0000000000000000
    R10: 00007ffc82e08920  R11: 0000000000000246  R12: 00007f785b301d78
    R13: 0000000000000000  R14: 00005597e34bc140  R15: 00005597e34bc040
    ORIG_RAX: 00000000000000a6  CS: 0033  SS: 002b

这里可以看到 最后在 #22 [ffff8b40973dbf50] int_signal at ffffffff84d942ef 调用发生问题,
可以进一步查看,我这里指向的地址是ffffffff84d942ef
分析命令:dis 反汇编该地址,查看源码Fail位置

crash> dis -l ffffffff84d942ef
/usr/src/debug/kernel-3.10.0-1160.el7/linux-3.10.0-1160.el7.x86_64/arch/x86/kernel/entry_64.S: 701
0xffffffff84d942ef <int_signal+18>:     mov    $0xfe0e,%edi

上面列出了源码指向/usr/src/debug/kernel-3.10.0-1160.el7/linux-3.10.0-1160.el7.x86_64/arch/x86/kernel/entry_64.S: 701
可以直接查看源码相应位置:

crash> cat -n /usr/src/debug/kernel-3.10.0-1160.el7/linux-3.10.0-1160.el7.x86_64/arch/x86/kernel/entry_64.S
#筛选了一下结果.....
   695  int_signal:
   696          testl $_TIF_DO_NOTIFY_MASK,%edx
   697          jz 1f
   698          movq %rsp,%rdi          # &ptregs -> arg1
   699          xorl %esi,%esi          # oldset -> arg2
   700          call do_notify_resume
   701  1:      movl $_TIF_WORK_MASK,%edi
   702  int_restore_rest:
   703          RESTORE_REST
   704          DISABLE_INTERRUPTS(CLBR_NONE)
   705          TRACE_IRQS_OFF
   706          jmp int_with_check
   707          CFI_ENDPROC
   708  END(system_call)

尴尬的是找到这里对我来说也没啥用,看不懂源码…

以上,暂时记录这些…遇到更多Kernel Panic的案例会再总结记录上来

参考文章:
https://blog.csdn.net/linuxvfast/article/details/116591523
https://blog.csdn.net/weixin_45030965/article/details/124960224

你可能感兴趣的:(Linux系统,linux,运维,服务器)