系统崩溃 - crash工具介绍

工欲善其事,必先利其器。本文主要介绍linux下crash工具常用命令的功能和使用。

背景知识

crash是redhat的工程师开发的,主要用来离线分析linux内核转存文件,它整合了gdb工具,功能非常强大。可以查看堆栈,dmesg日志,内核数据结构,反汇编等等。crash支持多种工具生成的转存文件格式,如kdump,LKCD,netdump和diskdump,而且还可以分析虚拟机Xen和Kvm上生成的内核转存文件。同时crash还可以调试运行时系统,直接运行crash即可,ubuntu下内核映象存放在/proc/kcore。

运行时系统调试

crash和linux内核是紧密耦合的,会随着内核的变化持续更新,它向前兼容的,新的crash工具可以分析老内核的转存文件。如果你的内核版本较新,crash无法解析,可以尝试安装最新的crash工具。

常用命令

下面介绍常用命令的使用,主要参考了crash_whitepaper和crash工具自带的帮助文档。crash_whitepaper介绍了开发的初衷,编译,命令的分类和使用以及如何添加自己的命令,是一个非常好的参考文献。我用的版本是crash-7.2.6和gdb-7.6,使用时可以使用“help command”来查看详细的帮助文档,详细的命令列表见附件。

帮助文档

crash在加载内核转存文件是会输出系统基本信息,如出问题的进程(bash - 2613),系统内存大小(7.9GB),系统架构(x86_64)等等,可以看到这个dump是sysrq触发的一个panic系统崩溃。

KERNEL: ../kernel-src/linux-4.19.53/vmlinux
DUMPFILE: crash/201907070732/dump.201907070732 [PARTIAL DUMP]
CPUS: 4
DATE: Sun Jul 7 07:31:34 2019
UPTIME: 00:10:27
LOAD AVERAGE: 0.14, 0.16, 0.12
TASKS: 584
NODENAME: glbian-OptiPlex-990
RELEASE: 4.19.53
VERSION: #1 SMP Sun Jun 23 11:01:25 CST 2019
MACHINE: x86_64 (3292 Mhz)
MEMORY: 7.9 GB
PANIC: "sysrq: SysRq : Trigger a crash"
PID: 2613
COMMAND: "bash"
TASK: ffff8b7df3cdae00 [THREAD_INFO: ffff8b7df3cdae00]
CPU: 2
STATE: TASK_RUNNING (SYSRQ)

查看堆栈

一般可以先查看堆栈(bt),看看系统死在什么地方,进而确定调查方向。可以看到这个dump的异常发生在sysrq的处理函数里面。

crash> bt
PID: 2613 TASK: ffff8b7df3cdae00 CPU: 2 COMMAND: "bash"
'#0 [ffffa0f442cd7a08] machine_kexec at ffffffff99a69313
'#1 [ffffa0f442cd7a68] __crash_kexec at ffffffff99b3e6b9
'#2 [ffffa0f442cd7b30] crash_kexec at ffffffff99b3f441
'#3 [ffffa0f442cd7b50] oops_end at ffffffff99a32bed
'#4 [ffffa0f442cd7b78] no_context at ffffffff99a7997c
'#5 [ffffa0f442cd7bd8] __bad_area_nosemaphore at ffffffff99a79d15
'#6 [ffffa0f442cd7c20] bad_area at ffffffff99a79f86
'#7 [ffffa0f442cd7c48] __do_page_fault at ffffffff99a7a486
'#8 [ffffa0f442cd7cc0] do_page_fault at ffffffff99a7a60d
'#9 [ffffa0f442cd7cf0] page_fault at ffffffff9a6010ae
[exception RIP: sysrq_handle_crash+22]
RIP: ffffffff9a034066 RSP: ffffa0f442cd7da8 RFLAGS: 00010286
RAX: ffffffff9a034050 RBX: 0000000000000063 RCX: 0000000000000006
RDX: 0000000000000000 RSI: 0000000000000096 RDI: 0000000000000063
RBP: ffffa0f442cd7da8 R8: 00000000000002f2 R9: 0000000000000007
R10: 0000000000000000 R11: ffffffff9b39c3ed R12: 0000000000000004
R13: 0000000000000000 R14: ffffffff9afa7300 R15: ffff8b7de5af9100
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
'#10 [ffffa0f442cd7db0] __handle_sysrq at ffffffff9a0347e8
'#11 [ffffa0f442cd7de0] write_sysrq_trigger at ffffffff9a034cbf
... ...

另外可以加参数显示函数偏移,函数所在的文件和每一帧的具体内容,从而对照源码和汇编代码,查看函数入参和局部变量。

crash> bt -slf
PID: 2613 TASK: ffff8b7df3cdae00 CPU: 2 COMMAND: "bash"
'#0 [ffffa0f442cd7a08] machine_kexec+451 at ffffffff99a69313
/home/glbian/data/kernel-src/linux-4.19.53/arch/x86/kernel/machine_kexec_64.c: 346
ffffa0f442cd7a10: 0000a0f442cd7a50 ffff8b7c40000000
ffffa0f442cd7a20: 0000000024001000 ffff8b7c64001000
ffffa0f442cd7a30: 0000000024000000 a05cedc0dfb99200
ffffa0f442cd7a40: a05cedc0dfb99200 ffffa0f442cd7cf8
ffffa0f442cd7a50: 0000000000000009 ffffa0f442cd7cf8
ffffa0f442cd7a60: ffffa0f442cd7b28 ffffffff99b3e6b9
... ...
’#8 [ffffa0f442cd7cc0] do_page_fault+45 at ffffffff99a7a60d
/home/glbian/data/kernel-src/linux-4.19.53/arch/x86/mm/fault.c: 1470
ffffa0f442cd7cc8: ffff8b7e6500d140 0000000000000000
ffffa0f442cd7cd8: 0000000000000000 0000000000000000
ffffa0f442cd7ce8: ffffa0f442cd7cf9 ffffffff9a6010ae
'#9 [ffffa0f442cd7cf0] page_fault+30 at ffffffff9a6010ae
/home/glbian/data/kernel-src/linux-4.19.53/arch/x86/entry/entry_64.S: 1181
[exception RIP: sysrq_handle_crash+22]
RIP: ffffffff9a034066 RSP: ffffa0f442cd7da8 RFLAGS: 00010286
RAX: ffffffff9a034050 RBX: 0000000000000063 RCX: 0000000000000006
RDX: 0000000000000000 RSI: 0000000000000096 RDI: 0000000000000063
RBP: ffffa0f442cd7da8 R8: 00000000000002f2 R9: 0000000000000007
R10: 0000000000000000 R11: ffffffff9b39c3ed R12: 0000000000000004
R13: 0000000000000000 R14: ffffffff9afa7300 R15: ffff8b7de5af9100
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
/home/glbian/data/kernel-src/linux-4.19.53/drivers/tty/sysrq.c: 147
ffffa0f442cd7cf8: ffff8b7de5af9100 ffffffff9afa7300
ffffa0f442cd7d08: 0000000000000000 0000000000000004
ffffa0f442cd7d18: ffffa0f442cd7da8 0000000000000063
ffffa0f442cd7d28: ffffffff9b39c3ed 0000000000000000
ffffa0f442cd7d38: 0000000000000007 00000000000002f2
ffffa0f442cd7d48: ffffffff9a034050 0000000000000006
ffffa0f442cd7d58: 0000000000000000 0000000000000096
ffffa0f442cd7d68: 0000000000000063 ffffffffffffffff
ffffa0f442cd7d78: ffffffff9a034066 0000000000000010
ffffa0f442cd7d88: 0000000000010286 ffffa0f442cd7da8
ffffa0f442cd7d98: 0000000000000018 0000000000000000
ffffa0f442cd7da8: ffffa0f442cd7dd8 ffffffff9a0347e8
'#10 [ffffa0f442cd7db0] __handle_sysrq+136 at ffffffff9a0347e8
/home/glbian/data/kernel-src/linux-4.19.53/drivers/tty/sysrq.c: 583
ffffa0f442cd7db8: 0000000000000002 fffffffffffffffb
ffffa0f442cd7dc8: ffffa0f442cd7ee8 0000563d45717780
ffffa0f442cd7dd8: ffffa0f442cd7df0 ffffffff9a034cbf
... ...

可以用dis命令进行返汇编,查看对应地址的代码逻辑。
>crash> dis -r ffffffff9a6010ae
0xffffffff9a601090 : data32 xchg %ax,%ax
0xffffffff9a601093 : callq 0xffffffff9a601230
0xffffffff9a601098 : mov %rsp,%rdi
0xffffffff9a60109b : mov 0x78(%rsp),%rsi
0xffffffff9a6010a0 : movq $0xffffffffffffffff,0x78(%rsp)
0xffffffff9a6010a9 : callq 0xffffffff99a7a5e0
0xffffffff9a6010ae : jmpq 0xffffffff9a601330
>crash> dis -f ffffffff9a6010ae
0xffffffff9a6010ae : jmpq 0xffffffff9a601330
0xffffffff9a6010b3 : nopl (%rax)
0xffffffff9a6010b6 : nopw %cs:0x0(%rax,%rax,1)

有时会出现堆栈被破坏的情况,可以用-t/-T来把整个stack的信息dump出来,往往可以看到一些蛛丝马迹。

crash> bt -t
PID: 2613 TASK: ffff8b7df3cdae00 CPU: 2 COMMAND: "bash"
START: machine_kexec at ffffffff99a69313
[ffffa0f442cd7a08] machine_kexec at ffffffff99a69313
[ffffa0f442cd7a68] __crash_kexec at ffffffff99b3e6b9
[ffffa0f442cd7ac0] sysrq_handle_crash at ffffffff9a034050
[ffffa0f442cd7af0] sysrq_handle_crash at ffffffff9a034066
[ffffa0f442cd7b30] crash_kexec at ffffffff99b3f441
[ffffa0f442cd7b38] __die at ffffffff99a33375
[ffffa0f442cd7b50] oops_end at ffffffff99a32bed
[ffffa0f442cd7b78] no_context at ffffffff99a7997c
[ffffa0f442cd7bd8] __bad_area_nosemaphore at ffffffff99a79d15
[ffffa0f442cd7c20] bad_area at ffffffff99a79f86
[ffffa0f442cd7c48] __do_page_fault at ffffffff99a7a486
[ffffa0f442cd7cc0] do_page_fault at ffffffff99a7a60d
[ffffa0f442cd7cf0] page_fault at ffffffff9a6010ae
[ffffa0f442cd7d48] sysrq_handle_crash at ffffffff9a034050
[ffffa0f442cd7d78] sysrq_handle_crash at ffffffff9a034066
[ffffa0f442cd7db0] __handle_sysrq at ffffffff9a0347e8
[ffffa0f442cd7de0] write_sysrq_trigger at ffffffff9a034cbf
[ffffa0f442cd7df8] proc_reg_write at ffffffff99d2a0ee
[ffffa0f442cd7e18] __vfs_write at ffffffff99ca8a0a
[ffffa0f442cd7e40] apparmor_file_permission at ffffffff99e53a0a
[ffffa0f442cd7e50] security_file_permission at ffffffff99e06cf1
[ffffa0f442cd7e78] _cond_resched at ffffffff9a4153f9
[ffffa0f442cd7ea0] vfs_write at ffffffff99ca8d11
[ffffa0f442cd7ed8] ksys_write at ffffffff99ca8fcc
[ffffa0f442cd7f20] __x64_sys_write at ffffffff99ca906a
[ffffa0f442cd7f30] do_syscall_64 at ffffffff99a0428a
[ffffa0f442cd7f50] entry_SYSCALL_64_after_hwframe at ffffffff9a600088
RIP: 00007ff47e1ef154 RSP: 00007ffee9226298 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007ff47e1ef154
RDX: 0000000000000002 RSI: 0000563d45717780 RDI: 0000000000000001
RBP: 0000563d45717780 R8: 000000000000000a R9: 0000000000000001
R10: 000000000000000a R11: 0000000000000246 R12: 00007ff47e4cb760
R13: 0000000000000002 R14: 00007ff47e4c72a0 R15: 00007ff47e4c6760
ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b

默认bt会dump问题线程的场景,还可以用bt -a/-c查看所有当前CPU或指定cpu的堆栈。

crash> bt -c 1
PID: 0 TASK: ffff8b7e64165c00 CPU: 1 COMMAND: "swapper/1"
'#0 [fffffe0000034e38] crash_nmi_callback at ffffffff99a5d3d7
'#1 [fffffe0000034e48] nmi_handle at ffffffff99a33691
... ...
'#12 [ffffa0f440cd7f50] secondary_startup_64 at ffffffff99a000d4

crash> bt -a
PID: 0 TASK: ffffffff9ae13740 CPU: 0 COMMAND: "swapper/0"
... ...
PID: 0 TASK: ffff8b7e64165c00 CPU: 1 COMMAND: "swapper/1"
... ...
PID: 2613 TASK: ffff8b7df3cdae00 CPU: 2 COMMAND: "bash"
... ...
PID: 0 TASK: ffff8b7e642c4500 CPU: 3 COMMAND: "swapper/3"
... ...

也可以用set命令来改变线程环境,从而查看别的cpu上的堆栈情况。

crash> set 1
PID: 1
COMMAND: "systemd"
TASK: ffff8b7e6413c500 [THREAD_INFO: ffff8b7e6413c500]
CPU: 3
STATE: TASK_INTERRUPTIBLE
crash> bt
PID: 1 TASK: ffff8b7e6413c500 CPU: 3 COMMAND: "systemd"
'#0 [ffffa0f440c6fce0] __schedule at ffffffff9a414ba7
'#1 [ffffa0f440c6fd80] schedule at ffffffff9a41519c
'#2 [ffffa0f440c6fd90] schedule_hrtimeout_range_clock at ffffffff9a419691
'#3 [ffffa0f440c6fe20] schedule_hrtimeout_range at ffffffff9a4196b3
'#4 [ffffa0f440c6fe30] ep_poll at ffffffff99cf8941
'#5 [ffffa0f440c6fee0] do_epoll_wait at ffffffff99cf8ae0
'#6 [ffffa0f440c6ff20] __x64_sys_epoll_wait at ffffffff99cf8b0e
'#7 [ffffa0f440c6ff30] do_syscall_64 at ffffffff99a0428a
'#8 [ffffa0f440c6ff50] entry_SYSCALL_64_after_hwframe at ffffffff9a600088
RIP: 00007ffa791c6bb7 RSP: 00007ffc1c00b9d0 RFLAGS: 00000293
RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007ffa791c6bb7
RDX: 00000000000000eb RSI: 00007ffc1c00ba10 RDI: 0000000000000004

RBP: 00007ffc1c00ba10   R8: 0000000000000000   R9: 7465677261742e79
R10: 00000000ffffffff  R11: 0000000000000293  R12: 00000000000000eb
R13: 00000000ffffffff  R14: 00007ffc1c00ba10  R15: 0000000000000001
ORIG_RAX: 00000000000000e8  CS: 0033  SS: 002b

系统日志

log命令可以用来查看系统的日志,“log -a”可以读取还没有从内核日志缓存到用户空间日志缓存的日志。
也可以重定向到文件(log > logfile)。
crash> log

... ...
[ 1610.759133] sysrq: SysRq : Trigger a crash
[ 1610.759147] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[ 1610.759150] PGD 0 P4D 0
[ 1610.759154] Oops: 0002 [#1] SMP PTI
[ 1610.759159] CPU: 2 PID: 2613 Comm: bash Kdump: loaded Not tainted 4.19.53 #1
[ 1610.759161] Hardware name: Dell Inc. OptiPlex 990/0RVG2C, BIOS A13 04/02/2012
[ 1610.759167] RIP: 0010:sysrq_handle_crash+0x16/0x20
[ 1610.759170] Code: e8 9f fb ff ff e9 c0 fe ff ff 90 90 90 90 90 90 90 90 90 90 66 66 66 66 90 55 48 89 e5 c7 05 85 10 36 01 01 00 00 00 0f ae f8 04 25 00 00 00 00 01 5d c3 66 66 66 66 90 55 c7 05 40 fa e2 00
[ 1610.759173] RSP: 0018:ffffa0f442cd7da8 EFLAGS: 00010286
[ 1610.759176] RAX: ffffffff9a034050 RBX: 0000000000000063 RCX: 0000000000000006
[ 1610.759178] RDX: 0000000000000000 RSI: 0000000000000096 RDI: 0000000000000063
[ 1610.759180] RBP: ffffa0f442cd7da8 R08: 00000000000002f2 R09: 0000000000000007
[ 1610.759182] R10: 0000000000000000 R11: ffffffff9b39c3ed R12: 0000000000000004
[ 1610.759184] R13: 0000000000000000 R14: ffffffff9afa7300 R15: ffff8b7de5af9100
[ 1610.759186] FS: 00007ff47eb0a740(0000) GS:ffff8b7e65880000(0000) knlGS:0000000000000000
[ 1610.759189] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1610.759191] CR2: 0000000000000000 CR3: 0000000205db0003 CR4: 00000000000606e0
[ 1610.759193] Call Trace:
[ 1610.759199] __handle_sysrq+0x88/0x140
[ 1610.759203] write_sysrq_trigger+0x2f/0x40
[ 1610.759208] proc_reg_write+0x3e/0x60
[ 1610.759212] __vfs_write+0x3a/0x190
[ 1610.759216] ? apparmor_file_permission+0x1a/0x20
[ 1610.759220] ? security_file_permission+0x31/0xc0
[ 1610.759224] ? _cond_resched+0x19/0x40
[ 1610.759226] vfs_write+0xb1/0x1a0
[ 1610.759229] ksys_write+0x5c/0xe0
[ 1610.759232] __x64_sys_write+0x1a/0x20
[ 1610.759237] do_syscall_64+0x5a/0x120
[ 1610.759241] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1610.759245] RIP: 0033:0x7ff47e1ef154
[ 1610.759247] Code: 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 8d 05 b1 07 2e 00 8b 00 85 c0 75 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 f3 c3 66 90 41 54 55 49 89 d4 53 48 89 f5
[ 1610.759249] RSP: 002b:00007ffee9226298 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 1610.759252] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007ff47e1ef154
[ 1610.759254] RDX: 0000000000000002 RSI: 0000563d45717780 RDI: 0000000000000001
[ 1610.759256] RBP: 0000563d45717780 R08: 000000000000000a R09: 0000000000000001
[ 1610.759258] R10: 000000000000000a R11: 0000000000000246 R12: 00007ff47e4cb760
[ 1610.759260] R13: 0000000000000002 R14: 00007ff47e4c72a0 R15: 00007ff47e4c6760
[ 1610.759263] Modules linked in: nls_iso8859_1 intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic pcbc aesni_intel snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep aes_x86_64 snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi input_leds crypto_simd cryptd snd_seq snd_seq_device snd_timer dcdbas snd glue_helper intel_cstate intel_rapl_perf lpc_ich serio_raw soundcore sch_fq_codel mei_me mei mac_hid parport_pc ppdev lp parport ip_tables x_tables autofs4 hid_generic usbhid hid uas usb_storage i915 kvmgt vfio_mdev mdev vfio_iommu_type1 vfio kvm irqbypass i2c_algo_bit cec rc_core drm_kms_helper psmouse syscopyarea sysfillrect video sysimgblt fb_sys_fops ahci drm libahci e1000e
[ 1610.759320] CR2: 0000000000000000

查看数据结构

struct和union可以用来查看结构体和共用体,用法相同,下面看一些struct
打印的例子。把指定地址的内容以task_struct结构体解析打印,如果不带地址会显示结构体定义和大小。
1 打印task_struct结构体

crash> task_struct ffff8b7df3cdae00 -x
struct task_struct {
thread_info = {
flags = 0x80000000,
status = 0x0
},
state = 0x0,
stack = 0xffffa0f442cd4000,
usage = {
counter = 0x2
},
... ...

2 打印task_struct定义和大小。

struct task_struct {
[0x0] struct thread_info thread_info;
[0x10] volatile long state;
[0x18] void *stack;
... ...
[0x1288] void *security;
[0x12c0] struct thread_struct thread;
}
SIZE: 0x23c0

3 查看成员变量

crash> task_struct.stack_refcount ffff8b7df3cdae00 -xo
struct task_struct {
[ffff8b7df3cdc080] atomic_t stack_refcount;
}

4 查看指针成员变量

crash> task_struct.mm ffff8b7df3cdae00
mm = 0xffff8b7e5af06600
crash> task_struct.mm ffff8b7df3cdae00 -p
struct mm_struct *mm = 0xffff8b7e5af06600
-> {
{
mmap = 0xffff8b7dec0520c8,
mm_rb = {
rb_node = 0xffff8b7dec003b78
},
vmacache_seqnum = 17,
get_unmapped_area = 0xffffffff99a35760,

此外还可以查看数组内容,per-cpu变量,以及其他一些功能,详细可参考帮助文档。

查看和搜索内存

除了打印数据结构,有时需要查看和搜索内存内容,看有没有制定的数据模式。
1 查看系统版本信息

crash> rd -a linux_banner
ffffffff9aa00100: Linux version 4.19.53 (glbian@glbian-OptiPlex-990) (gcc vers
ffffffff9aa0013c: ion 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)) #1 SMP Sun Jun 23
ffffffff9aa00178: 11:01:25 CST 2019

  1. 查看内存内容

crash> rd ffffa0f442cd7a08 32
ffffa0f442cd7a08: ffffffff99a69313 0000a0f442cd7a50 ........Pz.B....
ffffa0f442cd7a18: ffff8b7c40000000 0000000024001000 ...@|..........
ffffa0f442cd7a38: a05cedc0dfb99200 a05cedc0dfb99200 ..............
ffffa0f442cd7a48: ffffa0f442cd7cf8 0000000000000009 .|.B............
ffffa0f442cd7a58: ffffa0f442cd7cf8 ffffa0f442cd7b28 .|.B....({.B....
ffffa0f442cd7a68: ffffffff99b3e6b9 ffff8b7de5af9100 ............}...
ffffa0f442cd7a78: ffffffff9afa7300 0000000000000000 .s..............
ffffa0f442cd7a88: 0000000000000004 ffffa0f442cd7da8 .........}.B....
ffffa0f442cd7a98: 0000000000000063 ffffffff9b39c3ed c.........9.....
ffffa0f442cd7aa8: 0000000000000000 0000000000000007 ................
ffffa0f442cd7ab8: 00000000000002f2 ffffffff9a034050 ........P@......
ffffa0f442cd7ac8: 0000000000000006 0000000000000000 ................
ffffa0f442cd7ad8: 0000000000000096 0000000000000063 ........c.......
ffffa0f442cd7ae8: ffffffffffffffff ffffffff9a034066 ........f@......
ffffa0f442cd7af8: 0000000000000010 0000000000010286 ................

3 打印符号表

crash> rd ffffa0f442cd7a08 32 -s
ffffa0f442cd7a08: machine_kexec+451 0000a0f442cd7a50
ffffa0f442cd7a18: ffff8b7c40000000 0000000024001000
ffffa0f442cd7a28: ffff8b7c64001000 0000000024000000
ffffa0f442cd7a38: a05cedc0dfb99200 a05cedc0dfb99200
ffffa0f442cd7a48: ffffa0f442cd7cf8 0000000000000009
ffffa0f442cd7a58: ffffa0f442cd7cf8 ffffa0f442cd7b28
ffffa0f442cd7a68: __crash_kexec+105 ffff8b7de5af9100
ffffa0f442cd7a78: sysrq_crash_op 0000000000000000
ffffa0f442cd7a88: 0000000000000004 ffffa0f442cd7da8
ffffa0f442cd7a98: 0000000000000063 text.45672+13
ffffa0f442cd7aa8: 0000000000000000 0000000000000007
ffffa0f442cd7ab8: 00000000000002f2 sysrq_handle_crash
ffffa0f442cd7ac8: 0000000000000006 0000000000000000
ffffa0f442cd7ad8: 0000000000000096 0000000000000063
ffffa0f442cd7ae8: ffffffffffffffff sysrq_handle_crash+22
ffffa0f442cd7af8: 0000000000000010 0000000000010286

4 查看指定内存区域内容

crash> rd ffffa0f442cd7a08 -e ffffa0f442cd7a68
ffffa0f442cd7a08: ffffffff99a69313 0000a0f442cd7a50 ........Pz.B....
ffffa0f442cd7a18: ffff8b7c40000000 0000000024001000 ...@|..........
ffffa0f442cd7a38: a05cedc0dfb99200 a05cedc0dfb99200 ..............
ffffa0f442cd7a48: ffffa0f442cd7cf8 0000000000000009 .|.B............
ffffa0f442cd7a58: ffffa0f442cd7cf8 ffffa0f442cd7b28 .|.B....({.B....

5 搜索指定内存

crash> search -s ffffa0f442cd7a08 -e ffffa0f442cd7db0 ffffffff9b39c3ed
ffffa0f442cd7aa0: ffffffff9b39c3ed
ffffa0f442cd7d28: ffffffff9b39c3ed

6 搜索匹配数据

crash> search -p babe0000 -m ffff
1c4cc6530: babec685
21f7d35b8: babe4550
crash>

查看线程状态

1 查看所有线程状态

crash> ps
PID PPID CPU TASK ST %MEM VSZ RSS COMM
0 0 0 ffffffff9ae13740 RU 0.0 0 0 [swapper/0]
0 0 1 ffff8b7e64165c00 RU 0.0 0 0 [swapper/1]
0 0 2 ffff8b7e64162e00 RU 0.0 0 0 [swapper/2]
0 0 3 ffff8b7e642c4500 RU 0.0 0 0 [swapper/3]
1 0 3 ffff8b7e6413c500 IN 0.1 225916 9716 systemd
2 0 2 ffff8b7e64138000 IN 0.0 0 0 [kthreadd]

2 查看父线程树

crash> ps -p 2613
PID: 0 TASK: ffffffff9ae13740 CPU: 0 COMMAND: "swapper/0"
PID: 1 TASK: ffff8b7e6413c500 CPU: 3 COMMAND: "systemd"
PID: 1081 TASK: ffff8b7e5dc81700 CPU: 1 COMMAND: "gdm3"
PID: 2114 TASK: ffff8b7e584f2e00 CPU: 0 COMMAND: "gdm-session-wor"
PID: 2136 TASK: ffff8b7e63cc4500 CPU: 1 COMMAND: "gdm-x-session"
PID: 2149 TASK: ffff8b7e5dfaae00 CPU: 0 COMMAND: "gnome-session-b"
PID: 2254 TASK: ffff8b7e5e04dc00 CPU: 0 COMMAND: "gnome-shell"
PID: 2582 TASK: ffff8b7dec3bae00 CPU: 0 COMMAND: "terminator"
PID: 2592 TASK: ffff8b7dec05ae00 CPU: 1 COMMAND: "bash"
PID: 2611 TASK: ffff8b7df3f8ae00 CPU: 0 COMMAND: "sudo"
PID: 2612 TASK: ffff8b7dec3b9700 CPU: 3 COMMAND: "su"
PID: 2613 TASK: ffff8b7df3cdae00 CPU: 2 COMMAND: "bash"

3 查看子线程

crash> ps -c 2582
PID: 2582 TASK: ffff8b7dec3bae00 CPU: 0 COMMAND: "terminator"
PID: 2592 TASK: ffff8b7dec05ae00 CPU: 1 COMMAND: "bash"
PID: 2600 TASK: ffff8b7df3f88000 CPU: 0 COMMAND: "bash"
PID: 2787 TASK: ffff8b7df9f80000 CPU: 3 COMMAND: "bash"

4 查看线程运行时间

crash> ps -t 2613
PID: 2613 TASK: ffff8b7df3cdae00 CPU: 2 COMMAND: "bash"
RUN TIME: 00:00:00
START TIME: 1296209749767
UTIME: 36000000
STIME: 16000000

5 查看活动线程

crash> ps -A
PID PPID CPU TASK ST %MEM VSZ RSS COMM
0 0 0 ffffffff9ae13740 RU 0.0 0 0 [swapper/0]
0 0 1 ffff8b7e64165c00 RU 0.0 0 0 [swapper/1]
0 0 3 ffff8b7e642c4500 RU 0.0 0 0 [swapper/3]
2613 2612 2 ffff8b7df3cdae00 RU 0.0 28708 4352 bash

6 查看内核线程

crash> ps -k
PID PPID CPU TASK ST %MEM VSZ RSS COMM
0 0 0 ffffffff9ae13740 RU 0.0 0 0 [swapper/0]
0 0 1 ffff8b7e64165c00 RU 0.0 0 0 [swapper/1]
0 0 2 ffff8b7e64162e00 RU 0.0 0 0 [swapper/2]
0 0 3 ffff8b7e642c4500 RU 0.0 0 0 [swapper/3]
2 0 2 ffff8b7e64138000 IN 0.0 0 0 [kthreadd]

7 查看用户态线程

crash> ps -u
PID PPID CPU TASK ST %MEM VSZ RSS COMM
1 0 3 ffff8b7e6413c500 IN 0.1 225916 9716 systemd
298 1 3 ffff8b7e5879c500 IN 0.4 126508 38028 systemd-journal
318 1 0 ffff8b7e584f5c00 IN 0.1 48004 6360 systemd-udevd
822 1 2 ffff8b7e59c71700 IN 0.1 70756 6176 systemd-resolve
824 1 2 ffff8b7e586e5c00 IN 0.1 146108 5540 systemd-timesyn
834 1 3 ffff8b7e63881700 IN 0.1 146108 5540 sd-resolve
863 1 3 ffff8b7e5d790000 IN 0.1 51612 6112 dbus-daemon
864 1 1 ffff8b7e5d794500 IN 0.1 427264 9404 ModemManager

8 查看最后运行时间戳

crash> ps -l
[1610759003323] [IN] PID: 2582 TASK: ffff8b7dec3bae00 CPU: 0 COMMAND: "terminator"
[1610758998404] [ID] PID: 211 TASK: ffff8b7e585aae00 CPU: 3 COMMAND: "kworker/u32:5"
[1610758938747] [RU] PID: 2613 TASK: ffff8b7df3cdae00 CPU: 2 COMMAND: "bash"
[1610758009873] [IN] PID: 2587 TASK: ffff8b7e06cd5c00 CPU: 2 COMMAND: "gdbus"
crash> ps -m
[0 00:00:00.000] [IN] PID: 2582 TASK: ffff8b7dec3bae00 CPU: 0 COMMAND: "terminator"
[0 00:00:00.000] [ID] PID: 211 TASK: ffff8b7e585aae00 CPU: 3 COMMAND: "kworker/u32:5"
[0 00:00:00.000] [RU] PID: 2613 TASK: ffff8b7df3cdae00 CPU: 2 COMMAND: "bash"
[0 00:00:00.000] [IN] PID: 2587 TASK: ffff8b7e06cd5c00 CPU: 2 COMMAND: "gdbus"
[0 00:00:00.001] [IN] PID: 2138 TASK: ffff8b7e26801700 CPU: 0 COMMAND: "Xorg"

9 查看线程资源限制

crash> ps -r 2613
PID: 2613 TASK: ffff8b7df3cdae00 CPU: 2 COMMAND: "bash"
RLIMIT CURRENT MAXIMUM
CPU (unlimited) (unlimited)
FSIZE (unlimited) (unlimited)
DATA (unlimited) (unlimited)
STACK 8388608 (unlimited)
CORE 0 (unlimited)
RSS (unlimited) (unlimited)
NPROC 30393 30393
NOFILE 1024 1048576
MEMLOCK 16777216 16777216
AS (unlimited) (unlimited)
LOCKS (unlimited) (unlimited)
SIGPENDING 30393 30393
MSGQUEUE 819200 819200
NICE 0 0
RTPRIO 0 0
RTTIME (unlimited) (unlimited)

Context切换

有些命令是线程上线文相关的,比如bt,可以用set命令来进行线程上下文切换。
1 切换到指定线程

crash> set ffff8b7e6413c500
PID: 1
COMMAND: "systemd"
TASK: ffff8b7e6413c500 [THREAD_INFO: ffff8b7e6413c500]
CPU: 3
STATE: TASK_INTERRUPTIBLE
crash> bt
PID: 1 TASK: ffff8b7e6413c500 CPU: 3 COMMAND: "systemd"
'#0 [ffffa0f440c6fce0] __schedule at ffffffff9a414ba7
'#1 [ffffa0f440c6fd80] schedule at ffffffff9a41519c
'#2 [ffffa0f440c6fd90] schedule_hrtimeout_range_clock at ffffffff9a419691
'#3 [ffffa0f440c6fe20] schedule_hrtimeout_range at ffffffff9a4196b3
'#4 [ffffa0f440c6fe30] ep_poll at ffffffff99cf8941
'#5 [ffffa0f440c6fee0] do_epoll_wait at ffffffff99cf8ae0
'#6 [ffffa0f440c6ff20] __x64_sys_epoll_wait at ffffffff99cf8b0e
'#7 [ffffa0f440c6ff30] do_syscall_64 at ffffffff99a0428a
'#8 [ffffa0f440c6ff50] entry_SYSCALL_64_after_hwframe at ffffffff9a600088
RIP: 00007ffa791c6bb7 RSP: 00007ffc1c00b9d0 RFLAGS: 00000293
RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007ffa791c6bb7
RDX: 00000000000000eb RSI: 00007ffc1c00ba10 RDI: 0000000000000004
RBP: 00007ffc1c00ba10 R8: 0000000000000000 R9: 7465677261742e79
R10: 00000000ffffffff R11: 0000000000000293 R12: 00000000000000eb
R13: 00000000ffffffff R14: 00007ffc1c00ba10 R15: 0000000000000001
ORIG_RAX: 00000000000000e8 CS: 0033 SS: 002b

2 切会panic线程

crash> set -p
PID: 2613
COMMAND: "bash"
TASK: ffff8b7df3cdae00 [THREAD_INFO: ffff8b7df3cdae00]
CPU: 2
STATE: TASK_RUNNING (SYSRQ)

加载module符号表

1 查看当前加载的module

crash> mod
MODULE NAME SIZE OBJECT FILE
ffffffffc019d0c0 vfio_iommu_type1 24576 (not loaded) [CONFIG_KALLSYMS]
ffffffffc01a4440 uas 24576 (not loaded) [CONFIG_KALLSYMS]
ffffffffc01b0b40 rc_core 45056 (not loaded) [CONFIG_KALLSYMS]
ffffffffc01e76c0 e1000e 249856 (not loaded) [CONFIG_KALLSYMS]
ffffffffc01fcbc0 usbhid 49152 (not loaded) [CONFIG_KALLSYMS]
ffffffffc0207580 libahci 32768 (not loaded) [CONFIG_KALLSYMS]

2 加载所有module符号表

crash> mod -S
MODULE NAME SIZE OBJECT FILE
ffffffffc019d0c0 vfio_iommu_type1 24576 /lib/modules/4.19.53/kernel/drivers/vfio/vfio_iommu_type1.ko
ffffffffc01a4440 uas 24576 /lib/modules/4.19.53/kernel/drivers/usb/storage/uas.ko
ffffffffc01b0b40 rc_core 45056 /lib/modules/4.19.53/kernel/drivers/media/rc/rc-core.ko
ffffffffc01e76c0 e1000e 249856 /lib/modules/4.19.53/kernel/drivers/net/ethernet/intel/e1000e/e1000e.ko
ffffffffc01fcbc0 usbhid 49152 /lib/modules/4.19.53/kernel/drivers/hid/usbhid/usbhid.ko

3 加载指定module符号表

crash> mod -s rc_core /lib/modules/4.19.53/kernel/drivers/media/rc/rc-core.ko
MODULE NAME SIZE OBJECT FILE
ffffffffc01b0b40 rc_core 45056 /lib/modules/4.19.53/kernel/drivers/media/rc/rc-core.ko
crash> mod
MODULE NAME SIZE OBJECT FILE
ffffffffc019d0c0 vfio_iommu_type1 24576 (not loaded) [CONFIG_KALLSYMS]
ffffffffc01a4440 uas 24576 (not loaded) [CONFIG_KALLSYMS]
ffffffffc01b0b40 rc_core 45056 /lib/modules/4.19.53/kernel/drivers/media/rc/rc-core.ko
ffffffffc01e76c0 e1000e 249856 (not loaded) [CONFIG_KALLSYMS]
ffffffffc01fcbc0 usbhid 49152 (not loaded) [CONFIG_KALLSYMS]

其他命令

还有很多针对某些内核模块的命令,比如kmem,vm,tree,list,pte等等,参考附件命令列表,后面在使用过程中再学习和研究。

命令扩展

crash还支持用户添加在自己的调试命令。可以直接在Crash源码里添加新的命令,更多的是创建一个共享库,用extend动态加载。帮助文档里有一个简单的例子,在crash源码目录下新建一个test.c,把示例代码拷贝进去,就可以进行编译。
gcc -nostartfiles -shared -rdynamic -o echo.so echo.c -fPIC -D $(TARGET_CFLAGS)

crash> sys
KERNEL: ../../kernel-src/linux-4.19.53/vmlinux
DUMPFILE: 201907070732/dump.201907070732 [PARTIAL DUMP]
CPUS: 4
DATE: Sun Jul 7 07:31:34 2019
UPTIME: 00:10:27
LOAD AVERAGE: 0.14, 0.16, 0.12
TASKS: 584
NODENAME: glbian-OptiPlex-990
RELEASE: 4.19.53
VERSION: #1 SMP Sun Jun 23 11:01:25 CST 2019
MACHINE: x86_64 (3292 Mhz)
MEMORY: 7.9 GB
PANIC: "sysrq: SysRq : Trigger a crash"

可以用sys命令查看机器架构,我的及其machine-type选x86-64,编译命令如下:gcc -shared -rdynamic -o test.so test.c -fPIC -Dx86_64 _D_FILE_OFFSET_BITS=64
生成test.so。可以用extend直接加载,加载成功后可以看到帮助菜单多了一条echo命令,我们可以基于echo示例开发自己的命令。

crash> extend ../../src/crash-7.2.6/test.so
../../src/crash-7.2.6/test.so: shared object loaded
crash> extend
SHARED OBJECT COMMANDS
../../src/crash-7.2.6/test.so echo
crash> help
‘* extend mach runq union
alias files mod search vm
ascii foreach mount set vtop
bpf fuser net sig waitq
bt gdb p struct whatis
btop help ps swap wr
dev ipcs pte sym q
dis irq ptob sys
echo kmem ptov task
eval list rd timer
exit log repeat tree

结语

系统崩溃通常是非常棘手的问题,需要非常熟悉内核和相应的子模块,再结合crash工具进行分析,总之需要在实践中累积经验,实践出真知。

附件

Crash命令列表
命令 功能
* 指针快捷健
alias 命令快捷键
ascii ASCII码转换和码表
bpf eBPF - extended Berkeley Filter
bt 堆栈查看
btop 地址页表转换
dev 设备数据查询
dis 返汇编
eval 计算器
exit 退出
extend 命令扩展
files 打开的文件查看
foreach 循环查看
fuser 文件使用者查看
gdb 调用gdb执行命令
help 帮助
ipcs 查看system V IPC工具
irq 查看irq数据
kmem 查看Kernel内存
list 查看链表
log 查看系统消息缓存
mach 查看平台信息
mod 加载符号表
mount Mount文件系统数据
net 网络命令
p 查看数据结构
ps 查看进程状态信息
pte 查看页表
ptob 页表地址转换
ptov 物理地址虚拟地址转换
rd 查看内存
repeat 重复执行
runq 查看run queue上的线程
search 搜索内存
set 设置线程环境和Crash内部变量
sig 查询线程消息
struct 查询结构体
swap 查看swap信息
sym 符号和虚拟地址转换
sys 查看系统信息
task 查看task_struct和thread_thread信息
timer 查看timer队列
tree 查看radix树和rb树
union 查看union结构体
vm 查看虚拟内存
vtop 虚拟地址物理地址转换
waitq 查看wait queue上的进程
whatis 符号表查询
wr 改写内存
q 退出
图片发自App

你可能感兴趣的:(系统崩溃 - crash工具介绍)