以简单的系统提供的crash方法为例,echoc > /proc/sysrq-trigger.
得到crash文件后,一般情况下,最想看到的是错误类型和发生错误时的registers和backtrace.可以通过命令log| tail -200得到,意思是得到log文件的最后200行:
[2207.597488:0] Unable to handle kernel NULL pointer dereference atvirtual address 00000000
[2207.605719:0] pgd = ddf30000
[2207.608588:0] [00000000] *pgd=00000000
[2207.612339:0] Internal error: Oops: 805 [#1] PREEMPT SMP ARM
[2207.617975:0] Modules linked in:
[2207.621205:0] CPU: 0 Not tainted (3.4.0-gc37fe8c-dirty #651)
[2207.627196:0] PC is at sysrq_handle_crash+0x38/0x48
[2207.632059:0] LR is at _raw_spin_unlock_irqrestore+0x20/0x40
[2207.637699:0] pc : [
[2207.637704:0] sp : e7c61ec8 ip : e7c61e98 fp : e7c61ed4
[2207.649487:0] r10: e7c61f70 r9 : e7c60000 r8 : 00000000
[2207.654865:0] r7 : 60000013 r6 : 00000063 r5 : 00000004 r4 :c079fc74
[2207.661539:0] r3 : 00000000 r2 : 00000001 r1 : 20000093 r0 :00000001
[2207.668215:0] Flags: nZCv IRQs off FIQs on Mode SVC_32 ISA ARM Segment user
[2207.675582:0] Control: 10c53c7d Table: 9ff3004a DAC: 00000015
[2207.681477:0]
[2208.257865:0] Process sh (pid: 2309, stack limit = 0xe7c602f0)
[2208.351387:0] Backtrace:
[2208.354019:0] [
[2208.363293:0] [
[2208.372647:0] r8:b76e780c r7:00000002 r6:e331c5c0 r5:c01ed8e4r4:00000002
[2208.379373:0] r3:e7c61f70
[2208.382188:0] [
[2208.391455:0] r4:ed98b5e0 r3:e7c61f70
[2208.395221:0] [
[2208.403715:0] [
[2208.411772:0] r8:00000002 r7:00000000 r6:00000000 r5:b76e780cr4:e331c5c0
[2208.418695:0] [
[2208.427185:0] r8:c000e1e8 r7:00000004 r6:00000001 r5:00000002r4:00000003
[2208.434101:0] Code: 0a000000 e12fff33 e3a03000 e3a02001 (e5c32000)
[2208.440344:0] Enter crash kexec !!
[2208.443747:1] CPU 1 will stop doing anything useful since anotherCPU has crashed
[2208.451905:0] Loading crashdump kernel...
[2208.455900:0] Software reset on panic!
crash>ps | grep ">"
> 1904 1384 1 e3355000 RU 2.9 663340 23740 MediaScannerSer
> 2309 1394 0 d9742c00 RU 0.1 820 480 sh
因为是多核,到底是哪个进程?
其实上面的log信息已经显示出
[2207.621205:0] CPU: 0 Nottainted (3.4.0-gc37fe8c-dirty #651)
[2208.257865:0] Process sh (pid: 2309, stack limit = 0xe7c602f0)
也可以通过set命令得出:
crash>set 2309
PID:2309
COMMAND:"sh"
TASK:d9742c00 [THREAD_INFO: e7c60000]
CPU:0
STATE:TASK_RUNNING (PANIC)
crash>set 1904
PID:1904
COMMAND:"MediaScannerSer"
TASK:e3355000 [THREAD_INFO: e0422000]
CPU:1
STATE:TASK_RUNNING (ACTIVE)
[2207.621205:0] CPU: 0 Not tainted (3.4.0-gc37fe8c-dirty #651)
[2207.627196:0] PC is at sysrq_handle_crash+0x38/0x48
[2207.632059:0] LR is at _raw_spin_unlock_irqrestore+0x20/0x40
[2207.637699:0] pc : [
[2207.637704:0] sp : e7c61ec8 ip : e7c61e98 fp : e7c61ed4
[2207.649487:0] r10: e7c61f70 r9 : e7c60000 r8 : 00000000
[2207.654865:0] r7 : 60000013 r6 : 00000063 r5 : 00000004 r4 :c079fc74
[2207.661539:0] r3 : 00000000 r2 : 00000001 r1 : 20000093 r0 :00000001
当前的PC值是c01ed158
,使用命令dis-r xxx得到出问题的具体地方和从函数入口到此处的代码
helpdis
-r (reverse) displays all instructions from the start of the
routineup to and including the designated address.
crash>dis -r c01ed158
0xc01ed120
0xc01ed124
0xc01ed128
0xc01ed12c
0xc01ed130
0xc01ed134
0xc01ed138
0xc01ed13c
0xc01ed140
0xc01ed144
0xc01ed148
0xc01ed14c
0xc01ed150
0xc01ed154
0xc01ed158
出问题的具体地方就是strbr2, [r3],且此时r3: 00000000,把数据放入0地址肯定异常。
下面查找原因,看r3来自哪里?向上看就是:
0xc01ed150
查找具体的代码,看问题的原因.
Helpdis
-l displays source code line number data in addition to the
disassemblyoutput.
crash>dis -rl c01ed158
/home/wenshuai/code/kernel3.4/linux_kernel/drivers/tty/sysrq.c:129
0xc01ed120
0xc01ed124
0xc01ed128
/home/wenshuai/code/kernel3.4/linux_kernel/drivers/tty/sysrq.c:132
0xc01ed12c
0xc01ed130
0xc01ed134
/home/wenshuai/code/kernel3.4/linux_kernel/drivers/tty/sysrq.c:133
0xc01ed138
/home/wenshuai/code/kernel3.4/linux_kernel/arch/arm/include/asm/outercache.h:114
0xc01ed13c
0xc01ed140
0xc01ed144
0xc01ed148
/home/wenshuai/code/kernel3.4/linux_kernel/arch/arm/include/asm/outercache.h:115
0xc01ed14c
/home/wenshuai/code/kernel3.4/linux_kernel/drivers/tty/sysrq.c:134
0xc01ed150
0xc01ed154
0xc01ed158
从上可知出问题的具体地方是inux_kernel/drivers/tty/sysrq.c
staticvoid sysrq_handle_crash(int key)
{
char*killer = NULL;
panic_on_oops= 1; /* force panic */
wmb();
*killer= 1;
}
这个例子当然很简单,可以很容易看出原因。更过的错误原因是入口参数导致的,输入参数的某个成员没有赋值等原因导致