待机时反复按Power键概率性重启-定位问题

现象:

待机时反复按power键休眠会有概率性重启,概率 >= 50%,但连接USB线情况下不会出现,所有怀疑是系统进入休眠才发生了异常,取出mtklog看到有db产生,确实发生了KE(kernel exception),取出db 和 vmlinux (必须是和当前软件是同一次编译的)后,使用GAT工具解开db,取出 SYS_MINI_RDUMP,使用 gdb 调试:

android@c0490:~/log/6601/20151218-KE$ arm-linux-androideabi-gdb vmlinux SYS_MINI_RDUMP 
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "--host=x86_64-linux-gnu --target=arm-linux-android".
For bug reporting instructions, please see:
...
Reading symbols from /home/android/log/6601/20151218-KE/vmlinux...done.
[New LWP 100]
[New LWP 1]
[New LWP 2]
[New LWP 3]
[New LWP 4]
[New LWP 5]
[New LWP 6]
[New LWP 7]
[New LWP 8]
Core was generated by `console=tty0 console=ttyMT0,921600n1 root=/dev/ram vmalloc=496M slub_max_order='.
#0  device_prepare (dev=, state=...) at /home/android/work/prj/6601-32bit-kernel/baseline/kernel-3.10/drivers/base/power/main.c:1483
1483			callback = dev->bus->pm->prepare;
(gdb) list
1478		} else if (dev->class && dev->class->pm) {
1479			info = "preparing class ";
1480			callback = dev->class->pm->prepare;
1481		} else if (dev->bus && dev->bus->pm) {
1482			info = "preparing bus ";
1483			callback = dev->bus->pm->prepare;
1484		}
1485	
1486		if (!callback && dev->driver && dev->driver->pm) {
1487			info = "preparing driver ";
(gdb) bt
#0  device_prepare (dev=, state=...) at /home/android/work/prj/6601-32bit-kernel/baseline/kernel-3.10/drivers/base/power/main.c:1483
#1  dpm_prepare (state=...) at /home/android/work/prj/6601-32bit-kernel/baseline/kernel-3.10/drivers/base/power/main.c:1520
#2  0xc0327cc8 in dpm_suspend_start (state=...) at /home/android/work/prj/6601-32bit-kernel/baseline/kernel-3.10/drivers/base/power/main.c:1556
#3  0xc0090390 in suspend_devices_and_enter (state=0) at /home/android/work/prj/6601-32bit-kernel/baseline/kernel-3.10/kernel/power/suspend.c:273
#4  0xc0090a50 in enter_state (state=3) at /home/android/work/prj/6601-32bit-kernel/baseline/kernel-3.10/kernel/power/suspend.c:426
#5  0xc0090ac0 in pm_suspend (state=0) at /home/android/work/prj/6601-32bit-kernel/baseline/kernel-3.10/kernel/power/suspend.c:464
#6  0xc0098b24 in try_to_suspend (work=) at /home/android/work/prj/6601-32bit-kernel/baseline/kernel-3.10/kernel/power/autosleep.c:81
#7  0xc0042b60 in process_one_work (worker=0xd79a9600, work=0xc0e056e4 )
    at /home/android/work/prj/6601-32bit-kernel/baseline/kernel-3.10/kernel/workqueue.c:2216
#8  0xc0043098 in worker_thread (__worker=0x0 <__vectors_start>) at /home/android/work/prj/6601-32bit-kernel/baseline/kernel-3.10/kernel/workqueue.c:2348
#9  0xc004c0bc in kthread (_create=0xdb1ffdf0) at /home/android/work/prj/6601-32bit-kernel/baseline/kernel-3.10/kernel/kthread.c:200
#10 0xc000f308 in ret_from_fork () at /home/android/work/prj/6601-32bit-kernel/baseline/kernel-3.10/arch/arm/kernel/entry-common.S:91
#11 0xc000f308 in ret_from_fork () at /home/android/work/prj/6601-32bit-kernel/baseline/kernel-3.10/arch/arm/kernel/entry-common.S:91
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) 

看到这个KE调用栈,心情是比较压抑的o(╯□╰)o,感觉就不太爽,因为里面都是kernel代码错误,并没有直接给出可以定位到具体的客制化代码位置,这种问题明显是原本是A处埋的雷可却在B处炸了!
本来A处埋了地雷,如果在A处炸了,那么通常来说问题还比较容易定位和解决。但是结果却是在B处炸了,然而B和A之间距离有多远呢?从这个bt是难以定位出来的, 所以使用有限的debug线索实现从B抓到A的过程成了本次解题的关键之关键!!!

从调用栈 可以看到进程跑到进程在  kernel-3.10/drivers/base/power/main.c:1483  位置发生KE了,看看具体源代码 :

static int device_prepare(struct device *dev, pm_message_t state)
{
	int (*callback)(struct device *) = NULL;
	char *info = NULL;
	int error = 0;

	if (dev->power.syscore)
		return 0;

	/*
	 * If a device's parent goes into runtime suspend at the wrong time,
	 * it won't be possible to resume the device.  To prevent this we
	 * block runtime suspend here, during the prepare phase, and allow
	 * it again during the complete phase.
	 */
	pm_runtime_get_noresume(dev);

	device_lock(dev);

	dev->power.wakeup_path = device_may_wakeup(dev);

	if (dev->pm_domain) {
		info = "preparing power domain ";
		callback = dev->pm_domain->ops.prepare;
	} else if (dev->type && dev->type->pm) {
		info = "preparing type ";
		callback = dev->type->pm->prepare;
	} else if (dev->class && dev->class->pm) {
		info = "preparing class ";
		callback = dev->class->pm->prepare;
	} else if (dev->bus && dev->bus->pm) {
		info = "preparing bus ";
		callback = dev->bus->pm->prepare;
	}



从调用栈可以知道死机的具体kernel代码位置,但是为什么会导致死机呢?是空指针,还是野指针导致?这个就需要分析汇编指令了,下面分析发生死机的原因, 使用gdb输出崩溃时的关键汇编指令:
(gdb) disassemble
Dump of assembler code for function dpm_prepare:
...
   0xc0327ba8 <+252>:	mov	r0, r8
   0xc0327bac <+256>:	bl	0xc0998e1c 
   0xc0327bb0 <+260>:	b	0xc0327ad4 
   0xc0327bb4 <+264>:	ldr	r3, [r4, #-128]	; 0x80
   0xc0327bb8 <+268>:	cmp	r3, #0
   0xc0327bbc <+272>:	beq	0xc0327c18 
   0xc0327bc0 <+276>:	ldr	r3, [r3, #20]
   0xc0327bc4 <+280>:	cmp	r3, #0
   0xc0327bc8 <+284>:	beq	0xc0327c18 
=> 0xc0327bcc <+288>:	ldr	r9, [r3]
   0xc0327bd0 <+292>:	cmp	r9, #0
   0xc0327bd4 <+296>:	bne	0xc0327b98 
   0xc0327bd8 <+300>:	ldr	r3, [r4, #-48]	; 0x30
   0xc0327bdc <+304>:	cmp	r3, #0
   0xc0327be0 <+308>:	beq	0xc0327ba8 
   0xc0327be4 <+312>:	ldr	r3, [r3, #52]	; 0x34
---Type  to continue, or q  to quit---q
Quit


汇编指令打印出来有点长,但我们只看关键的部分,如上的 =>  提示指令:
=> 0xc0327bcc <+288>:	ldr	r9, [r3]

 
   
gdb 工具已标记出进程是运行到这条 ldr(Memory Access) 指令时挂了!这个时候我们需要知道此刻的各个寄存器内容才能判断分析KE的原因:
(gdb) info reg
r0             0x0	0
r1             0x0	0
r2             0x0	0
r3             0x6b6b6b6b	1802201963
r4             0xd90680b0	3641082032
r5             0xd9068000	3641081856
r6             0xc0e1c280	3236020864
r7             0xc0e1c2e0	3236020960
r8             0xd9068034	3641081908
r9             0x0	0
r10            0xc0f07f3c	3236986684
r11            0xd7b83d1c	3619175708
r12            0xd7b83c88	3619175560
sp             0xd7b83cf0	0xd7b83cf0
lr             0xc099767c	3231282812
pc             0xc0327bcc	0xc0327bcc 
cpsr           0x200e0013	537788435
(gdb)

从当前寄存器值可以看出,r3 的 值是 0x6b6b6b6b  ==> 非kernel虚拟内存范围(从 0xc0008000 开始),执行 Memroy access 操作会导致MMU报 page translation fault(缺页异常), MMU首先会自己尝试处理缺页异常,通常是试图从Low level的memroy(如disk)中重新 load 数据进来,然后再次执行 TTW (Translation table walk) 地址转换, 最后发现处理不了就只能执行 __die()  ==> panic ==> 发生Kernel exception ,所以此题死机重启的原因就是 野指针 导致!

到现在这一步,死机的原因也明确了,那么关键是怎么根据现有的线索抓到最终导致问题的真凶呢?
基本思路就是:找出出错指针 dev->bus->pm 的地址 & dev->bus->pm ,根据这个出错指针的地址来想办法抓到真凶.
现在需要打印出汇编-C混合代码供分析,下面这段代码很关键:
1481		} else if (dev->bus && dev->bus->pm) {
   0xc0327c30 <+388>:	ldr	r3, [r4, #-52]	; 0x34
   0xc0327c34 <+392>:	cmp	r3, #0
   0xc0327c38 <+396>:	beq	0xc0327bd8 
   0xc0327c3c <+400>:	ldr	r3, [r3, #52]	; 0x34
   0xc0327c40 <+404>:	cmp	r3, #0
   0xc0327c44 <+408>:	bne	0xc0327bcc 
   0xc0327c48 <+412>:	b	0xc0327bd8 

1482			info = "preparing bus ";
1483			callback = dev->bus->pm->prepare;
=> 0xc0327bcc <+288>:	ldr	r9, [r3]

上面两句ldr指令加上下面的cmp指令,很明显就是代表上面 if(dev->bus && dev->bus->pm) 的描述,所以这里就可以判断出:
dev->bus  ==> r4 - #52 == 0xd906807c
dev->bus->pm  ==> r4 == 0xd90680b0

 
  
也可以使用gdb直接打印出来验证分析是否正确:
(gdb) p dev
$2 = (struct device *) 0xd9068000
(gdb) p &dev->bus
$3 = (struct bus_type **) 0xd906807c
(gdb) p &dev->bus->pm
$4 = (const struct dev_pm_ops **) 0x34
我们看到&dev->bus-pm的值是个0x34,其实这个值就是pm成员在bus结构体中的偏移刚好就是0x34,所以上面的分析的正确的,接下来就要根据dev->bus->pm的地址找到具体是哪里给出来的了,我们需要用gdb分析调用栈分析dev是从哪里传过来的,如下frame 0 是指最顶层的栈帧,通常被损坏了无法打印出当时的变量值,需要看上一级frame 1的栈帧分析:
(gdb) f 0
#0  device_prepare (dev=, state=...) at /home/android/work/prj/6601-32bit-kernel/baseline/kernel-3.10/drivers/base/power/main.c:1483
1483			callback = dev->bus->pm->prepare;
(gdb) p dev
$1 = 
(gdb) f 1
#1  dpm_prepare (state=...) at /home/android/work/prj/6601-32bit-kernel/baseline/kernel-3.10/drivers/base/power/main.c:1520
1520			error = device_prepare(dev, state);
(gdb) list
1515			struct device *dev = to_device(dpm_list.next);
1516	
1517			get_device(dev);
1518			mutex_unlock(&dpm_list_mtx);
1519	
1520			error = device_prepare(dev, state);
1521	
1522			mutex_lock(&dpm_list_mtx);
1523			if (error) {
1524				if (error == -EAGAIN) {
(gdb) p dev
$2 = (struct device *) 0xd9068000
(gdb) p &dev->bus
$3 = (struct bus_type **) 0xd906807c
(gdb) p &dev->bus->pm
$4 = (const struct dev_pm_ops **) 0x34
(gdb) p dpm_list
$5 = {next = 0xd90680b0, prev = 0xdb1df5b0}

在frame 1栈帧里可以看到此刻的dev具体地址,我们发现dev是从链表 dpm_list得来的,而且此刻出问题的dpm_list.next ==  0xd90680b0!!!,发现这个值很眼熟,==》就是 r4的值,也是刚刚要找的&dev->bus->pm的值!为什么会这样呢?
下面分析源代码:
int dpm_prepare(pm_message_t state)
{
	int error = 0;

	might_sleep();

	mutex_lock(&dpm_list_mtx);
	while (!list_empty(&dpm_list)) {
		struct device *dev = to_device(dpm_list.next);

		get_device(dev);
		mutex_unlock(&dpm_list_mtx);

		error = device_prepare(dev, state);

		mutex_lock(&dpm_list_mtx);
		if (error) {
			if (error == -EAGAIN) {
				put_device(dev);
				error = 0;
				continue;
			}
			printk(KERN_INFO "PM: Device %s not prepared "
				"for power transition: code %d\n",
				dev_name(dev), error);
			put_device(dev);
			break;
		}
		dev->power.is_prepared = true;
		if (!list_empty(&dev->power.entry))
			list_move_tail(&dev->power.entry, &dpm_prepared_list);
		put_device(dev);
	}
	mutex_unlock(&dpm_list_mtx);
	return error;
}


从上面代码可知:dev是从dmp_list得来的,继续追踪代码可以追踪发现其实是在开机的时候从 device_pm_add 函数加载的:
/**
 * device_pm_add - Add a device to the PM core's list of active devices.
 * @dev: Device to add to the list.
 */
void device_pm_add(struct device *dev)
{
	pr_debug("PM: Adding info for %s:%s\n",
		 dev->bus ? dev->bus->name : "No Bus", dev_name(dev));

	mutex_lock(&dpm_list_mtx);
	if (dev->parent && dev->parent->power.is_prepared)
		dev_warn(dev, "parent %s should not be sleeping\n",
			dev_name(dev->parent));
	list_add_tail(&dev->power.entry, &dpm_list);
	mutex_unlock(&dpm_list_mtx);
}

上面代码中,dev->power.entry的地址作为成员添加到dmp_list中,猜想&dev->power.entry 其实和  &dev->bus->pm 是同一个东西,dmp_list就是以各个dev的bus->pm成员的链表。既然dev是从这里加载来的,那么就可以在这里加打印,抓开机log抓出导致KE的真凶!
到这一步,心情总算是好点了,感觉就要快抓到它了!
void device_pm_add(struct device *dev)
{
	pr_debug("PM: Adding info for %s:%s\n",
		 dev->bus ? dev->bus->name : "No Bus", dev_name(dev));

	pr_debug("[KE],[%s],[0x%lx],[0x%lx],[0x%lx],[0x%lx],[%s]\n",
		__func__, 
		(long)dev, 
		(long)&dev->bus,
		(long)&dev->bus->pm, 
		(long)&dev->power.entry,
		dev_name(dev));

	mutex_lock(&dpm_list_mtx);
	if (dev->parent && dev->parent->power.is_prepared)
		dev_warn(dev, "parent %s should not be sleeping\n",
			dev_name(dev->parent));
	list_add_tail(&dev->power.entry, &dpm_list);
	mutex_unlock(&dpm_list_mtx);
}


编译kernel、烧机、抓开机log :
[   21.491480]<2>.(2)[1:swapper/0]PM: Adding info for No Bus:silead_fp_dev
[   21.492349]<2>.(2)[1:swapper/0][KE],[device_pm_add],[0xd9068000],[0xd906807c],[0x34],[0xd90680b0],[silead_fp_dev]


看到这个结果,心情马上就好了^_^ ! 因为造成死机重启的真凶露面了!------>>>>  silead_fp_dev
看到这个设备其实就可以去代码里面搜索关键字找到代码了,但是能否来得跟直接一点呢?能否直接定位到具体什么时候执行的呢?当然答案是肯定的,加上dump_stack() 函数就行了:
void device_pm_add(struct device *dev)
{
	pr_debug("PM: Adding info for %s:%s\n",
		 dev->bus ? dev->bus->name : "No Bus", dev_name(dev));

	pr_debug("[KE],[%s],[0x%lx],[0x%lx],[0x%lx],[0x%lx],[%s]\n",
		__func__, 
		(long)dev, 
		(long)&dev->bus,
		(long)&dev->bus->pm, 
		(long)&dev->power.entry,
		dev_name(dev));

	if(strcmp(dev_name(dev), "silead_fp_dev") == 0) 
	{
		pr_debug("[KE]:[0x%lx]\n",(long)dev);
		dump_stack();   
	}

	mutex_lock(&dpm_list_mtx);
	if (dev->parent && dev->parent->power.is_prepared)
		dev_warn(dev, "parent %s should not be sleeping\n",
			dev_name(dev->parent));
	list_add_tail(&dev->power.entry, &dpm_list);
	mutex_unlock(&dpm_list_mtx);
}


编译kernel、烧机、抓开机log :
[   21.493949]<2>.(2)[1:swapper/0][KE]:[0xd9068000]
[   21.494539]Backtrace: 
[   21.494883]<2>.(2)[1:swapper/0][] (dump_backtrace+0x0/0x110) from [] (dump_stack+0x18/0x1c)
[   21.496161] r6:d9068000 r5:d90680b0 r4:d9068000 r3:dbc3d000
[   21.496891]<2>.(2)[1:swapper/0][] (dump_stack+0x0/0x1c) from [] (device_pm_add+0x118/0x120)
[   21.498191]<2>.(2)[1:swapper/0][] (device_pm_add+0x0/0x120) from [] (device_add+0x4e0/0x5d0)
[   21.499470] r5:c0df60c8 r4:d9068008
[   21.499945]<2>.(2)[1:swapper/0][] (device_add+0x0/0x5d0) from [] (device_create_vargs+0xa8/0xc0)
[   21.501304]<2>.(2)[1:swapper/0][] (device_create_vargs+0x0/0xc0) from [] (device_create+0x28/0x30)
[   21.502648] r9:00000000 r8:c14698c0 r7:dbcd72a8 r6:c165b144 r5:d9046a00
r4:dbcd7280
[   21.503754]<2>.(2)[1:swapper/0][] (device_create+0x0/0x30) from [] (spidev_probe+0x2fc/0x528)
[   21.505076]<2>.(2)[1:swapper/0][] (spidev_probe+0x0/0x528) from [] (spi_drv_probe+0x20/0x24)

[   21.505076]<2>.(2)[1:swapper/0][] (spidev_probe+0x0/0x528) from [] (spi_drv_probe+0x20/0x24)

结合addr2Line工具可以解析出backtrace对于的函数所在文件行号,然后直接去源代码搜索相关函数就可以了.

到这一步已经大大缩小了异常的范围到一个设备驱动代码上了, 已完成了从汇编代码到源代码转换的过程,此题已解70%,
接下来就需要分析 silead_fp_dev  代码 logic了,找出具体异常原因和解决问题了,剩下内容 待下一章分析  ^_^ 。













你可能感兴趣的:(【解题笔记】,【系统异常分析】)