调试了3周,终于把问题找到了,今晚不加班:-).
在高通的一个项目中,我们在modem侧制造crash,这个crash会通过管道传递到linux这端。Linux这侧收到这个消息后,就会将重启任务放到一个work中,通过schedule_work将这个work加入system_wq. 等待worker线程执行重启任务。但是经过反复的重启测试,发现会有偶现的系统hang无法重启的情况。
对于system hang的问题,高通提供了一套工具,名叫linux-ramdump-parser-v2,可以将ramdump出来,然后用这个工具去解析。
下面是workqueue的解析情况,当前平台只有一个CPU,对于bind的线程池,每个 CPU有两个,一个是普通优先级的线程池,一个是高优先级的线程池。高优先级线程池中的线程如kwoker/0:0H。下面pool 0是普通线程池,pool 1是高优先级线程池。
CPU 0
pool 0
BUSY Workqueue worker: kworker/0:8 current_work: (None)
BUSY Workqueue worker: kworker/0:9 current_work: (None)
BUSY Workqueue worker: kworker/0:10 current_work: (None)
BUSY Workqueue worker: kworker/0:11 current_work: (None)
BUSY Workqueue worker: kworker/0:12 current_work: (None)
BUSY Workqueue worker: kworker/0:13 current_work: (None)
BUSY Workqueue worker: kworker/0:2 current_work: (None)
BUSY Workqueue worker: kworker/0:14 current_work: (None)
BUSY Workqueue worker: kworker/0:5 current_work: (None)
BUSY Workqueue worker: kworker/0:4 current_work: (None)
BUSY Workqueue worker: kworker/0:3 current_work: (None)
BUSY Workqueue worker: kworker/0:0 current_work: (None)
BUSY Workqueue worker: kworker/0:6 current_work: (None)
BUSY Workqueue worker: kworker/0:7 current_work: (None)
IDLE Workqueue worker: kworker/0:1 current_work: (None)
IDLE Workqueue worker: kworker/0:15 current_work: (None)
Pending entry: device_restart_work_hdlr
Pending entry: ubiblock_do_work
Pending entry: ubiblock_do_work
Pending entry: ubiblock_do_work
Pending entry: ubiblock_do_work
Pending entry: ubiblock_do_work
Pending entry: pm_runtime_work
Pending entry: neigh_periodic_work
Pending entry: do_cache_clean
Pending entry: neigh_periodic_work
Pending entry: push_to_pool
Pending entry: push_to_pool
Pending entry: addrconf_verify_work
Pending entry: check_lifetime
Pending entry: release_one_tty
Pending entry: console_callback
pool 1
IDLE Workqueue worker: kworker/0:0H current_work: (None)
从上面的日志可以看出普通线程池有大量的BUSY线程,同时也有两个IDLE线程。理论上,只要有IDLE线程,我们的重启任务device_restart_work_hdlr就会被IDLE线程执行。从CMWQ(Concurrency Managed Workqueue)的设计上来说,线程池中一个时间点只允许1个线程运行。于是我们可以猜测,线程池中确实有一个线程正在执行,并且这个线程获取了某种锁,导致其他线程无法获取到锁而睡眠。我们需要这些kworker的backtrace,看看他们在干什么。这里只举两个例子,kworker/0:10和kworker/0:14。
=====================================================
Process: kworker/0:10, cpu: 0 pid: 292 start: 0xddea3600
=====================================================
Task name: kworker/0:10 pid: 292 cpu: 0
state: 0x2 exit_state: 0x0 stack base: 0xdd9dc000
Stack:
[] __schedule+0x2c8
[] msm_nand_read_oob+0xec
[] msm_nand_read+0x22c
[] part_read+0x44
[] mtd_read+0x70
[] ubi_io_read+0x1b0
[] ubi_eba_read_leb+0x254
[] ubi_eba_read_leb_sg+0xcc
[] ubi_leb_read_sg+0x68
[] ubiblock_do_work+0x8c
[] process_one_work+0x1e4
[] worker_thread+0x2f0
[] kthread+0xc4
[] ret_from_fork+0x14
=======================================================
Process: kworker/0:14, cpu: 0 pid: 296 start: 0xdd076880
=====================================================
Task name: kworker/0:14 pid: 296 cpu: 0
state: 0x0 exit_state: 0x0 stack base: 0xde3d4000
Stack:
[] __schedule+0x2c8
[] preempt_schedule+0x3c
[] sps_get_iovec+0x20c
[] msm_nand_sps_get_iovec+0x1c
[] msm_nand_read_oob+0x4c8
[] msm_nand_read+0x22c
[] part_read+0x44
[] mtd_read+0x70
[] ubi_io_read+0x1b0
[] ubi_eba_read_leb+0x254
[] ubi_eba_read_leb_sg+0xcc
[] ubi_leb_read_sg+0x68
[] ubiblock_do_work+0x8c
[] process_one_work+0x1e4
[] worker_thread+0x2f0
[] kthread+0xc4
[] ret_from_fork+0x14
从下面代码不难看出,msm_nand_read_oob中有个mutex_lock,kworker/0:14获取这里这把锁,继续往下走,而其他pool 0中的kworker则获取不到这个锁就schedule了。
for (n = 0; n < (cmd_list->count - 1); n++) {
iovec->addr = msm_virt_to_dma(chip,
&cmd_list->cw_desc[n].ce[0]);
iovec->size = sizeof(struct sps_command_element) *
cmd_list->cw_desc[n].num_ce;
iovec->flags = cmd_list->cw_desc[n].flags;
iovec++;
}
mutex_lock(&info->lock);
err = msm_nand_get_device(chip->dev);
if (err)
goto unlock_mutex;
/* Submit data descriptors */
for (n = rw_params.start_sector; n < cwperpage; n++) {
err = msm_nand_submit_rw_data_desc(ops,
&rw_params, info, n, 0);
if (err) {
pr_err("Failed to submit data descs %d\n", err);
panic("error in nand driver\n");
goto put_dev;
}
}
重新回到kworker/0:14的backtrace,从这里看,这个kworker也最终调用__schedule,而且貌似再也没有调度回来了。但真的是这样吗?CFS给了我们答案,这个kworker线程一直在CFS pending队列中,也就是说他一定会有机会运行。
======================= RUNQUEUE STATE ============================
CPU0 2 process is running
|--curr: fwupdateDaemon(567)
|--idle: swapper(0)
|--stop: None(0)
CFS 2 process is pending
|--curr: 0 process is grouping
| |--curr: None(0)
| |--next: None(0)
| |--last: None(0)
| |--skip: None(0)
|--next: None(0)
|--last: None(0)
8 |--skip: None(0)
9 |--pend: kworker/0:14(296)
只能看看kworker/0:14调用的其他函数。sps_get_iovec这个函数在spin_unlock的时候开启抢占,继而调用__schedule,这个函数是没有问题的。最后发现msm_nand_sps_get_iovec貌似有些问题,中间有个while循环。我们猜测,在条件判断中,可能出现一直是true,然后无法返回的情况。最后在while循环中加上pr_err,再次测试。果然,出现问题时,这个while没有退出。因为sps设备是modem和Linux共用的,当modem crash的时候,sps可能无法正常运行,造成Linux这端无法读取到数据。因为这个线程一直运行,普通优先级线程池中无法唤醒其他的IDLE线程处理我们的重启任务。
static int msm_nand_sps_get_iovec(struct sps_pipe *pipe, uint32_t indx,
unsigned int cnt, struct sps_iovec *iovec)
{
int ret = 0;
do {
do {
ret = sps_get_iovec((pipe), (iovec));
} while (((iovec)->addr == 0x0) && ((iovec)->size == 0x0));
if (ret)
return ret;
} while (--(cnt));
return ret;
}
http://kernel.meizu.com/linux-workqueue.html
http://www.wowotech.net/irq_subsystem/cmwq-intro.html