一次system hang的调试经历

调试了3周,终于把问题找到了,今晚不加班:-).

问题描述

在高通的一个项目中,我们在modem侧制造crash,这个crash会通过管道传递到linux这端。Linux这侧收到这个消息后,就会将重启任务放到一个work中,通过schedule_work将这个work加入system_wq. 等待worker线程执行重启任务。但是经过反复的重启测试,发现会有偶现的系统hang无法重启的情况。

问题定位

对于system hang的问题,高通提供了一套工具,名叫linux-ramdump-parser-v2,可以将ramdump出来,然后用这个工具去解析。

Log分析

下面是workqueue的解析情况,当前平台只有一个CPU,对于bind的线程池,每个 CPU有两个,一个是普通优先级的线程池,一个是高优先级的线程池。高优先级线程池中的线程如kwoker/0:0H。下面pool 0是普通线程池,pool 1是高优先级线程池。

CPU 0
pool 0
BUSY Workqueue worker: kworker/0:8 current_work: (None)
BUSY Workqueue worker: kworker/0:9 current_work: (None)
BUSY Workqueue worker: kworker/0:10 current_work: (None)
BUSY Workqueue worker: kworker/0:11 current_work: (None)
BUSY Workqueue worker: kworker/0:12 current_work: (None)
BUSY Workqueue worker: kworker/0:13 current_work: (None)
BUSY Workqueue worker: kworker/0:2 current_work: (None)
BUSY Workqueue worker: kworker/0:14 current_work: (None)
BUSY Workqueue worker: kworker/0:5 current_work: (None)
BUSY Workqueue worker: kworker/0:4 current_work: (None)
BUSY Workqueue worker: kworker/0:3 current_work: (None)
BUSY Workqueue worker: kworker/0:0 current_work: (None)
BUSY Workqueue worker: kworker/0:6 current_work: (None)
BUSY Workqueue worker: kworker/0:7 current_work: (None)
IDLE Workqueue worker: kworker/0:1 current_work: (None)
IDLE Workqueue worker: kworker/0:15 current_work: (None)
Pending entry: device_restart_work_hdlr
Pending entry: ubiblock_do_work
Pending entry: ubiblock_do_work
Pending entry: ubiblock_do_work
Pending entry: ubiblock_do_work
Pending entry: ubiblock_do_work
Pending entry: pm_runtime_work
Pending entry: neigh_periodic_work
Pending entry: do_cache_clean
Pending entry: neigh_periodic_work
Pending entry: push_to_pool
Pending entry: push_to_pool
Pending entry: addrconf_verify_work
Pending entry: check_lifetime
Pending entry: release_one_tty
Pending entry: console_callback
pool 1
IDLE Workqueue worker: kworker/0:0H current_work: (None)

从上面的日志可以看出普通线程池有大量的BUSY线程,同时也有两个IDLE线程。理论上,只要有IDLE线程,我们的重启任务device_restart_work_hdlr就会被IDLE线程执行。从CMWQ(Concurrency Managed Workqueue)的设计上来说,线程池中一个时间点只允许1个线程运行。于是我们可以猜测,线程池中确实有一个线程正在执行,并且这个线程获取了某种锁,导致其他线程无法获取到锁而睡眠。我们需要这些kworker的backtrace,看看他们在干什么。这里只举两个例子,kworker/0:10和kworker/0:14。

===================================================== 
 Process: kworker/0:10, cpu: 0 pid: 292 start: 0xddea3600 
===================================================== 
Task name: kworker/0:10 pid: 292 cpu: 0 
state: 0x2 exit_state: 0x0 stack base: 0xdd9dc000 
Stack: 
[] __schedule+0x2c8 
[] msm_nand_read_oob+0xec 
[] msm_nand_read+0x22c 
[] part_read+0x44 
[] mtd_read+0x70 
[] ubi_io_read+0x1b0 
[] ubi_eba_read_leb+0x254 
[] ubi_eba_read_leb_sg+0xcc 
[] ubi_leb_read_sg+0x68 
[] ubiblock_do_work+0x8c 
[] process_one_work+0x1e4 
[] worker_thread+0x2f0 
[] kthread+0xc4 
[] ret_from_fork+0x14
======================================================= 
Process: kworker/0:14, cpu: 0 pid: 296 start: 0xdd076880 
===================================================== 
Task name: kworker/0:14 pid: 296 cpu: 0 
state: 0x0 exit_state: 0x0 stack base: 0xde3d4000 
Stack: 
[] __schedule+0x2c8 
[] preempt_schedule+0x3c 
[] sps_get_iovec+0x20c 
[] msm_nand_sps_get_iovec+0x1c 
[] msm_nand_read_oob+0x4c8 
[] msm_nand_read+0x22c 
[] part_read+0x44 
[] mtd_read+0x70 
[] ubi_io_read+0x1b0 
[] ubi_eba_read_leb+0x254 
[] ubi_eba_read_leb_sg+0xcc 
[] ubi_leb_read_sg+0x68 
[] ubiblock_do_work+0x8c 
[] process_one_work+0x1e4 
[] worker_thread+0x2f0 
[] kthread+0xc4 
[] ret_from_fork+0x14

从下面代码不难看出,msm_nand_read_oob中有个mutex_lock,kworker/0:14获取这里这把锁,继续往下走,而其他pool 0中的kworker则获取不到这个锁就schedule了。

        for (n = 0; n < (cmd_list->count - 1); n++) {
            iovec->addr =  msm_virt_to_dma(chip,
                        &cmd_list->cw_desc[n].ce[0]);
            iovec->size = sizeof(struct sps_command_element) *
                        cmd_list->cw_desc[n].num_ce;
            iovec->flags = cmd_list->cw_desc[n].flags;
            iovec++;
        }
        mutex_lock(&info->lock);
        err = msm_nand_get_device(chip->dev);
        if (err)
            goto unlock_mutex;
        /* Submit data descriptors */
        for (n = rw_params.start_sector; n < cwperpage; n++) {
            err = msm_nand_submit_rw_data_desc(ops,
                        &rw_params, info, n, 0);
            if (err) {
                pr_err("Failed to submit data descs %d\n", err);
                panic("error in nand driver\n");
                goto put_dev;
            }
        }

重新回到kworker/0:14的backtrace,从这里看,这个kworker也最终调用__schedule,而且貌似再也没有调度回来了。但真的是这样吗?CFS给了我们答案,这个kworker线程一直在CFS pending队列中,也就是说他一定会有机会运行。

======================= RUNQUEUE STATE ============================                                                                                                                                                                    
CPU0 2 process is running                                                                                                                                                                                                              
|--curr: fwupdateDaemon(567)                                                                                                                                                                                                          
|--idle: swapper(0)                                                                                                                                                                                                                   
|--stop: None(0)                                                                                                                                                                                                                      
CFS 2 process is pending                                                                                                                                                                                                               
|--curr: 0 process is grouping                                                                                                                                                                                                        
|   |--curr: None(0)                                                                                                                                                                                                                  
|   |--next: None(0)                                                                                                                                                                                                                  
|   |--last: None(0)                                                                                                                                                                                                                  
|   |--skip: None(0)                                                                                                                                                                                                                  
|--next: None(0)                                                                                                                                                                                                                      
|--last: None(0)                                                                                                                                                                                                                      
   8  |--skip: None(0)                                                                                                                                                                                                                      
   9  |--pend: kworker/0:14(296)              

只能看看kworker/0:14调用的其他函数。sps_get_iovec这个函数在spin_unlock的时候开启抢占,继而调用__schedule,这个函数是没有问题的。最后发现msm_nand_sps_get_iovec貌似有些问题,中间有个while循环。我们猜测,在条件判断中,可能出现一直是true,然后无法返回的情况。最后在while循环中加上pr_err,再次测试。果然,出现问题时,这个while没有退出。因为sps设备是modem和Linux共用的,当modem crash的时候,sps可能无法正常运行,造成Linux这端无法读取到数据。因为这个线程一直运行,普通优先级线程池中无法唤醒其他的IDLE线程处理我们的重启任务。

static int msm_nand_sps_get_iovec(struct sps_pipe *pipe, uint32_t indx,
                unsigned int cnt, struct sps_iovec *iovec)
{
    int ret = 0;

    do {
        do {
            ret = sps_get_iovec((pipe), (iovec));
        } while (((iovec)->addr == 0x0) && ((iovec)->size == 0x0));
        if (ret)
            return ret;
    } while (--(cnt));
    return ret;
}

解决方法

  1. 因为重启任务本来就是非常紧急的,可以将这个work加入pool 1,也就是高优先级线程池。即使pool 0所有线程挂起,pool 1仍然可以唤起一个IDLE工作线程处理重启任务。
  2. 可以在上面的循环加上计数,在重试一定次数后退出。

工作队列相关文档

http://kernel.meizu.com/linux-workqueue.html
http://www.wowotech.net/irq_subsystem/cmwq-intro.html

你可能感兴趣的:(process)