测试组发现用户进程在某种特定情况下,会出现死锁,现象是进程还在S状态,但没有任何反应,所以怀疑死锁。
通过几次测试发现,进程中设置的参数恢复出厂后重启进程很大概率会出现死锁,这时候已经把复现的方法明确,但是从复现的场景来看暂时无法定位出原因。接下来就编译问题版本进行问题跟踪。
追查进程死锁方法我知道的有这么几种:另开线程心跳监控、另开进程心跳监控,打印调试,gdb调试,git回溯版本范围缩小;
由于进程中开了2个业务线程,所以使用另开线程心跳监控方法有弊端,死锁后调度也会卡住心跳线程导致不能准确定位;综合来看使用gdb进行调试;
下载gdb源码并进行交叉编译,然后拷贝到盒子进行调试;编译进程时加上-g调试信息:
./arm-linux-gnueabihf-gdb ifotond 开始复现死锁,死锁后打印如下
root@www:/mnt/emmc/lock# ./arm-linux-gnueabihf-gdb ifotond-25-g
GNU gdb (GDB) 8.2
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "arm-linux-gnueabihf".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
.
Find the GDB manual and other documentation resources online at:
.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ifotond-25-g...done.
(gdb) r
Starting program: /mnt/emmc/lock/ifotond-25-g
warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.
[Detaching after fork from child process 28336]
not set the app nameSet sleep time is 28800
[New LWP 28341]
[New LWP 28342]
[New LWP 28343]
[WARNING]:not set the ubus name
regis id = 8
regis id = 3
regis id = 1
regis id = 2
[LWP 28343 exited]
227 ota_event_set etype:5, e_status:1, LR:00040600
[Detaching after fork from child process 28346]
227 ota_event_set etype:6, e_status:1, LR:000283fc
[Detaching after fork from child process 28348]
send data!type is 1!data is 07 01 ac 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 be 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 10 30 d4 59 56 71 e8 9a 6a b2 fc db 7e f3 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 04 07 60 2a 22 00 00 00 00 00 00 00 00 00 00 00 00 08 07 60 2a 22 d9 db 2b 89 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
recv data!type is 0!data is 01 00
[Detaching after fork from child process 28357]
[ERROR]:register timeout = 30 name = ifoton
Successfully captured all of multi-frame. Freeing memory.
卡住了
^C 断下来
Thread 1 "ifotond-25-g" received signal SIGINT, Interrupt.
0xb6cc41b0 in poll () from /lib/arm-linux-gnueabihf/libc.so.6
(gdb)
(gdb)
(gdb) info thread
Id Target Id Frame
* 1 LWP 28322 "ifotond-25-g" 0xb6cc41b0 in poll () 主进程==线程1
from /lib/arm-linux-gnueabihf/libc.so.6
2 LWP 28341 "ifotond-25-g" 0xb6caa0c0 in nanosleep () 线程2
from /lib/arm-linux-gnueabihf/libc.so.6
3 LWP 28342 "ifotond-25-g" 0xb6cc41b0 in poll () 线程3
from /lib/arm-linux-gnueabihf/libc.so.6
(gdb) thread 1
[Switching to thread 1 (LWP 28322)]
#0 0xb6cc41b0 in poll () from /lib/arm-linux-gnueabihf/libc.so.6
(gdb) bt
#0 0xb6cc41b0 in poll () from /lib/arm-linux-gnueabihf/libc.so.6
#1 0x00000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
从上面打印看,死锁后主线程和线程3卡在同一个poll函数,由于进程中使用的socket通信使用的是select,所以没有直接调用poll函数;poll函数也是用于网络通信的,进程中频繁使用的ubus内部的机制就是使用的网络通信;
查询进程代码没有直接调用poll函数,poll函数在libc库实现,又由于bt没有打印出回溯信息,所以怀疑poll函数是在动态库里面调用的;
搜索动态库,可知在/lib/libubus.so:1742:poll调用,查看ubus源码,确实有调用poll函数;
ubus-1d2b3bb/libubus-io.c
static void wait_data(int fd, bool write)
{
struct pollfd pfd = { .fd = fd };
pfd.events = write ? POLLOUT : POLLIN;
poll(&pfd, 1, -1);
}
void __hidden ubus_poll_data(struct ubus_context *ctx, int timeout)
{
struct pollfd pfd = {
.fd = ctx->sock.fd,
.events = POLLIN | POLLERR,
};
poll(&pfd, 1, timeout ? timeout : -1);
ubus_handle_data(&ctx->sock, ULOOP_READ);
}
到这里就知道是ubus导致的死锁,我们知道,ubus不支持多线程调用,否则容易出现死锁;进程代码中调用ubus是主线程负责,出现死锁的原因可能就是其他线程调用了ubus,这点从gdb打印也可看出;
在ubus提供的接口ubus_call、ubus_reply、ubus_send中添加参数和in/out打印,待死锁后查看参数就可知道在代码中调用的位置;
最后查出是ubus_send在复位情况后会被线程3调用,导致了主线程调用ubus_call卡住死锁,ubus_call可以明确是正常调用,通过在ubus_send中造一个空指针把pg调用顺序打出就知道了调用者,最后查出了问题原因:没有注意到复位流程会走线程3调用;
问题解决:把这个ubus_send调用加入到主线程队列等待被调用就可以了,可能会有不实时的风险;
打印调试,在怀疑死锁的模块里面加上这段代码,db_msg换成printf。
#if 1
#define pthread_mutex_lock(lock) do { \
db_msg("lock: in %d, %s", __LINE__, __FUNCTION__); \
pthread_mutex_lock(lock); \
db_msg("locked: in %d, %s", __LINE__, __FUNCTION__); \
} while(0)
#define pthread_mutex_unlock(lock) do { \
db_msg("unlock: in %d, %s", __LINE__, __FUNCTION__); \
pthread_mutex_unlock(lock); \
db_msg("unlocked: in %d, %s", __LINE__, __FUNCTION__); \
} while(0)
#endif
提示warning: Unable to find libthread_db,应该是libthread.so strap过了或者需要调用libthread_db库来支持,需要验证一下,又有说法是需要额外的libc库和libthread库(size很大)在支持调试,否则info thread信息不准确。