关于信号处理signal()、sigaction()等的使用,相信很多人都已熟悉。 这里主要想讲一下信号处理函数使用上的一个常见陷阱:信号处理函数必须是可重入函数。如果信号处理函数不可重入,那么可能导致很多诡异问题。
《UNIX环境高级编程》“可重入函数”章节中这样写道:
“但在信号处理程序中,不能判断捕捉到信号时进程在何处执行。如果进程正在执行malloc,在其堆中分配另外的存储空间,而此时由于捕捉到信号而插入执行该信号处理程序,其中又调用malloc,这时会发生什么?”
关于“可重入函数”相信其概念并不难理解,但真正使用信号时,很多人都忽略了这一点,特别是一些比较隐晦的“不可重入函数”。本人在项目中就曾两次遇到信号处理函数中调用不可重入函数导致的死锁:某项目运行一段时间后,进程基本停止响应各种外界命令,日志也基本停止打印(只有个别简单轮询线程定时大义些信息),但ps命令看到进程还在运行。看到这个问题,第一反应就是进程死锁,gdb attach到进程上,查看各个线程的堆栈,果然, 很多线程都卡在malloc调用上:
Thread 152 (Thread 0x7f020abf5700 (LWP 7801)): #0 0x00000032120f6dde in __lll_lock_wait_private () from /lib64/libc.so.6 #1 0x000000321207c59b in _L_lock_9495 () from /lib64/libc.so.6 #2 0x0000003212079b86 in malloc () from /lib64/libc.so.6 #3 0x00000030142bd09d in operator new(unsigned long) () from /usr/lib64/libstdc++.so.6 #4 0x00000000005b9092 in __gnu_cxx::new_allocator<std::_List_node<Memory::TSmartObjectPtr<CPacketBase> > >::allocate(unsigned long, void const*) () #5 0x00000000005b8f10 in std::_List_base<Memory::TSmartObjectPtr<CPacketBase>, std::allocator<Memory::TSmartObjectPtr<CPacketBase> > >::_M_g #6 0x00000000005b8cff in std::list<Memory::TSmartObjectPtr<CPacketBase>, std::allocator<Memory::TSmartObjectPtr<CPacketBase> > >::_M_create_ #7 0x00000000005b889b in std::list<Memory::TSmartObjectPtr<CPacketBase>, std::allocator<Memory::TSmartObjectPtr<CPacketBase> > >::_M_insert( #8 0x00000000005b8020 in std::list<Memory::TSmartObjectPtr<CPacketBase>, std::allocator<Memory::TSmartObjectPtr<CPacketBase> > >::push_back( #9 0x00000000006194cd in CProtoParser::parser(char*, unsigned int) () #10 0x0000000000618fb5 in CProtoParser::putDataLen(unsigned int) () #11 0x00000000006666be in CConnection::handle_input(int) () #12 0x000000000069595a in NetFramework::CNetThread::handle_netevent(NetFramework::list_node*) () #13 0x0000000000695bbf in NetFramework::CNetThread::ThreadProc(Infra::CThreadLite&) () #14 0x00000000006a3f38 in (anonymous namespace)::InternalThreadBody(void*) () #15 0x0000003212407851 in start_thread () from /lib64/libpthread.so.0 #16 0x00000032120e767d in clone () from /lib64/libc.so.6 ------------------------------------------------------------------------------------------------------------------------ Thread 19 (Thread 0x7f01019ec700 (LWP 7939)): #0 0x00000032120f6dde in __lll_lock_wait_private () from /lib64/libc.so.6 #1 0x000000321207bede in _L_lock_44 () from /lib64/libc.so.6 #2 0x0000003212074d4c in ptmalloc_lock_all () from /lib64/libc.so.6 #3 0x00000032120ab9a5 in fork () from /lib64/libc.so.6 #4 0x0000003212067c07 in _IO_proc_open@@GLIBC_2.2.5 () from /lib64/libc.so.6 #5 0x0000003212067ef9 in popen@@GLIBC_2.2.5 () from /lib64/libc.so.6 #6 0x0000000000548746 in os::shell(std::basic_ostream<char, std::char_traits<char> >*, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) #7 0x00000000005f5a5c in CDiskPerfCollector::IsCollectorRunning() () #8 0x00000000005f5766 in CDiskPerfCollector::threadProc() () #9 0x00000000006a3f38 in (anonymous namespace)::InternalThreadBody(void*) () #10 0x0000003212407851 in start_thread () from /lib64/libpthread.so.0 #11 0x00000032120e767d in clone () from /lib64/libc.so.6
看到这里,有些人可能认为是glibc malloc 出什么bug了。其实不然,仔细分析,就会发现其中蹊跷:有一个线程,在执行malloc的过程中,跳转到了信号处理函数中。而信号处理函数在调用某个系统api时,内部又调用了malloc。 看了glibc源码就会知道,malloc内部也是有锁、而且是非嵌套的,如果在上一次调用中拿到锁,又跳转到信号处理函数中再次malloc,自然就导致死锁了。而且即使没有死锁,也极有可能破坏malloc内部维护的一些全局信息,导致后面莫名其妙的崩溃。
Thread 63 (Thread 0x7f010b3f9700 (LWP 7890)): #0 0x00000032120f6dde in __lll_lock_wait_private () from /lib64/libc.so.6 #1 0x000000321207c59b in _L_lock_9495 () from /lib64/libc.so.6 #2 0x0000003212079b86 in malloc () from /lib64/libc.so.6 #3 0x000000321180cb8d in _dl_map_object_deps () from /lib64/ld-linux-x86-64.so.2 #4 0x0000003211812a11 in dl_open_worker () from /lib64/ld-linux-x86-64.so.2 #5 0x000000321180e196 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2 #6 0x000000321181246a in _dl_open () from /lib64/ld-linux-x86-64.so.2 #7 0x00000032121250a0 in do_dlopen () from /lib64/libc.so.6 #8 0x000000321180e196 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2 #9 0x00000032121251f7 in __libc_dlopen_mode () from /lib64/libc.so.6 #10 0x00000032120fd5f5 in init () from /lib64/libc.so.6 #11 0x000000321240cb23 in pthread_once () from /lib64/libpthread.so.0 #12 0x00000032120fd6f4 in backtrace () from /lib64/libc.so.6 #13 0x0000000000614363 in printStackTrace() () #14 0x000000000061497a in interruptTrigger(int, siginfo*, void*) () #15 <signal handler called> #16 0x0000003212078c33 in _int_malloc () from /lib64/libc.so.6 #17 0x0000003212079b91 in malloc () from /lib64/libc.so.6 #18 0x00000030142bd09d in operator new(unsigned long) () from /usr/lib64/libstdc++.so.6 #19 0x00000000006982ce in NetFramework::CSockAddrStorage::CSockAddrStorage() () #20 0x000000000066639d in CConnection::attach(NetFramework::CSockStream&) () #21 0x0000000000620be0 in CServiceBase::init(NetFramework::CSockStream&) () #22 0x00000000005c1193 in CSession::init(NetFramework::CSockStream&) () #23 0x00000000005705a8 in CDNServer::accept(NetFramework::CSockStream&) () #24 0x000000000061f3e4 in CServer::Internal::handle_input(int) () #25 0x000000000069595a in NetFramework::CNetThread::handle_netevent(NetFramework::list_node*) () #26 0x0000000000695bbf in NetFramework::CNetThread::ThreadProc(Infra::CThreadLite&) () #27 0x00000000006a3f38 in (anonymous namespace)::InternalThreadBody(void*) () #28 0x0000003212407851 in start_thread () from /lib64/libpthread.so.0 #29 0x00000032120e767d in clone () from /lib64/libc.so.6
由于LWP 7890 线程处理信号时两次进入malloc死锁,导致很多其他线程在执行到malloc时卡主。而这些线程本身可能还持有一些业务上的锁,导致死锁迅速扩散,最终整个进程几乎都卡主了。
而且需要指出的是,有时候我们对malloc的调用可能比较隐晦,比如为std::string 等赋值,打印日志等,所以一不留神就容易栽进坑里。文中锁涉及的代码,更是我们项目组一些比较资深的骨干同事写的,其初衷是想在收到一些特殊信号时通过backtrace等函数将当前线程的堆栈打印到日志,方便定位问题。殊不知就是这个看似高明的处理,引发了更加复杂的问题。 由此可见,对于信号处理函数“必须保证可重入”这一点,在实际编码中必须慎之又慎,时刻谨记。
一般来说,信号处理函数中要做的事情应该尽量简单。通常可以置一个标识,由其他线程检测到这个标识后再做相应处理,而不是直接在信号处理函数中做这些事情。