上一篇博客《freeswitch 1.10.10-dev录音早期媒体卡通道的bug分析》 描述了找出死锁地方,但是还没深层次的描述出现死锁的原因,我又花了半天分析代码,和写这个blog 描述死锁原因。
我们知道fs一个通道一个状态机线程,发起呼叫又是一个单独的线程,发起呼叫的线程一般 switch_ivr_originate 执行返回就可以退出了。
mod_cti模块(顶顶通呼叫中心中间件模块) 在 switch_ivr_originate 执行完成后没有马上退出而是执行了等待通话应答的操作和获取早期媒体的操作,也就是 switch_core_media_read_frame(#7) 函数。
#0 0x00007f9dc597a54d in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007f9dc5975eb6 in _L_lock_941 () from /lib64/libpthread.so.0
#2 0x00007f9dc5975daf in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007f9dc8687759 in fspr_thread_mutex_lock (mutex=) at locks/unix/thread_mutex.c:92
#4 0x00007f9dc850e895 in switch_mutex_lock (lock=) at src/switch_apr.c:301
#5 0x00007f9db9c9029c in sofia_receive_message (session=0x7f9cf5b4a7c8, msg=0x7f9d345ac4d0) at mod_sofia.c:1526
#6 0x00007f9dc853f80b in switch_core_session_perform_receive_message (session=session@entry=0x7f9cf5b4a7c8, message=message@entry=0x7f9d345ac4d0, file=file@entry=0x7f9dc86a961d "src/switch_core_io.c", func=func@entry=0x7f9dc86a9970 <__func__.19414> "switch_core_session_read_frame", line=line@entry=416) at src/switch_core_session.c:930
#7 0x00007f9dc854aa7e in switch_core_session_read_frame (session=session@entry=0x7f9cf5b4a7c8, frame=frame@entry=0x7f9d345ae568, flags=flags@entry=0, stream_id=stream_id@entry=0) at src/switch_core_io.c:416
#8 0x00007f9dc8608251 in switch_ivr_sleep (session=0x7f9cf5b4a7c8, ms=100, sync=, args=0x0) at src/switch_ivr.c:294
#9 0x00007f9daeb4dcc5 in ?? () from /ddt/fs/mod/mod_cti.so
#10 0x00007f9dc8543b5b in switch_core_session_exec (session=0x7f9cf5b4a7c8, application_interface=application_interface@entry=0x19077a0, arg=arg@entry=0x7f9d345aeb40 "38981 2000") at src/switch_core_session.c:2964
#11 0x00007f9dc85441ef in switch_core_session_execute_application_get_flags (session=, app=0x7f9db140b89b "cti_wait_for_answer", arg=0x7f9d345aeb40 "38981 2000", flags=) at src/switch_core_session.c:2824
#12 0x00007f9daeb8eb7e in ?? () from /ddt/fs/mod/mod_cti.so
#13 0x00007f9daeb92619 in ?? () from /ddt/fs/mod/mod_cti.so
#14 0x00007f9dc853c9c5 in switch_core_session_thread_pool_worker (thread=0x7f9d45af47a0, obj=) at src/switch_core_session.c:1790
#15 0x00007f9dc868d760 in dummy_worker (opaque=0x7f9d45af47a0) at threadproc/unix/thread.c:151
#16 0x00007f9dc5973ea5 in start_thread () from /lib64/libpthread.so.0
#17 0x00007f9dc4fc7b0d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f9d2f76b700 (LWP 30595)):
switch_ivr_originate函数 在发起呼叫成功后 会创建一个 状态机线程
if (!switch_core_session_running(oglobals.originate_status[i].peer_session)) {
if (oglobals.originate_status[i].per_channel_delay_start) {
switch_channel_set_flag(oglobals.originate_status[i].peer_channel, CF_BLOCK_STATE);
}
switch_core_session_thread_launch(oglobals.originate_status[i].peer_session);
}
}
状态机线程启动后 里面会调用 switch_ivr_parse_all_messages(#18); 来处理线通道消息。
#0 0x00007f9dc597a54d in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007f9dc5975eb6 in _L_lock_941 () from /lib64/libpthread.so.0
#2 0x00007f9dc5975daf in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007f9dc8687759 in fspr_thread_mutex_lock (mutex=) at locks/unix/thread_mutex.c:92
#4 0x00007f9dc850e895 in switch_mutex_lock (lock=) at src/switch_apr.c:301
#5 0x00007f9dc852c6ec in switch_core_session_lock_codec_read (session=) at src/switch_core_codec.c:74
#6 0x00007f9dc8561e7c in switch_core_media_set_codec (session=session@entry=0x7f9cf5b4a7c8, force=force@entry=0, codec_flags=0) at src/switch_core_media.c:3608
#7 0x00007f9dc856763e in switch_core_media_activate_rtp (session=0x7f9cf5b4a7c8) at src/switch_core_media.c:8565
#8 0x00007f9db9cf73de in sofia_media_activate_rtp (tech_pvt=tech_pvt@entry=0x7f9cf5b53ff8) at sofia_media.c:58
#9 0x00007f9db9cf745e in sofia_media_tech_media (tech_pvt=tech_pvt@entry=0x7f9cf5b53ff8, r_sdp=, type=type@entry=SDP_TYPE_REQUEST) at sofia_media.c:189
#10 0x00007f9db9ccab5a in sofia_handle_sip_i_state (de=0x7f9d983bcea0, tags=0x7f9d882ae950, sip=0x0, sofia_private=, nh=0x7f9da1db69b0, profile=0x1ac35070, nua=0x7f9d98040be0, phrase=0x7f9d882aece6 "Ringing", status=183, session=0x7f9cf5b4a7c8) at sofia.c:7683
#11 our_sofia_event_callback (event=nua_i_state, status=, phrase=0x7f9d882aece6 "Ringing", nua=0x7f9d98040be0, profile=0x1ac35070, nh=0x7f9da1db69b0, sofia_private=0x7f9d1927a640, sip=0x0, de=de@entry=0x7f9d983bcea0, tags=0x7f9d882ae950) at sofia.c:1813
#12 0x00007f9db9ccea5b in sofia_process_dispatch_event (dep=0x7f9d2f76a2c0) at sofia.c:2253
#13 0x00007f9db9c8ffc9 in sofia_receive_message (session=0x7f9cf5b4a7c8, msg=0x7f9d2f76aa20) at mod_sofia.c:1347
#14 0x00007f9dc853f6d5 in switch_core_session_perform_receive_message (session=session@entry=0x7f9cf5b4a7c8, message=message@entry=0x7f9d2f76aa20, file=file@entry=0x7f9dc86c5e35 "src/switch_ivr.c", func=func@entry=0x7f9dc86c6ed0 <__func__.19070> "switch_ivr_parse_signal_data", line=line@entry=893) at src/switch_core_session.c:853
#15 0x00007f9dc8604f9c in switch_ivr_parse_signal_data (session=session@entry=0x7f9cf5b4a7c8, all=all@entry=SWITCH_TRUE, only_session_thread=only_session_thread@entry=SWITCH_FALSE) at src/switch_ivr.c:893
#16 0x00007f9dc8604fec in switch_ivr_parse_all_signal_data (session=session@entry=0x7f9cf5b4a7c8) at src/switch_ivr.c:906
#17 0x00007f9dc8605007 in switch_ivr_parse_all_messages (session=session@entry=0x7f9cf5b4a7c8) at src/switch_ivr.c:852
#18 0x00007f9dc8607b4a in switch_ivr_parse_all_events (session=session@entry=0x7f9cf5b4a7c8) at src/switch_ivr.c:925
#19 0x00007f9dc8548632 in switch_core_session_run (session=0x7f9cf5b4a7c8) at src/switch_core_state_machine.c:710
#20 0x00007f9dc85412ae in switch_core_session_thread (thread=, obj=0x7f9cf5b4a7c8) at src/switch_core_session.c:1726
#21 0x00007f9dc853c9c5 in switch_core_session_thread_pool_worker (thread=0x7f9d38410920, obj=) at src/switch_core_session.c:1790
#22 0x00007f9dc868d760 in dummy_worker (opaque=0x7f9d38410920) at threadproc/unix/thread.c:151
#23 0x00007f9dc5973ea5 in start_thread () from /lib64/libpthread.so.0
#24 0x00007f9dc4fc7b0d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f9dc8db78c0 (LWP 11507)):
本例中switch_ivr_parse_all_messages处理中接收到183消息执行到 switch_core_media_set_codec ,本例子 死锁触发的 就很明确了
1 发起呼叫 switch_ivr_originate
2 switch_ivr_originate 创建 通道状态机线程
3 收到183
1 发起呼叫的线程switch_ivr_originate 返回执行 switch_core_media_read_frame
2 通道状态机线程执行初始化媒体操作 switch_core_media_set_codec
由于 发起呼叫的线程 switch_core_media_read_frame 中先锁定了 session->codec_read_mutex,并且执行到sofia_process_dispatch_event 卡在锁定tech_pvt->sofia_mutex
状态机线程 锁定了tech_pvt->sofia_mutex,执行到 switch_core_media_set_codec 卡在锁定session->codec_read_mutex
死锁的问题到这里已经描述完成了。避开死锁的方法就是发起呼叫的线程不要执行媒体操作。比如不要再 execute_on_post_originate 里面执行媒体函数。所有媒体操作都放到状态机线程去执行。
能不能通过修改fs代码来解决这个死锁呢?
我们看switch_core_media_read_frame代码,
if (!switch_test_flag(session, SSF_WARN_TRANSCODE)) {
switch_core_session_message_t msg = { 0 };
msg.message_id = SWITCH_MESSAGE_INDICATE_TRANSCODING_NECESSARY;
switch_core_session_receive_message(session, &msg);
switch_set_flag(session, SSF_WARN_TRANSCODE);
}
如果媒体没初始化,就通过消息 SWITCH_MESSAGE_INDICATE_TRANSCODING_NECESSARY 去执行初始化媒体的代码,上一篇说了,可以再switch_core_session_receive_message之前先解锁session->codec_read_mutex,就可以避免这次死锁,,可是switch_core_media_read_frame里面有大量的 switch_core_session_receive_message调用,除非switch_core_media_read_frame里面的每次switch_core_session_receive_message调用,都修改成先解锁codec_read_mutex,执行完成再锁定 codec_read_mutex,否则多线程操作一个通道,都有出现死锁的概率。所以正确的做法是 不要多线程去操作同一个通道。让通道状态机去执行媒体操作。