MDB分析Solaris内核死锁问题:一个实例
本文通过一个实例,描述了如何使用MDB调试分析Solaris内核死锁的问题。
死锁是多线程内核必须面对的问题。部分死锁可以通过精心设计来避免。但随着现在操作系统的复杂性和并发性不断增加,代码规模迅速膨胀, 我们难以避免引入一些具有潜在危险的代码。下面的例子就是我们在Solaris系统测试中,遇到的一个读写死锁导致的系统挂起,下面引用的代码都来自
www.opensolaris.org。
内核提供的同步机制都可能导致死锁,比如互斥锁(mutex), 读写锁(rw lock)等。而我们这个例子遇到是读写锁引起的死锁。
这个问题发生在x64平台系统测试过程中,当我们启动重负载网络背景流量(TCP/UDP),同时对一个网络聚合(aggregation)进 行反复的管理操作,用dladm,最终导致系统挂起并停止报文的收发。因为可以复现,我们重新启动系统,并在启动时加载了kmdb,当挂起再次发生,我们 有了core dump。
下面我们就通过分析这个core dump,来看看如何用MDB发现死锁的。
首先加载core dump映像,
<SUT> mdb 0
前面提到了,我们在运行网络管理命令的过程中导致了死锁,所以我们知道挂起可能与网络聚合的管理命令dladm有关,看一下dladm在做什 么:
> ::ps ! grep dladm
R 12476 12268 12268 725 0 0x42004000 fffffe87705e23d0 dladm
> fffffe87705e23d0::ps -t
S PID PPID PGID SID UID FLAGS ADDR NAME
R 12476 12268 12268 725 0 0x42004000 fffffe87705e23d0 dladm
T 0xffffffffa69be740 <TS_SLEEP>
dladm只有一个线程0xffffffffa69be740,它在休眠。看来很有希望,我们再来看看它为什么休眠,
> 0xffffffffa69be740::findstack -v
stack pointer for thread ffffffffa69be740: fffffe8000a2e570
[ fffffe8000a2e570 _resume_from_idle+0xf8() ]
fffffe8000a2e5b0 swtch+0x185()
fffffe8000a2e650 turnstile_block+0x80d(0, 0, ffffffff93785580,
fffffffffbc039d8, 0, 0)
fffffe8000a2e6b0 rw_enter_sleep+0x186(ffffffff93785580, 0)
fffffe8000a2e6f0 mac_rx_remove+0x32(ffffffff93785310, fffffe86d2d1e368)
fffffe8000a2e720 aggr_port_delete+0x3d(ffffffff8614ccb0)
fffffe8000a2e780 aggr_grp_rem_port+0x1e4(ffffffff85798720, ffffffff8614ccb0,
fffffe8000a2e7e4)
fffffe8000a2e800 aggr_grp_rem_ports+0x182(1, 1, ffffffff9d1faeb8)
fffffe8000a2e840 aggr_ioc_remove+0xc5(ffffffffa27164c0, 100000)
fffffe8000a2e890 aggr_ioctl+0x9b(fffffe8692c06108, ffffffffa27164c0)
fffffe8000a2e8b0 aggr_wput+0x2e(fffffe8692c06108, ffffffffa27164c0)
fffffe8000a2e920 putnext+0x246(fffffe80caeb3658, ffffffffa27164c0)
fffffe8000a2e9f0 strdoioctl+0x3bb(fffffe8692bfee68, fffffe8000a2ea68, 100003,
1, ffffffff9ec93f40, fffffe8000a2ee9c)
fffffe8000a2ed20 strioctl+0x3a73(ffffffff94512e80, 5308, 8037490, 100003, 1,
ffffffff9ec93f40, fffffe8000a2ee9c)
fffffe8000a2ed70 spec_ioctl+0x83(ffffffff94512e80, 5308, 8037490, 100003,
ffffffff9ec93f40, fffffe8000a2ee9c)
fffffe8000a2edc0 fop_ioctl+0x36(ffffffff94512e80, 5308, 8037490, 100003,
ffffffff9ec93f40, fffffe8000a2ee9c)
调用栈输出最左边一列是每个函数栈框的指针,如果函数的参数列表没有解析出来,可以通过栈框指针来帮助你找到传递给函数的参数。但我们这里可以直接得到参 数,我们继续看。
看来这个线程睡眠在一个读写锁0xffffffff93785580上面,那这个锁是什么状态呢?
> ffffffff93785580::rwlock
ADDR OWNER/COUNT FLAGS WAITERS
ffffffff93785580 READERS=1 B011 ffffffffa69be740 (W)
||
WRITE_WANTED -------+|
HAS_WAITERS --------+
>
为了好好解释一下上面的MDB输出,我们先看一下这个锁的裸数据是什么样子。
> 0xffffffff93785580/J
0xffffffff93785580: b
它是0x0B,二进制是1011,对应上面的OWNER/COUNT和FLAGS域。bit0是"wait", 这里是1,对应"HAS_WAITERS",表示有线程在等待这个锁。bit1是"wrwant", 也是1,对应"WRITE_WANTED",表示至少有一个线程在等待获得写锁,这个线程在这里是0xffffffffa69be740,对应 "WAITERS"。bit2是"wrlock", 它是决定剩下高位意义的关键参数,这里是0,高位表示这个锁读者的个数,现在是1,就是线程0xffffffffa69be740。如果bit2被设置为 0, 那么高位对应的是现在持有此写锁的线程地址。
到现在,还不知道是哪个线程持有这个读锁,我们要在内核里面查找这个读写锁的地址,看它出现在哪些线程的堆栈中,用kgrep来查一下:
> ffffffff93785580::kgrep | ::whatis
fffffe8000359b28 is in thread fffffe8000359c80’s stack
fffffe8000a2e5d8 is in thread ffffffffa69be740’s stack
fffffe8000a2e638 is in thread ffffffffa69be740’s stack
fffffe8000a2e670 is in thread ffffffffa69be740’s stack
fffffe8000a2e6a8 is in thread ffffffffa69be740’s stack
ffffffffa2f3ce48 is ffffffffa2f3ce38+10, bufctl ffffffffa2f5b478 allocated from
turnstile_cache
ffffffffa69be7c0 is ffffffffa69be740+80, allocated as a thread structure
从输出可以看到这个锁出现在两个线程的堆栈,分别是0xfffffe8000359c80和0xffffffffa69be740。我们已经知道 0xffffffffa69be740对应dladm,但0xfffffe8000359c80是什么?看一下它的调用栈:
> fffffe8000359c80::findstack -v
stack pointer for thread fffffe8000359c80: fffffe80003592b0
[ fffffe80003592b0 resume_from_intr+0xbb() ]
fffffe80003592f0 swtch+0xad()
fffffe8000359390 turnstile_block+0x80d(0, 1, ffffffff85798720,
fffffffffbc039d8, 0, 0)
fffffe80003593f0 rw_enter_sleep+0x1fb(ffffffff85798720, 1)
fffffe8000359430 aggr_m_tx+0x2d(ffffffff85798720, ffffffff98778be0)
fffffe8000359450 dls_tx+0x20(ffffffff9540ca88, ffffffff98778be0)
fffffe8000359480 str_mdata_fastpath_put+0x2c(ffffffff85be6338,
ffffffff98778be0)
fffffe8000359520 tcp_send_data+0x6d7(fffffe80caedbf80, ffffffffa12eb0f8,
ffffffff98778be0)
fffffe8000359600 tcp_send+0x87b(ffffffffa12eb0f8, fffffe80caedbf80, 5b4, 28,
14, 0, fffffe80003596ac, fffffe80003596b0, fffffe80003596b4, fffffe8000359678
, 4fe574, 7fffffff)
fffffe80003596d0 tcp_wput_data+0x6db(fffffe80caedbf80, 0, 0)
fffffe8000359820 tcp_rput_data+0x2dbc(fffffe80caedbd80, ffffffffa0ecd8c0,
ffffffff843def40)
fffffe80003598b0 squeue_drain+0x212(ffffffff843def40, 4, 2fc20433e599)
fffffe8000359930 squeue_enter_chain+0x3bb(ffffffff843def40, ffffffffa66b4f60,
ffffffff98ede720, 10, 1)
fffffe8000359a00 ip_input+0x780(ffffffff85fadae8, fffffe8692bf3088,
ffffffffa66b4f60, e)
fffffe8000359ab0 i_dls_link_ether_rx+0x1ae(ffffffff93784290, fffffe8692bf3088
, ffffffffa66b4f60)
fffffe8000359b00 mac_rx+0x7a(ffffffff85798750, fffffe8692bf3088,
ffffffffa66b4f60)
fffffe8000359b70 aggr_recv_cb+0x1b9(ffffffff8614ccb0, fffffe8692bf3088,
ffffffffa66b4f60)
fffffe8000359bc0 mac_rx+0x7a(ffffffffa2112e00, fffffe8692bf3088,
ffffffffa66b4f60)
fffffe8000359c00 e1000g_intr+0xd2(fffffe813c11f000)
fffffe8000359c60 av_dispatch_autovect+0x83(19)
fffffe8000359c70 intr_thread+0x50()
>
这是一个网卡收包产生的中断,睡眠在另外的一个读写锁0xffffffff85798720上面。我们看一下这个读写锁,
> ffffffff85798720::rwlock
ADDR OWNER/COUNT FLAGS WAITERS
ffffffff85798720 ffffffffa69be740 B101 fffffe8000359c80 (R)
| | fffffe800016dc80 (R)
WRITE_LOCKED ------+ |
HAS_WAITERS --------+
参照前面描述的读写锁的数据结构意义,我们可以知道这是个写锁,现在被线程0xffffffffa69be740(也就是dladm)所持有。
总结上面的分析我们知道,看到的死锁是这样的:
为了方便,这里我们把0xffffffff85798720叫lock A。另外一个锁叫lock B, 即0xffffffff93785580。网卡线程叫thread A, dladm线程叫thread B。
lock B的持有者是thread A, 当前持有方式是读;等待者是thread B,它试图以写方式获得此锁。Thread B阻塞。
lock A的持有者是thread B, 当前持有方式是写;有两个等待者,我们关心的是thread A,另外一个和这个问题没有关系。Thread A阻塞。
两个线程在互相等待对方释放自己需要的资源,典型的死锁条件导致了最终结果的发生。
我们最后结合Opensolaris的代码来看一下这个死锁的操作时序,本例子和Solaris的aggregation有关,有关 aggregation的背景知识,请参见dladm(1M)的manpage。
/* 因为opensolaris代码会一直发展,以下代码参照2006年9月20日的http: //cvs.opensolaris.org/source/xref/on/,仅供参考*/
1. 当一个TCP报文到达网卡并触发一个中断,下面的调用序列会发生,请参看代码,以e1000g为例:
e1000g_intr -> mac_rx -> (a serie of TCP rx/tx funcs) -> aggr_m_tx
上面的调用序列导致下面的锁请求顺序,
(1) mac_rx -> mi_rx_lock (as RW_READER) mac.c, LINE 1353
(2) aggr_m_tx -> lg_lock (as RW_READER) aggr_send.c, LINE 220
/* See below code */
http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/io/mac/mac.c ,
1344 void
1345 mac_rx(mac_handle_t mh, mac_resource_handle_t mrh, mblk_t *bp)
1346 {
1347 mac_impl_t *mip = (mac_impl_t *)mh;
1348 mac_rx_fn_t *mrfp;
1349
1350 /*
1351 * Call all registered receive functions.
1352 */
1353 rw_enter(&mip->mi_rx_lock, RW_READER);
1354 mrfp = mip->mi_mrfp;
1355 if (mrfp == NULL) {
1356 /* There are no registered receive functions. */
1357 freemsgchain(bp);
1358 rw_exit(&mip->mi_rx_lock);
1359 return;
1360 }
1361 do {
1362 mblk_t *recv_bp;
1363
1364 if (mrfp->mrf_nextp != NULL) {
1365 /* XXX Do we bump a counter if copymsgchain() fails? */
1366 recv_bp = copymsgchain(bp);
1367 } else {
1368 recv_bp = bp;
1369 }
1370 if (recv_bp != NULL)
1371 mrfp->mrf_fn(mrfp->mrf_arg, mrh, recv_bp);
1372 mrfp = mrfp->mrf_nextp;
1373 } while (mrfp != NULL);
1374 rw_exit(&mip->mi_rx_lock);
1375 }
1376
当报文到达网络接口,中断处理函数会调到mac_rx。在1353行,mip->mi_rx_lock被作为RW_READER持有。
在aggr的代码中,
http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/io/aggr/aggr_send.c. In LINE 220, it tries to acquire grp->lg_lock as RW_READER.
208 /*
209 * Send function invoked by the MAC service module.
210 */
211 mblk_t *
212 aggr_m_tx(void *arg, mblk_t *mp)
213 {
214 aggr_grp_t *grp = arg;
215 aggr_port_t *port;
216 mblk_t *nextp;
217 const mac_txinfo_t *mtp;
218
219 for (;;) {
220 rw_enter(&grp->lg_lock, RW_READER);
221 if (grp->lg_ntx_ports == 0) {
222 /*
223 * We could have returned from aggr_m_start() before
224 * the ports were actually attached. Drop the chain.
225 */
226 rw_exit(&grp->lg_lock);
227 freemsgchain(mp);
228 return (NULL);
229 }
230 nextp = mp->b_next;
231 mp->b_next = NULL;
232
233 port = grp->lg_tx_ports[aggr_send_port(grp, mp)];
234 ASSERT(port->lp_state == AGGR_PORT_STATE_ATTACHED);
235
236 rw_exit(&grp->lg_lock);
237
238 /*
239 * We store the transmit info pointer locally in case it
240 * changes between loading mt_fn and mt_arg.
241 */
242 mtp = port->lp_txinfo;
243 if ((mp = mtp->mt_fn(mtp->mt_arg, mp)) != NULL) {
244 mp->b_next = nextp;
245 break;
246 }
247
248 if ((mp = nextp) == NULL)
249 break;
250 }
251 return (mp);
252 }
2. 此时,当管理员用dladm命令试图从当前聚合(aggregation)中移除一个网络接口(dladm remove-aggr),下面的调用序列会发生:
aggr_ioctl -> aggr_ioc_remove -> aggr_grp_rem_ports -> aggr_grp_rem_port -> aggr_port_delete -> mac_rx_remove
这里同样会获取mi_rx_lock和lg_lock两个锁,如下所示:
(1) aggr_grp_rem_ports -> "lg_lock" (RW_WRITER), aggr_grp.c LINE 933
(2) mac_rx_remove -> "mi_rx_lock" (RW_WRITER), mac.c LINE 925
/* See below code */
http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/io/aggr/aggr_grp.c,
920 /*
921 * Remove one or more ports from an existing link aggregation group.
922 */
923 int
924 aggr_grp_rem_ports(uint32_t key, uint_t nports, laioc_port_t *ports)
925 {
926 int rc = 0, i;
927 aggr_grp_t *grp = NULL;
928 aggr_port_t *port;
929 boolean_t mac_addr_update = B_FALSE, mac_addr_changed;
930 boolean_t link_state_update = B_FALSE, link_state_changed;
931
932 /* get group corresponding to key */
933 rw_enter(&aggr_grp_lock, RW_READER);
934 if (mod_hash_find(aggr_grp_hash, GRP_HASH_KEY(key),
935 (mod_hash_val_t *)&grp) != 0) {
936 rw_exit(&aggr_grp_lock);
937 return (ENOENT);
938 }
939 AGGR_GRP_REFHOLD(grp);
940 rw_exit(&aggr_grp_lock);
941
942 AGGR_LACP_LOCK(grp);
943 rw_enter(&grp->lg_lock, RW_WRITER);
944
945 /* we need to keep at least one port per group */
946 if (nports >= grp->lg_nports) {
947 rc = EINVAL;
948 goto bail;
949 }
950
951 /* first verify that all the groups are valid */
952 for (i = 0; i < nports; i++) {
953 if (aggr_grp_port_lookup(grp, ports.lp_devname) == NULL) {
954 /* port not found */
955 rc = ENOENT;
956 goto bail;
957 }
958 }
959
960 /* remove the specified ports from group */
961 for (i = 0; i < nports && !grp->lg_closing; i++) {
962 /* lookup port */
963 port = aggr_grp_port_lookup(grp, ports.lp_devname);
964 ASSERT(port != NULL);
965
966 /* stop port if group has already been started */
967 if (grp->lg_started) {
968 rw_enter(&port->lp_lock, RW_WRITER);
969 aggr_port_stop(port);
970 rw_exit(&port->lp_lock);
971 }
972
973 /* remove port from group */
974 rc = aggr_grp_rem_port(grp, port, &mac_addr_changed,
975 &link_state_changed);
976 ASSERT(rc == 0);
977 mac_addr_update = mac_addr_update || mac_addr_changed;
978 link_state_update = link_state_update || link_state_changed;
979 }
980
981 bail:
982 rw_exit(&grp->lg_lock);
983 AGGR_LACP_UNLOCK(grp);
984 if (!grp->lg_closing) {
985 if (mac_addr_update)
986 mac_unicst_update(grp->lg_mh, grp->lg_addr);
987 if (link_state_update)
988 mac_link_update(grp->lg_mh, grp->lg_link_state);
989 if (rc == 0)
990 mac_resource_update(grp->lg_mh);
991 }
992 AGGR_GRP_REFRELE(grp);
993
994 return (rc);
995 }
http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/io/mac/mac.c,
910 /*
911 * Unregister a receive function for this mac. This removes the function
912 * from the list of receive functions for this mac.
913 */
914 void
915 mac_rx_remove(mac_handle_t mh, mac_rx_handle_t mrh)
916 {
917 mac_impl_t *mip = (mac_impl_t *)mh;
918 mac_rx_fn_t *mrfp = (mac_rx_fn_t *)mrh;
919 mac_rx_fn_t **pp;
920 mac_rx_fn_t *p;
921
922 /*
923 * Search the ’rx’ callback list for the function closure.
924 */
925 rw_enter(&(mip->mi_rx_lock), RW_WRITER);
926 for (pp = &(mip->mi_mrfp); (p = *pp) != NULL; pp = &(p->mrf_nextp)) {
927 if (p == mrfp)
928 break;
929 }
930 ASSERT(p != NULL);
931
932 /* Remove it from the list. */
933 *pp = p->mrf_nextp;
934 kmem_free(mrfp, sizeof (mac_rx_fn_t));
935 rw_exit(&(mip->mi_rx_lock));
936 }
3. 结合代码,我们更容易看出死锁的时序,如下面描述:
(1) 线程B, dladm调用aggr_grp_rem_ports并成功获得lg_lock,以RW_WRITER的写方式
(2) 线程A, TCP报文到达聚合,网卡中断处理程序调用序列调用mac_rx,成功获得mi_rx_lock,以RW_READER方式
(3) 线程A, TCP处理完毕需要返回ACK报文,mac_rx继续调用aggr_m_tx并试图获得lg_lock作为RW_READER, 但是现在lg_lock正在被线程B在步骤(1)以RW_WRITER所持有,所以线程A阻塞
(4) 线程B,继续执行。aggr_grp_rem_ports调用mac_rx_remove并试图获得mi_rx_lock,以RW_WRITER方式, 但现在mi_rx_lock正在被线程A以RW_READER方式持有, 所以线程B也阻塞在这里
(5) 死锁发生