最近又遇到一次hung,dmesg中堆栈如下:
[176223.913270] ------------[ cut here ]------------ [176223.913280] WARNING: CPU: 10 PID: 0 at net/sched/sch_generic.c:300 dev_watchdog+0x248/0x260-----注意cpu号是10 [176223.913282] NETDEV WATCHDOG: eth4 (ixgbe): transmit queue 5 timed out----------------warn的内容 [176223.913283] Modules linked in: witdriver(OE) mysendmsg(OE) fuse tipc(OE) ossmod(OE) iptable_filter mptctl mptbase bonding dm_mirror dm_region_hash dm_log dm_mod iTCO_wdt iTCO_vendor_support skx_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr joydev ses enclosure sg mei_me mei shpchp lpc_ich i2c_i801 wmi ipmi_si ipmi_devintf ipmi_msghandler nfit libnvdimm acpi_power_meter acpi_cpufreq acpi_pad tcp_bbr sch_fq binfmt_misc ip_tables ext4 mbcache jbd2 raid1 sd_mod crc_t10dif crct10dif_generic ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crct10dif_pclmul crct10dif_common crc32c_intel ahci libahci libata ixgbe(OE) i40e(OE) mpt3sas(OE) [176223.913340] ptp raid_class pps_core scsi_transport_sas i2c_core dca [last unloaded: witdriver] [176223.913347] CPU: 10 PID: 0 Comm: swapper/10 Tainted: G OE ------------ 3.10.0-693.21.1.el7.x86_64 #1 [176223.913348] Hardware name: ZTE ZXCDN/SC621DI-16F, BIOS 1.0b 09/21/2017 [176223.913350] Call Trace: [176223.913351][ ] dump_stack+0x19/0x1b [176223.913361] [ ] __warn+0xd8/0x100----------------------warn了,内容见第二行 [176223.913364] [ ] warn_slowpath_fmt+0x5f/0x80 [176223.913369] [ ] dev_watchdog+0x248/0x260----------------在处理到dev_watchdog时候出现问题 [176223.913372] [ ] ? dev_deactivate_queue.constprop.33+0x60/0x60 [176223.913375] [ ] call_timer_fn+0x38/0x110----------------处理回调 [176223.913378] [ ] ? dev_deactivate_queue.constprop.33+0x60/0x60 [176223.913382] [ ] run_timer_softirq+0x22d/0x310-----------这个软中断是timer [176223.913386] [ ] __do_softirq+0xfd/0x290 [176223.913390] [ ] call_softirq+0x1c/0x30 [176223.913394] [ ] do_softirq+0x65/0xa0---------------------硬中断处理后,顺便处理软中断 [176223.913397] [ ] irq_exit+0x175/0x180---------------------硬中断结束 [176223.913400] [ ] smp_apic_timer_interrupt+0x48/0x60 [176223.913402] [ ] apic_timer_interrupt+0x162/0x170---------硬中断开始 [176223.913403] [ ] ? cpuidle_enter_state+0x57/0xd0 [176223.913410] [ ] cpuidle_idle_call+0xde/0x230 [176223.913414] [ ] arch_cpu_idle+0xe/0x40 [176223.913420] [ ] cpu_startup_entry+0x14a/0x1c0 [176223.913423] [ ] start_secondary+0x1f2/0x270 [176223.913425] ---[ end trace 30e7271cf4a53655 ]--- [176223.913430] ixgbe 0000:86:00.0 eth4: Fake Tx hang detected with timeout of 5 seconds [176224.918915] ixgbe 0000:86:00.1 eth5: Fake Tx hang detected with timeout of 5 seconds [176224.918917] ixgbe 0000:af:00.1 eth7: Fake Tx hang detected with timeout of 5 seconds [176224.918948] ixgbe 0000:af:00.0 eth6: Fake Tx hang detected with timeout of 5 seconds [176224.934872] ixgbe 0000:18:00.1 eth1: Fake Tx hang detected with timeout of 5 seconds [176226.403293] hrtimer_gaq_time = 912735, soft_irq_time = 42216, cpu = 7 [176233.921764] ixgbe 0000:86:00.0 eth4: Fake Tx hang detected with timeout of 10 seconds [176234.911438] ixgbe 0000:af:00.1 eth7: Fake Tx hang detected with timeout of 10 seconds [176234.911441] ixgbe 0000:86:00.1 eth5: Fake Tx hang detected with timeout of 10 seconds [176234.911474] ixgbe 0000:af:00.0 eth6: Fake Tx hang detected with timeout of 10 seconds [176234.943364] ixgbe 0000:18:00.1 eth1: Fake Tx hang detected with timeout of 10 seconds [176237.664127] hrtimer_gaq_time = 854268, soft_irq_time = 72171, cpu = 43
如果只盯着这个堆栈看,可以看出,网卡的 dev_watchdog 函数检测到了eth4的queue 5 出现了 trans_timeout。
超时的检测周期,不同的设备是不一样的,intel的ixgbe对应的超时时间是 ixgbe_main.c 中设置的
netdev->watchdog_timeo = 5 * HZ;
检测的原理就是,每当发包成功的时候,设置 txq->trans_start ,如果这个值在timer到期检测的时候,和当前时间相差 一个超时周期(默认是5s,各种驱动可能不同),则认为出现发送
超时。依赖于定时器软中断,检测的对象是处于running状态的网卡已经处于stop状态的发送队列,话说网卡如果没有running,检测也没有意义。
static inline netdev_tx_t netdev_start_xmit(struct sk_buff *skb, struct net_device *dev, struct netdev_queue *txq, bool more) { const struct net_device_ops *ops = dev->netdev_ops; int rc; /*__netdev_start_xmit 里面就完全是使用driver 的ops去发包了,其实到此为止,一个skb已经从netdevice层送到driver层了,接下来会等待driver的返回*/ rc = __netdev_start_xmit(ops, skb, dev, more); if (rc == NETDEV_TX_OK) txq_trans_update(txq); return rc; }
static inline void txq_trans_update(struct netdev_queue *txq) { if (txq->xmit_lock_owner != -1) txq->trans_start = jiffies;---------------------更新当前发送队列txq的最后发送时间 }
ixgbe驱动会检测是否pending,也就是有的tx中的ring没变动了,那么打印warning。
接下来,又有一个堆栈:
[176242.505583] NMI watchdog: BUG: soft lockup - CPU#9 stuck for 22s! [kthread_send/9:178838]-----------注意这个时候是cpu9 [176242.506532] Modules linked in: witdriver(OE) mysendmsg(OE) fuse tipc(OE) ossmod(OE) iptable_filter mptctl mptbase bonding dm_mirror dm_region_hash dm_log dm_mod iTCO_wdt iTCO_vendor_support skx_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr joydev ses enclosure sg mei_me mei shpchp lpc_ich i2c_i801 wmi ipmi_si ipmi_devintf ipmi_msghandler nfit libnvdimm acpi_power_meter acpi_cpufreq acpi_pad tcp_bbr sch_fq binfmt_misc ip_tables ext4 mbcache jbd2 raid1 sd_mod crc_t10dif crct10dif_generic ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crct10dif_pclmul crct10dif_common crc32c_intel ahci libahci libata ixgbe(OE) i40e(OE) mpt3sas(OE) [176242.506588] ptp raid_class pps_core scsi_transport_sas i2c_core dca [last unloaded: witdriver] [176242.506597] CPU: 9 PID: 178838 Comm: kthread_send/9 Tainted: G W OE ------------ 3.10.0-693.21.1.el7.x86_64 #1 [176242.506599] Hardware name: ZTE ZXCDN/SC621DI-16F, BIOS 1.0b 09/21/2017 [176242.506601] task: ffff88290f7ddee0 ti: ffff882b680d0000 task.ti: ffff882b680d0000 [176242.506603] RIP: 0010:[] [ ] native_queued_spin_lock_slowpath+0x1ce/0x200 [176242.506610] RSP: 0018:ffff882fbe843dc0 EFLAGS: 00000202 [176242.506611] RAX: 0000000000000001 RBX: ffff882fbe843d50 RCX: 0000000000000001 [176242.506613] RDX: 0000000000000101 RSI: 0000000000000001 RDI: ffff882fb80a0bc0-------大家都在抢的锁 [176242.506614] RBP: ffff882fbe843dc0 R08: 0000000000000101 R09: ffff8801363f4000 [176242.506616] R10: 0000000000000000 R11: ffff882fbe843da8 R12: ffff882fbe843d38 [176242.506617] R13: ffffffff816c6732 R14: ffff882fbe843dc0 R15: ffff882fb80a0b40 [176242.506619] FS: 0000000000000000(0000) GS:ffff882fbe840000(0000) knlGS:0000000000000000 [176242.506621] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [176242.506622] CR2: 00007fd86fd46316 CR3: 0000000001a0a000 CR4: 00000000003607e0 [176242.506624] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [176242.506626] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [176242.506627] Call Trace: [176242.506629] [176242.506630] [176242.506635] [ ] queued_spin_lock_slowpath+0xb/0xf [176242.506640] [ ] _raw_spin_lock+0x20/0x30 [176242.506645] [ ] dev_watchdog+0x72/0x260-----------------------同上分析,不过这次是拿不到锁 [176242.506649] [ ] ? dev_deactivate_queue.constprop.33+0x60/0x60 [176242.506653] [ ] call_timer_fn+0x38/0x110 [176242.506655] [ ] ? dev_deactivate_queue.constprop.33+0x60/0x60 [176242.506658] [ ] run_timer_softirq+0x22d/0x310 [176242.506663] [ ] __do_softirq+0xfd/0x290 [176242.506667] [ ] call_softirq+0x1c/0x30 [176242.506672] [ ] do_softirq+0x65/0xa0 [176242.506675] [ ] irq_exit+0x175/0x180 [176242.506679] [ ] smp_apic_timer_interrupt+0x48/0x60 [176242.506682] [ ] apic_timer_interrupt+0x162/0x170----------------被硬中断打断 [176242.506684]
[176242.506702] [
[176242.506710] [
[176242.506715] [
[176242.506719] [
[176242.506723] [
[176242.506726] [
[176242.506729] [
[176242.506733] [
[176242.506736] [
分析这个堆栈,dev_watchdog 这次拿不到锁了,已经持续了22s,走查代码,它需要哪一把锁呢?
etif_tx_lock(dev);
static inline void netif_tx_lock(struct net_device *dev) { unsigned int i; int cpu; spin_lock(&dev->tx_global_lock); cpu = smp_processor_id(); for (i = 0; i < dev->num_tx_queues; i++) { struct netdev_queue *txq = netdev_get_tx_queue(dev, i); /* We are the only thread of execution doing a * freeze, but we have to grab the _xmit_lock in * order to synchronize with threads which are in * the ->hard_start_xmit() handler and already * checked the frozen bit. */ __netif_tx_lock(txq, cpu); set_bit(__QUEUE_STATE_FROZEN, &txq->state); __netif_tx_unlock(txq); } }
那这把锁被谁拿了呢?被 wit_kthread_xmit_fn 函数拿了,因为我们正在发包啊。
也就是说,我们的内核线程发包的时候,拿了锁,结果被中断,然后硬中断处理dev_watchdog的时候,也要拿这把锁,死锁了。
接下来,又引发其他cpu死锁:(每次进入_raw_spin_lock,rdi中必然存放的就是锁,x86反汇编看的话确实是rdi,其他架构不是如此)
[176246.611936] NMI watchdog: BUG: soft lockup - CPU#25 stuck for 22s! [a.out:189281]-------------------------------cpu25 [176246.612901] Modules linked in: witdriver(OE) mysendmsg(OE) fuse tipc(OE) ossmod(OE) iptable_filter mptctl mptbase bonding dm_mirror dm_region_hash dm_log dm_mod iTCO_wdt iTCO_vendor_support skx_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr joydev ses enclosure sg mei_me mei shpchp lpc_ich i2c_i801 wmi ipmi_si ipmi_devintf ipmi_msghandler nfit libnvdimm acpi_power_meter acpi_cpufreq acpi_pad tcp_bbr sch_fq binfmt_misc ip_tables ext4 mbcache jbd2 raid1 sd_mod crc_t10dif crct10dif_generic ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crct10dif_pclmul crct10dif_common crc32c_intel ahci libahci libata ixgbe(OE) i40e(OE) mpt3sas(OE) [176246.612961] ptp raid_class pps_core scsi_transport_sas i2c_core dca [last unloaded: witdriver] [176246.612969] CPU: 25 PID: 189281 Comm: a.out Tainted: G W OEL ------------ 3.10.0-693.21.1.el7.x86_64 #1 [176246.612970] Hardware name: ZTE ZXCDN/SC621DI-16F, BIOS 1.0b 09/21/2017 [176246.612973] task: ffff882ba14a0000 ti: ffff882ba92c8000 task.ti: ffff882ba92c8000 [176246.612974] RIP: 0010:[] [ ] native_queued_spin_lock_slowpath+0x156/0x200 [176246.612982] RSP: 0018:ffff882ba92cb978 EFLAGS: 00000202 [176246.612983] RAX: 0000000000000101 RBX: 0000090190134ba9 RCX: 0000000000c90000 [176246.612985] RDX: 0000000000c90101 RSI: 0000000000000101 RDI: ffff882fb80a0bc0------------大家都在抢的香饽饽存放在此,就是这把锁 [176246.612986] RBP: ffff882ba92cb978 R08: ffff882fbe959700 R09: 0000000000000000 [176246.612988] R10: 0000000000000001 R11: ffffea017c3e5580 R12: ffff882c18a85c00 [176246.612989] R13: 00000000000005dc R14: ffff882ba2091100 R15: ffffffff811e63d1 [176246.612991] FS: 00007fbf25f9b700(0000) GS:ffff882fbe940000(0000) knlGS:0000000000000000 [176246.612993] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [176246.612995] CR2: 0000000000c79000 CR3: 000000593a2ec000 CR4: 00000000003607e0 [176246.612996] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [176246.612998] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [176246.612999] Call Trace: [176246.613006] [ ] queued_spin_lock_slowpath+0xb/0xf [176246.613011] [ ] _raw_spin_lock+0x20/0x30 [176246.613015] [ ] sch_direct_xmit+0x13f/0x240------------qdisc中sys调用发包 [176246.613017] [ ] __qdisc_run+0x95/0x1c0 [176246.613022] [ ] __dev_queue_xmit+0x2c9/0x550-----------这个是实际物理网卡的 [176246.613025] [ ] dev_queue_xmit+0x10/0x20 [176246.613032] [ ] bond_dev_queue_xmit+0x32/0x80 [bonding] [176246.613036] [ ] bond_start_xmit+0x1be/0x420 [bonding] [176246.613039] [ ] dev_hard_start_xmit+0x90/0x1a0 [176246.613041] [ ] __dev_queue_xmit+0x448/0x550-----------这个是bond的 [176246.613044] [ ] dev_queue_xmit+0x10/0x20 [176246.613048] [ ] ip_finish_output+0x546/0x7a0 [176246.613050] [ ] ip_output+0x73/0xe0 [176246.613053] [ ] ? __ip_local_out_sk+0xf6/0x100 [176246.613055] [ ] ip_local_out_sk+0x37/0x40 [176246.613058] [ ] ip_send_skb+0x16/0x50 [176246.613062] [ ] udp_send_skb+0xac/0x2b0 [176246.613065] [ ] udp_push_pending_frames+0x3e/0x60 [176246.613068] [ ] udp_sendpage+0x119/0x1c0 [176246.613071] [ ] inet_sendpage+0x70/0xe0 [176246.613075] [ ] kernel_sendpage+0x1e/0x30 [176246.613079] [ ] sendfile_slowpath_2+0x1a8/0x1f0 [witdriver] [176246.613084] [ ] sys_mycall+0x776/0x8f0 [witdriver] [176246.613090] [ ] system_call_fastpath+0x1c/0x21 [176246.613091] Code: 8b 08 4d 85 c9 74 04 41 0f 18 09 8b 17 0f b7 c2 85 c0 74 21 83 f8 03 75 10 eb 1a 66 2e 0f 1f 84 00 00 00 00 00 85 c0 74 0c f3 90 <8b> 17 0f b7 c2 83 f8 03 75 f0 be 01 00 00 00 eb 15 66 0f 1f 84
由于我们拿的是dev的锁,在没开启抢占的情况下不调度出去,但也并没有关中断,这样其他cpu在系统调用中要发包,也会去尝试拿这把锁,好吧,你也死锁了,等着吧。
可以想象,后续其他cpu要拿这把锁的话,大家都会挂,一个cpu的softlock导致其他cpu全挂,只要你这个cpu来拿那锁就会如此。
由于我们拿的是spin_lock,并没有关中断,所以它还可以响应中断,看如下堆栈就明白,最早打印网卡异常的9号cpu,堆栈在死的时候是如下这样的:
crash> bt PID: 178838 TASK: ffff88290f7ddee0 CPU: 9 COMMAND: "kthread_send/9" #0 [ffff882fbe843a70] machine_kexec at ffffffff8105d77b #1 [ffff882fbe843ad0] __crash_kexec at ffffffff8110aca2 #2 [ffff882fbe843ba0] panic at ffffffff816ad52f #3 [ffff882fbe843c20] watchdog_timer_fn at ffffffff81135a51----------------------------hardlock检测中hrtimer的回调 #4 [ffff882fbe843c58] __hrtimer_run_queues at ffffffff810b93a6 #5 [ffff882fbe843cb0] hrtimer_interrupt at ffffffff810b993f #6 [ffff882fbe843cf8] local_apic_timer_interrupt at ffffffff8105467b #7 [ffff882fbe843d10] smp_apic_timer_interrupt at ffffffff816c9e83 #8 [ffff882fbe843d28] apic_timer_interrupt at ffffffff816c6732--------------------------死之前又增加的堆栈,这个是硬中断 #9 [ffff882fbe843dc8] queued_spin_lock_slowpath at ffffffff816adeee---------------------一开始打印hung的堆栈,在自旋 #10 [ffff882fbe843dd8] _raw_spin_lock at ffffffff816bb080 #11 [ffff882fbe843de8] dev_watchdog at ffffffff815bca52 #12 [ffff882fbe843e28] call_timer_fn at ffffffff8109a9c8 #13 [ffff882fbe843e60] run_timer_softirq at ffffffff8109ceed #14 [ffff882fbe843ed8] __do_softirq at ffffffff8109404d #15 [ffff882fbe843f48] call_softirq at ffffffff816c8afc #16 [ffff882fbe843f60] do_softirq at ffffffff8102d435 #17 [ffff882fbe843f80] irq_exit at ffffffff81094495 #18 [ffff882fbe843f98] smp_apic_timer_interrupt at ffffffff816c9e88 #19 [ffff882fbe843fb0] apic_timer_interrupt at ffffffff816c6732 ------ #20 [ffff882b680d3c28] apic_timer_interrupt at ffffffff816c6732 [exception RIP: ixgbe_xmit_frame_ring+83] RIP: ffffffffc01299e3 RSP: ffff882b680d3cd0 RFLAGS: 00000212 RAX: 0000000000000562 RBX: 0000000000000001 RCX: 000000000000403d RDX: ffff882fb9331c00 RSI: ffff8828d7b8fac0 RDI: 0000000000000001 RBP: ffff882b680d3d48 R8: 0000000000000008 R9: 0000a0a5447b9d78 R10: ffff8828c6e84f00 R11: 000000002b3000b8 R12: ffff8828c0291b00 R13: 0000000022300000 R14: 0000000000000001 R15: ffff882b680d3cc0 ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018 #21 [ffff882b680d3d50] ixgbe_xmit_frame at ffffffffc012a918 [ixgbe] #22 [ffff882b680d3d80] wit_send_tasklet at ffffffffc043b63c [witdriver] #23 [ffff882b680d3e78] wit_kthread_xmit_fn at ffffffffc043ba95 [witdriver] #24 [ffff882b680d3ec8] kthread at ffffffff810b5241 #25 [ffff882b680d3f50] ret_from_fork at ffffffff816c5577
总结:
1、解决办法很简单,就是在 wit_kthread_xmit_fn 函数中,加了一个关闭软中断就行,因为就是软中断和sys拿锁冲突。
2、分析软锁,最好从dmesg中开始分析,那个时候时间还早,会按时间顺序来打印堆栈,比起直接crash来分析要快一些,特别是在对自己模块代码比较熟悉的情况下。