linux 3.10 的又一次hung

最近又遇到一次hung,dmesg中堆栈如下:

[176223.913270] ------------[ cut here ]------------
[176223.913280] WARNING: CPU: 10 PID: 0 at net/sched/sch_generic.c:300 dev_watchdog+0x248/0x260-----注意cpu号是10
[176223.913282] NETDEV WATCHDOG: eth4 (ixgbe): transmit queue 5 timed out----------------warn的内容
[176223.913283] Modules linked in: witdriver(OE) mysendmsg(OE) fuse tipc(OE) ossmod(OE) iptable_filter mptctl mptbase bonding dm_mirror dm_region_hash dm_log dm_mod iTCO_wdt iTCO_vendor_support skx_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr joydev ses enclosure sg mei_me mei shpchp lpc_ich i2c_i801 wmi ipmi_si ipmi_devintf ipmi_msghandler nfit libnvdimm acpi_power_meter acpi_cpufreq acpi_pad tcp_bbr sch_fq binfmt_misc ip_tables ext4 mbcache jbd2 raid1 sd_mod crc_t10dif crct10dif_generic ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crct10dif_pclmul crct10dif_common crc32c_intel ahci libahci libata ixgbe(OE) i40e(OE) mpt3sas(OE)
[176223.913340]  ptp raid_class pps_core scsi_transport_sas i2c_core dca [last unloaded: witdriver]
[176223.913347] CPU: 10 PID: 0 Comm: swapper/10 Tainted: G           OE  ------------   3.10.0-693.21.1.el7.x86_64 #1
[176223.913348] Hardware name: ZTE ZXCDN/SC621DI-16F, BIOS 1.0b 09/21/2017
[176223.913350] Call Trace:
[176223.913351]    [] dump_stack+0x19/0x1b
[176223.913361]  [] __warn+0xd8/0x100----------------------warn了,内容见第二行
[176223.913364]  [] warn_slowpath_fmt+0x5f/0x80
[176223.913369]  [] dev_watchdog+0x248/0x260----------------在处理到dev_watchdog时候出现问题
[176223.913372]  [] ? dev_deactivate_queue.constprop.33+0x60/0x60
[176223.913375]  [] call_timer_fn+0x38/0x110----------------处理回调
[176223.913378]  [] ? dev_deactivate_queue.constprop.33+0x60/0x60
[176223.913382]  [] run_timer_softirq+0x22d/0x310-----------这个软中断是timer
[176223.913386]  [] __do_softirq+0xfd/0x290
[176223.913390]  [] call_softirq+0x1c/0x30
[176223.913394]  [] do_softirq+0x65/0xa0---------------------硬中断处理后,顺便处理软中断
[176223.913397]  [] irq_exit+0x175/0x180---------------------硬中断结束
[176223.913400]  [] smp_apic_timer_interrupt+0x48/0x60
[176223.913402]  [] apic_timer_interrupt+0x162/0x170---------硬中断开始
[176223.913403]    [] ? cpuidle_enter_state+0x57/0xd0
[176223.913410]  [] cpuidle_idle_call+0xde/0x230
[176223.913414]  [] arch_cpu_idle+0xe/0x40
[176223.913420]  [] cpu_startup_entry+0x14a/0x1c0
[176223.913423]  [] start_secondary+0x1f2/0x270
[176223.913425] ---[ end trace 30e7271cf4a53655 ]---
[176223.913430] ixgbe 0000:86:00.0 eth4: Fake Tx hang detected with timeout of 5 seconds
[176224.918915] ixgbe 0000:86:00.1 eth5: Fake Tx hang detected with timeout of 5 seconds
[176224.918917] ixgbe 0000:af:00.1 eth7: Fake Tx hang detected with timeout of 5 seconds
[176224.918948] ixgbe 0000:af:00.0 eth6: Fake Tx hang detected with timeout of 5 seconds
[176224.934872] ixgbe 0000:18:00.1 eth1: Fake Tx hang detected with timeout of 5 seconds
[176226.403293] hrtimer_gaq_time = 912735, soft_irq_time = 42216, cpu = 7
[176233.921764] ixgbe 0000:86:00.0 eth4: Fake Tx hang detected with timeout of 10 seconds
[176234.911438] ixgbe 0000:af:00.1 eth7: Fake Tx hang detected with timeout of 10 seconds
[176234.911441] ixgbe 0000:86:00.1 eth5: Fake Tx hang detected with timeout of 10 seconds
[176234.911474] ixgbe 0000:af:00.0 eth6: Fake Tx hang detected with timeout of 10 seconds
[176234.943364] ixgbe 0000:18:00.1 eth1: Fake Tx hang detected with timeout of 10 seconds
[176237.664127] hrtimer_gaq_time = 854268, soft_irq_time = 72171, cpu = 43

如果只盯着这个堆栈看,可以看出,网卡的 dev_watchdog 函数检测到了eth4的queue 5 出现了 trans_timeout。

超时的检测周期,不同的设备是不一样的,intel的ixgbe对应的超时时间是 ixgbe_main.c 中设置的

netdev->watchdog_timeo = 5 * HZ;

 检测的原理就是,每当发包成功的时候,设置 txq->trans_start ,如果这个值在timer到期检测的时候,和当前时间相差 一个超时周期(默认是5s,各种驱动可能不同),则认为出现发送

超时。依赖于定时器软中断,检测的对象是处于running状态的网卡已经处于stop状态的发送队列,话说网卡如果没有running,检测也没有意义。

static inline netdev_tx_t netdev_start_xmit(struct sk_buff *skb, struct net_device *dev,
                        struct netdev_queue *txq, bool more)
{
    const struct net_device_ops *ops = dev->netdev_ops;
    int rc;
    /*__netdev_start_xmit 里面就完全是使用driver 的ops去发包了,其实到此为止,一个skb已经从netdevice层送到driver层了,接下来会等待driver的返回*/
    rc = __netdev_start_xmit(ops, skb, dev, more);
    if (rc == NETDEV_TX_OK)
        txq_trans_update(txq);

    return rc;
}
static inline void txq_trans_update(struct netdev_queue *txq)
{
    if (txq->xmit_lock_owner != -1)
        txq->trans_start = jiffies;---------------------更新当前发送队列txq的最后发送时间
}

ixgbe驱动会检测是否pending,也就是有的tx中的ring没变动了,那么打印warning。

接下来,又有一个堆栈:

[176242.505583] NMI watchdog: BUG: soft lockup - CPU#9 stuck for 22s! [kthread_send/9:178838]-----------注意这个时候是cpu9
[176242.506532] Modules linked in: witdriver(OE) mysendmsg(OE) fuse tipc(OE) ossmod(OE) iptable_filter mptctl mptbase bonding dm_mirror dm_region_hash dm_log dm_mod iTCO_wdt iTCO_vendor_support skx_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr joydev ses enclosure sg mei_me mei shpchp lpc_ich i2c_i801 wmi ipmi_si ipmi_devintf ipmi_msghandler nfit libnvdimm acpi_power_meter acpi_cpufreq acpi_pad tcp_bbr sch_fq binfmt_misc ip_tables ext4 mbcache jbd2 raid1 sd_mod crc_t10dif crct10dif_generic ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crct10dif_pclmul crct10dif_common crc32c_intel ahci libahci libata ixgbe(OE) i40e(OE) mpt3sas(OE)
[176242.506588]  ptp raid_class pps_core scsi_transport_sas i2c_core dca [last unloaded: witdriver]
[176242.506597] CPU: 9 PID: 178838 Comm: kthread_send/9 Tainted: G        W  OE  ------------   3.10.0-693.21.1.el7.x86_64 #1
[176242.506599] Hardware name: ZTE ZXCDN/SC621DI-16F, BIOS 1.0b 09/21/2017
[176242.506601] task: ffff88290f7ddee0 ti: ffff882b680d0000 task.ti: ffff882b680d0000
[176242.506603] RIP: 0010:[]  [] native_queued_spin_lock_slowpath+0x1ce/0x200
[176242.506610] RSP: 0018:ffff882fbe843dc0  EFLAGS: 00000202
[176242.506611] RAX: 0000000000000001 RBX: ffff882fbe843d50 RCX: 0000000000000001
[176242.506613] RDX: 0000000000000101 RSI: 0000000000000001 RDI: ffff882fb80a0bc0-------大家都在抢的锁
[176242.506614] RBP: ffff882fbe843dc0 R08: 0000000000000101 R09: ffff8801363f4000
[176242.506616] R10: 0000000000000000 R11: ffff882fbe843da8 R12: ffff882fbe843d38
[176242.506617] R13: ffffffff816c6732 R14: ffff882fbe843dc0 R15: ffff882fb80a0b40
[176242.506619] FS:  0000000000000000(0000) GS:ffff882fbe840000(0000) knlGS:0000000000000000
[176242.506621] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[176242.506622] CR2: 00007fd86fd46316 CR3: 0000000001a0a000 CR4: 00000000003607e0
[176242.506624] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[176242.506626] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[176242.506627] Call Trace:
[176242.506629]  
[176242.506630]
[176242.506635]  [] queued_spin_lock_slowpath+0xb/0xf
[176242.506640]  [] _raw_spin_lock+0x20/0x30
[176242.506645]  [] dev_watchdog+0x72/0x260-----------------------同上分析,不过这次是拿不到锁
[176242.506649]  [] ? dev_deactivate_queue.constprop.33+0x60/0x60
[176242.506653]  [] call_timer_fn+0x38/0x110
[176242.506655]  [] ? dev_deactivate_queue.constprop.33+0x60/0x60
[176242.506658]  [] run_timer_softirq+0x22d/0x310
[176242.506663]  [] __do_softirq+0xfd/0x290
[176242.506667]  [] call_softirq+0x1c/0x30
[176242.506672]  [] do_softirq+0x65/0xa0
[176242.506675]  [] irq_exit+0x175/0x180
[176242.506679]  [] smp_apic_timer_interrupt+0x48/0x60
[176242.506682]  [] apic_timer_interrupt+0x162/0x170----------------被硬中断打断
[176242.506684]  

[176242.506702] [] ? ixgbe_xmit_frame_ring+0x53/0xf30 [ixgbe]
[176242.506710] [] ixgbe_xmit_frame+0x58/0xd0 [ixgbe]
[176242.506715] [] wit_send_tasklet+0x73c/0xae0 [witdriver]
[176242.506719] [] wit_kthread_xmit_fn+0xb5/0x150 [witdriver]----------我们的内核线程kthread_send调用的发包函数
[176242.506723] [] ? wit_send_tasklet+0xae0/0xae0 [witdriver]
[176242.506726] [] kthread+0xd1/0xe0
[176242.506729] [] ? insert_kthread_work+0x40/0x40
[176242.506733] [] ret_from_fork+0x77/0xb0
[176242.506736] [] ? insert_kthread_work+0x40/0x40

 分析这个堆栈,dev_watchdog 这次拿不到锁了,已经持续了22s,走查代码,它需要哪一把锁呢?

etif_tx_lock(dev);
static inline void netif_tx_lock(struct net_device *dev)
{
    unsigned int i;
    int cpu;

    spin_lock(&dev->tx_global_lock);
    cpu = smp_processor_id();
    for (i = 0; i < dev->num_tx_queues; i++) {
        struct netdev_queue *txq = netdev_get_tx_queue(dev, i);

        /* We are the only thread of execution doing a
         * freeze, but we have to grab the _xmit_lock in
         * order to synchronize with threads which are in
         * the ->hard_start_xmit() handler and already
         * checked the frozen bit.
         */
        __netif_tx_lock(txq, cpu);
        set_bit(__QUEUE_STATE_FROZEN, &txq->state);
        __netif_tx_unlock(txq);
    }
}

 

那这把锁被谁拿了呢?被 wit_kthread_xmit_fn 函数拿了,因为我们正在发包啊。

也就是说,我们的内核线程发包的时候,拿了锁,结果被中断,然后硬中断处理dev_watchdog的时候,也要拿这把锁,死锁了。

接下来,又引发其他cpu死锁:(每次进入_raw_spin_lock,rdi中必然存放的就是锁,x86反汇编看的话确实是rdi,其他架构不是如此)

[176246.611936] NMI watchdog: BUG: soft lockup - CPU#25 stuck for 22s! [a.out:189281]-------------------------------cpu25
[176246.612901] Modules linked in: witdriver(OE) mysendmsg(OE) fuse tipc(OE) ossmod(OE) iptable_filter mptctl mptbase bonding dm_mirror dm_region_hash dm_log dm_mod iTCO_wdt iTCO_vendor_support skx_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr joydev ses enclosure sg mei_me mei shpchp lpc_ich i2c_i801 wmi ipmi_si ipmi_devintf ipmi_msghandler nfit libnvdimm acpi_power_meter acpi_cpufreq acpi_pad tcp_bbr sch_fq binfmt_misc ip_tables ext4 mbcache jbd2 raid1 sd_mod crc_t10dif crct10dif_generic ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crct10dif_pclmul crct10dif_common crc32c_intel ahci libahci libata ixgbe(OE) i40e(OE) mpt3sas(OE)
[176246.612961]  ptp raid_class pps_core scsi_transport_sas i2c_core dca [last unloaded: witdriver]
[176246.612969] CPU: 25 PID: 189281 Comm: a.out Tainted: G        W  OEL ------------   3.10.0-693.21.1.el7.x86_64 #1
[176246.612970] Hardware name: ZTE ZXCDN/SC621DI-16F, BIOS 1.0b 09/21/2017
[176246.612973] task: ffff882ba14a0000 ti: ffff882ba92c8000 task.ti: ffff882ba92c8000
[176246.612974] RIP: 0010:[]  [] native_queued_spin_lock_slowpath+0x156/0x200
[176246.612982] RSP: 0018:ffff882ba92cb978  EFLAGS: 00000202
[176246.612983] RAX: 0000000000000101 RBX: 0000090190134ba9 RCX: 0000000000c90000
[176246.612985] RDX: 0000000000c90101 RSI: 0000000000000101 RDI: ffff882fb80a0bc0------------大家都在抢的香饽饽存放在此,就是这把锁
[176246.612986] RBP: ffff882ba92cb978 R08: ffff882fbe959700 R09: 0000000000000000
[176246.612988] R10: 0000000000000001 R11: ffffea017c3e5580 R12: ffff882c18a85c00
[176246.612989] R13: 00000000000005dc R14: ffff882ba2091100 R15: ffffffff811e63d1
[176246.612991] FS:  00007fbf25f9b700(0000) GS:ffff882fbe940000(0000) knlGS:0000000000000000
[176246.612993] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[176246.612995] CR2: 0000000000c79000 CR3: 000000593a2ec000 CR4: 00000000003607e0
[176246.612996] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[176246.612998] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[176246.612999] Call Trace:
[176246.613006]  [] queued_spin_lock_slowpath+0xb/0xf
[176246.613011]  [] _raw_spin_lock+0x20/0x30
[176246.613015]  [] sch_direct_xmit+0x13f/0x240------------qdisc中sys调用发包
[176246.613017]  [] __qdisc_run+0x95/0x1c0
[176246.613022]  [] __dev_queue_xmit+0x2c9/0x550-----------这个是实际物理网卡的
[176246.613025]  [] dev_queue_xmit+0x10/0x20
[176246.613032]  [] bond_dev_queue_xmit+0x32/0x80 [bonding]
[176246.613036]  [] bond_start_xmit+0x1be/0x420 [bonding]
[176246.613039]  [] dev_hard_start_xmit+0x90/0x1a0
[176246.613041]  [] __dev_queue_xmit+0x448/0x550-----------这个是bond的
[176246.613044]  [] dev_queue_xmit+0x10/0x20
[176246.613048]  [] ip_finish_output+0x546/0x7a0
[176246.613050]  [] ip_output+0x73/0xe0
[176246.613053]  [] ? __ip_local_out_sk+0xf6/0x100
[176246.613055]  [] ip_local_out_sk+0x37/0x40
[176246.613058]  [] ip_send_skb+0x16/0x50
[176246.613062]  [] udp_send_skb+0xac/0x2b0
[176246.613065]  [] udp_push_pending_frames+0x3e/0x60
[176246.613068]  [] udp_sendpage+0x119/0x1c0
[176246.613071]  [] inet_sendpage+0x70/0xe0
[176246.613075]  [] kernel_sendpage+0x1e/0x30
[176246.613079]  [] sendfile_slowpath_2+0x1a8/0x1f0 [witdriver]
[176246.613084]  [] sys_mycall+0x776/0x8f0 [witdriver]
[176246.613090]  [] system_call_fastpath+0x1c/0x21
[176246.613091] Code: 8b 08 4d 85 c9 74 04 41 0f 18 09 8b 17 0f b7 c2 85 c0 74 21 83 f8 03 75 10 eb 1a 66 2e 0f 1f 84 00 00 00 00 00 85 c0 74 0c f3 90 <8b> 17 0f b7 c2 83 f8 03 75 f0 be 01 00 00 00 eb 15 66 0f 1f 84

由于我们拿的是dev的锁,在没开启抢占的情况下不调度出去,但也并没有关中断,这样其他cpu在系统调用中要发包,也会去尝试拿这把锁,好吧,你也死锁了,等着吧。

可以想象,后续其他cpu要拿这把锁的话,大家都会挂,一个cpu的softlock导致其他cpu全挂,只要你这个cpu来拿那锁就会如此。

由于我们拿的是spin_lock,并没有关中断,所以它还可以响应中断,看如下堆栈就明白,最早打印网卡异常的9号cpu,堆栈在死的时候是如下这样的:

crash> bt
PID: 178838  TASK: ffff88290f7ddee0  CPU: 9   COMMAND: "kthread_send/9"
 #0 [ffff882fbe843a70] machine_kexec at ffffffff8105d77b
 #1 [ffff882fbe843ad0] __crash_kexec at ffffffff8110aca2
 #2 [ffff882fbe843ba0] panic at ffffffff816ad52f
 #3 [ffff882fbe843c20] watchdog_timer_fn at ffffffff81135a51----------------------------hardlock检测中hrtimer的回调
 #4 [ffff882fbe843c58] __hrtimer_run_queues at ffffffff810b93a6
 #5 [ffff882fbe843cb0] hrtimer_interrupt at ffffffff810b993f
 #6 [ffff882fbe843cf8] local_apic_timer_interrupt at ffffffff8105467b
 #7 [ffff882fbe843d10] smp_apic_timer_interrupt at ffffffff816c9e83
 #8 [ffff882fbe843d28] apic_timer_interrupt at ffffffff816c6732--------------------------死之前又增加的堆栈,这个是硬中断
 #9 [ffff882fbe843dc8] queued_spin_lock_slowpath at ffffffff816adeee---------------------一开始打印hung的堆栈,在自旋
#10 [ffff882fbe843dd8] _raw_spin_lock at ffffffff816bb080
#11 [ffff882fbe843de8] dev_watchdog at ffffffff815bca52
#12 [ffff882fbe843e28] call_timer_fn at ffffffff8109a9c8
#13 [ffff882fbe843e60] run_timer_softirq at ffffffff8109ceed
#14 [ffff882fbe843ed8] __do_softirq at ffffffff8109404d
#15 [ffff882fbe843f48] call_softirq at ffffffff816c8afc
#16 [ffff882fbe843f60] do_softirq at ffffffff8102d435
#17 [ffff882fbe843f80] irq_exit at ffffffff81094495
#18 [ffff882fbe843f98] smp_apic_timer_interrupt at ffffffff816c9e88
#19 [ffff882fbe843fb0] apic_timer_interrupt at ffffffff816c6732
---  ---
#20 [ffff882b680d3c28] apic_timer_interrupt at ffffffff816c6732
    [exception RIP: ixgbe_xmit_frame_ring+83]
    RIP: ffffffffc01299e3  RSP: ffff882b680d3cd0  RFLAGS: 00000212
    RAX: 0000000000000562  RBX: 0000000000000001  RCX: 000000000000403d
    RDX: ffff882fb9331c00  RSI: ffff8828d7b8fac0  RDI: 0000000000000001
    RBP: ffff882b680d3d48   R8: 0000000000000008   R9: 0000a0a5447b9d78
    R10: ffff8828c6e84f00  R11: 000000002b3000b8  R12: ffff8828c0291b00
    R13: 0000000022300000  R14: 0000000000000001  R15: ffff882b680d3cc0
    ORIG_RAX: ffffffffffffff10  CS: 0010  SS: 0018
#21 [ffff882b680d3d50] ixgbe_xmit_frame at ffffffffc012a918 [ixgbe]
#22 [ffff882b680d3d80] wit_send_tasklet at ffffffffc043b63c [witdriver]
#23 [ffff882b680d3e78] wit_kthread_xmit_fn at ffffffffc043ba95 [witdriver]
#24 [ffff882b680d3ec8] kthread at ffffffff810b5241
#25 [ffff882b680d3f50] ret_from_fork at ffffffff816c5577

 总结:

1、解决办法很简单,就是在 wit_kthread_xmit_fn 函数中,加了一个关闭软中断就行,因为就是软中断和sys拿锁冲突。

2、分析软锁,最好从dmesg中开始分析,那个时候时间还早,会按时间顺序来打印堆栈,比起直接crash来分析要快一些,特别是在对自己模块代码比较熟悉的情况下。

 

转载于:https://www.cnblogs.com/10087622blog/p/9770945.html

你可能感兴趣的:(linux 3.10 的又一次hung)