Kernel Panic

Kernel Panic问题分析

最近遇到一个Ethernet驱动导致的系统重启问题

贴下log如下:

[59744.009642] -(0)[2349:ave_monitor]ave 65000000.ethernet eth0: AVE: doing ave_rxfifo_reset ...... 
[59744.009648] -(0)[2349:ave_monitor]ave 65000000.ethernet eth0: AVE: RxFIFO Overflow !
[59744.009652] -(0)[2349:ave_monitor]ave 65000000.ethernet eth0: AVE: doing ave_rxfifo_reset ...... 
[59744.009659] -(0)[2349:ave_monitor]watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [ave_monitor:2349]
--------------------------***-------------------
[59744.009774]  pow(O) map_reg(O)
[59744.009785] -(0)[2349:ave_monitor]CPU: 0 PID: 2349 Comm: ave_monitor Tainted: P        W  O      4.19.176 #1
[59744.009790] -(0)[2349:ave_monitor]Hardware name: XXX
[59744.009795] -(0)[2349:ave_monitor]pstate: 20400005 (nzCv daif +PAN -UAO)
[59744.009808] -(0)[2349:ave_monitor]pc : __do_softirq+0xa8/0x39c
[59744.009813] -(0)[2349:ave_monitor]lr : __do_softirq+0xa4/0x39c
[59744.009816] -(0)[2349:ave_monitor]sp : ffffff8008003ed0
[59744.009819] -(0)[2349:ave_monitor]x29: ffffff8008003ef0 x28: ffffffc0b9901d80 
[59744.009826] -(0)[2349:ave_monitor]x27: ffffff800942a058 x26: ffffff80097f60c0 
[59744.009831] -(0)[2349:ave_monitor]x25: 000000000000000a x24: ffffffc0b9901d80 
[59744.009836] -(0)[2349:ave_monitor]x23: ffffff80094353d0 x22: 0000000000000001 
[59744.009842] -(0)[2349:ave_monitor]x21: ffffff8009435400 x20: 0000000000000202 
[59744.009847] -(0)[2349:ave_monitor]x19: 0000000100e29685 x18: 0000000000f8c913 
[59744.009852] -(0)[2349:ave_monitor]x17: 0000000000000000 x16: 0000000000000000 
[59744.009857] -(0)[2349:ave_monitor]x15: 000000000000134d x14: ffffff800bd7bb48 
[59744.009863] -(0)[2349:ave_monitor]x13: 00000000018a8f22 x12: 0000000000000020 
[59744.009868] -(0)[2349:ave_monitor]x11: 3c0c51ada1937c00 x10: 0000000000000001 
[59744.009873] -(0)[2349:ave_monitor]x9 : 0000000000000000 x8 : 0000000000000101 
[59744.009878] -(0)[2349:ave_monitor]x7 : 0000000000000000 x6 : ffffff80099fc620 
[59744.009883] -(0)[2349:ave_monitor]x5 : ffffff8008003cc0 x4 : 0000000000000001 
[59744.009888] -(0)[2349:ave_monitor]x3 : ffffff8008003e68 x2 : ffffff80081154cc 
[59744.009892] -(0)[2349:ave_monitor]x1 : ffffff8008172bf4 x0 : ffffff80081018a4 
[59744.009899] -(0)[2349:ave_monitor]Call trace:
[59744.009904] -(0)[2349:ave_monitor] __do_softirq+0xa8/0x39c
[59744.009912] -(0)[2349:ave_monitor] irq_exit+0xd8/0xdc
[59744.009920] -(0)[2349:ave_monitor] __handle_domain_irq+0x8c/0xc4
[59744.009924] -(0)[2349:ave_monitor] gic_handle_irq+0x10c/0x184
[59744.009928] -(0)[2349:ave_monitor] el1_irq+0xec/0x198
[59744.009936] -(0)[2349:ave_monitor] console_unlock+0x330/0x480
[59744.009941] -(0)[2349:ave_monitor] vprintk_emit+0x160/0x2ac
[59744.009950] -(0)[2349:ave_monitor] dev_vprintk_emit+0x1a4/0x1f8
[59744.009955] -(0)[2349:ave_monitor] dev_printk_emit+0x74/0xa0
[59744.009963] -(0)[2349:ave_monitor] __netdev_printk+0xd4/0x1c0
[59744.009967] -(0)[2349:ave_monitor] netdev_warn+0x68/0x90
[59744.009979] -(0)[2349:ave_monitor] 0xffffff800134c704
[59744.009986] -(0)[2349:ave_monitor] kthread+0x138/0x154
[59744.009991] -(0)[2349:ave_monitor] ret_from_fork+0x10/0x18
[59744.009996] -(0)[2349:ave_monitor]Kernel panic - not syncing: softlockup: hung tasks
[59744.010002] -(0)[2349:ave_monitor]CPU: 0 PID: 2349 Comm: ave_monitor Tainted: P        W  O L    4.19.176 #1
[59744.010006] -(0)[2349:ave_monitor]Hardware name: UniPhier LD20 Global Board v4 (REF_LD20_GP_V4) (DT)
[59744.010009] -(0)[2349:ave_monitor]Call trace:
[59744.010015] -(0)[2349:ave_monitor] dump_backtrace+0x0/0x194
[59744.010020] -(0)[2349:ave_monitor] show_stack+0x20/0x2c
[59744.010028] -(0)[2349:ave_monitor] dump_stack+0xd8/0x128
[59744.010033] -(0)[2349:ave_monitor] panic+0x134/0x2b4
[59744.010039] -(0)[2349:ave_monitor] softlockup_fn+0x0/0x60
[59744.010045] -(0)[2349:ave_monitor] __run_hrtimer+0xa8/0x2b0
[59744.010049] -(0)[2349:ave_monitor] hrtimer_interrupt+0x174/0x3c8
[59744.010058] -(0)[2349:ave_monitor] arch_timer_handler_virt+0x40/0x50
[59744.010065] -(0)[2349:ave_monitor] handle_percpu_devid_irq+0x88/0x278
[59744.010069] -(0)[2349:ave_monitor] __handle_domain_irq+0x84/0xc4
[59744.010073] -(0)[2349:ave_monitor] gic_handle_irq+0x10c/0x184
[59744.010077] -(0)[2349:ave_monitor] el1_irq+0xec/0x198
[59744.010081] -(0)[2349:ave_monitor] __do_softirq+0xa8/0x39c
[59744.010086] -(0)[2349:ave_monitor] irq_exit+0xd8/0xdc
[59744.010090] -(0)[2349:ave_monitor] __handle_domain_irq+0x8c/0xc4
[59744.010094] -(0)[2349:ave_monitor] gic_handle_irq+0x10c/0x184
[59744.010098] -(0)[2349:ave_monitor] el1_irq+0xec/0x198
[59744.010103] -(0)[2349:ave_monitor] console_unlock+0x330/0x480
[59744.010108] -(0)[2349:ave_monitor] vprintk_emit+0x160/0x2ac
[59744.010114] -(0)[2349:ave_monitor] dev_vprintk_emit+0x1a4/0x1f8
[59744.010118] -(0)[2349:ave_monitor] dev_printk_emit+0x74/0xa0
[59744.010123] -(0)[2349:ave_monitor] __netdev_printk+0xd4/0x1c0
[59744.010127] -(0)[2349:ave_monitor] netdev_warn+0x68/0x90
[59744.010131] -(0)[2349:ave_monitor] 0xffffff800134c704
[59744.010136] -(0)[2349:ave_monitor] kthread+0x138/0x154
[59744.010141] -(0)[2349:ave_monitor] ret_from_fork+0x10/0x18
[59744.010150] -(0)[2349:ave_monitor]SMP: stopping secondary CPUs
[59744.010168] -(0)[2349:ave_monitor]Kernel Offset: 0x80000 from 0xffffff8008000000
[59744.010173] -(0)[2349:ave_monitor]CPU features: 0x00000000,2180600c
[59744.010177] -(0)[2349:ave_monitor]Memory Limit: none
[59751.455492] -(0)[2349:ave_monitor]Rebooting in 5 seconds..

有两处关键的打印:

watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [ave_monitor:2349]

Kernel panic - not syncing: softlockup: hung tasks

Lockup分为两种: soft lockup 和 hardlockup

  • Soft lockup在CPU无法正常调度其他线程时发生,即某段代码一直占用某个CPU,导致watchdog/x内核线程得不到调度,此时中断仍可响应;
  • Hard lockup在中断无法正常响应时发生,即关中断时间过长或中断处理程序执行时间过长。

此处我系统重启的问题是[ave_monitor:2349] 线程卡住,导致CPU#0其他线程得不到调度

然后,watchdog开始工作了【Watchdog主要用于监测系统运行情况,一旦出现以上异常情况,就会重启系统

所以此处我的代码逻辑就是ave_rxfifo_reset一直在知行,没有返回,分析具体的业务逻辑就不在这贴出来了。。。

在处理该类问题时,可以遵循以下原则:

  • 查看watchdog_touch_ts变量在最近20秒(watchdog_thresh * 2)内,是否被watchdog 线程更新过。若没有更新,就意味着watchdog线程得不到调度。很有可能某个cpu关抢占或中断执行时间过长,导致调度器无法调度watchdog线程。
  • 这种情况下,系统往往不会死掉,但是会很慢。 如果将内核参数 softlockup_panic(CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC宏)设置为1,系统会panic。 否则,只将warning信息打印出来。

参考资料:请记住内核中这个勤劳的监测卫士---Watchdog(Soft lockup篇)_问题_调度_运行

你可能感兴趣的:(#,kernel,异常,linux)