Using RCU‘s CPU Stall Detector(待更新)

文章目录

    • 是什么原因导致RCU CPU Stall 警告?
    • Fine-Tuning the RCU CPU Stall Detector
    • 解读RCU's CPU Stall Detector "Splats"

本文首先讨论RCU的CPU Stall Detector可以定位哪些类型的问题,然后讨论内核参数和Kconfig选项,它们可以用来微调Detector的操作。最后,本文解释了stall Detector的“splat”格式。

是什么原因导致RCU CPU Stall 警告?

所以你的内核打印了一个RCU CPU Stall 警告。下一个问题是“是什么引起的?”以下问题可能导致RCU CPU Stall警告:

  • 一个CPU在RCU读侧临界区循环。
  • 中断被禁用的CPU循环。
  • 禁用抢占的CPU循环。
  • 禁用下半部分的CPU循环。
  • 对于非CONFIG_PREEMPT的 kernels,不调用schedule()在内核中任意位置循环的CPU。如果内核中的循环确实是预期的和理想的行为,则可能需要添加一些对cond_resched()的调用。
  • 使用一个太慢的控制台连接来引导Linux,这个连接太慢了,无法跟上引导时控制台的消息速率。例如,115Kbaud串行控制台可能太慢,无法跟上引导时间消息速率,并经常导致RCU CPU Stall警告消息。尤其是如果您添加了debug printk。
  • 任何阻止RCU的宽限期kthreads运行的东西。这会导致出现“All QSes seen”控制台日志消息。此消息将包含有关kthread上次运行的时间和预期运行频率的信息。它还可能导致控制台日志消息“rcu_.*kthread starved for”,其中将包含其他调试信息。
  • 一个在CONFIG_PREEMPT内核中占用CPU的实时任务,它可能会在RCU read side critical部分的中间抢占低优先级任务。如果低优先级的任务不允许在任何其他CPU上运行,这种情况下,下一个RCU宽限期永远无法完成,这将最终导致系统内存不足并挂起。当系统正在耗尽内存时,您可能会看到暂停警告消息。
  • 配置抢占内核中的一个CPU限制的实时任务,它以比RCU softirq线程更高的优先级运行。这将阻止RCU回调被调用,并且在CONFIG_PREEMPT_RCU 内核将进一步阻止RCU宽限期的完成。不管怎样,系统最终都会耗尽内存并挂起。在CONFIG_PREEMPT_RCU的情况下,您可能会看到暂停警告消息。你可以使用rcutree.kthread_优先级提高内核的启动优先级,有助于避免这个问题。但是,请注意,这样做会增加系统的上下文切换率,从而降低性能。
  • 一种周期性中断,其处理程序所用的时间比连续两对中断之间的时间间隔长。这会阻止RCU的kthreads和softirq处理程序运行。请注意,某些高开销调试选项,例如函数图跟踪器,会导致中断处理程序花费比正常时间长得多的时间,进而导致RCU CPU Stall 警告。
  • 在一个快速的系统上测试一个工作负载,将暂停警告超时调低到几乎可以避免RCU-CPU暂停警告,然后在一个速度慢的系统上运行相同的工作负载和相同的暂停警告超时。注意,thermal和on-demand governors可能导致单个系统有时快有时慢!
  • 硬件或软件问题关闭非dyntick空闲模式的CPU上的调度程序时钟中断。这个问题确实发生过,而且似乎最有可能导致CONFIG_NO_HZ_COMMON=n内核的RCU CPU Stall 警告。
  • RCU实现中的一个bug。
  • 硬件故障。这不太可能,但在现实生活中至少发生过一次。CPU在运行的系统中出现故障,变得无响应,但不会立即导致崩溃。这导致了一系列RCU CPU暂停警告,最终导致CPU出现故障。

Fine-Tuning the RCU CPU Stall Detector

The rcuupdate.rcu_cpu_stall_suppress module parameter disables RCU’s
CPU stall detector, which detects conditions that unduly delay RCU grace
periods. This module parameter enables CPU stall detection by default,
but may be overridden via boot-time parameter or at runtime via sysfs.
The stall detector’s idea of what constitutes “unduly delayed” is
controlled by a set of kernel configuration variables and cpp macros:

CONFIG_RCU_CPU_STALL_TIMEOUT

This kernel configuration parameter defines the period of time
that RCU will wait from the beginning of a grace period until it
issues an RCU CPU stall warning.  This time period is normally
21 seconds.

This configuration parameter may be changed at runtime via the
/sys/module/rcupdate/parameters/rcu_cpu_stall_timeout, however
this parameter is checked only at the beginning of a cycle.
So if you are 10 seconds into a 40-second stall, setting this
sysfs parameter to (say) five will shorten the timeout for the
-next- stall, or the following warning for the current stall
(assuming the stall lasts long enough).  It will not affect the
timing of the next warning for the current stall.

Stall-warning messages may be enabled and disabled completely via
/sys/module/rcupdate/parameters/rcu_cpu_stall_suppress.

RCU_STALL_DELAY_DELTA

Although the lockdep facility is extremely useful, it does add
some overhead.  Therefore, under CONFIG_PROVE_RCU, the
RCU_STALL_DELAY_DELTA macro allows five extra seconds before
giving an RCU CPU stall warning message.  (This is a cpp
macro, not a kernel configuration parameter.)

RCU_STALL_RAT_DELAY

The CPU stall detector tries to make the offending CPU print its
own warnings, as this often gives better-quality stack traces.
However, if the offending CPU does not detect its own stall in
the number of jiffies specified by RCU_STALL_RAT_DELAY, then
some other CPU will complain.  This delay is normally set to
two jiffies.  (This is a cpp macro, not a kernel configuration
parameter.)

rcupdate.rcu_task_stall_timeout

This boot/sysfs parameter controls the RCU-tasks stall warning
interval.  A value of zero or less suppresses RCU-tasks stall
warnings.  A positive value sets the stall-warning interval
in seconds.  An RCU-tasks stall warning starts with the line:

	INFO: rcu_tasks detected stalls on tasks:

And continues with the output of sched_show_task() for each
task stalling the current RCU-tasks grace period.

解读RCU’s CPU Stall Detector “Splats”

对于RCU的non-RCU-tasks类型,当CPU检测到它正在暂停时,

它将打印类似于以下内容的消息:

	INFO: rcu_sched detected stalls on CPUs/tasks:
	2-...: (3 GPs behind) idle=06c/0/0 softirq=1453/1455 fqs=0
	16-...: (0 ticks this GP) idle=81c/0/0 softirq=764/764 fqs=0
	(detected by 32, t=2603 jiffies, g=7075, q=625)

此消息表示CPU 32检测到CPU 2和16都导致暂停,并且暂停正在影响RCU sched。此消息之后通常是每个CPU的堆栈转储。请注意,PREEMPT_RCU构建可以被任务和cpu暂停,任务将由PID指示,例如,“P3421”。甚至有可能是cpu和任务都会导致rcu的状态停滞,在这种情况下,有问题的cpu和任务都将在列表中被调用。

CPU 2的“(3 GPs behind)”表示该CPU在过去三个宽限期内没有与RCU核心交互。相反,CPU 16的“(0 ticks this GP)”表示在当前暂停的宽限期内,该CPU没有执行任何调度时钟中断。

消息的“idle=”部分打印dyntick空闲状态。第一个“/”前的十六进制数是dynticks计数器的低阶12位,如果CPU处于dyntick空闲模式,则其值为偶数,否则为奇数。两个“/”之间的十六进制数是嵌套的值,如果在idle loop中(如上所示),它将是一个小的非负数,否则将是一个非常大的正数。

消息的“softirq=”部分跟踪暂停的CPU已执行的RCU softirq处理程序的数量。“/”前面的数字是自引导后在该CPU上次注意到宽限期开始时执行的数字,该宽限期可能是当前(暂停的)宽限期,也可能是某个较早的宽限期(例如,如果CPU可能已经处于dyntick idle模式很长一段时间)。“/”后面的数字是自引导到当前时间为止执行的数字。如果后一个数字在重复的暂停警告消息中保持不变,则RCU的softirq处理程序可能不再能够在这个CPU上执行。如果中断被禁用,或者在-rt内核中,如果一个高优先级进程缺少RCU的softirq处理程序,就会发生这种情况。

“fqs=”显示自上一次CPU注意到宽限期开始以来,宽限期kthread在此CPU上执行的强制静态空闲/脱机检测传递数。

“detected by”行指示哪个CPU检测到暂停(在本例中为CPU 32)、自宽限期开始以来已过多少jiffie(在本例中为2603)、宽限期序列号(7075)以及在所有CPU中排队的RCU回调总数的估计值(本例中为625个)。

In kernels with CONFIG_RCU_FAST_NO_HZ, more information is printed
for each CPU:

0: (64628 ticks this GP) idle=dd5/3fffffffffffffff/0 softirq=82/543 last_accelerate: a345/d342 dyntick_enabled: 1

The “last_accelerate:” prints the low-order 16 bits (in hex) of the
jiffies counter when this CPU last invoked rcu_try_advance_all_cbs()
from rcu_needs_cpu() or last invoked rcu_accelerate_cbs() from
rcu_prepare_for_idle(). “dyntick_enabled: 1” indicates that dyntick-idle
processing is enabled.

If the grace period ends just as the stall warning starts printing,
there will be a spurious stall-warning message, which will include
the following:

INFO: Stall ended before state dump start

This is rare, but does happen from time to time in real life. It is also
possible for a zero-jiffy stall to be flagged in this case, depending
on how the stall warning and the grace-period initialization happen to
interact. Please note that it is not possible to entirely eliminate this
sort of false positive without resorting to things like stop_machine(),
which is overkill for this sort of problem.

If all CPUs and tasks have passed through quiescent states, but the
grace period has nevertheless failed to end, the stall-warning splat
will include something like the following:

All QSes seen, last rcu_preempt kthread activity 23807 (4297905177-4297881370), jiffies_till_next_fqs=3, root ->qsmask 0x0

The “23807” indicates that it has been more than 23 thousand jiffies
since the grace-period kthread ran. The “jiffies_till_next_fqs”
indicates how frequently that kthread should run, giving the number
of jiffies between force-quiescent-state scans, in this case three,
which is way less than 23807. Finally, the root rcu_node structure’s
->qsmask field is printed, which will normally be zero.

If the relevant grace-period kthread has been unable to run prior to
the stall warning, as was the case in the “All QSes seen” line above,
the following additional line is printed:

kthread starved for 23807 jiffies! g7075 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1 ->cpu=5

Starving the grace-period kthreads of CPU time can of course result
in RCU CPU stall warnings even when all CPUs and tasks have passed
through the required quiescent states. The “g” number shows the current
grace-period sequence number, the “f” precedes the ->gp_flags command
to the grace-period kthread, the “RCU_GP_WAIT_FQS” indicates that the
kthread is waiting for a short timeout, the “state” precedes value of the
task_struct ->state field, and the “cpu” indicates that the grace-period
kthread last ran on CPU 5.

Multiple Warnings From One Stall

If a stall lasts long enough, multiple stall-warning messages will be
printed for it. The second and subsequent messages are printed at
longer intervals, so that the time between (say) the first and second
message will be about three times the interval between the beginning
of the stall and the first message.

你可能感兴趣的:(Linux内核修炼)