NAME
mpstat - Report processors related statistics.
vmstat用来观测系统整体的性能情况,pidstat来观测单个进程的性能情况,那么mpstat是用来观测单个CPU的性能情况。
mpstat命令输出每个可用处理器的标准输出活动,输出显示中cpu0是第一个处理器。还报告了所有处理器的全局的平均活动。mpstat命令可以在SMP和UP机器上使用,但在后者(UP机器上)中,只打印全局平均活动。如果未指定参数,则默认报告为CPU整体利用率报告。
备注:
UP(Uni-Processor):系统只有一个处理器单元,即单核CPU系统。
SMP(Symmetric Multi-Processors):系统有多个处理器单元。各个处理器之间共享总线,内存等等
[root@localhost ~]# mpstat
Linux 3.10.0-957.el7.x86_64 (localhost.localdomain) 11/28/2022 _x86_64_ (4 CPU)
02:48:09 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
02:48:09 PM all 0.31 0.00 0.28 0.01 0.00 0.00 0.00 0.00 0.00 99.40
mpstat ...... [ interval [ count ] ]
interval参数指定每个报告之间的时间量(以秒为单位)。值为0(或根本没有参数)表示自系统启动(引导)以来将报告处理器统计信息。如果没有将interval参数设置为零,则可以将count参数与interval参数一起指定。count的值决定了间隔几秒生成的报告的数量。如果指定interval参数而不指定count参数,则mpstat命令将连续生成报表。
[root@localhost ~]# mpstat 2 5
Linux 3.10.0-957.el7.x86_64 (localhost.localdomain) 11/28/2022 _x86_64_ (4 CPU)
02:55:10 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
02:55:12 PM all 0.50 0.00 0.50 0.00 0.00 0.00 0.00 0.00 0.00 99.00
02:55:14 PM all 0.25 0.00 0.63 0.00 0.00 0.00 0.00 0.00 0.00 99.12
02:55:16 PM all 0.38 0.00 0.50 0.00 0.00 0.00 0.00 0.00 0.00 99.12
02:55:18 PM all 0.50 0.00 0.62 0.00 0.00 0.00 0.00 0.00 0.00 98.88
02:55:20 PM all 0.38 0.00 0.50 0.00 0.00 0.00 0.00 0.00 0.00 99.12
Average: all 0.40 0.00 0.55 0.00 0.00 0.00 0.00 0.00 0.00 99.05
以两秒间隔显示所有处理器之间的五个全局统计数据报告。
就是读取 /proc/stat 文件中的数据:
open("/proc/stat", O_RDONLY) = 3
-P { cpu [,...] | ON | ALL }
指示要报告统计信息的处理器编号,cpu是处理器编号,处理器0是第一个处理器。ON关键字表示要为每个在线处理器报告统计信息,而ALL关键字表示要报告所有处理器的统计信息。
输出处理器1(第二个处理器)的报告统计信息:
[root@localhost ~]# mpstat -P 1
Linux 3.10.0-957.el7.x86_64 (localhost.localdomain) 11/28/2022 _x86_64_ (4 CPU)
03:08:34 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
03:08:34 PM 1 0.28 0.00 0.29 0.00 0.00 0.00 0.00 0.00 0.00 99.43
输出每个在线处理器报告统计信息:
[root@localhost ~]# mpstat -P ON
Linux 3.10.0-957.el7.x86_64 (localhost.localdomain) 11/28/2022 _x86_64_ (4 CPU)
03:09:59 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
03:09:59 PM all 0.31 0.00 0.28 0.01 0.00 0.00 0.00 0.00 0.00 99.40
03:09:59 PM 0 0.31 0.00 0.27 0.00 0.00 0.00 0.00 0.00 0.00 99.41
03:09:59 PM 1 0.28 0.00 0.29 0.00 0.00 0.00 0.00 0.00 0.00 99.43
03:09:59 PM 2 0.33 0.00 0.29 0.00 0.00 0.00 0.00 0.00 0.00 99.38
03:09:59 PM 3 0.34 0.00 0.28 0.01 0.00 0.00 0.00 0.00 0.00 99.37
输出所有处理器的统计信息:
[root@localhost ~]# mpstat -P ALL
Linux 3.10.0-957.el7.x86_64 (localhost.localdomain) 11/28/2022 _x86_64_ (4 CPU)
03:11:24 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
03:11:24 PM all 0.31 0.00 0.28 0.01 0.00 0.00 0.00 0.00 0.00 99.40
03:11:24 PM 0 0.31 0.00 0.27 0.00 0.00 0.00 0.00 0.00 0.00 99.41
03:11:24 PM 1 0.28 0.00 0.29 0.00 0.00 0.00 0.00 0.00 0.00 99.43
03:11:24 PM 2 0.33 0.00 0.29 0.00 0.00 0.00 0.00 0.00 0.00 99.38
03:11:24 PM 3 0.34 0.00 0.28 0.01 0.00 0.00 0.00 0.00 0.00 99.37
每个字段的含义:
CPU
Processor number. The keyword all indicates that statistics are calculated as averages among all processors.
%usr
Show the percentage of CPU utilization that occurred while executing at the user level (application).
%nice
Show the percentage of CPU utilization that occurred while executing at the user level with nice priority.
%sys
Show the percentage of CPU utilization that occurred while executing at the system level (kernel). Note that this does not include time spent servicing hardware and soft‐
ware interrupts.
%iowait
Show the percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request.
%irq
Show the percentage of time spent by the CPU or CPUs to service hardware interrupts.
%soft
Show the percentage of time spent by the CPU or CPUs to service software interrupts.
%steal
Show the percentage of time spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor.
%guest
Show the percentage of time spent by the CPU or CPUs to run a virtual processor.
%gnice
Show the percentage of time spent by the CPU or CPUs to run a niced guest.
%idle
Show the percentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk I/O request.
具体请参考:Linux top命令的cpu使用率和内存使用率 这篇文章中关于各个字段的释义。
-I { SUM | CPU | SCPU | ALL }
报告中断统计信息
使用SUM关键字,mpstat命令报告每个处理器的中断总数。将显示以下值:
CPU:处理器编号,关键字all表示统计数据是以所有处理器的平均值计算的。
intr/s:显示CPU每秒接收的中断总数。
[root@localhost ~]# mpstat -I SUM
Linux 3.10.0-957.el7.x86_64 (localhost.localdomain) 11/28/2022 _x86_64_ (4 CPU)
03:21:23 PM CPU intr/s
03:21:23 PM all 93.37
使用CPU关键字,将显示 CPU or CPUs 每秒接收到的每个中断(硬中断)的数量。
硬中断是硬件中断处理程序,在Linux 中称为上半部分,优先级最高,硬件中断处理程序处理过程中会屏蔽其它中断。
[root@localhost ~]# mpstat -I CPU
Linux 3.10.0-957.el7.x86_64 (localhost.localdomain) 11/28/2022 _x86_64_ (4 CPU)
03:23:41 PM CPU 0/s 1/s 8/s 9/s 12/s 16/s 20/s 120/s 121/s 122/s 123/s 124/s 125/s 126/s 127/s NMI/s LOC/s SPU/s PMI/s IWI/s RTR/s RES/s CAL/s TLB/s TRM/s THR/s DFR/s MCE/s MCP/s ERR/s MIS/s PIN/s NPI/s PIW/s
03:23:41 PM 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02 4.03 0.00 0.00 0.00 20.01 0.00 0.00 0.19 0.00 0.21 0.00 0.11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
03:23:41 PM 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 21.62 0.00 0.00 0.21 0.00 0.15 0.00 0.14 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
03:23:41 PM 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 22.75 0.00 0.00 0.17 0.00 0.15 0.00 0.07 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
03:23:41 PM 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.20 0.00 0.00 0.00 0.00 22.86 0.00 0.00 0.25 0.00 0.15 0.00 0.08 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
数据来源就是读取 /proc/interrupts 文件,/proc/interrupts 提供了硬中断的运行情况:
备注:中断本质上是一种特殊的电信号,由硬件设备发向处理器。处理器接受到中断后,会马上向操作系统反映中断信号的到来,然后由操作系统负责处理这些新到来的数据。硬件设备生成中断的时候不考虑与处理器的时钟同步,即中断随时可以产生。
中断其实是一种异步的事件处理机制,可以提高系统的并发处理能力。
由于中断处理程序会打断其他进程的运行,所以,为了减少对正常进程运行调度的影响,中断处理程序就需要尽可能快地运行
open("/proc/interrupts", O_RDONLY) = 3
[root@localhost ~]# cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3
0: 55 0 0 0 IR-IO-APIC-edge timer
1: 4 0 0 0 IR-IO-APIC-edge i8042
8: 1 0 0 0 IR-IO-APIC-edge rtc0
9: 4 0 0 0 IR-IO-APIC-fasteoi acpi
12: 3 3 0 0 IR-IO-APIC-edge i8042
16: 0 0 0 0 IR-IO-APIC-fasteoi i801_smbus
20: 0 0 0 0 IR-IO-APIC-fasteoi idma64.0
120: 0 0 0 0 DMAR_MSI-edge dmar0
121: 0 0 0 0 DMAR_MSI-edge dmar1
122: 0 0 0 0 IR-PCI-MSI-edge aerdrv, PCIe PME
123: 148 16 10 2 IR-PCI-MSI-edge xhci_hcd
124: 4738 519 421 54943 IR-PCI-MSI-edge 0000:00:17.0
125: 1111752 0 0 0 IR-PCI-MSI-edge enp1s0
126: 38 1 109 9 IR-PCI-MSI-edge i915
127: 541 136 191 85 IR-PCI-MSI-edge snd_hda_intel:card0
NMI: 56 52 55 56 Non-maskable interrupts
LOC: 5504316 5950291 6263079 6292876 Local timer interrupts
SPU: 0 0 0 0 Spurious interrupts
PMI: 56 52 55 56 Performance monitoring interrupts
IWI: 52427 58892 47990 68017 IRQ work interrupts
RTR: 0 0 0 0 APIC ICR read retries
RES: 56937 40801 42634 41527 Rescheduling interrupts
CAL: 1150 1147 1194 1149 Function call interrupts
TLB: 28982 39043 19609 20986 TLB shootdowns
TRM: 0 0 0 0 Thermal event interrupts
THR: 0 0 0 0 Threshold APIC interrupts
DFR: 0 0 0 0 Deferred Error APIC interrupts
MCE: 0 0 0 0 Machine check exceptions
MCP: 918 918 918 918 Machine check polls
ERR: 0
MIS: 0
PIN: 0 0 0 0 Posted-interrupt notification event
NPI: 0 0 0 0 Nested posted-interrupt event
PIW: 0 0 0 0 Posted-interrupt wakeup event
其中的一些字段:
NMI(Non-maskable interrupts):在这种情况下,NMI会递增,因为每个定时器中断都会生成一个NMI(非屏蔽中断),NMI看门狗使用它来检测锁定。
LOC:LOC是每个CPU的内部APIC的 the local interrupt counter。
SPU:a spurious interrupt 是在APIC完全处理之前由某个IO设备引发然后降低的某个中断。因此,APIC看到这种中断,但不知道它来自哪个设备。在这种情况下,APIC将生成IRQ向量为0xff的中断。这也可能是芯片组错误造成的。
RES(Rescheduling interrupts)、CAL(Function call interrupts)、TLB(TLB shootdowns):根据OS的需要从一个CPU向另一个CPU发送重新调度、调用和TLB刷新中断。通常,内核开发人员和感兴趣的用户使用它们的统计信息来确定给定类型中断的发生。
TRM( Thermal event interrupts):当超过CPU的温度阈值时,发生热事件中断。当温度降至正常值时,也可能会产生该中断。
THR(Threshold APIC interrupts):当机器检查阈值计数器(通常计数内存或缓存的ECC纠正错误)超过可配置阈值时引发的中断。仅在某些系统上可用。
// linux-3.10/fs/proc/interrupts.c
/*
* /proc/interrupts
*/
static void *int_seq_start(struct seq_file *f, loff_t *pos)
{
return (*pos <= nr_irqs) ? pos : NULL;
}
static void *int_seq_next(struct seq_file *f, void *v, loff_t *pos)
{
(*pos)++;
if (*pos > nr_irqs)
return NULL;
return pos;
}
static void int_seq_stop(struct seq_file *f, void *v)
{
/* Nothing to do */
}
static const struct seq_operations int_seq_ops = {
.start = int_seq_start,
.next = int_seq_next,
.stop = int_seq_stop,
.show = show_interrupts
};
static int interrupts_open(struct inode *inode, struct file *filp)
{
return seq_open(filp, &int_seq_ops);
}
static const struct file_operations proc_interrupts_operations = {
.open = interrupts_open,
.read = seq_read,
.llseek = seq_lseek,
.release = seq_release,
};
static int __init proc_interrupts_init(void)
{
proc_create("interrupts", 0, NULL, &proc_interrupts_operations);
return 0;
}
module_init(proc_interrupts_init);
其中show_interrupts函数:
// linux-3.10/kernel/irq/proc.c
int show_interrupts(struct seq_file *p, void *v)
{
......
arch_show_interrupts(p, prec);
......
}
arch_show_interrupts是一个与架构有关的函数,对于x86架构:
// linux-3.10/arch/x86/kernel/irq.c
#define irq_stats(x) (&per_cpu(irq_stat, x))
/*
* /proc/interrupts printing for arch specific interrupts
*/
int arch_show_interrupts(struct seq_file *p, int prec)
{
int j;
seq_printf(p, "%*s: ", prec, "NMI");
for_each_online_cpu(j)
seq_printf(p, "%10u ", irq_stats(j)->__nmi_count);
seq_printf(p, " Non-maskable interrupts\n");
#ifdef CONFIG_X86_LOCAL_APIC
seq_printf(p, "%*s: ", prec, "LOC");
for_each_online_cpu(j)
seq_printf(p, "%10u ", irq_stats(j)->apic_timer_irqs);
seq_printf(p, " Local timer interrupts\n");
seq_printf(p, "%*s: ", prec, "SPU");
for_each_online_cpu(j)
seq_printf(p, "%10u ", irq_stats(j)->irq_spurious_count);
seq_printf(p, " Spurious interrupts\n");
seq_printf(p, "%*s: ", prec, "PMI");
for_each_online_cpu(j)
seq_printf(p, "%10u ", irq_stats(j)->apic_perf_irqs);
seq_printf(p, " Performance monitoring interrupts\n");
seq_printf(p, "%*s: ", prec, "IWI");
for_each_online_cpu(j)
seq_printf(p, "%10u ", irq_stats(j)->apic_irq_work_irqs);
seq_printf(p, " IRQ work interrupts\n");
seq_printf(p, "%*s: ", prec, "RTR");
for_each_online_cpu(j)
seq_printf(p, "%10u ", irq_stats(j)->icr_read_retry_count);
seq_printf(p, " APIC ICR read retries\n");
#endif
if (x86_platform_ipi_callback) {
seq_printf(p, "%*s: ", prec, "PLT");
for_each_online_cpu(j)
seq_printf(p, "%10u ", irq_stats(j)->x86_platform_ipis);
seq_printf(p, " Platform interrupts\n");
}
#ifdef CONFIG_SMP
seq_printf(p, "%*s: ", prec, "RES");
for_each_online_cpu(j)
seq_printf(p, "%10u ", irq_stats(j)->irq_resched_count);
seq_printf(p, " Rescheduling interrupts\n");
seq_printf(p, "%*s: ", prec, "CAL");
for_each_online_cpu(j)
seq_printf(p, "%10u ", irq_stats(j)->irq_call_count -
irq_stats(j)->irq_tlb_count);
seq_printf(p, " Function call interrupts\n");
seq_printf(p, "%*s: ", prec, "TLB");
for_each_online_cpu(j)
seq_printf(p, "%10u ", irq_stats(j)->irq_tlb_count);
seq_printf(p, " TLB shootdowns\n");
#endif
#ifdef CONFIG_X86_THERMAL_VECTOR
seq_printf(p, "%*s: ", prec, "TRM");
for_each_online_cpu(j)
seq_printf(p, "%10u ", irq_stats(j)->irq_thermal_count);
seq_printf(p, " Thermal event interrupts\n");
#endif
#ifdef CONFIG_X86_MCE_THRESHOLD
seq_printf(p, "%*s: ", prec, "THR");
for_each_online_cpu(j)
seq_printf(p, "%10u ", irq_stats(j)->irq_threshold_count);
seq_printf(p, " Threshold APIC interrupts\n");
#endif
#ifdef CONFIG_X86_MCE
seq_printf(p, "%*s: ", prec, "MCE");
for_each_online_cpu(j)
seq_printf(p, "%10u ", per_cpu(mce_exception_count, j));
seq_printf(p, " Machine check exceptions\n");
seq_printf(p, "%*s: ", prec, "MCP");
for_each_online_cpu(j)
seq_printf(p, "%10u ", per_cpu(mce_poll_count, j));
seq_printf(p, " Machine check polls\n");
#endif
seq_printf(p, "%*s: %10u\n", prec, "ERR", atomic_read(&irq_err_count));
#if defined(CONFIG_X86_IO_APIC)
seq_printf(p, "%*s: %10u\n", prec, "MIS", atomic_read(&irq_mis_count));
#endif
return 0;
}
可以看到主要是从 per-cpu内存区读取相应的数据,关于x86_64 per-cpu相关知识请参考:Linux per-cpu
// linux-3.10/arch/x86/include/asm/hardirq.h
typedef struct {
unsigned int __softirq_pending;
unsigned int __nmi_count; /* arch dependent */
#ifdef CONFIG_X86_LOCAL_APIC
unsigned int apic_timer_irqs; /* arch dependent */
unsigned int irq_spurious_count;
unsigned int icr_read_retry_count;
#endif
#ifdef CONFIG_HAVE_KVM
unsigned int kvm_posted_intr_ipis;
#endif
unsigned int x86_platform_ipis; /* arch dependent */
unsigned int apic_perf_irqs;
unsigned int apic_irq_work_irqs;
#ifdef CONFIG_SMP
unsigned int irq_resched_count;
unsigned int irq_call_count;
/*
* irq_tlb_count is double-counted in irq_call_count, so it must be
* subtracted from irq_call_count when displaying irq_call_count
*/
unsigned int irq_tlb_count;
#endif
#ifdef CONFIG_X86_THERMAL_VECTOR
unsigned int irq_thermal_count;
#endif
#ifdef CONFIG_X86_MCE_THRESHOLD
unsigned int irq_threshold_count;
#endif
} ____cacheline_aligned irq_cpustat_t;
DECLARE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat);
// linux-3.10/include/linux/irq_cpustat.h
/*
* Simple wrappers reducing source bloat. Define all irq_stat fields
* here, even ones that are arch dependent. That way we get common
* definitions instead of differing sets for each arch.
*/
#ifndef __ARCH_IRQ_STAT
extern irq_cpustat_t irq_stat[]; /* defined in asm/hardirq.h */
#define __IRQ_STAT(cpu, member) (irq_stat[cpu].member)
#endif
使用SCPU关键字,将显示 CPU or CPUs 每秒接收到的每个软件中断的数量。
软中断是预留给系统中对时间要求较严格和重要的下半部使用的(上半部是硬件中断处理,优先级最高),软中断执行过程中会响应其它的中断。
驱动中只有块设备和网络子系统使用了软中断。
[root@localhost ~]# mpstat -I SCPU
Linux 3.10.0-957.el7.x86_64 (localhost.localdomain) 11/28/2022 _x86_64_ (4 CPU)
04:48:54 PM CPU HI/s TIMER/s NET_TX/s NET_RX/s BLOCK/s BLOCK_IOPOLL/s TASKLET/s SCHED/s HRTIMER/s RCU/s
04:48:54 PM 0 0.00 10.62 0.17 4.26 0.02 0.00 0.02 6.63 0.00 3.92
04:48:54 PM 1 0.00 12.78 0.00 0.03 0.00 0.00 0.00 7.19 0.00 4.89
04:48:54 PM 2 0.00 12.41 0.00 0.03 0.00 0.00 0.00 7.28 0.00 4.48
04:48:54 PM 3 0.00 12.97 0.00 0.02 0.19 0.00 0.00 7.13 0.00 4.90
数据来源读取 /proc/softirqs 文件,/proc/softirqs 提供了软中断的运行情况:
open("/proc/softirqs", O_RDONLY) = 3
[root@localhost ~]# cat /proc/softirqs
CPU0 CPU1 CPU2 CPU3
HI: 29 12 81 4
TIMER: 2978642 3586869 3484711 3642228
NET_TX: 46707 2 2 1
NET_RX: 1195259 8070 7563 6755
BLOCK: 5432 776 578 53783
BLOCK_IOPOLL: 0 0 0 0
TASKLET: 5769 0 0 0
SCHED: 1860352 2017852 2042455 1999179
HRTIMER: 0 0 0 0
RCU: 1100586 1372825 1258876 1377782
软中断包括了 10 个类别,分别对应不同的工作类型。比如 NET_RX 表示网络接收中断,而 NET_TX 表示网络发送中断。
参数详解:
tasklet | 优先级 | 描述 |
---|---|---|
HI | 0 | 优先级最高的软中断 |
TIMER | 1 | 定时器的下半部 |
NET_TX | 2 | 网络发送软中断 |
NET_RX | 3 | 网络接收软中断 |
BLOCK | 4 | 用于块设备的软中断 |
BLOCK_IOPOLL | 5 | 用于块设备的软中断 |
TASKLET | 6 | 用于tasklets机制的软中断 |
SCHED | 7 | 进程调度和负载均衡 |
HRTIMER | 8 | 高分辨率定时器 |
RCU | 9 | 为RCU锁服务的软中断 |
优先级高的软中断(比如:0)在优先级低的软中断(比如:9)优先执行。
当内核中出现大量的软中断时(当软中断比较多时,普通进程优先级低于软中断,那么普通进程无法获得足够多的处理器时间),软中断会以内核线程的方式运行的,每个 CPU 都对应一个软中断内核线程,这个软中断内核线程就叫做 ksoftirqd/CPU 编号。
[root@localhost ~]# top -n 1 | grep ksoftirqd
3 root 20 0 0 0 0 S 0.0 0.0 0:00.89 ksoftirqd/0
14 root 20 0 0 0 0 S 0.0 0.0 0:00.10 ksoftirqd/1
19 root 20 0 0 0 0 S 0.0 0.0 0:00.13 ksoftirqd/2
24 root 20 0 0 0 0 S 0.0 0.0 0:00.18 ksoftirqd/3
或者:
[root@localhost ~]# ps aux | grep ksoftirq
root 3 0.0 0.0 0 0 ? S Nov25 0:00 [ksoftirqd/0]
root 14 0.0 0.0 0 0 ? S Nov25 0:00 [ksoftirqd/1]
root 19 0.0 0.0 0 0 ? S Nov25 0:00 [ksoftirqd/2]
root 24 0.0 0.0 0 0 ? S Nov25 0:00 [ksoftirqd/3]
这些线程的名字外面都有中括号,这说明 ps 无法获取它们的命令行参数(cmline)。一般来说,ps 的输出中,名字括在中括号里的,一般都是内核线程。
注意:软中断的优先级是高于普通进程的,当一个软中断执行的时候,可以重新触发自己以便在得到执行(比如网络子系统),如果软中断出现的频率比较高,再加上软中断又有将自己重新设置为可执行状态的能力,那么就会导致用户态普通进程无法获得足够多的运行时间。
因此内核线程 ksoftirq 就是在内核中出现大量的软中断时,内核线程就会辅助软中断,处理软中断的数据,内核线程的优先级比较低,由上面可以看到 nice 值为0,和普通进程一样,这样就会避免普通进程无法获得足够多的运行时间。
内核线程性能问题:
在 Linux 中,每个 CPU 都对应一个软中断内核线程,名字是 ksoftirqd/CPU 编号。当软中断事件的频率过高时,内核线程也会因为 CPU 使用率过高而导致软中断处理不及时,进而引发网络收发延迟、调度缓慢等性能问题。
软中断 CPU 使用率(softirq)升高是一种很常见的性能问题。虽然软中断的类型很多,但实际生产中,我们遇到的性能瓶颈大多是网络收发类型的软中断,特别是网络接收的软中断。
(1)softirqs
// linux-3.10/fs/proc/softirqs.c
/*
* /proc/softirqs ... display the number of softirqs
*/
static int show_softirqs(struct seq_file *p, void *v)
{
int i, j;
seq_puts(p, " ");
for_each_possible_cpu(i)
seq_printf(p, "CPU%-8d", i);
seq_putc(p, '\n');
for (i = 0; i < NR_SOFTIRQS; i++) {
seq_printf(p, "%12s:", softirq_to_name[i]);
for_each_possible_cpu(j)
seq_printf(p, " %10u", kstat_softirqs_cpu(i, j));
seq_putc(p, '\n');
}
return 0;
}
static int softirqs_open(struct inode *inode, struct file *file)
{
return single_open(file, show_softirqs, NULL);
}
static const struct file_operations proc_softirqs_operations = {
.open = softirqs_open,
.read = seq_read,
.llseek = seq_lseek,
.release = single_release,
};
static int __init proc_softirqs_init(void)
{
proc_create("softirqs", 0, NULL, &proc_softirqs_operations);
return 0;
}
module_init(proc_softirqs_init);
// linux-3.10/include/linux/interrupt.h
/* PLEASE, avoid to allocate new softirqs, if you need not _really_ high
frequency threaded job scheduling. For almost all the purposes
tasklets are more than enough. F.e. all serial device BHs et
al. should be converted to tasklets, not to softirqs.
*/
enum
{
HI_SOFTIRQ=0,
TIMER_SOFTIRQ,
NET_TX_SOFTIRQ,
NET_RX_SOFTIRQ,
BLOCK_SOFTIRQ,
BLOCK_IOPOLL_SOFTIRQ,
TASKLET_SOFTIRQ,
SCHED_SOFTIRQ,
HRTIMER_SOFTIRQ,
RCU_SOFTIRQ, /* Preferable RCU should always be the last softirq */
NR_SOFTIRQS
};
/* map softirq index to softirq name. update 'softirq_to_name' in
* kernel/softirq.c when adding a new softirq.
*/
extern char *softirq_to_name[NR_SOFTIRQS];
// linux-3.10/kernel/softirq.c
static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp;
char *softirq_to_name[NR_SOFTIRQS] = {
"HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL",
"TASKLET", "SCHED", "HRTIMER", "RCU"
};
定义per-cpu变量 :struct kernel_stat kstat,并且将 kstat 符号导出
// linux-3.10/kernel/sched/core.c
DEFINE_PER_CPU(struct kernel_stat, kstat);
EXPORT_PER_CPU_SYMBOL(kstat);
[root@localhost ~]# cat /proc/kallsyms | grep '\'
0000000000015b60 A kstat
[root@localhost ~]# cat /proc/kallsyms | grep '\<__per_cpu_start\>'
0000000000000000 A __per_cpu_start
[root@localhost ~]# cat /proc/kallsyms | grep '\<__per_cpu_end\>'
000000000001d000 A __per_cpu_end
kstat 在 _per_cpu_start 和 __per_cpu_end 范围内,是内核中的per-cpu变量。
读取 softirqs 数据:
// linux-3.10/include/linux/kernel_stat.h
struct kernel_stat {
#ifndef CONFIG_GENERIC_HARDIRQS
unsigned int irqs[NR_IRQS];
#endif
unsigned long irqs_sum;
unsigned int softirqs[NR_SOFTIRQS];
};
DECLARE_PER_CPU(struct kernel_stat, kstat);
#define kstat_cpu(cpu) per_cpu(kstat, cpu)
static inline unsigned int kstat_softirqs_cpu(unsigned int irq, int cpu)
{
return kstat_cpu(cpu).softirqs[irq];
}
(2)ksoftirqd
定义ksoftirqd,用一个struct task_struct表示,存放在per-cpu内存中:
// linux-3.10/kernel/softirq.c
DEFINE_PER_CPU(struct task_struct *, ksoftirqd);
/*
* we cannot loop indefinitely here to avoid userspace starvation,
* but we also don't want to introduce a worst case 1/HZ latency
* to the pending events, so lets the scheduler to balance
* the softirq load for us.
*/
/*
不能在这里无限循环以避免用户空间不足,但我们也不想为 the pending events 引入最坏的1/HZ延迟
因此让调度器为我们平衡 the softirq 负载。
*/
static void wakeup_softirqd(void)
{
/* Interrupts are disabled: no need to stop preemption */
struct task_struct *tsk = __this_cpu_read(ksoftirqd);
if (tsk && tsk->state != TASK_RUNNING)
wake_up_process(tsk);
}
声明 ksoftirqd:
// linux-3.10/include/linux/interrupt.h
DECLARE_PER_CPU(struct task_struct *, ksoftirqd);
static inline struct task_struct *this_cpu_ksoftirqd(void)
{
return this_cpu_read(ksoftirqd);
}
Linux内核 3.10.0
Linux内核设计与实现
极客时间:Linux性能优化实战