Linux性能分析-平均负载

平均负载的理解

一般系统变慢时，我们会使用top或uptime命令来查看下系统的负载情况

[root@localhost shell]# uptime
 13:51:08 up 5 days, 21:50,  3 users,  load average: 0.00, 0.02, 0.05

load average：0.00,0.02,0.05 分别代表了1min/5min/15min的平均负载，那么平均负载到底是什么意思呢？

使用man uptime查看下详细的说明

 man uptime

其中关于load average的解释如下：
System load averages is the average number of processes that are either in a runnable or uninterruptable state. A process in a runnable state is either using the CPU or waiting to use the CPU. A process in uninterruptable state is waiting for some I/O access, eg waiting for disk.
The averages are taken over the three time intervals. Load averages are not normalized for the number of CPUs in a system, so a load average of 1 means a single CPU system is loaded all the time while on a 4 CPU system it means it was idle 75% of the time.

可以看到，平均负载统计的是处于runnable or uninterruptable状态的进程数量，并且这个数据需要和系统的cpu数量进行比较才有意义。

场景模拟

平均负载，计算的是进程对系统资源的需求程度，包括CPU和IO，所以平均负载高，CPU使用率不一定高。
下面模拟三种场景，这三种场景的平均负载都很高，但是分别对应的CPU密集型进程、IO密集型进程、大量等待CPU调度的进程组。

首先安装stress，stress是一个压力测试工具。

yum install -y epel-release
yum install -y stress

第一种：CPU密集型

用stress执行下面的命令，模拟占用一个核

stress --cpu 1 --timeout 600

在新开的终端中查看不同核的占用情况，5代表等待5秒，20代表一共打印20次数据
可以看到第三个核的CPU占用率是100%

[root@localhost shell]# mpstat -P ALL 5 20
Linux 3.10.0-1062.el7.x86_64 (localhost.localdomain)    2020年07月07日     _x86_64_(4 CPU)

14时11分49秒  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
14时11分54秒  all   25.29    0.00    0.05    0.00    0.00    0.00    0.00    0.00    0.00   74.66
14时11分54秒    0    0.60    0.00    0.20    0.00    0.00    0.00    0.00    0.00    0.00   99.20
14时11分54秒    1   14.00    0.00    0.20    0.00    0.00    0.00    0.00    0.00    0.00   85.80
14时11分54秒    2    0.20    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   99.80
14时11分54秒    3   86.23    0.00    0.20    0.00    0.00    0.00    0.00    0.00    0.00   13.57

14时11分54秒  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
14时11分59秒  all   25.23    0.00    0.05    0.00    0.00    0.00    0.00    0.00    0.00   74.72
14时11分59秒    0    0.20    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   99.80
14时11分59秒    1    0.20    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   99.80
14时11分59秒    2    0.40    0.00    0.20    0.00    0.00    0.00    0.00    0.00    0.00   99.40
14时11分59秒    3  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00

在另一个终端中，打印进程对CPU的占用情。可以看到pid为6325的stress的进程占用cpu达到100%

[root@localhost shell]# pidstat -u 5 10
Linux 3.10.0-1062.el7.x86_64 (localhost.localdomain)    2020年07月07日     _x86_64_    (4 CPU)

14时13分11秒   UID       PID    %usr %system  %guest    %CPU   CPU  Command
14时13分16秒     0      1891    0.40    0.00    0.00    0.40     3  X
14时13分16秒     0      2581    0.60    0.20    0.00    0.79     3  gnome-shell
14时13分16秒     0      2604    0.00    0.20    0.00    0.20     0  ibus-daemon
14时13分16秒     0      2668    0.00    0.20    0.00    0.20     0  goa-identity-se
14时13分16秒     0      2847    0.20    0.00    0.00    0.20     3  vmtoolsd
14时13分16秒     0      6325   99.40    0.00    0.00   99.40     1  stress
14时13分16秒     0      6487    0.00    0.20    0.00    0.20     3  pidstat

14时13分16秒   UID       PID    %usr %system  %guest    %CPU   CPU  Command
14时13分21秒     0      1891    0.40    0.00    0.00    0.40     3  X
14时13分21秒     0      2581    0.80    0.00    0.00    0.80     3  gnome-shell
14时13分21秒     0      6325  100.00    0.00    0.00  100.00     1  stress
14时13分21秒     0      6487    0.20    0.20    0.00    0.40     3  pidstat

14时13分21秒   UID       PID    %usr %system  %guest    %CPU   CPU  Command
14时13分26秒     0       500    0.00    0.20    0.00    0.20     0  xfsaild/dm-0
14时13分26秒     0      1891    0.60    0.40    0.00    1.00     1  X
14时13分26秒     0      2581    1.00    0.40    0.00    1.40     1  gnome-shell
14时13分26秒     0      6206    0.20    0.20    0.00    0.40     1  gnome-terminal-
14时13分26秒     0      6325  100.00    0.00    0.00  100.00     2  stress
14时13分26秒     0      6487    0.00    0.20    0.00    0.20     3  pidstat

第二种IO密集型

安装stress-ng

yum install stress-ng

执行下面的命令，模拟IO密集

stress-ng -i 1 --hdd 1 --timeout 600

查看mpstat，可以看到iowait%明显提高

[root@localhost shell]# mpstat -P ALL 5 20
Linux 3.10.0-1062.el7.x86_64 (localhost.localdomain)    2020年07月07日     _x86_64_    (4 CPU)

14时30分20秒  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
14时30分25秒  all    0.81    0.00   12.68   37.67    0.00    1.83    0.00    0.00    0.00   47.02
14时30分25秒    0    0.22    0.00   15.08   32.59    0.00    1.11    0.00    0.00    0.00   51.00
14时30分25秒    1    1.65    0.00   22.63   37.65    0.00    0.41    0.00    0.00    0.00   37.65
14时30分25秒    2    0.88    0.00    8.10   43.98    0.00    3.06    0.00    0.00    0.00   43.98
14时30分25秒    3    0.00    0.00    4.75   36.29    0.00    2.81    0.00    0.00    0.00   56.16

查看哪个应用的IO占比较高，可以看到stress的应用，对磁盘的写入很大

[root@localhost shell]# pidstat -d 5 10 | grep "stress"
                           UID       PID   kB_rd/s   kB_wr/s   kB_ccwr/s  Command
                    
14时32分34秒     0      6730      0.00 549011.95 143466.14  stress-ng-hdd
14时32分34秒     0      6731      0.00      0.00  13406.37  stress-ng-io

第三种大量进程场景

执行下面命令，模拟16个进程执行

stress -c 16 --timeout 600

查看uptime，可以看到最近一分钟的平均负载显著升高

[root@localhost shell]# uptime
 14:38:53 up 5 days, 22:38,  4 users,  load average: 12.05, 5.38, 2.74

查看进程的cpu占用，可以看到stress的进程占用cpu为20左右，大量的stress进程在竞争cpu

[root@localhost shell]# pidstat -u 5 5
Linux 3.10.0-1062.el7.x86_64 (localhost.localdomain)    2020年07月07日     _x86_64_    (4 CPU)

14时38分26秒   UID       PID    %usr %system  %guest    %CPU   CPU  Command
14时38分31秒     0      1891    0.00    0.18    0.00    0.18     2  X
14时38分31秒     0      2581    0.18    0.18    0.00    0.37     1  gnome-shell
14时38分31秒     0      7225   23.29    0.00    0.00   23.29     3  stress
14时38分31秒     0      7226   23.48    0.00    0.00   23.48     0  stress
14时38分31秒     0      7227   23.11    0.00    0.00   23.11     3  stress
14时38分31秒     0      7228   23.29    0.00    0.00   23.29     0  stress
14时38分31秒     0      7229   23.29    0.00    0.00   23.29     3  stress
14时38分31秒     0      7230   23.11    0.00    0.00   23.11     2  stress
14时38分31秒     0      7231   23.11    0.00    0.00   23.11     1  stress
14时38分31秒     0      7232   23.11    0.00    0.00   23.11     0  stress
14时38分31秒     0      7233   23.29    0.00    0.00   23.29     1  stress
14时38分31秒     0      7234   23.11    0.00    0.00   23.11     2  stress
14时38分31秒     0      7235   23.29    0.00    0.00   23.29     1  stress
14时38分31秒     0      7236   23.48    0.00    0.00   23.48     2  stress
14时38分31秒     0      7237   23.29    0.00    0.00   23.29     0  stress
14时38分31秒     0      7238   23.29    0.00    0.00   23.29     3  stress
14时38分31秒     0      7239   23.11    0.00    0.00   23.11     1  stress
14时38分31秒     0      7240   22.92    0.00    0.00   22.92     2  stress