每次执行uptime 都会显示如下信息
root@user:~# uptime
20:54:13 up 428 days, 4:28, 7 users, load average: 0.00, 0.10, 0.55
对于每个信息的展示 可以通过man uptime 查看到
man uptime
UPTIME(1) User Commands UPTIME(1)
NAME
uptime - Tell how long the system has been running.
SYNOPSIS
uptime [options]
DESCRIPTION
uptime gives a one line display of the following information. The current time, how long the system has been running, how many users are currently logged on,
and the system load averages for the past 1, 5, and 15 minutes.
This is the same information contained in the header line displayed by w(1).
System load averages is the average number of processes that are either in a runnable or uninterruptable state. A process in a runnable state is either using
the CPU or waiting to use the CPU. A process in uninterruptable state is waiting for some I/O access, eg waiting for disk. The averages are taken over the
three time intervals. Load averages are not normalized for the number of CPUs in a system, so a load average of 1 means a single CPU system is loaded all the
time while on a 4 CPU system it means it was idle 75% of the time.
首先这个命令用来展示 主要当前系统已经运行了多长时间,第一列是当前时间,第二列是系统已经运行的时间,第三列当前已登陆用户数,第四列展示过去1分钟、5分钟、15分钟系统平均负载
20:54:13 //当前时间
up 428 days, 4:28, //系统已经运行的时间
7 users, //当前已登陆用户数
load average: 0.00, 0.10, 0.55 //过去1分钟、5分钟、15分钟系统平均负载
前三列都好理解,第四列什么是系统平均负载?首先他不是cpu使用率,解释中提到它是单位时间内系统处于**可运行状态(runnable)以及不可中断状态(uninterruptable)**的平均进程数
那如何理解系统平均负载?
如果系统平均负载是 2,那意味着:
首先cpu使用一旦接近100% 肯定是有问题的那么如何评价系统平均负载是合理的?根据上面的解释,最理想的情况是每个cpu上运行着一个进程,那么前提是我们要知道当前系统有多少cpu:
grep 'model name' /proc/cpuinfo -l
知道了cpu个数,当系统平均负载大于cpu个数的时候,系统肯定是超负荷的,实际的生产环境下,当系统平均负载 大于 cpu个数70%的时候,就需要排查问题了,因为一旦负载过高,就会导致系统响应过慢,影响服务功能了
但是系统平均负载高就一定意味着cpu使用率高么?
回看平均负载的定义,它不仅包括正在使用cpu的进程,还有等待cpu核等待io的进程,而cpu使用率的定义是单位时间内cpu处于占用情况的统计,所以两者并不是完全对等的
那就可能意味着:
为了验证这三种情况,首先需要安装几个辅助性能测试工具
apt-get install -y stress
apt-get install -y stress-ng
apt-get install -y sysstat
首先查看平均负载
root@user:~# uptime
21:53:02 up 428 days, 5:27, 7 users, load average: 0.01, 0.01, 0.00
在第一个终端执行stress 命令 模拟cpu使用100%
stress --cpu 1 --timeout 600
在第二个终端实时查看cpu平均负载
root@user:~# watch -d uptime //-d 会高亮显示变化的区域
Every 2.0s: uptime Sat Jul 13 21:56:17 2019
21:56:17 up 428 days, 5:30, 7 users, load average: 0.12, 0.04, 0.01
第三个终端 使用mpstat,查看cpu使用率的变化情况
root@user:~/soft/systat/sysstat-11.5.5# mpstat -P ALL 5
Linux 4.4.0-105-generic (user) 07/13/2019 _x86_64_ (1 CPU)
10:06:23 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
10:06:28 PM all 99.58 0.00 0.42 0.00 0.00 0.00 0.00 0.00 0.00 0.00
10:06:28 PM 0 99.58 0.00 0.42 0.00 0.00 0.00 0.00 0.00 0.00 0.00
10:06:28 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
10:06:33 PM all 99.79 0.00 0.21 0.00 0.00 0.00 0.00 0.00 0.00 0.00
10:06:33 PM 0 99.79 0.00 0.21 0.00 0.00 0.00 0.00 0.00 0.00 0.00
10:06:33 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
10:06:38 PM all 99.78 0.00 0.22 0.00 0.00 0.00 0.00 0.00 0.00 0.00
10:06:38 PM 0 99.78 0.00 0.22 0.00 0.00 0.00 0.00 0.00 0.00 0.00
第四个终端使用pidstat查看到底哪个进程cpu使用率高
root@user:~# pidstat -u 5 1
Linux 4.4.0-105-generic (user) 07/13/2019 _x86_64_ (1 CPU)
10:18:43 PM UID PID %usr %system %guest %wait %CPU CPU Command
10:18:48 PM 0 9165 0.21 0.00 0.00 0.00 0.21 0 AliYunDun
10:18:48 PM 0 17652 100.00 0.00 0.00 0.64 100.00 0 stress
Average: UID PID %usr %system %guest %wait %CPU CPU Command
Average: 0 9165 0.21 0.00 0.00 0.00 0.21 - AliYunDun
Average: 0 17652 100.00 0.00 0.00 0.64 100.00 - stress
root@user:~# ps -ef| grep 17652
root 17652 17651 98 22:18 pts/5 00:00:53 stress --cpu 1 --timeout 600
root 17769 17695 0 22:19 pts/8 00:00:00 grep --color=auto 17652
终端2看到 系统平均负载会接近1
终端3看到cpu 0的使用率接近100%
终端4看到stress 进程使用率很高
终端1执行stress 模拟io压力
root@user:~# stress-ng -i 1 --hdd 1 --timeout 600
stress-ng: info: [17352] dispatching hogs: 1 hdd, 1 iosync
stress-ng: info: [17352] cache allocate: default cache size: 33792K
终端2查看平均负载
root@user:~# watch -d uptime //-d 会高亮显示变化的区域
发现平均负载高到3 +
终端3查看cpu使用率以及io等待
root@user:~/soft/systat/sysstat-11.5.5# mpstat -P ALL 5
Linux 4.4.0-105-generic (user) 07/13/2019 _x86_64_ (1 CPU)
10:15:38 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
10:15:43 PM all 0.45 0.00 7.80 91.76 0.00 0.00 0.00 0.00 0.00 0.00
10:15:43 PM 0 0.45 0.00 7.80 91.76 0.00 0.00 0.00 0.00 0.00 0.00
10:15:43 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
10:15:48 PM all 0.43 0.00 13.36 86.21 0.00 0.00 0.00 0.00 0.00 0.00
10:15:48 PM 0 0.43 0.00 13.36 86.21 0.00 0.00 0.00 0.00 0.00 0.00
10:15:48 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
10:15:53 PM all 0.65 0.00 9.29 90.06 0.00 0.00 0.00 0.00 0.00 0.00
10:15:53 PM 0 0.65 0.00 9.29 90.06 0.00 0.00 0.00 0.00 0.00 0.00
10:15:53 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
10:15:58 PM all 0.67 0.00 7.78 91.56 0.00 0.00 0.00 0.00 0.00 0.00
10:15:58 PM 0 0.67 0.00 7.78 91.56 0.00 0.00 0.00 0.00 0.00 0.00
10:15:58 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
10:16:03 PM all 0.65 0.00 14.07 85.28 0.00 0.00 0.00 0.00 0.00 0.00
10:16:03 PM 0 0.65 0.00 14.07 85.28 0.00 0.00 0.00 0.00 0.00 0.00
发现cpu使用率只有14%,而iowait 高达90%多
终端4 查看到底哪个进行io高
root@user:~# pidstat -u 5 1
Linux 4.4.0-105-generic (user) 07/13/2019 _x86_64_ (1 CPU)
10:21:25 PM UID PID %usr %system %guest %wait %CPU CPU Command
10:21:30 PM 0 156 0.00 0.90 0.00 0.68 0.90 0 jbd2/vda1-8
10:21:30 PM 0 9165 0.45 0.00 0.00 0.00 0.45 0 AliYunDun
10:21:30 PM 0 15654 0.23 0.00 0.00 0.00 0.23 0 watch
10:21:30 PM 0 17918 0.00 15.32 0.00 0.90 15.32 0 stress-ng-hdd
10:21:30 PM 0 20490 0.00 2.03 0.00 0.45 2.03 0 kworker/u2:2
Average: UID PID %usr %system %guest %wait %CPU CPU Command
Average: 0 156 0.00 0.90 0.00 0.68 0.90 - jbd2/vda1-8
Average: 0 9165 0.45 0.00 0.00 0.00 0.45 - AliYunDun
Average: 0 15654 0.23 0.00 0.00 0.00 0.23 - watch
Average: 0 17918 0.00 15.32 0.00 0.90 15.32 - stress-ng-hdd
Average: 0 20490 0.00 2.03 0.00 0.45 2.03 - kworker/u2:2
root@user:~# ps -ef| grep 17918
root 17918 17917 14 22:20 pts/5 00:00:15 stress-ng -i 1 --hdd 1 --timeout 600
root 18083 17695 0 22:22 pts/8 00:00:00 grep --color=auto 17918
看到stress-ng io比较高
当系统中出现超出cpu运行能力时,就会出现cpu等待时间
终端1用stress 模拟8个进程
stress -c 8 --timeout 600
终端2查看系统负载
Every 2.0s: uptime Sat Jul 13 22:30:47 2019
22:30:57 up 428 days, 6:05, 4 users, load average: 7.91, 5.70, 3.19
看到系统负载接近运行的进程数
终端3查看cpu使用率
root@user:~/soft/systat/sysstat-11.5.5# mpstat -P ALL 5
Linux 4.4.0-105-generic (user) 07/13/2019 _x86_64_ (1 CPU)
10:27:21 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
10:27:26 PM all 99.56 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.00 0.00
10:27:26 PM 0 99.56 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.00 0.00
10:27:26 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
10:27:31 PM all 99.80 0.00 0.20 0.00 0.00 0.00 0.00 0.00 0.00 0.00
10:27:31 PM 0 99.80 0.00 0.20 0.00 0.00 0.00 0.00 0.00 0.00 0.00
10:27:31 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
10:27:36 PM all 99.57 0.00 0.43 0.00 0.00 0.00 0.00 0.00 0.00 0.00
10:27:36 PM 0 99.57 0.00 0.43 0.00 0.00 0.00 0.00 0.00 0.00 0.00
10:27:36 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
10:27:41 PM all 99.79 0.00 0.21 0.00 0.00 0.00 0.00 0.00 0.00 0.00
10:27:41 PM 0 99.79 0.00 0.21 0.00 0.00 0.00 0.00 0.00 0.00 0.00
10:27:41 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
10:27:46 PM all 99.79 0.00 0.21 0.00 0.00 0.00 0.00 0.00 0.00 0.00
10:27:46 PM 0 99.79 0.00 0.21 0.00 0.00 0.00 0.00 0.00 0.00 0.00
10:27:46 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
10:27:51 PM all 99.79 0.00 0.21 0.00 0.00 0.00 0.00 0.00 0.00 0.00
10:27:51 PM 0 99.79 0.00 0.21 0.00 0.00 0.00 0.00 0.00 0.00 0.00
可以看到cpu使用率到了100%
终端4查看 哪个进程导致cpu使用率高
root@user:~# pidstat -u 5 1
Linux 4.4.0-105-generic (user) 07/13/2019 _x86_64_ (1 CPU)
10:29:30 PM UID PID %usr %system %guest %wait %CPU CPU Command
10:29:35 PM 0 9165 0.21 0.21 0.00 0.00 0.43 0 AliYunDun
10:29:35 PM 0 18501 13.52 0.00 0.00 93.99 13.52 0 stress
10:29:35 PM 0 18502 13.52 0.00 0.00 94.21 13.52 0 stress
10:29:35 PM 0 18503 11.16 0.00 0.00 100.00 11.16 0 stress
10:29:35 PM 0 18504 14.38 0.00 0.00 83.48 14.38 0 stress
10:29:35 PM 0 18505 13.52 0.00 0.00 94.21 13.52 0 stress
10:29:35 PM 0 18506 13.52 0.00 0.00 94.21 13.52 0 stress
10:29:35 PM 0 18507 13.52 0.00 0.00 94.42 13.52 0 stress
10:29:35 PM 0 18508 13.30 0.00 0.00 93.78 13.30 0 stress
root@user:~# ps -ef| grep 18503
root 18503 18500 12 22:27 pts/5 00:00:36 stress -c 8 --timeout 600
root 18948 17695 0 22:32 pts/8 00:00:00 grep --color=auto 18503
可以看到时stress进程导致了cpu高
现在基本感受到了平均系统负载与cpu使用率之间的关系,以及我们如何判断具体的情况
参考 https://time.geekbang.org/column/article/69618