中午了,正在吃着黄焖鸡外卖,突然手机短信声音响了,一看一台mysql数据库服务器的cpu,IO使用率查过告警阀值了,都快到100%。赶紧放下黄焖鸡,快速登录到有问题的mysql数据库服务器。
下面模拟一下当时排查的步骤,使用的mysql5.7.26版本
用iostat看一下IO情况
[mysql@localhost ~]$ iostat -mxt 1
Linux 3.10.0-1062.9.1.el7.x86_64 (localhost.localdomain) 08/23/2020 _x86_64_ (1 CPU)
08/23/2020 03:32:41 PM
avg-cpu: %user %nice %system %iowait %steal %idle
5.49 0.00 78.02 6.59 0.00 9.89
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 3460.44 0.00 54.07 0.00 32.00 0.89 0.26 0.26 0.00 0.26 89.45
scd0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-0 0.00 0.00 3460.44 0.00 54.07 0.00 32.00 0.90 0.26 0.26 0.00 0.26 89.56
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
用top看看一下cpu使用情况,看看什么进程导致cpu高
top - 15:37:21 up 8:50, 5 users, load average: 0.45, 0.19, 0.12
Tasks: 144 total, 1 running, 143 sleeping, 0 stopped, 0 zombie
%Cpu(s): 99.0 us, 1.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 2175376 total, 663640 free, 599256 used, 912480 buff/cache
KiB Swap: 2097148 total, 2096884 free, 264 used. 1417460 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
40474 mysql 20 0 1266400 324264 9496 S 96.7 14.9 3:42.00 mysqld
40916 root 20 0 204492 12108 4304 S 2.7 0.6 1:22.59 iotop
1 root 20 0 125520 3716 2344 S 0.0 0.2 0:05.60 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.01 kthreadd
4 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
6 root 20 0 0 0 0 S 0.0 0.0 0:01.45 ksoftirqd/0
果然是mysql进程导致cpu使用高,知道了是mysql数据库导致的,可是还是不知道什么原因,如果是新手,这个时候是不是有点慌,无从下手的感觉,如果是老手,肯定知道大概率是慢sql的问题,其它的原因概率比较低,例如mysql的bug,业务qps暴增等等。
于是用show看看mysql线程情况
mysql> show full processlist;
+----+------+-----------+--------+---------+------+----------+-----------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+------+-----------+--------+---------+------+----------+-----------------------+
| 8 | root | localhost | sbtest | Query | 0 | starting | show full processlist |
| 10 | root | localhost | NULL | Sleep | 28 | | NULL |
+----+------+-----------+--------+---------+------+----------+-----------------------+
2 rows in set (0.00 sec)
在实际的生产上,结果可能有好几十,上百个结果,甚至有上千,要从这么多的线程里排查出导致cpu高的那个,简直是折磨人,这个不仅仅要经验足,还得对跑的业务非常熟悉,知道那些业务,那些大表可能会导致cpu高,你想想公司里有多少人满足这个条件,估计屈指可数。
那么有没有更直接,更直观的工具进行精确定位呢,在mysql5.7以后的版本,performance_schema.threads这个性能视图中,就有linxu系统的的线程ID,有了这个ID,就把mysql和操作系统精密的关联在一起了,下面就带大家,一步一步从Linux线程到MySQL慢sql精确定位
1.找到mysql服务的进程PID
[mysql@localhost ~]$ ps -ef|grep -i mysqld|grep -v grep
mysql 39463 39441 0 14:21 pts/0 00:00:00 /bin/sh /u02/mysql/bin/mysqld_safe --defaults-file=/u02/conf/my3308.cnf
mysql 40474 39463 7 14:21 pts/0 00:07:57 /u02/mysql/bin/mysqld --defaults-file=/u02/conf/my3308.cnf --basedir=/u02/mysql --datadir=/u02/data/3308 --plugin-dir=/u02/mysql/lib/plugin --log-error=/u02/log/3308/error.log --open-files-limit=65535 --pid-file=/u02/run/3308/mysqld.pid --socket=/u02/run/3308/mysql.sock --port=3308
在这里mysql的PID为:40474
2.用top找到cpu消耗最大的线程ID
[mysql@localhost ~]$ top -H -p 40474
Threads: 33 total, 1 running, 32 sleeping, 0 stopped, 0 zombie
%Cpu(s):100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 2175376 total, 670468 free, 592304 used, 912604 buff/cache
KiB Swap: 2097148 total, 2096884 free, 264 used. 1424364 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
40624 mysql 20 0 1266400 324264 9496 R 99.3 14.9 5:14.97 mysqld
40474 mysql 20 0 1266400 324264 9496 S 0.0 14.9 0:05.29 mysqld
40475 mysql 20 0 1266400 324264 9496 S 0.0 14.9 0:00.00 mysqld
40476 mysql 20 0 1266400 324264 9496 S 0.0 14.9 0:00.00 mysqld
40477 mysql 20 0 1266400 324264 9496 S 0.0 14.9 0:00.00 mysqld
消耗cpu资源最高的线程PID为:40624
3.根据系统的线程PID,精确定位慢sql
连接mysql数据库,使用以下sql语句,进行慢sql的精确定位
mysql> select PROCESSLIST_ID,THREAD_OS_ID,PROCESSLIST_USER,PROCESSLIST_HOST,PROCESSLIST_DB,PROCESSLIST_COMMAND,PROCESSLIST_INFO from performance_schema.threads where THREAD_OS_ID=40624;
+----------------+--------------+------------------+------------------+----------------+---------------------+-------------------------------------------+
| PROCESSLIST_ID | THREAD_OS_ID | PROCESSLIST_USER | PROCESSLIST_HOST | PROCESSLIST_DB | PROCESSLIST_COMMAND | PROCESSLIST_INFO |
+----------------+--------------+------------------+------------------+----------------+---------------------+-------------------------------------------+
| 8 | 40624 | root | localhost | sbtest | Query | select count(*) from sbtest1 a ,sbtest1 b |
+----------------+--------------+------------------+------------------+----------------+---------------------+-------------------------------------------+
1 row in set (0.00 sec)
“select count(*) from sbtest1 a ,sbtest1 b”
这条sql语句就是导致cpu暴增的元凶,找到之后,该怎么办,当然是先快速恢复业务,将这个select语句的连接kill掉。
[mysql@localhost ~]$ /u02/mysql/bin/mysql -uroot -proot --socket=/u02/run/3308/mysql.sock
mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 12
Server version: 5.7.26-log Source distribution
Copyright (c) 2000, 2019, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql>kill 8;
到这里,整个cpu使用率100%导致业务响应速度变慢,定位到慢sql,恢复业务的全过程,这个技能你get了吗