在日常运维工作中,MySQL数据库服务器出现SQL语句执行导致服务器CPU使用率突增,如何通过现有手段快速定位排查到哪个SQL语句,并采取应急措施。本文介绍基于传统的操作系统线程的CPU使用监控手段入手,利用操作系统线程ID和MySQL线程ID对应关系,逐步定位到异常SQL和事务。
MySQL是一个单进程多线程数据库,进程是正在运行的程序的实例,线程是操作系统能够进行运算调度的最小单位。
MySQL数据库选择单进程多线程,是因为进程同一时间内并发很多线程给不同的CPU,线程共享相同的内存单元/内存地址空间, 由于线程间的通信是在同一地址空间上进行的,所以不需要额外的通信机制,这就使得通信更简便而且信息传递的速度也更快。
上图是MySQL存储引擎InnoDB的后台线程,在“InnoDB存储引擎解密”中详细介绍了。InnoDB后台线程主要用于维持服务器的正常运行和完成用户提交的任务,主要包括:master thread、page cleaner thread、purge thread、read thread、write thread、redo log thread、insert buffer thread、monitor thread、error monitor thread、lock monitor thread等。
Master thread是核心的后台线程,主要负责将缓冲池中的数据异步刷新到磁盘,保证数据的一致性,包括脏页的刷新、合并插入缓冲、undo页的回收等。Master thread内部由多个循环组成,包括主循环、后台循环、刷新循环和暂停循环,master thread会根据数据库运行的状态在不同的循环之间切换。在InnoDB 1.0.x版本之前的主循环中,分两大部分操作:每秒钟的操作和每10秒钟的操作。
1)每秒一次的操作包括:
即使某个事务还没有提交,InnoDB存储引擎仍然每秒会将重做日志缓存中的内容刷新到redo log中。
2)每10秒一次的操作包括:
在以上过程中,InnoDB存储引擎会先判断过去10s内的磁盘IO操作是否小于200次,如果是InnoDB存储引擎认为当前有足够的磁盘IO操作能力,因此将100个脏页刷新到磁盘。接着,InnoDB存储引擎会合并插入缓冲,之后再将日志缓冲刷新到磁盘。最后InnoDB存储引擎会执行full purge操作,删除无用的undo页,它会先去判断当前系统中已被标记为删除的行是否可以删除,如果可以则可以立即将其删除。
在InnoDB 1.0.x版本之前,InnoDB存储引擎对于IO是有限制的,缓冲池向磁盘刷新做了一定的硬编码,随着磁盘硬件性能提高,这种方式会限制InnoDB对磁盘IO的性能。因此在1.0.x版本之后,InnoDB提供了参数innodb_io_capacity,用来表示磁盘IO的吞吐量,默认值为200。对于刷新到磁盘页的数量,会按照innodb_io_capacity的百分比进行控制:
命令show engine innodb status;可以查看master thread信息:
=====================================
2021-08-22 20:38:37 140186194323200 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 12 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 1 srv_active, 0 srv_shutdown, 255 srv_idle
srv_master_thread log flush and writes: 0
事务被提交后,其使用的undo log可能不再需要,因此需要Purge Thread来回收已经使用并分配的undo页,它的作用是真正的删除记录和删除undo log。
比如语句delete from tb1 where pk=1;
Page页标记为已删除的原因有两点:1)该事物可能需要回滚,先作保留;2)当事物1去删除pk=1且没有提交时,事物2要能看到pk=1的记录(事物的隔离性)。根据不同的过滤条件,对删除标记的处理也不一样,如下表所示:
因此,标记为delete-mark的记录最后会被purge线程回收,Purge会检测记录上是否有其他事物在引用undo,如果没有就可以删除。InnoDB 1.2版本开始,InnoDB支持多个purge thread,这样能够加快undo页的回收,同时离散的读取undo页也可以进一步提升磁盘的随机读取性能,目前MySQL 8.0版本中默认设置为4。
mysql> show variables like 'innodb_purge_threads';
+----------------------+-------+
| Variable_name | Value |
+----------------------+-------+
| innodb_purge_threads | 4 |
+----------------------+-------+
1 row in set (0.00 sec)
Page Cleaner Thread是在InnoDB 1.2.x版本新引入的,其作用是将之前版本中脏页的刷新操作都放入单独的线程中来完成,这样减轻了Master Thread的工作及对于用户查询线程的阻塞。
在InnoDB存储引擎中大量使用了异步IO来处理写IO请求,IO Thread的工作主要是负责这些IO请求的回调处理。InnoDB中有4种IO thread,分别为write、read、insert buffer和log IO thread:
通过命令SHOW ENGINE INNODB STATUS可以观察InnoDB中的IO Thread:
--------
FILE I/O
--------
I/O thread 0 state: waiting for completed aio requests (insert buffer thread)
I/O thread 1 state: waiting for completed aio requests (log thread)
I/O thread 2 state: waiting for completed aio requests (read thread)
I/O thread 3 state: waiting for completed aio requests (read thread)
I/O thread 4 state: waiting for completed aio requests (read thread)
I/O thread 5 state: waiting for completed aio requests (read thread)
I/O thread 6 state: waiting for completed aio requests (write thread)
I/O thread 7 state: waiting for completed aio requests (write thread)
I/O thread 8 state: waiting for completed aio requests (write thread)
I/O thread 9 state: waiting for completed aio requests (write thread)
Pending normal aio reads: [0, 0, 0, 0] , aio writes: [0, 0, 0, 0] ,
ibuf aio reads:, log i/o's:, sync i/o's:
Pending flushes (fsync) log: 0; buffer pool: 0
854 OS file reads, 207 OS file writes, 38 OS fsyncs
0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s
Linux系统中获得MySQL进程ID方法很简单,使用命令ps -ef|grep mysqld就能看到其系统进程ID。如下所示进程MySQL运行的进行ID为1479:
[root@tango-GDB-DB01 ~]# ps -ef|grep mysqld
mysql 1479 1090 1 16:59 ? 00:00:03 /usr/local/mysql/bin/mysqld --basedir=/usr/local/mysql --datadir=/usr/local/mysql/data --plugin-dir=/usr/local/mysql/lib/plugin --user=mysql --log-error=tango-GDB-DB01.err --pid-file=/usr/local/mysql/data/tango-GDB-DB01.pid --socket=/tmp/mysql.sock
Linux中显示某个具体线程信息有以下几种方法:
1)TOP -H显示线程信息
top命令可以实时显示各个线程情况,调用top命令的“-H”选项,该选项会列出所有Linux线程。加上-p会筛选具体进程下面的线程信息,如下所示:
[root@tango-GDB-DB01 ~]# top -H -p 1479
top - 17:10:44 up 11 min, 2 users, load average: 0.00, 0.05, 0.05
Threads: 39 total, 0 running, 39 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 1867024 total, 1051452 free, 495804 used, 319768 buff/cache
KiB Swap: 2097148 total, 2097148 free, 0 used. 1185628 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1479 mysql 20 0 1318236 367380 16280 S 0.0 19.7 0:01.55 mysqld
1527 mysql 20 0 1318236 367380 16280 S 0.0 19.7 0:00.00 mysqld
1528 mysql 20 0 1318236 367380 16280 S 0.0 19.7 0:00.00 mysqld
2)在ps命令中,“-T”选项可以开启线程查看
下面的命令列出了由进程号为1479的进程创建的所有线程。
ps -aT -p
[root@tango-GDB-DB01 ~]# ps -aT -p 1479
PID SPID TTY TIME CMD
1479 1479 ? 00:00:01 mysqld
1479 1527 ? 00:00:00 mysqld
1479 1528 ? 00:00:00 mysqld
3)命令ps -Lef查看线程
[root@tango-GDB-DB01 ~]# ps -Lef|grep 1479
UID PID PPID LWP C NLWP STIME TTY TIME CMD
mysql 1479 1090 1479 0 39 16:59 ? 00:00:01 /usr/local/mysql/bin/mysqld --basedir=/usr/local/mysql --datadir=/usr/local/mysql/data --plugin-dir=/usr/local/mysql/lib/plugin --user=mysql --log-error=tango-GDB-DB01.err --pid-file=/usr/local/mysql/data/tango-GDB-DB01.pid --socket=/tmp/mysql.sock
mysql 1479 1090 1527 0 39 16:59 ? 00:00:00 /usr/local/mysql/bin/mysqld --basedir=/usr/local/mysql --datadir=/usr/local/mysql/data --plugin-dir=/usr/local/mysql/lib/plugin --user=mysql --log-error=tango-GDB-DB01.err --pid-file=/usr/local/mysql/data/tango-GDB-DB01.pid --socket=/tmp/mysql.sock
上述命令中PID给出进程号、LWP显示线程ID、C表示CPU使用率、NLWP表示 线程组内线程的个数。
命令pstack可用来查看进程中的堆栈信息,需要注意的是运行pstack会短暂阻塞mysqld进程,所以请切勿在业务高峰期执行。
1)连接MySQL,并执行以下语句
[root@tango-GDB-DB01 local]# mysql -h192.168.112.121 -P3306 -uroot -p –A
mysql> begin;
mysql> begin;select count(1),sleep(2000) from tango.t2 for update;
2)show processlist查看processlist信息,找到processlist_id
mysql> show processlist;
+----+-----------------+----------------------+-------+---------+------+------------------------+-------------------------------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+-----------------+----------------------+-------+---------+------+------------------------+-------------------------------------------+ |
| 10 | root | tango-GDB-DB01:50620 | tango | Query | 52 | User sleep | select count(1),sleep(2000) from tango.t2 for update |
| 11 | root | localhost | NULL | Query | 0 | init | show processlist |
+----+-----------------+----------------------+-------+---------+------+------------------------+-------------------------------------------+
3 rows in set (0.00 sec)
3)查找MySQL内部线程对应的系统线程ID
从MySQL 5.7开始,performance_schema.threads 表增加 THREAD_OS_ID 列,用于记录MySQL内部线程对应的系统线程ID。
mysql> select * from performance_schema.threads where processlist_id=10\G
*************************** 1. row ***************************
THREAD_ID: 49
NAME: thread/sql/one_connection
TYPE: FOREGROUND
PROCESSLIST_ID: 10
PROCESSLIST_USER: root
PROCESSLIST_HOST: tango-GDB-DB01
PROCESSLIST_DB: tango
PROCESSLIST_COMMAND: Query
PROCESSLIST_TIME: 84
PROCESSLIST_STATE: User sleep
PROCESSLIST_INFO: select count(1),sleep(2000) from tango.t2 for update
PARENT_THREAD_ID: NULL
ROLE: NULL
INSTRUMENTED: YES
HISTORY: YES
CONNECTION_TYPE: SSL/TLS
THREAD_OS_ID: 1657
RESOURCE_GROUP: USR_default
1 row in set (0.00 sec)
以上找到MySQL内部的thread_id和对应操作系统的线程ID:THREAD_OS_ID
4)找到对应的操作系统线程信息
[root@tango-GDB-DB01 ~]# top -H -p 1479
top - 17:41:22 up 41 min, 3 users, load average: 0.00, 0.01, 0.05
Threads: 39 total, 0 running, 39 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 1.5 sy, 0.0 ni, 98.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 1867024 total, 483128 free, 513864 used, 870032 buff/cache
KiB Swap: 2097148 total, 2097148 free, 0 used. 1158844 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1657 mysql 20 0 1318236 377188 16536 S 0.0 20.2 0:00.04 mysqld
通过top -H可以看到对应线程的CPU使用情况,包括内存和CPU的使用
1)查询当前statements表
mysql> select * from performance_schema.events_statements_current WHERE THREAD_ID = 49\G
*************************** 1. row ***************************
THREAD_ID: 49
EVENT_ID: 20
END_EVENT_ID: NULL
EVENT_NAME: statement/sql/select
SOURCE: init_net_server_extension.cc:96
TIMER_START: 10615744184481000
TIMER_END: 10743215769959000
TIMER_WAIT: 127471585478000
LOCK_TIME: 121000000
SQL_TEXT: select count(1),sleep(2000) from tango.t2 for update
DIGEST: 91558228446c9877c86805735096b85b8afe5a8148c4ea9c315a1948464ab27f
DIGEST_TEXT: SELECT COUNT (?) , `sleep` (?) FROM `tango` . `t2` FOR UPDATE
CURRENT_SCHEMA: tango
OBJECT_TYPE: NULL
OBJECT_SCHEMA: NULL
OBJECT_NAME: NULL
OBJECT_INSTANCE_BEGIN: NULL
MYSQL_ERRNO: 0
RETURNED_SQLSTATE: NULL
MESSAGE_TEXT: NULL
ERRORS: 0
WARNINGS: 0
ROWS_AFFECTED: 0
ROWS_SENT: 0
ROWS_EXAMINED: 0
CREATED_TMP_DISK_TABLES: 0
CREATED_TMP_TABLES: 0
SELECT_FULL_JOIN: 0
SELECT_FULL_RANGE_JOIN: 0
SELECT_RANGE: 0
SELECT_RANGE_CHECK: 0
SELECT_SCAN: 1
SORT_MERGE_PASSES: 0
SORT_RANGE: 0
SORT_ROWS: 0
SORT_SCAN: 0
NO_INDEX_USED: 1
NO_GOOD_INDEX_USED: 0
NESTING_EVENT_ID: 18
NESTING_EVENT_TYPE: TRANSACTION
NESTING_EVENT_LEVEL: 0
STATEMENT_ID: 37
1 row in set (0.00 sec)
2)执行SHOW ENGINE INNODB STATUS\G查看事务状态:
---TRANSACTION 2715147, ACTIVE 480 sec
mysql tables in use 1, locked 1
2 lock struct(s), heap size 1136, 3 row lock(s)
MySQL thread id 10, OS thread handle 140127942276864, query id 37 tango-GDB-DB01 192.168.112.121 root User sleep
select count(1),sleep(2000) from tango.t2 for update
Trx read view will not see trx with id >= 2715147, sees < 2715147
MySQL连接ID=10,OS线程句柄 = 140127942276864
3)查看历史thread_statement信息
mysql> select * from performance_schema.events_statements_history WHERE THREAD_ID = 49 limit 1 \G
*************************** 1. row ***************************
THREAD_ID: 49
EVENT_ID: 17
END_EVENT_ID: 18
EVENT_NAME: statement/sql/begin
SOURCE: init_net_server_extension.cc:96
TIMER_START: 10288037810168000
TIMER_END: 10288037917673000
TIMER_WAIT: 107505000
LOCK_TIME: 0
SQL_TEXT: begin
DIGEST: 55fa5810fbb2760e86d578526176c1497b134d4ef3dd0863dd78b1c5e819848c
DIGEST_TEXT: BEGIN
CURRENT_SCHEMA: tango
OBJECT_TYPE: NULL
OBJECT_SCHEMA: NULL
OBJECT_NAME: NULL
OBJECT_INSTANCE_BEGIN: NULL
MYSQL_ERRNO: 0
RETURNED_SQLSTATE: 00000
MESSAGE_TEXT: NULL
ERRORS: 0
WARNINGS: 0
ROWS_AFFECTED: 0
ROWS_SENT: 0
ROWS_EXAMINED: 0
CREATED_TMP_DISK_TABLES: 0
CREATED_TMP_TABLES: 0
SELECT_FULL_JOIN: 0
SELECT_FULL_RANGE_JOIN: 0
SELECT_RANGE: 0
SELECT_RANGE_CHECK: 0
SELECT_SCAN: 0
SORT_MERGE_PASSES: 0
SORT_RANGE: 0
SORT_ROWS: 0
SORT_SCAN: 0
NO_INDEX_USED: 0
NO_GOOD_INDEX_USED: 0
NESTING_EVENT_ID: NULL
NESTING_EVENT_TYPE: NULL
NESTING_EVENT_LEVEL: 0
STATEMENT_ID: 28
1 row in set (0.00 sec)
1)show processlists找到长时间运行的SQL
mysql> show processlist;
+----+-----------------+----------------------+-------+---------+-------+------------------------+------------------------------------------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+-----------------+----------------------+-------+---------+-------+------------------------+------------------------------------------------------+
| 5 | event_scheduler | localhost | NULL | Daemon | 11423 | Waiting on empty queue | NULL |
| 10 | root | tango-GDB-DB01:50620 | tango | Query | 816 | User sleep | select count(1),sleep(2000) from tango.t2 for update |
反向查找到processlist id
mysql> select * from performance_schema.threads WHERE THREAD_OS_ID = 1657\G
*************************** 1. row ***************************
THREAD_ID: 49
NAME: thread/sql/one_connection
TYPE: FOREGROUND
PROCESSLIST_ID: 10
PROCESSLIST_USER: root
PROCESSLIST_HOST: tango-GDB-DB01
PROCESSLIST_DB: tango
PROCESSLIST_COMMAND: Query
PROCESSLIST_TIME: 1050
PROCESSLIST_STATE: User sleep
PROCESSLIST_INFO: select count(1),sleep(2000) from tango.t2 for update
PARENT_THREAD_ID: NULL
ROLE: NULL
INSTRUMENTED: YES
HISTORY: YES
CONNECTION_TYPE: SSL/TLS
THREAD_OS_ID: 1657
RESOURCE_GROUP: USR_default
1 row in set (0.00 sec)
3)在mysql客户端kill processlist id
mysql> kill 10;
Query OK, 0 rows affected (0.00 sec)
杀掉当前长事务或者thread
在2.2步骤中查到的OS thread handle 140127942276864(OS thread handle是进程内部用于识别各个线程的内部ID),这里是个十进制的数值,需要先转成十六进制:
mysql> select lower(conv(140127942276864, 10, 16));
+--------------------------------------+
| lower(conv(140127942276864, 10, 16)) |
+--------------------------------------+
| 7f721438f700 |
+--------------------------------------+
1 row in set (0.00 sec)
2)利用 pstack 查询该句柄和操作系统线程ID的关联:
[root@tango-GDB-DB01 ~]# pstack 1479 |grep 7f721438f700
Thread 3 (Thread 0x7f721438f700 (LWP 1657)):
可以看到 LWP=1657,对应上面的THREAD_OS_ID值,LWP是Light-Weight Processes的缩写。