故障分析 | 是谁偷走了我的 IO

作者:孙桥

爱可生DBA团队成员,主要负责MySQL和DMP平台日常的维持工作,对数据库自动化运维存有浓厚兴趣。

本文来源:原创投稿

*爱可生开源社区出品,原创内容未经授权不得随意使用,转载请联系小编并注明来源。


前言

对于刚接触 MySQL 数据库不久的小伙伴来说,或多或少会遇到一些棘手的小问题;比如在数据库磁盘无损坏的情况下,我们经常会遇到数据库服务器的磁盘IO压力很大,但由于服务器上只有 MySQL 程序,我们第一想法肯定是某个连接用户在连到数据库后,背着我们在瞎搞着什么;然而直接通过 show processlist 命令,看到的结果不一定准确。下面模拟一个简单场景,让我们确切地定位问题所在。

场景模拟,制造 IO 压力

## 生成一张大表,用于测试[root@tidb-master ~]# /usr/share/sysbench/oltp_write_only.lua --mysql-host=10.186.61.36 --mysql-user=sun --mysql-password='xx' --mysql-port=3306 --mysql-db=testdb --tables=1 --table_size=15000000 --auto_inc=off --report-interval=2 --db-ps-mode=disable --time=100000 --threads=1 prepare
sysbench 1.0.17 (using system LuaJIT 2.0.4)

Creating table 'sbtest1'...
Inserting 15000000 records into 'sbtest1'
Creating a secondary index on 'sbtest1'...## 对生成的表进行写操作
[root@tidb-master ~]# /usr/share/sysbench/oltp_write_only.lua --mysql-host=10.186.61.36 --mysql-user=sun --mysql-password='xx' --mysql-port=3306 --mysql-db=testdb --tables=1 --table_size=15000000 --auto_inc=off --report-interval=2 --db-ps-mode=disable --time=100000 --threads=1 run
sysbench 1.0.17 (using system LuaJIT 2.0.4)

Running the test with following options:
Number of threads: 1
Report intermediate results every 2 second(s)
Initializing random number generator from current time


Initializing worker threads...

Threads started!

[ 2s ] thds: 1 tps: 24.47 qps: 148.33 (r/w/o: 0.00/98.89/49.44) lat (ms,95%): 86.00 err/s: 0.00 reconn/s: 0.00
[ 4s ] thds: 1 tps: 53.02 qps: 318.62 (r/w/o: 0.00/212.58/106.04) lat (ms,95%): 48.34 err/s: 0.00 reconn/s: 0.00
[ 6s ] thds: 1 tps: 60.50 qps: 363.50 (r/w/o: 0.00/242.50/121.00) lat (ms,95%): 43.39 err/s: 0.00 reconn/s: 0.00
[ 8s ] thds: 1 tps: 65.50 qps: 391.02 (r/w/o: 0.00/260.01/131.01) lat (ms,95%): 39.65 err/s: 0.00 reconn/s: 0.00
[ 10s ] thds: 1 tps: 55.50 qps: 334.51 (r/w/o: 0.00/223.51/111.00) lat (ms,95%): 44.98 err/s: 0.00 reconn/s: 0.00
[ 12s ] thds: 1 tps: 57.00 qps: 340.99 (r/w/o: 0.00/227.00/114.00) lat (ms,95%): 38.25 err/s: 0.00 reconn/s: 0.00

问题分析

1、通过top命令发现io wait比较高,咦,What is going on?(假装不知道我在压测(●'◡'●))

top - 17:05:59 up 179 days,  7:05,  2 users,  load average: 0.84, 0.70, 0.93
Tasks: 181 total,   1 running, 180 sleeping,   0 stopped,   0 zombie
%Cpu0  :  1.3 us,  0.7 sy,  0.0 ni, 95.3 id,  2.7 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  6.9 us,  3.8 sy,  0.0 ni, 60.3 id, 29.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  2.4 us,  2.0 sy,  0.0 ni, 90.8 id,  4.7 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  1.7 us,  1.3 sy,  0.0 ni, 95.3 id,  1.7 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  :  1.4 us,  1.0 sy,  0.0 ni, 95.6 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu5  :  3.1 us,  1.4 sy,  0.0 ni, 81.4 id, 14.2 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu6  :  0.7 us,  1.0 sy,  0.0 ni, 96.6 id,  1.7 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu7  :  1.0 us,  0.3 sy,  0.0 ni, 96.6 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu8  :  0.3 us,  1.0 sy,  0.0 ni, 94.6 id,  4.1 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu9  :  2.4 us,  1.0 sy,  0.0 ni, 91.2 id,  5.4 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu10 : 14.7 us,  2.7 sy,  0.0 ni, 82.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu11 :  2.0 us,  1.0 sy,  0.0 ni, 96.0 id,  1.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 12137024 total,   185984 free,  3987936 used,  7963104 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  7150088 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                            
 8858 actiont+  20   0 7249940   1.1g  11892 S  30.0  9.3 210:38.75 mysqld                                                                                             
26447 tidb      20   0 4000364   1.0g   4740 S   5.3  9.0  10018:23 TiFlashMain                                                                                        
22761 actiont+  20   0 2526348  36188   9756 S   4.0  0.3   1197:43 ustats                                                                                             
29004 root      20   0 3784084 103676   2012 S   2.0  0.9   1802:29 beam.s

2、通过iostat命令进一步分析,我们看到磁盘在频繁的写操作,维持在26M/s左右(%uitl值比较大,一般来说,值大于75%是表明磁盘压力比较大了)

[root@tidb-tikv01 ~]# iostat  -xm 1 
Linux 3.10.0-862.14.4.el7.x86_64 (tidb-tikv01)  04/22/2022      _x86_64_        (12 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.21    0.00    0.44    0.18    0.10   98.07

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.01     0.18    3.53   22.99     0.08     0.25    25.84     0.19    7.16    5.64    7.40   0.71   1.88
scd0              0.00     0.00    0.00    0.00     0.00     0.00     6.00     0.00    0.50    0.50    0.00   0.50   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.22    0.00    1.19    8.28    0.09   88.23

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00     0.00  328.00 1800.00     5.12    26.24    30.17     3.18    1.49    4.68    0.91   0.41  88.30
scd0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.63    0.00    1.03    6.68    0.09   90.57

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00     0.00  352.00 1918.00     5.50    27.82    30.06     2.96    1.31    3.21    0.96   0.37  83.60
scd0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.54    0.00    1.96    8.77    0.09   83.65

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00     0.00  300.00 1682.00     4.69    25.68    31.38     3.09    1.55    3.68    1.17   0.45  88.60
scd0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

3、通过iotop命令可以看到与MySQL相关线程ID 22233的IO消耗为52.59%,读为2.56M/s,同时线程ID 8879的写为27.31M/s(一般来说,IO列值越大,说明IO压力最有可能与其相关)

Total DISK READ :       5.29 M/s | Total DISK WRITE :      29.42 M/s
Actual DISK READ:       5.29 M/s | Actual DISK WRITE:      26.83 M/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND                                                                                                  
22233 be/4 actionte    2.56 M/s  473.07 K/s  0.00 % 52.59 % mysqld --defaults-file=/opt/mysql/etc/3306/my.cnf --d~ --socket=/opt/mysql/data/3306/mysqld.sock --port=3306
 8879 be/4 actionte    0.00 B/s   27.31 M/s  0.00 % 17.76 % mysqld --defaults-file=/opt/mysql/etc/3306/my.cnf --d~ --socket=/opt/mysql/data/3306/mysqld.sock --port=3306
 8887 be/4 actionte    0.00 B/s    0.00 B/s  0.00 % 10.10 % mysqld --defaults-file=/opt/mysql/etc/3306/my.cnf --d~ --socket=/opt/mysql/data/3306/mysqld.sock --port=3306
 8907 be/4 actionte  116.33 K/s    0.00 B/s  0.00 %  6.27 % mysqld --defaults-file=/opt/mysql/etc/3306/my.cnf --d~ --socket=/opt/mysql/data/3306/mysqld.sock --port=3306
 8930 be/4 actionte    7.76 K/s    0.00 B/s  0.00 %  3.45 % mysqld --defaults-file=/opt/mysql/etc/3306/my.cnf --d~ --socket=/opt/mysql/data/3306/mysqld.sock --port=3306
 8910 be/4 actionte 1217.57 K/s    0.00 B/s  0.00 %  3.33 % mysqld --defaults-file=/opt/mysql/etc/3306/my.cnf --d~ --socket=/opt/mysql/data/3306/mysqld.sock --port=3306
 8911 be/4 actionte  372.25 K/s    0.00 B/s  0.00 %  1.53 % mysqld --defaults-file=/opt/mysql/etc/3306/my.cnf --d~ --socket=/opt/mysql/data/3306/mysqld.sock --port=3306
 2946 be/4 actionte    0.00 B/s 1985.34 B/s  0.00 %  1.52 % ustats
 8909 be/4 actionte  418.78 K/s    0.00 B/s  0.00 %  1.08 % mysqld --defaults-file=/opt/mysql/etc/3306/my.cnf --d~ --socket=/opt/mysql/data/3306/mysqld.sock --port=3306

4、将上面获取的系统线程ID带入到MySQL中,我们便可以查到连接用户sun一直在频繁写操作,现在我很怀疑你,但我先保持沉默

[root@tidb-tikv01 ~]# /opt/mysql/base/8.0.21/bin/mysql -S /opt/mysql/data/3306/mysqld.sock -pxx
mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 15041
Server version: 8.0.21 MySQL Community Server - GPL

Copyright (c) 2000, 2020, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> select * from performance_schema.threads where THREAD_OS_ID=22233\G
*************************** 1. row ***************************
          THREAD_ID: 14880
               NAME: thread/sql/one_connection
               TYPE: FOREGROUND
     PROCESSLIST_ID: 14830           ### ID为148830连接在疯狂的进行写操作,show processlist可以看到正在进行的连接
   PROCESSLIST_USER: sun
   PROCESSLIST_HOST: 10.186.61.16
     PROCESSLIST_DB: testdb
PROCESSLIST_COMMAND: Query
   PROCESSLIST_TIME: 0
  PROCESSLIST_STATE: updating
   PROCESSLIST_INFO: UPDATE sbtest1 SET k=k+1 WHERE id=7899611
   PARENT_THREAD_ID: NULL
               ROLE: NULL
       INSTRUMENTED: YES
            HISTORY: YES
    CONNECTION_TYPE: TCP/IP
       THREAD_OS_ID: 22233          ### THREAD_OS_ID为linux系统中的线程ID
     RESOURCE_GROUP: USR_default
1 row in set (0.00 sec)

mysql> select * from performance_schema.threads where THREAD_OS_ID=8879\G
*************************** 1. row ***************************
          THREAD_ID: 21
               NAME: thread/innodb/page_flush_coordinator_thread ### MySQL自身刷脏数据线程,因为用户的频繁写操作,导致该线程一直进行落盘操作
               TYPE: BACKGROUND
     PROCESSLIST_ID: NULL
   PROCESSLIST_USER: NULL
   PROCESSLIST_HOST: NULL
     PROCESSLIST_DB: NULL
PROCESSLIST_COMMAND: NULL
   PROCESSLIST_TIME: 264523
  PROCESSLIST_STATE: NULL
   PROCESSLIST_INFO: NULL
   PARENT_THREAD_ID: NULL
               ROLE: NULL
       INSTRUMENTED: YES
            HISTORY: YES
    CONNECTION_TYPE: NULL
       THREAD_OS_ID: 8879
     RESOURCE_GROUP: SYS_default
1 row in set (0.00 sec)

mysql> show processlist;
+-------+-----------------+--------------------+--------+------------------+--------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------+
| Id    | User            | Host               | db     | Command          | Time   | State                                                         | Info                                                                                                 |
+-------+-----------------+--------------------+--------+------------------+--------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------+
|     5 | event_scheduler | localhost          | NULL   | Daemon           | 264544 | Waiting on empty queue                                        | NULL                                                                                                 |
|     8 | universe_op     | 127.0.0.1:38362    | NULL   | Sleep            |      0 |                                                               | NULL                                                                                                 |
|     9 | universe_op     | 127.0.0.1:38374    | NULL   | Sleep            |      0 |                                                               | NULL                                                                                                 |
|    10 | universe_op     | 127.0.0.1:38376    | NULL   | Sleep            |      0 |                                                               | NULL                                                                                                 |
|    11 | universe_op     | 127.0.0.1:38380    | NULL   | Sleep            |      0 |                                                               | NULL                                                                                                 |
|    12 | universe_op     | 127.0.0.1:38378    | NULL   | Sleep            |      0 |                                                               | NULL                                                                                                 |
|    13 | universe_op     | 127.0.0.1:38384    | NULL   | Sleep            |      0 |                                                               | NULL                                                                                                 |
|    14 | universe_op     | 127.0.0.1:38382    | NULL   | Sleep            |      0 |                                                               | NULL                                                                                                 |
|    15 | universe_op     | 10.186.61.45:50322 | NULL   | Binlog Dump GTID | 264542 | Master has sent all binlog to slave; waiting for more updates | NULL                                                                                                 |
|    17 | universe_op     | 127.0.0.1:38454    | NULL   | Sleep            |     14 |                                                               | NULL                                                                                                 |
|    18 | universe_op     | 127.0.0.1:38456    | NULL   | Sleep            |     14 |                                                               | NULL                                                                                                 |
|    19 | universe_op     | 127.0.0.1:38458    | NULL   | Sleep            |     14 |                                                               | NULL                                                                                                 |
|    20 | universe_op     | 127.0.0.1:38460    | NULL   | Sleep            |     14 |                                                               | NULL                                                                                                 |
|    21 | universe_op     | 127.0.0.1:38462    | NULL   | Sleep            |     14 |                                                               | NULL                                                                                                 |
|    22 | universe_op     | 127.0.0.1:38464    | NULL   | Sleep            |     14 |                                                               | NULL                                                                                                 |
|    23 | universe_op     | 127.0.0.1:38452    | NULL   | Sleep            |     14 |                                                               | NULL                                                                                                 |
|    24 | universe_op     | 127.0.0.1:38466    | NULL   | Sleep            |     14 |                                                               | NULL                                                                                                 |
|   142 | universe_op     | 127.0.0.1:33390    | NULL   | Sleep            |      0 |                                                               | NULL                                                                                                 |
| 14830 | sun             | 10.186.61.16:60730 | testdb | Query            |      0 | update                                                        | INSERT INTO sbtest1 (id, k, c, pad) VALUES (7493994, 6518751, '15795471662-35813531247-93886706048-2 |
| 15041 | root            | localhost          | NULL   | Query            |      0 | starting                                                      | show processlist                                                                                     |
+-------+-----------------+--------------------+--------+------------------+--------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------+
20 rows in set (0.00 sec)

5、通过kill掉14830后,发现磁盘压力果然over了,你这“小阳人”( ̄ε(# ̄)☆╰╮o( ̄皿 ̄///)

mysql> kill 14830;
Query OK, 0 rows affected (0.00 sec)

mysql> exit
Bye
Segmentation fault
[root@tidb-tikv01 ~]# iostat -xm 1
Linux 3.10.0-862.14.4.el7.x86_64 (tidb-tikv01)  04/22/2022      _x86_64_        (12 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.21    0.00    0.44    0.18    0.10   98.06

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.01     0.18    3.83   24.55     0.09     0.27    26.12     0.19    6.78    5.46    6.99   0.69   1.95
scd0              0.00     0.00    0.00    0.00     0.00     0.00     6.00     0.00    0.50    0.50    0.00   0.50   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.85    0.00    0.59    0.00    0.00   97.56

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00     0.00    0.00   61.00     0.00     0.46    15.52     0.01    0.20    0.00    0.20   0.18   1.10
scd0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.36    0.00    0.84    0.08    0.00   95.71

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00    10.00    1.00   53.00     0.02     0.48    18.81     2.87    0.22    1.00    0.21   1.31   7.10
scd0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

备注说明:

生产环境可不要随随便便kill哦,出问题是要被鞭尸的呦~

总结

对于一些新入门的小可爱来说,以上方法可以循序渐进的帮助你定位到是哪些连接线程把你的服务器IO折腾高的,可以让你更加了解Linux系统与MySQL之间的联动关系。老鸟嘛,emmm...

你可能感兴趣的:(linuxmysql)