MySQL因IO能力不足导致复制延迟过大

背景

收到告警,内容为:MySQL_主从复制延迟过高(实例:XXXX),当前值:1032.00,阈值:1000 - 100000000 ,发生时间:12/30/2021 02:27:16。

分析过程

一般情况下,出现复制延迟,我们首先看从库的复制状态和活动会话情况。第二天早上分析时,发现已经没有延迟了。还好我们的监控和运维脚本做得较完善,监控脚本是每分钟执行一次,发现延迟就会记录上述两个信息,所以从监控保留的日志看到,最早是2021-12-30 02:02:48就已经有9秒的延迟了。

  • 查看复制状态

从复制状态可以看到,IO线程和SQL线程运行状态均是正常的Yes,复制延迟Seconds_Behind_Master为9秒,SQL线程的状态为Waiting for dependent transaction to commit,也就是等待事务提交。

[kingly@rhel7]$ 
mysql > show slave status\G
        Relay_Master_Log_File: master-bin.039806
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 927883985
              Relay_Log_Space: 929519379
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 9
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 10
                  Master_UUID: 157bcec0-fc94-11e9-a77d-48df372eb155
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Waiting for dependent transaction to commit
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set: 157bcec0-fc94-11e9-a77d-48df372eb155:10146560435-10186081960
            Executed_Gtid_Set: 157bcec0-fc94-11e9-a77d-48df372eb155:1-10186081526,
68eb64a3-fc94-11e9-b9c7-48df372eaef5:1-228645
                Auto_Position: 1
         Replicate_Rewrite_DB:
                 Channel_Name: 
           Master_TLS_Version:
image.gif
  • 查看活动会话情况

从活动会话看,几条DELETE语句均处于System lock状态,而出现System lock 的原因通常有:1)大量小事务,比如update大表的一行数据;2)表上没有主键;3)类似InnoDB lock,slave修改数据同时和sql_thread修改同样数据;4)IO压力过大。而且其他SQL回放线程均处于等待事务提交的状态。

[kingly@rhel7]$ 
mysql >show processlist;           
2021-12-30 02:02:48     NULL    NULL    114314  Waiting for master to send event        NULL
2021-12-30 02:02:48     NULL    NULL    0       Waiting for dependent transaction to commit     NULL
2021-12-30 02:02:48     NULL    NULL    48      System lock     DELETE FROM offlinenews_48 WHERE  id=45702564
2021-12-30 02:02:48     NULL    NULL    48      System lock     DELETE FROM offlinenews_75 WHERE  id=47490222
2021-12-30 02:02:48     NULL    NULL    48      Waiting for preceding transaction to commit     NULL
2021-12-30 02:02:48     NULL    NULL    48      Waiting for preceding transaction to commit     NULL
2021-12-30 02:02:48     NULL    NULL    48      System lock     DELETE FROM offlinenews_84 WHERE  id=46076462
2021-12-30 02:02:48     NULL    NULL    48      Waiting for preceding transaction to commit     NULL
2021-12-30 02:02:48     NULL    NULL    48      Waiting for preceding transaction to commit     NULL
2021-12-30 02:02:48     NULL    NULL    48      System lock     DELETE FROM offlinenews_95 WHERE  id=46705915
2021-12-30 02:02:48     NULL    NULL    48      Waiting for preceding transaction to commit     NULL
2021-12-30 02:02:48     NULL    NULL    48      Waiting for preceding transaction to commit     NULL
2021-12-30 02:02:48     NULL    NULL    48      Waiting for preceding transaction to commit     NULL
2021-12-30 02:02:48     NULL    NULL    48      Waiting for preceding transaction to commit     NULL
2021-12-30 02:02:48     NULL    NULL    48      Waiting for preceding transaction to commit     NULL
2021-12-30 02:02:48     NULL    NULL    48      Waiting for preceding transaction to commit     NULL
2021-12-30 02:02:48     NULL    NULL    48      Waiting for preceding transaction to commit     NULL
2021-12-30 02:02:48     NULL    NULL    48      Waiting for preceding transaction to commit     NULL
2021-12-30 02:02:48     NULL    NULL    49      Waiting for an event from Coordinator   NULL
2021-12-30 02:02:48     NULL    NULL    49      Waiting for an event from Coordinator   NULL
2021-12-30 02:02:48     NULL    NULL    49      Waiting for an event from Coordinator   NULL
2021-12-30 02:02:48     NULL    NULL    55      Waiting for an event from Coordinator   NULL
2021-12-30 02:02:48     NULL    NULL    55      Waiting for an event from Coordinator   NULL
2021-12-30 02:02:48     NULL    NULL    55      Waiting for an event from Coordinator   NULL
image.gif

所以需要进一步分析,当前从库是否有执行的事务未提交,导致阻塞了SQL线程回放动作以及这些DELETE操作。通过分析问题开始时间点2021-12-30 2点左右的binlog,发现并没有短时间的大事务,只有少量基于主键的单行DELETE操作,按道理是会执行很快。而且我们从库是设置只读的,通常不会执行啥事务。因此,排查方向切换到服务器上。

  • 查看服务器负载

通过部署的OSWBB脚本,分析其中的vmstat结果,发现在2021-12-30 2点左右,服务器只有约1-2个线程处于运行状态,有大量线程(约40多个)处于阻塞(block)状态,这是很不正常的,且CPU列可以看到有约10%的处于IO wait状态,说明当时服务器IO压力非常大。

zzz ***Thu Dec 30 02:02:35 CST 2021
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2 42 126088 1259068    404 110934224    0    0   169   166    0    0  0  0 99  0  0
 1 42 126088 1251800    404 110934512    0    0 612144   502 8380 4156  0  0 91  8  0
 2 41 126088 1252672    404 110934576    0    0 620544   358 8844 4115  0  1 90  9  0
zzz ***Thu Dec 30 02:02:50 CST 2021
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1 41 126088 1256476    404 110936800    0    0   169   166    0    0  0  0 99  0  0
 1 41 126088 1249628    404 110937072    0    0 596992  2147 9037 4735  0  0 89 11  0
 2 42 126088 1247176    404 110937536    0    0 630704   605 12100 6991  1  0 87 12  0
zzz ***Thu Dec 30 02:03:05 CST 2021
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1 42 126088 1251024    404 110939968    0    0   169   166    0    0  0  0 99  0  0
 3 37 126088 1242940    404 110940208    0    0 626448   475 8316 3920  0  0 89 10  0
 1 42 126088 1246844    404 110940472    0    0 606976 11179 12130 8121  0  0 88 11  0
zzz ***Thu Dec 30 02:03:20 CST 2021
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2 39 126088 1247112    404 110943920    0    0   169   166    0    0  0  0 99  0  0
 1 43 126088 1240348    404 110944080    0    0 629248   442 9168 4325  0  0 88 11  0
 1 41 126088 1242900    404 110944848    0    0 578576   498 7578 3871  0  0 89 10  0
image.gif
  • 分析数据库日志

查看数据库错误日志,也发现了端倪。大量的 [Note] InnoDB: page_cleane的提示,说明当时刷脏慢,进一步证明了服务器IO压力过大。

2021-12-30T01:55:17.492848+08:00 2 [Note] Multi-threaded slave statistics for channel '': seconds elapsed = 120; events assigned = 270201857; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 18682298959400 waited (count) when Workers occupied = 2554253 waited when Workers occupied = 6778341143100
2021-12-30T01:57:18.156330+08:00 2 [Note] Multi-threaded slave statistics for channel '': seconds elapsed = 121; events assigned = 270245889; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 18682611334700 waited (count) when Workers occupied = 2554253 waited when Workers occupied = 6778341143100
2021-12-30T01:59:20.712082+08:00 2 [Note] Multi-threaded slave statistics for channel '': seconds elapsed = 122; events assigned = 270290945; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 18683053354600 waited (count) when Workers occupied = 2554253 waited when Workers occupied = 6778341143100
2021-12-30T02:01:20.422439+08:00 2 [Note] Multi-threaded slave statistics for channel '': seconds elapsed = 120; events assigned = 270330881; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 18683501041500 waited (count) when Workers occupied = 2554254 waited when Workers occupied = 6778342662900
2021-12-30T02:01:33.788727+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 5527ms. The settings might not be optimal. (flushed=74 and evicted=0, during the time.)
2021-12-30T02:01:52.969122+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 18181ms. The settings might not be optimal. (flushed=76 and evicted=0, during the time.)
2021-12-30T02:02:36.051478+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 13227ms. The settings might not be optimal. (flushed=73 and evicted=0, during the time.)
2021-12-30T02:03:22.916208+08:00 2 [Note] Multi-threaded slave statistics for channel '': seconds elapsed = 122; events assigned = 270349313; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 18770439067400 waited (count) when Workers occupied = 2554266 waited when Workers occupied = 6779684449500
2021-12-30T02:03:49.254684+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 13291ms. The settings might not be optimal. (flushed=72 and evicted=0, during the time.)
2021-12-30T02:05:23.176365+08:00 2 [Note] Multi-threaded slave statistics for channel '': seconds elapsed = 121; events assigned = 270361601; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 18859577400400 waited (count) when Workers occupied = 2554270 waited when Workers occupied = 6780589144300
2021-12-30T02:06:25.751734+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 19666ms. The settings might not be optimal. (flushed=70 and evicted=0, during the time.)
2021-12-30T02:07:38.989406+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 18069ms. The settings might not be optimal. (flushed=73 and evicted=0, during the time.)
2021-12-30T02:07:44.028187+08:00 2 [Note] Multi-threaded slave statistics for channel '': seconds elapsed = 141; events assigned = 270376961; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 18970653888400 waited (count) when Workers occupied = 2554270 waited when Workers occupied = 6780589144300
2021-12-30T02:08:22.735042+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 15828ms. The settings might not be optimal. (flushed=76 and evicted=0, during the time.)
2021-12-30T02:09:32.283672+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 12917ms. The settings might not be optimal. (flushed=75 and evicted=0, during the time.)
image.gif

问题真相

考虑我们是一个高配服务器,只运行数据库服务,无其他应用服务,部署的也是单实例数据库,使用的SAN存储做数据存放(比较奢侈),为啥会有IO瓶颈呢?结合问题发生时间点,联想到应该是备份任务导致的。具体备份策略如下,每天的2点会进行增备。

[kingly@rhel7]$crontab -l
00 00 * * 7 sh /mysql/scripts/backup/mysql_backup.sh FULL 1>/dev/null 2>&1
00 2 * * *   sh /mysql/scripts/backup/mysql_backup.sh INCR 1>/dev/null 2>&1

image.gif

为啥之前没出问题,最近就出问题了呢。其实是这套库数据量有数T,近期业务规模增长较大,考虑到备份时间较长,为避免备份过程中占用IO过高,影响了同交换机的其他系统,跟存储同事商量后,将此服务器的交换机端口进行了限速。结果由于xtrabackup备份在开始备份时,是直接拷贝ibd文件的,瞬时的IO流量会非常大,服务器本身的IO能力足够,所以导致这个IO流量在交换机端因限速导致积压,进而影响了服务器往SAN交换机写入数据,所以看到服务器出现了大量的IO等待,导致SQL线程回放过程慢,出现了复制延迟。

所以这次的MySQL复制延迟原因是IO能力不足。

总结

MySQL复制延迟原因非常多,后面有机会也会再整理一些案例。但是像IO能力不足导致的复制延迟现象平常可能比较少见,如果大家有服务器配置不高或IO能力不强的情况,可以留意下。

================欢迎大家同时关注我的其他平台文章===============
微信公众号 | 老王聊MySQL
知乎 | 老王聊MySQL | https://www.zhihu.com/people/old_king
CSDN | 老王聊MySQL | https://blog.csdn.net/wly619?spm=1010.2135.3001.5421
| 老王聊MySQL | https://www.jianshu.com/u/80e4c9cd3af5

你可能感兴趣的:(MySQL因IO能力不足导致复制延迟过大)