MySQL:最近遇到的3个BUG和一个慢查询的疑问


简单记录,仅供参考,其实有2个问题和慢查询有关,为什么呢,因为MySQL的慢查询真的很重要,我们关注得也多。


一、执行flush logs出现hang住的情况

这个BUG主要和err log权限不足有关如下,当err log权限不足的情况下执行flush logs出现如下等待栈,


image.png

其中reload_acl_and_cache,就是flush类型命令同一个接口,根据不同的命令跑不同的流程,这里就是reopen所有的日志文件。
但是这里存在一个问题 reopen errlog需要LOCK_error_log mutex,如果打开的时候报错了,也要记录到errlog,又会加一次LOCK_error_log mutex,也就是说同一个线程对同一把mutex进行了2次加锁,这就出现了Self-deadlock,会一直等待,如下:

Self-deadlock occurs when a single thread attempts to lock a mutex twice: the second attempt will block indefinitely. This can easily happen when the same resource is used at multiple levels within an algorithm.

如果开启了debug模式则会检测这种错误,会直接crash掉如下:

safe_mutex: Trying to lock mutex at /opt/percona-server-locks-detail-5.7.22/sql/log.cc, line 2419, when the mutex was already locked at /opt/percona-server-locks-detail-5.7.22/sql/log.cc, line 2384 in thread T@2

出现这种情况,只能重启了,当然这个BUG已经在5.7.24中修复了,如下:

    Bug#27891472: FLUSH LOGS WITH NO LOG FILE PERMISSION LEADS HANG
    
    Changing the log file acquires the log-lock. If the file cannot
    be changed, we try to inform the user of this. To do this, we
    call the error logger, which in turn waits to acquire the log-lock,
    and thus hangs forever.

如果使用较低版本的版本可能需要注意这个问题吧。

二、8.0.31 版本慢查询不准的情况(BUG待确认)

这个问题,比较简单,当查询中出现了group by这种语句,使用drivered table的时候,看到的慢查询的扫描函数,仅仅为顶层SELECT_LEX 的行数,不包含drivered table这个SELECT_LEX ,类似这种语句:

select * from (select id,count(*) from myt1 group by id) a;

如下即可

mysql> show create table myt1;
+-------+------------------------------------------------------------------------------------+
| Table | Create Table                                                                       |
+-------+------------------------------------------------------------------------------------+
| myt1  | CREATE TABLE `myt1` (
  `id` int DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |
+-------+------------------------------------------------------------------------------------+
1 row in set (0.01 sec)

mysql> set persist long_query_time=0.1;

mysql> insert into myt1 values(10);
...

mysql> insert into myt1 select *from myt1;
Query OK, 1048576 rows affected (6.78 sec)

mysql> select id,count(*) from myt1 group by id  ;
+------+----------+
| id   | count(*) |
+------+----------+
|   10 |  2097152 |
+------+----------+
1 row in set (1.15 sec)

mysql> select * from (select id,count(*) from myt1 group by id) a;
+------+----------+
| id   | count(*) |
+------+----------+
|   10 |  2097152 |
+------+----------+
1 row in set (1.09 sec)

slow log:

# Time: 2022-11-09T03:37:19.466846Z
# User@Host: root[root] @ localhost []  Id:    18
# Query_time: 1.144293  Lock_time: 0.000003 Rows_sent: 1  Rows_examined: 2097152
SET timestamp=1667965038;
select id,count(*) from myt1 group by id;
# Time: 2022-11-09T03:37:32.895778Z
# User@Host: root[root] @ localhost []  Id:    18
# Query_time: 1.091765  Lock_time: 0.000002 Rows_sent: 1  Rows_examined: 1
SET timestamp=1667965051;
select * from (select id,count(*) from myt1 group by id) a;

很明显这个语句的扫描行数是一样的,但是我们发现包含drivered table的语句,扫描行数仅仅为一行,这个情况在5.7 好像是没有的,这可能导致对语句性能的误判,还是比较严重的一个BUG,因为最近影响到我了。

提交的BUG如下:

  • https://bugs.mysql.com/bug.php?id=109034

待确认吧。

三、缺乏权限REPL_SLAVE_ACL报错歧义(BUG确认)

这个也是个小BUG,最近就遇到了因为缺少这个权限而报错,报错是密码错误,如下:

2022-11-07T22:30:01.777653+08:00 74 [ERROR] [MY-013120] [Repl] Slave I/O for channel '': Master command COM_REGISTER_SLAVE failed: Access denied for user 'mysql_innodb_cluster_3570435201'@'%' (using password: YES) (Errno: 1045), Error_code: MY-013120

原因在于进行鉴权的时候逻辑处理有点问题,我应该用check_global_access 函数去检查全局权限,比如缺少 REPL_SLAVE_ACL权限就做得很好啊如下:

Access denied; you need (at least one of) the SUPER, REPLICATION CLIENT privilege(s) for this operation

都8.0.31了,也应该注意这些问题了,那么如果出现密码错误的报错,我们貌似也要检查一下权限,至少当前是这样的。提交并且确认的BUG如下:

  • https://bugs.mysql.com/bug.php?id=109023

四、慢查询和网络传输的时间

问题是这样的慢查询中的语句到底包不包含网络传输的时间,这个问题我觉得需要从慢查询的原理说起了,其实慢查询包含了这些时间:

  • MySQLD服务端调用read函数从socket的SO_RCVBUF(recv-Q)中读取到SQL语句
  • MySQLD服务端执行语句
  • MySQLD端将查询的每行数据发送到MySQLD 的net buffer
  • 如果MySQLD 的net buffer满,则调用write函数将数据发送到SO_SNDBUF(send-Q)

因为在MySQLD和TCP 协议之间还有一个socket buffer,实际上MySQLD只要将数据传输到SO_SNDBUF(send-Q)就算完成,接下来的任务就是TCP协议的了。
但是存在一种可能,就是客户端的程序不处理数据那么,其SO_RCVBUF(recv-Q)可能出现堆积,更有可能导致TCP协议窗口值为0,也就是服务端TCP协议发送的包太多没有收到ACK信息了,这种情况下,MySQLD的SO_SNDBUF(send-Q)是可能堆满的,这样,那么慢查询就包含了发送数据的时间了,且语句状态为Sending to client。

这部分具体可参考殿堂级人物,Richard Stevens的《TCP/IP详解 卷1:协议》 第15章和Linux 网络编程。

下面为简单的测试:


1、建立表
create table  myt4(name longtext);
insert into myt4 values(repeat('a',100000));
insert into myt4 select * from myt4;
insert into myt4 select * from myt4;
insert into myt4 select * from myt4;
insert into myt4 select * from myt4;

2、mysql client 断点b vio_read堵塞不读取数据,模拟客户端不处理数据的情况。

3、开启tcpdump转本地3325端口

4、跑语句
select * from ttt.myt4;

5、观察rec-Q和send-Q如下

开始前
[root@mgr3 tmptbs]# netstat -anlp|grep 3325|grep 192.168.1.63
tcp        0      0 192.168.1.63:2224       192.168.1.63:3325       ESTABLISHED 12584/mysql         
tcp6       0      0 192.168.1.63:3325       192.168.1.63:2224       ESTABLISHED 11404/mysqld        
开始后
[root@mgr3 tmptbs]# netstat -anlp|grep 3325|grep 192.168.1.63
tcp   944171      0 192.168.1.63:2224       192.168.1.63:3325       ESTABLISHED 12584/mysql         
tcp6       0 656018 192.168.1.63:3325       192.168.1.63:2224       ESTABLISHED 11404/mysqld   

6、观察语句如下,已经执行完成,不计入慢查询
mysql> show processlist;
+----+------+-----------+------+---------+------+----------+------------------+-----------+---------------+
| Id | User | Host      | db   | Command | Time | State    | Info             | Rows_sent | Rows_examined |
+----+------+-----------+------+---------+------+----------+------------------+-----------+---------------+
| 14 | root | localhost | ttt  | Query   |    0 | starting | show processlist |         0 |             0 |
| 24 | root | mgr3:2222 | NULL | Sleep   |   10 |          | NULL             |        16 |            16 |
+----+------+-----------+------+---------+------+----------+------------------+-----------+---------------+
2 rows in set (0.00 sec)

慢查询并不记录这条语句

7、gdb 放数据

稍微停留5秒左右放一次,放完过后如下
[root@mgr3 tmptbs]# netstat -anlp|grep 3325|grep 192.168.1.63
tcp        0      0 192.168.1.63:2224       192.168.1.63:3325       ESTABLISHED 12584/mysql         
tcp6       0      0 192.168.1.63:3325       192.168.1.63:2224       ESTABLISHED 11404/mysqld  

8、解析tcpdump结果

大概为9点38分46秒发起查询
65  2022-11-05 09:38:46.768521  192.168.1.63    192.168.1.63    MySQL   95  65  Request Query { select * from ttt.myt4 } 
大概为 9点39分20秒返回最后一条数据给客户端
141 2022-11-05 09:39:20.264467  192.168.1.63    192.168.1.63    MySQL   32488   141 Response [Malformed Packet]Response  OK 
9、如果加大数据量就会出现这种
这里做一个16条数据的两个表的笛卡尔积就可以了
select * from ttt.myt4 a,ttt.myt4 b

mysql> show processlist;
+----+------+-----------+------+---------+------+-------------------+-------------------------------------+-----------+---------------+
| Id | User | Host      | db   | Command | Time | State             | Info                                | Rows_sent | Rows_examined |
+----+------+-----------+------+---------+------+-------------------+-------------------------------------+-----------+---------------+
| 14 | root | localhost | ttt  | Query   |    0 | starting          | show processlist                    |         0 |             0 |
| 26 | root | mgr3:2226 | NULL | Query   |    9 | Sending to client | select * from ttt.myt4 a,ttt.myt4 b |        28 |             0 |
+----+------+-----------+------+---------+------+-------------------+-------------------------------------+-----------+---------------+
10、慢查询记录了这条语句

# Time: 2022-11-05T01:57:28.353335Z
# User@Host: root[root] @ mgr3 [192.168.1.63]  Id:    26
# Schema:   Last_errno: 0  Killed: 0
# Query_time: 62.718498  Lock_time: 0.000513  Rows_sent: 256  Rows_examined: 112  Rows_affected: 0
# Bytes_sent: 51203172
SET timestamp=1667613448;
select * from ttt.myt4 a,ttt.myt4 b;

这样在TCP装包中能看到类如下,TCP ZeroWindow的错误。


image.png

你可能感兴趣的:(MySQL:最近遇到的3个BUG和一个慢查询的疑问)