MySQL:最近遇到的3个BUG和一个慢查询的疑问

简单记录，仅供参考，其实有2个问题和慢查询有关，为什么呢，因为MySQL的慢查询真的很重要，我们关注得也多。

一、执行flush logs出现hang住的情况

这个BUG主要和err log权限不足有关如下，当err log权限不足的情况下执行flush logs出现如下等待栈，

image.png

其中reload_acl_and_cache，就是flush类型命令同一个接口，根据不同的命令跑不同的流程，这里就是reopen所有的日志文件。
但是这里存在一个问题 reopen errlog需要LOCK_error_log mutex，如果打开的时候报错了，也要记录到errlog，又会加一次LOCK_error_log mutex，也就是说同一个线程对同一把mutex进行了2次加锁，这就出现了Self-deadlock，会一直等待，如下：

Self-deadlock occurs when a single thread attempts to lock a mutex twice: the second attempt will block indefinitely. This can easily happen when the same resource is used at multiple levels within an algorithm.

如果开启了debug模式则会检测这种错误，会直接crash掉如下：

safe_mutex: Trying to lock mutex at /opt/percona-server-locks-detail-5.7.22/sql/log.cc, line 2419, when the mutex was already locked at /opt/percona-server-locks-detail-5.7.22/sql/log.cc, line 2384 in thread T@2

出现这种情况，只能重启了，当然这个BUG已经在5.7.24中修复了，如下：

    Bug#27891472: FLUSH LOGS WITH NO LOG FILE PERMISSION LEADS HANG
    
    Changing the log file acquires the log-lock. If the file cannot
    be changed, we try to inform the user of this. To do this, we
    call the error logger, which in turn waits to acquire the log-lock,
    and thus hangs forever.

如果使用较低版本的版本可能需要注意这个问题吧。

二、8.0.31 版本慢查询不准的情况（BUG待确认）

这个问题，比较简单，当查询中出现了group by这种语句，使用drivered table的时候，看到的慢查询的扫描函数，仅仅为顶层SELECT_LEX 的行数，不包含drivered table这个SELECT_LEX ，类似这种语句：

select * from (select id,count(*) from myt1 group by id) a;

如下即可

mysql> show create table myt1;
+-------+------------------------------------------------------------------------------------+
| Table | Create Table                                                                       |
+-------+------------------------------------------------------------------------------------+
| myt1  | CREATE TABLE `myt1` (
  `id` int DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |
+-------+------------------------------------------------------------------------------------+
1 row in set (0.01 sec)

mysql> set persist long_query_time=0.1;

mysql> insert into myt1 values(10);
...

mysql> insert into myt1 select *from myt1;
Query OK, 1048576 rows affected (6.78 sec)

mysql> select id,count(*) from myt1 group by id  ;
+------+----------+
| id   | count(*) |
+------+----------+
|   10 |  2097152 |
+------+----------+
1 row in set (1.15 sec)

mysql> select * from (select id,count(*) from myt1 group by id) a;
+------+----------+
| id   | count(*) |
+------+----------+
|   10 |  2097152 |
+------+----------+
1 row in set (1.09 sec)

slow log:

# Time: 2022-11-09T03:37:19.466846Z
# User@Host: root[root] @ localhost []  Id:    18
# Query_time: 1.144293  Lock_time: 0.000003 Rows_sent: 1  Rows_examined: 2097152
SET timestamp=1667965038;
select id,count(*) from myt1 group by id;
# Time: 2022-11-09T03:37:32.895778Z
# User@Host: root[root] @ localhost []  Id:    18
# Query_time: 1.091765  Lock_time: 0.000002 Rows_sent: 1  Rows_examined: 1
SET timestamp=1667965051;
select * from (select id,count(*) from myt1 group by id) a;

很明显这个语句的扫描行数是一样的，但是我们发现包含drivered table的语句，扫描行数仅仅为一行，这个情况在5.7 好像是没有的，这可能导致对语句性能的误判，还是比较严重的一个BUG，因为最近影响到我了。

提交的BUG如下：

https://bugs.mysql.com/bug.php?id=109034

待确认吧。

三、缺乏权限REPL_SLAVE_ACL报错歧义（BUG确认）

这个也是个小BUG，最近就遇到了因为缺少这个权限而报错，报错是密码错误，如下：

2022-11-07T22:30:01.777653+08:00 74 [ERROR] [MY-013120] [Repl] Slave I/O for channel '': Master command COM_REGISTER_SLAVE failed: Access denied for user 'mysql_innodb_cluster_3570435201'@'%' (using password: YES) (Errno: 1045), Error_code: MY-013120

原因在于进行鉴权的时候逻辑处理有点问题，我应该用check_global_access 函数去检查全局权限，比如缺少 REPL_SLAVE_ACL权限就做得很好啊如下：

Access denied; you need (at least one of) the SUPER, REPLICATION CLIENT privilege(s) for this operation

都8.0.31了，也应该注意这些问题了，那么如果出现密码错误的报错，我们貌似也要检查一下权限，至少当前是这样的。提交并且确认的BUG如下：

https://bugs.mysql.com/bug.php?id=109023

四、慢查询和网络传输的时间

问题是这样的慢查询中的语句到底包不包含网络传输的时间，这个问题我觉得需要从慢查询的原理说起了，其实慢查询包含了这些时间：

MySQLD服务端调用read函数从socket的SO_RCVBUF（recv-Q）中读取到SQL语句
MySQLD服务端执行语句
MySQLD端将查询的每行数据发送到MySQLD 的net buffer
如果MySQLD 的net buffer满，则调用write函数将数据发送到SO_SNDBUF（send-Q）

因为在MySQLD和TCP 协议之间还有一个socket buffer，实际上MySQLD只要将数据传输到SO_SNDBUF（send-Q）就算完成，接下来的任务就是TCP协议的了。
但是存在一种可能，就是客户端的程序不处理数据那么，其SO_RCVBUF（recv-Q）可能出现堆积，更有可能导致TCP协议窗口值为0，也就是服务端TCP协议发送的包太多没有收到ACK信息了，这种情况下，MySQLD的SO_SNDBUF（send-Q）是可能堆满的，这样，那么慢查询就包含了发送数据的时间了，且语句状态为Sending to client。

这部分具体可参考殿堂级人物，Richard Stevens的《TCP/IP详解卷1：协议》第15章和Linux 网络编程。

下面为简单的测试：


1、建立表
create table  myt4(name longtext);
insert into myt4 values(repeat('a',100000));
insert into myt4 select * from myt4;
insert into myt4 select * from myt4;
insert into myt4 select * from myt4;
insert into myt4 select * from myt4;

2、mysql client 断点b vio_read堵塞不读取数据，模拟客户端不处理数据的情况。

3、开启tcpdump转本地3325端口

4、跑语句
select * from ttt.myt4;

5、观察rec-Q和send-Q如下

开始前
[root@mgr3 tmptbs]# netstat -anlp|grep 3325|grep 192.168.1.63
tcp        0      0 192.168.1.63:2224       192.168.1.63:3325       ESTABLISHED 12584/mysql         
tcp6       0      0 192.168.1.63:3325       192.168.1.63:2224       ESTABLISHED 11404/mysqld        
开始后
[root@mgr3 tmptbs]# netstat -anlp|grep 3325|grep 192.168.1.63
tcp   944171      0 192.168.1.63:2224       192.168.1.63:3325       ESTABLISHED 12584/mysql         
tcp6       0 656018 192.168.1.63:3325       192.168.1.63:2224       ESTABLISHED 11404/mysqld   

6、观察语句如下，已经执行完成，不计入慢查询
mysql> show processlist;
+----+------+-----------+------+---------+------+----------+------------------+-----------+---------------+
| Id | User | Host      | db   | Command | Time | State    | Info             | Rows_sent | Rows_examined |
+----+------+-----------+------+---------+------+----------+------------------+-----------+---------------+
| 14 | root | localhost | ttt  | Query   |    0 | starting | show processlist |         0 |             0 |
| 24 | root | mgr3:2222 | NULL | Sleep   |   10 |          | NULL             |        16 |            16 |
+----+------+-----------+------+---------+------+----------+------------------+-----------+---------------+
2 rows in set (0.00 sec)

慢查询并不记录这条语句

7、gdb 放数据

稍微停留5秒左右放一次，放完过后如下
[root@mgr3 tmptbs]# netstat -anlp|grep 3325|grep 192.168.1.63
tcp        0      0 192.168.1.63:2224       192.168.1.63:3325       ESTABLISHED 12584/mysql         
tcp6       0      0 192.168.1.63:3325       192.168.1.63:2224       ESTABLISHED 11404/mysqld  

8、解析tcpdump结果

大概为9点38分46秒发起查询
65  2022-11-05 09:38:46.768521  192.168.1.63    192.168.1.63    MySQL   95  65  Request Query { select * from ttt.myt4 } 
大概为 9点39分20秒返回最后一条数据给客户端
141 2022-11-05 09:39:20.264467  192.168.1.63    192.168.1.63    MySQL   32488   141 Response [Malformed Packet]Response  OK 
9、如果加大数据量就会出现这种
这里做一个16条数据的两个表的笛卡尔积就可以了
select * from ttt.myt4 a,ttt.myt4 b

mysql> show processlist;
+----+------+-----------+------+---------+------+-------------------+-------------------------------------+-----------+---------------+
| Id | User | Host      | db   | Command | Time | State             | Info                                | Rows_sent | Rows_examined |
+----+------+-----------+------+---------+------+-------------------+-------------------------------------+-----------+---------------+
| 14 | root | localhost | ttt  | Query   |    0 | starting          | show processlist                    |         0 |             0 |
| 26 | root | mgr3:2226 | NULL | Query   |    9 | Sending to client | select * from ttt.myt4 a,ttt.myt4 b |        28 |             0 |
+----+------+-----------+------+---------+------+-------------------+-------------------------------------+-----------+---------------+
10、慢查询记录了这条语句

# Time: 2022-11-05T01:57:28.353335Z
# User@Host: root[root] @ mgr3 [192.168.1.63]  Id:    26
# Schema:   Last_errno: 0  Killed: 0
# Query_time: 62.718498  Lock_time: 0.000513  Rows_sent: 256  Rows_examined: 112  Rows_affected: 0
# Bytes_sent: 51203172
SET timestamp=1667613448;
select * from ttt.myt4 a,ttt.myt4 b;

这样在TCP装包中能看到类如下，TCP ZeroWindow的错误。