简单记录,仅供参考,其实有2个问题和慢查询有关,为什么呢,因为MySQL的慢查询真的很重要,我们关注得也多。
一、执行flush logs出现hang住的情况
这个BUG主要和err log权限不足有关如下,当err log权限不足的情况下执行flush logs出现如下等待栈,
其中reload_acl_and_cache,就是flush类型命令同一个接口,根据不同的命令跑不同的流程,这里就是reopen所有的日志文件。
但是这里存在一个问题 reopen errlog需要LOCK_error_log mutex,如果打开的时候报错了,也要记录到errlog,又会加一次LOCK_error_log mutex,也就是说同一个线程对同一把mutex进行了2次加锁,这就出现了Self-deadlock,会一直等待,如下:
Self-deadlock occurs when a single thread attempts to lock a mutex twice: the second attempt will block indefinitely. This can easily happen when the same resource is used at multiple levels within an algorithm.
如果开启了debug模式则会检测这种错误,会直接crash掉如下:
safe_mutex: Trying to lock mutex at /opt/percona-server-locks-detail-5.7.22/sql/log.cc, line 2419, when the mutex was already locked at /opt/percona-server-locks-detail-5.7.22/sql/log.cc, line 2384 in thread T@2
出现这种情况,只能重启了,当然这个BUG已经在5.7.24中修复了,如下:
Bug#27891472: FLUSH LOGS WITH NO LOG FILE PERMISSION LEADS HANG
Changing the log file acquires the log-lock. If the file cannot
be changed, we try to inform the user of this. To do this, we
call the error logger, which in turn waits to acquire the log-lock,
and thus hangs forever.
如果使用较低版本的版本可能需要注意这个问题吧。
二、8.0.31 版本慢查询不准的情况(BUG待确认)
这个问题,比较简单,当查询中出现了group by这种语句,使用drivered table的时候,看到的慢查询的扫描函数,仅仅为顶层SELECT_LEX 的行数,不包含drivered table这个SELECT_LEX ,类似这种语句:
select * from (select id,count(*) from myt1 group by id) a;
如下即可
mysql> show create table myt1;
+-------+------------------------------------------------------------------------------------+
| Table | Create Table |
+-------+------------------------------------------------------------------------------------+
| myt1 | CREATE TABLE `myt1` (
`id` int DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |
+-------+------------------------------------------------------------------------------------+
1 row in set (0.01 sec)
mysql> set persist long_query_time=0.1;
mysql> insert into myt1 values(10);
...
mysql> insert into myt1 select *from myt1;
Query OK, 1048576 rows affected (6.78 sec)
mysql> select id,count(*) from myt1 group by id ;
+------+----------+
| id | count(*) |
+------+----------+
| 10 | 2097152 |
+------+----------+
1 row in set (1.15 sec)
mysql> select * from (select id,count(*) from myt1 group by id) a;
+------+----------+
| id | count(*) |
+------+----------+
| 10 | 2097152 |
+------+----------+
1 row in set (1.09 sec)
slow log:
# Time: 2022-11-09T03:37:19.466846Z
# User@Host: root[root] @ localhost [] Id: 18
# Query_time: 1.144293 Lock_time: 0.000003 Rows_sent: 1 Rows_examined: 2097152
SET timestamp=1667965038;
select id,count(*) from myt1 group by id;
# Time: 2022-11-09T03:37:32.895778Z
# User@Host: root[root] @ localhost [] Id: 18
# Query_time: 1.091765 Lock_time: 0.000002 Rows_sent: 1 Rows_examined: 1
SET timestamp=1667965051;
select * from (select id,count(*) from myt1 group by id) a;
很明显这个语句的扫描行数是一样的,但是我们发现包含drivered table的语句,扫描行数仅仅为一行,这个情况在5.7 好像是没有的,这可能导致对语句性能的误判,还是比较严重的一个BUG,因为最近影响到我了。
提交的BUG如下:
- https://bugs.mysql.com/bug.php?id=109034
待确认吧。
三、缺乏权限REPL_SLAVE_ACL报错歧义(BUG确认)
这个也是个小BUG,最近就遇到了因为缺少这个权限而报错,报错是密码错误,如下:
2022-11-07T22:30:01.777653+08:00 74 [ERROR] [MY-013120] [Repl] Slave I/O for channel '': Master command COM_REGISTER_SLAVE failed: Access denied for user 'mysql_innodb_cluster_3570435201'@'%' (using password: YES) (Errno: 1045), Error_code: MY-013120
原因在于进行鉴权的时候逻辑处理有点问题,我应该用check_global_access 函数去检查全局权限,比如缺少 REPL_SLAVE_ACL权限就做得很好啊如下:
Access denied; you need (at least one of) the SUPER, REPLICATION CLIENT privilege(s) for this operation
都8.0.31了,也应该注意这些问题了,那么如果出现密码错误的报错,我们貌似也要检查一下权限,至少当前是这样的。提交并且确认的BUG如下:
- https://bugs.mysql.com/bug.php?id=109023
四、慢查询和网络传输的时间
问题是这样的慢查询中的语句到底包不包含网络传输的时间,这个问题我觉得需要从慢查询的原理说起了,其实慢查询包含了这些时间:
- MySQLD服务端调用read函数从socket的SO_RCVBUF(recv-Q)中读取到SQL语句
- MySQLD服务端执行语句
- MySQLD端将查询的每行数据发送到MySQLD 的net buffer
- 如果MySQLD 的net buffer满,则调用write函数将数据发送到SO_SNDBUF(send-Q)
因为在MySQLD和TCP 协议之间还有一个socket buffer,实际上MySQLD只要将数据传输到SO_SNDBUF(send-Q)就算完成,接下来的任务就是TCP协议的了。
但是存在一种可能,就是客户端的程序不处理数据那么,其SO_RCVBUF(recv-Q)可能出现堆积,更有可能导致TCP协议窗口值为0,也就是服务端TCP协议发送的包太多没有收到ACK信息了,这种情况下,MySQLD的SO_SNDBUF(send-Q)是可能堆满的,这样,那么慢查询就包含了发送数据的时间了,且语句状态为Sending to client。
这部分具体可参考殿堂级人物,Richard Stevens的《TCP/IP详解 卷1:协议》 第15章和Linux 网络编程。
下面为简单的测试:
1、建立表
create table myt4(name longtext);
insert into myt4 values(repeat('a',100000));
insert into myt4 select * from myt4;
insert into myt4 select * from myt4;
insert into myt4 select * from myt4;
insert into myt4 select * from myt4;
2、mysql client 断点b vio_read堵塞不读取数据,模拟客户端不处理数据的情况。
3、开启tcpdump转本地3325端口
4、跑语句
select * from ttt.myt4;
5、观察rec-Q和send-Q如下
开始前
[root@mgr3 tmptbs]# netstat -anlp|grep 3325|grep 192.168.1.63
tcp 0 0 192.168.1.63:2224 192.168.1.63:3325 ESTABLISHED 12584/mysql
tcp6 0 0 192.168.1.63:3325 192.168.1.63:2224 ESTABLISHED 11404/mysqld
开始后
[root@mgr3 tmptbs]# netstat -anlp|grep 3325|grep 192.168.1.63
tcp 944171 0 192.168.1.63:2224 192.168.1.63:3325 ESTABLISHED 12584/mysql
tcp6 0 656018 192.168.1.63:3325 192.168.1.63:2224 ESTABLISHED 11404/mysqld
6、观察语句如下,已经执行完成,不计入慢查询
mysql> show processlist;
+----+------+-----------+------+---------+------+----------+------------------+-----------+---------------+
| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined |
+----+------+-----------+------+---------+------+----------+------------------+-----------+---------------+
| 14 | root | localhost | ttt | Query | 0 | starting | show processlist | 0 | 0 |
| 24 | root | mgr3:2222 | NULL | Sleep | 10 | | NULL | 16 | 16 |
+----+------+-----------+------+---------+------+----------+------------------+-----------+---------------+
2 rows in set (0.00 sec)
慢查询并不记录这条语句
7、gdb 放数据
稍微停留5秒左右放一次,放完过后如下
[root@mgr3 tmptbs]# netstat -anlp|grep 3325|grep 192.168.1.63
tcp 0 0 192.168.1.63:2224 192.168.1.63:3325 ESTABLISHED 12584/mysql
tcp6 0 0 192.168.1.63:3325 192.168.1.63:2224 ESTABLISHED 11404/mysqld
8、解析tcpdump结果
大概为9点38分46秒发起查询
65 2022-11-05 09:38:46.768521 192.168.1.63 192.168.1.63 MySQL 95 65 Request Query { select * from ttt.myt4 }
大概为 9点39分20秒返回最后一条数据给客户端
141 2022-11-05 09:39:20.264467 192.168.1.63 192.168.1.63 MySQL 32488 141 Response [Malformed Packet]Response OK
9、如果加大数据量就会出现这种
这里做一个16条数据的两个表的笛卡尔积就可以了
select * from ttt.myt4 a,ttt.myt4 b
mysql> show processlist;
+----+------+-----------+------+---------+------+-------------------+-------------------------------------+-----------+---------------+
| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined |
+----+------+-----------+------+---------+------+-------------------+-------------------------------------+-----------+---------------+
| 14 | root | localhost | ttt | Query | 0 | starting | show processlist | 0 | 0 |
| 26 | root | mgr3:2226 | NULL | Query | 9 | Sending to client | select * from ttt.myt4 a,ttt.myt4 b | 28 | 0 |
+----+------+-----------+------+---------+------+-------------------+-------------------------------------+-----------+---------------+
10、慢查询记录了这条语句
# Time: 2022-11-05T01:57:28.353335Z
# User@Host: root[root] @ mgr3 [192.168.1.63] Id: 26
# Schema: Last_errno: 0 Killed: 0
# Query_time: 62.718498 Lock_time: 0.000513 Rows_sent: 256 Rows_examined: 112 Rows_affected: 0
# Bytes_sent: 51203172
SET timestamp=1667613448;
select * from ttt.myt4 a,ttt.myt4 b;
这样在TCP装包中能看到类如下,TCP ZeroWindow的错误。