MySQL V5.6.x/5.7.x SQL查询性能问题
一 简单创建一表,并使用存储过程插入一部分数据
CREATE TABLE users (
user_id int(11) unsigned NOT NULL,
user_name varchar(64) DEFAULT NULL,
PRIMARY KEY (user_id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
DELIMITER $$
DROP PROCEDURE IF EXISTS proc_auto_insertdata$$
创建存储过程
CREATE PROCEDURE proc_auto_insertdata()
BEGIN
DECLARE init_data INTEGER DEFAULT 1;
WHILE init_data <= 20000 DO
INSERT INTO users VALUES(init_data, CONCAT('用户-',init_data));
SET init_data = init_data + 1;
END WHILE;
END$$
DELIMITER ;
CALL proc_auto_insertdata();
二 执行如下查询
Q1:
SELECT u.user_id, u.user_name FROM users u
WHERE u.user_name IN (SELECT t.user_name FROM users t WHERE t.user_id < 2000);
Q2: Q2比Q1只多了一个使用OR子句连接的条件,数据中没有满足此条件的数据)
SELECT u.user_id, u.user_name FROM users u WHERE
(u.user_name IN (SELECT t.user_name FROM users t WHERE t.user_id < 2000) OR
u.user_name IN (SELECT t.user_name FROM users t WHERE t.user_id < -1));
问题: Q1和Q2哪个查询快? 快者比慢者能快出几倍? 为什么?
三 实际运行结果
对Q1和Q2稍加改造,目的是避免有大量的查询结果输出. 目标列使用COUNT()函数替换.
mysql> SELECT COUNT(u.user_id) FROM users u
-> WHERE u.user_name IN (SELECT t.user_name FROM users t WHERE t.user_id < 2000);
+------------------+
| COUNT(u.user_id) |
+------------------+
| 1999 |
+------------------+
1 row in set (19.93 sec)
mysql> SELECT COUNT(u.user_id) FROM users u WHERE (u.user_name IN (SELECT t.user_name FROM users t WHERE t.user_id < 2000) OR
u.user_name IN (SELECT t.user_name FROM users t WHERE t.user_id < -1));
+------------------+
| COUNT(u.user_id) |
+------------------+
| 1999 |
+------------------+
1 row in set (0.50 sec)
看红色字体,所耗费的时间,Q1是Q2的近乎40倍. 为什么?
四 探索原因
1: 察看执行计划
mysql> EXPLAIN SELECT COUNT(u.user_id) FROM users u
-> WHERE u.user_name IN (SELECT t.user_name FROM users t WHERE t.user_id < 2000);
+----+--------------+-------------+-------+---------+-------+----------+----------------------------------------------------+
| id | select_type | table | type | key | rows | filtered | Extra |
+----+--------------+-------------+-------+---------+-------+----------+----------------------------------------------------+
| 1 | SIMPLE | | ALL | NULL | NULL | 100.00 | NULL |
| 1 | SIMPLE | u | ALL | NULL | 19761 | 10.00 | Using where; Using join buffer(Block Nested Loop) |
| 2 | MATERIALIZED | t | range | PRIMARY | 1999 | 100.00 | Using where |
+----+--------------+-------------+-------+---------+-------+----------+----------------------------------------------------+
3 rows in set, 1 warning (0.00 sec)
mysql> EXPLAIN SELECT COUNT(u.user_id) FROM users u
-> WHERE (u.user_name IN (SELECT t.user_name FROM users t WHERE t.user_id < 2000) OR
-> u.user_name IN (SELECT t.user_name FROM users t WHERE t.user_id < -1));
+----+-------------+-------+-------+---------+---------+-------+----------+--------------------------------+
| id | select_type | table | type | key | key_len | rows | filtered | Extra |
+----+-------------+-------+-------+---------+---------+-------+----------+--------------------------------+
| 1 | PRIMARY | u | ALL | NULL | NULL | 19761 | 100.00 | Using where |
| 3 | SUBQUERY | NULL | NULL | NULL | NULL | NULL | NULL | no matching row in const table |
| 2 | SUBQUERY | t | range | PRIMARY | 4 | 1999 | 100.00 | Using where |
+----+-------------+-------+-------+---------+---------+-------+----------+--------------------------------+
3 rows in set, 1 warning (0.00 sec)
对比执行计划,发现Q1使用了"MATERIALIZED"物化方式存储子查询的临时结果. 是不是物化导致了Q1慢呢?
2: 察看IO
mysql> flush status; //保证计数器每次从新开始计数
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT COUNT(u.user_id) FROM users u
-> WHERE u.user_name IN (SELECT t.user_name FROM users t WHERE t.user_id < 2000);
+------------------+
| COUNT(u.user_id) |
+------------------+
| 1999 |
+------------------+
1 row in set (19.93 sec)
mysql> show status like 'Handler_read%';
+----------------------------+-------+
| Variable_name | Value |
+----------------------------+-------+
| Handler_commit | 1 |
...
| Handler_external_lock | 5 |
...
| Handler_read_first | 2 |
| Handler_read_key | 2 |
| Handler_read_last | 0 |
| Handler_read_next | 1999 |
| Handler_read_prev | 0 |
| Handler_read_rnd | 0 |
| Handler_read_rnd_next | 22001 |
...
| Handler_write | 1999 |
+----------------------------+-------+
18 rows in set (0.00 sec)
mysql> flush status; //保证计数器每次从新开始计数
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT COUNT(u.user_id) FROM users u
-> WHERE (u.user_name IN (SELECT t.user_name FROM users t WHERE t.user_id < 2000) OR
-> u.user_name IN (SELECT t.user_name FROM users t WHERE t.user_id < -1));
+------------------+
| COUNT(u.user_id) |
+------------------+
| 1999 |
+------------------+
1 row in set (0.50 sec)
mysql> show status like 'Handler%';
+----------------------------+-------+
| Variable_name | Value |
+----------------------------+-------+
| Handler_commit | 1 |
...
| Handler_external_lock | 7 |
...
| Handler_read_first | 2 |
| Handler_read_key | 20002 |
| Handler_read_last | 0 |
| Handler_read_next | 1999 |
| Handler_read_prev | 0 |
| Handler_read_rnd | 0 |
| Handler_read_rnd_next | 20001 |
...
| Handler_write | 1999 |
+----------------------------+-------+
18 rows in set (0.00 sec)
Q2和Q1不一致之处在于Q2的"Handler_read_key"值20002远远比比Q1的2高. 这说明Q2更多地利用了索引.
且看MySQL官方解释如下:
Handler_read_key
The number of requests to read a row based on a key. If this value is high, it is a good indication that your tables are properly indexed for your queries.
问题:
为什么Q2会有更多的索引读? 索引是从哪里来的?
Q1被物化,意味着Q1使用了临时表; 而Q2子查询是否被物化是否使用了临时表呢?
五 新的疑问,再次探索
之下如下操作,注意show warnings技巧的使用。查询结果作了形式的调整,便于阅读。
mysql> EXPLAIN SELECT COUNT(u.user_id) FROM users u
-> WHERE u.user_name IN (SELECT t.user_name FROM users t WHERE t.user_id < 2000);
...
mysql> show warnings;
/* select#1 */ select count(`d2`.`u`.`user_id`) AS `COUNT(u.user_id)`
from `d2`.`users` `u` semi join (`d2`.`users` `t`)
where ((`d2`.`u`.`user_name` = ``.`user_name`) and (`d2`.`t`.`user_id` < 2000))
1 row in set (0.00 sec)
可以看出,Q1的子查询被物化后,又作了半连接优化,意味着子查询被上拉方式优化。
mysql> EXPLAIN SELECT COUNT(u.user_id) FROM users u
-> WHERE (u.user_name IN (SELECT t.user_name FROM users t WHERE t.user_id < 2000) OR
-> u.user_name IN (SELECT t.user_name FROM users t WHERE t.user_id < -1));
...
mysql> show warnings;
/* select#1 */ select count(`d2`.`u`.`user_id`) AS `COUNT(u.user_id)`
from `d2`.`users` `u`
where
(
(`d2`.`u`.`user_name`,`d2`.`u`.`user_name` in
(
(/* select#2 */ select `d2`.`t`.`user_name`
from `d2`.`users` `t`
where (`d2`.`t`.`user_id` < 2000) ),
(`d2`.`u`.`user_name` in on
where ((`d2`.`u`.`user_name` = `materialized-subquery`.`user_name`)))
)
)
or
(`d2`.`u`.`user_name`,`d2`.`u`.`user_name` in
(
(/* select#3 */ select `d2`.`t`.`user_name`
from `d2`.`users` `t`
where (`d2`.`t`.`user_id` < -(1)) ),
(`d2`.`u`.`user_name` in on
where ((`d2`.`u`.`user_name` = `materialized-subquery`.`user_name`)))
)
)
)
Q2表明,首先使用了临时表,但是和Q1不同的是,子查询没有被上拉优化。
但是,MySQL对于临时表的使用,会自动创建索引,所以我们能看到在“auto_key”上执行了“primary_index_lookup”。这就是Q2快于Q1的原因。也是为什么Q2的索引读计数器的值较大的原因。
问题:半连接优化
六 继续探索
mysql> SET optimizer_switch='semijoin=off'; //关闭半连接优化
Query OK, 0 rows affected (0.00 sec)
mysql> EXPLAIN SELECT COUNT(u.user_id) FROM users u
-> WHERE u.user_name IN (SELECT t.user_name FROM users t WHERE t.user_id < 2000);
+----+-------------+-------+------------+-------+---------------+---------+---------+------+-------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+-------+---------------+---------+---------+------+-------+----------+-------------+
| 1 | PRIMARY | u | NULL | ALL | NULL | NULL | NULL | NULL | 19761 | 100.00 | Using where |
| 2 | SUBQUERY | t | NULL | range | PRIMARY | PRIMARY | 4 | NULL | 1999 | 100.00 | Using where |
+----+-------------+-------+------------+-------+---------------+---------+---------+------+-------+----------+-------------+
2 rows in set, 1 warning (0.01 sec)
执行计划似乎改变不大,但类似了Q2的执行计划。(哈哈,可执行show warnings;命令看看,获取更详细的信息才能得出更靠谱的结论)
mysql> SELECT COUNT(u.user_id) FROM users u
-> WHERE u.user_name IN (SELECT t.user_name FROM users t WHERE t.user_id < 2000);
+------------------+
| COUNT(u.user_id) |
+------------------+
| 1999 |
+------------------+
1 row in set (0.41 sec)
在禁止了半连接操作之后,执行速度一下子坐上了飞机,有了40余倍的提升。
七 结论
1. Q1使用了物化+半连接优化, Q2是子查询,但没有使用半连接优化, 可见MySQL中半连接优化的效率未必高
2. 似乎物化的子查询用半连接上拉,MySQL的判断条件还是存在一点儿问题。