MySQL V5.6.x/5.7.x SQL查询性能问题 一 简单创建一表,并使用存储过程插入一部分数据 CREATE TABLE users ( user_id int(11) unsigned NOT NULL, user_name varchar(64) DEFAULT NULL, PRIMARY KEY (user_id) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; DELIMITER $$ DROP PROCEDURE IF EXISTS proc_auto_insertdata$$ 创建存储过程 CREATE PROCEDURE proc_auto_insertdata() BEGIN DECLARE init_data INTEGER DEFAULT 1; WHILE init_data <= 20000 DO INSERT INTO users VALUES(init_data, CONCAT('用户-',init_data)); SET init_data = init_data + 1; END WHILE; END$$ DELIMITER ; CALL proc_auto_insertdata(); 二 执行如下查询 Q1: SELECT u.user_id, u.user_name FROM users u WHERE u.user_name IN (SELECT t.user_name FROM users t WHERE t.user_id < 2000); Q2: Q2比Q1只多了一个使用OR子句连接的条件,数据中没有满足此条件的数据) SELECT u.user_id, u.user_name FROM users u WHERE (u.user_name IN (SELECT t.user_name FROM users t WHERE t.user_id < 2000) OR u.user_name IN (SELECT t.user_name FROM users t WHERE t.user_id < -1)); 问题: Q1和Q2哪个查询快? 快者比慢者能快出几倍? 为什么? 三 实际运行结果 对Q1和Q2稍加改造,目的是避免有大量的查询结果输出. 目标列使用COUNT()函数替换. mysql> SELECT COUNT(u.user_id) FROM users u -> WHERE u.user_name IN (SELECT t.user_name FROM users t WHERE t.user_id < 2000); +------------------+ | COUNT(u.user_id) | +------------------+ | 1999 | +------------------+ 1 row in set (19.93 sec) mysql> SELECT COUNT(u.user_id) FROM users u WHERE (u.user_name IN (SELECT t.user_name FROM users t WHERE t.user_id < 2000) OR u.user_name IN (SELECT t.user_name FROM users t WHERE t.user_id < -1)); +------------------+ | COUNT(u.user_id) | +------------------+ | 1999 | +------------------+ 1 row in set (0.50 sec) 看红色字体,所耗费的时间,Q1是Q2的近乎40倍. 为什么? 四 探索原因 1: 察看执行计划 mysql> EXPLAIN SELECT COUNT(u.user_id) FROM users u -> WHERE u.user_name IN (SELECT t.user_name FROM users t WHERE t.user_id < 2000); +----+--------------+-------------+-------+---------+-------+----------+----------------------------------------------------+ | id | select_type | table | type | key | rows | filtered | Extra | +----+--------------+-------------+-------+---------+-------+----------+----------------------------------------------------+ | 1 | SIMPLE | <subquery2> | ALL | NULL | NULL | 100.00 | NULL | | 1 | SIMPLE | u | ALL | NULL | 19761 | 10.00 | Using where; Using join buffer(Block Nested Loop) | | 2 | MATERIALIZED | t | range | PRIMARY | 1999 | 100.00 | Using where | +----+--------------+-------------+-------+---------+-------+----------+----------------------------------------------------+ 3 rows in set, 1 warning (0.00 sec) mysql> EXPLAIN SELECT COUNT(u.user_id) FROM users u -> WHERE (u.user_name IN (SELECT t.user_name FROM users t WHERE t.user_id < 2000) OR -> u.user_name IN (SELECT t.user_name FROM users t WHERE t.user_id < -1)); +----+-------------+-------+-------+---------+---------+-------+----------+--------------------------------+ | id | select_type | table | type | key | key_len | rows | filtered | Extra | +----+-------------+-------+-------+---------+---------+-------+----------+--------------------------------+ | 1 | PRIMARY | u | ALL | NULL | NULL | 19761 | 100.00 | Using where | | 3 | SUBQUERY | NULL | NULL | NULL | NULL | NULL | NULL | no matching row in const table | | 2 | SUBQUERY | t | range | PRIMARY | 4 | 1999 | 100.00 | Using where | +----+-------------+-------+-------+---------+---------+-------+----------+--------------------------------+ 3 rows in set, 1 warning (0.00 sec) 对比执行计划,发现Q1使用了"MATERIALIZED"物化方式存储子查询的临时结果. 是不是物化导致了Q1慢呢? 2: 察看IO mysql> flush status; //保证计数器每次从新开始计数 Query OK, 0 rows affected (0.00 sec) mysql> SELECT COUNT(u.user_id) FROM users u -> WHERE u.user_name IN (SELECT t.user_name FROM users t WHERE t.user_id < 2000); +------------------+ | COUNT(u.user_id) | +------------------+ | 1999 | +------------------+ 1 row in set (19.93 sec) mysql> show status like 'Handler_read%'; +----------------------------+-------+ | Variable_name | Value | +----------------------------+-------+ | Handler_commit | 1 | ... | Handler_external_lock | 5 | ... | Handler_read_first | 2 | | Handler_read_key | 2 | | Handler_read_last | 0 | | Handler_read_next | 1999 | | Handler_read_prev | 0 | | Handler_read_rnd | 0 | | Handler_read_rnd_next | 22001 | ... | Handler_write | 1999 | +----------------------------+-------+ 18 rows in set (0.00 sec) mysql> flush status; //保证计数器每次从新开始计数 Query OK, 0 rows affected (0.00 sec) mysql> SELECT COUNT(u.user_id) FROM users u -> WHERE (u.user_name IN (SELECT t.user_name FROM users t WHERE t.user_id < 2000) OR -> u.user_name IN (SELECT t.user_name FROM users t WHERE t.user_id < -1)); +------------------+ | COUNT(u.user_id) | +------------------+ | 1999 | +------------------+ 1 row in set (0.50 sec) mysql> show status like 'Handler%'; +----------------------------+-------+ | Variable_name | Value | +----------------------------+-------+ | Handler_commit | 1 | ... | Handler_external_lock | 7 | ... | Handler_read_first | 2 | | Handler_read_key | 20002 | | Handler_read_last | 0 | | Handler_read_next | 1999 | | Handler_read_prev | 0 | | Handler_read_rnd | 0 | | Handler_read_rnd_next | 20001 | ... | Handler_write | 1999 | +----------------------------+-------+ 18 rows in set (0.00 sec) Q2和Q1不一致之处在于Q2的"Handler_read_key"值20002远远比比Q1的2高. 这说明Q2更多地利用了索引. 且看MySQL官方解释如下: Handler_read_key The number of requests to read a row based on a key. If this value is high, it is a good indication that your tables are properly indexed for your queries. 问题: 为什么Q2会有更多的索引读? 索引是从哪里来的? Q1被物化,意味着Q1使用了临时表; 而Q2子查询是否被物化是否使用了临时表呢? 五 新的疑问,再次探索 之下如下操作,注意show warnings技巧的使用。查询结果作了形式的调整,便于阅读。 mysql> EXPLAIN SELECT COUNT(u.user_id) FROM users u -> WHERE u.user_name IN (SELECT t.user_name FROM users t WHERE t.user_id < 2000); ... mysql> show warnings; /* select#1 */ select count(`d2`.`u`.`user_id`) AS `COUNT(u.user_id)` from `d2`.`users` `u` semi join (`d2`.`users` `t`) where ((`d2`.`u`.`user_name` = `<subquery2>`.`user_name`) and (`d2`.`t`.`user_id` < 2000)) 1 row in set (0.00 sec) 可以看出,Q1的子查询被物化后,又作了半连接优化,意味着子查询被上拉方式优化。 mysql> EXPLAIN SELECT COUNT(u.user_id) FROM users u -> WHERE (u.user_name IN (SELECT t.user_name FROM users t WHERE t.user_id < 2000) OR -> u.user_name IN (SELECT t.user_name FROM users t WHERE t.user_id < -1)); ... mysql> show warnings; /* select#1 */ select count(`d2`.`u`.`user_id`) AS `COUNT(u.user_id)` from `d2`.`users` `u` where ( <in_optimizer>(`d2`.`u`.`user_name`,`d2`.`u`.`user_name` in ( <materialize> (/* select#2 */ select `d2`.`t`.`user_name` from `d2`.`users` `t` where (`d2`.`t`.`user_id` < 2000) ), <primary_index_lookup>(`d2`.`u`.`user_name` in <temporary table> on <auto_key> where ((`d2`.`u`.`user_name` = `materialized-subquery`.`user_name`))) ) ) or <in_optimizer>(`d2`.`u`.`user_name`,`d2`.`u`.`user_name` in ( <materialize> (/* select#3 */ select `d2`.`t`.`user_name` from `d2`.`users` `t` where (`d2`.`t`.`user_id` < -(1)) ), <primary_index_lookup>(`d2`.`u`.`user_name` in <temporary table> on <auto_key> where ((`d2`.`u`.`user_name` = `materialized-subquery`.`user_name`))) ) ) ) Q2表明,首先使用了临时表,但是和Q1不同的是,子查询没有被上拉优化。 但是,MySQL对于临时表的使用,会自动创建索引,所以我们能看到在“auto_key”上执行了“primary_index_lookup”。这就是Q2快于Q1的原因。也是为什么Q2的索引读计数器的值较大的原因。 问题:半连接优化 六 继续探索 mysql> SET optimizer_switch='semijoin=off'; //关闭半连接优化 Query OK, 0 rows affected (0.00 sec) mysql> EXPLAIN SELECT COUNT(u.user_id) FROM users u -> WHERE u.user_name IN (SELECT t.user_name FROM users t WHERE t.user_id < 2000); +----+-------------+-------+------------+-------+---------------+---------+---------+------+-------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+-------+---------------+---------+---------+------+-------+----------+-------------+ | 1 | PRIMARY | u | NULL | ALL | NULL | NULL | NULL | NULL | 19761 | 100.00 | Using where | | 2 | SUBQUERY | t | NULL | range | PRIMARY | PRIMARY | 4 | NULL | 1999 | 100.00 | Using where | +----+-------------+-------+------------+-------+---------------+---------+---------+------+-------+----------+-------------+ 2 rows in set, 1 warning (0.01 sec) 执行计划似乎改变不大,但类似了Q2的执行计划。(哈哈,可执行show warnings;命令看看,获取更详细的信息才能得出更靠谱的结论) mysql> SELECT COUNT(u.user_id) FROM users u -> WHERE u.user_name IN (SELECT t.user_name FROM users t WHERE t.user_id < 2000); +------------------+ | COUNT(u.user_id) | +------------------+ | 1999 | +------------------+ 1 row in set (0.41 sec) 在禁止了半连接操作之后,执行速度一下子坐上了飞机,有了40余倍的提升。 七 结论 1. Q1使用了物化+半连接优化, Q2是子查询,但没有使用半连接优化, 可见MySQL中半连接优化的效率未必高 2. 似乎物化的子查询用半连接上拉,MySQL的判断条件还是存在一点儿问题。