MySQL Optimizer的子查询优化行为

MySQL Optimizer

先谈谈Optimizer。

Optimizer即数据库优化器,在MySQL中大家普遍称之为查询优化器。在任何一款数据库中,Optimizer一直占据着最为核心的角色,有些人将它比喻为数据库的大脑,真是太贴切了,因为从一定程度上来讲优化器的设计原则决定着一款数据库的行为。那么优化的前提是了解优化器,优化的作用是辅助优化器

MySQL Optimizer属于CBO,并且在不断地改进中已经渐渐开始支持各种主流的优化技术,在生成执行计划之前进行部分优化工作。mySQL中的SQL处理过程大致如下,

MySQL Optimizer的子查询优化行为_第1张图片

语法分析器处理的结果是将SQL语句解析为查询树(select_lex),优化器以JOIN::prepare方法作为入口,最终将语法树转变为最优的执行计划。期间交错进行着逻辑优化与物理优化,各个组件及数据结构彼此配合,共同完成查询优化以及执行计划生成这一个完整的工作流程。


再来说一下逻辑优化与物理优化。

逻辑优化,主要功能是基于关系代数以及启发式规则,找出SQL语句等价的变换形式,使得SQL执行更高效;物理优化,即根据代价估算模型,选择最优的表连接顺序以及数据访问方式,这个阶段依赖于对数据的了解。

从我们的角度来看,逻辑优化就是优化器对我们的SQL进行了一次改进优化,那么优化器就不单单只是简单地解析SQL,为了让生活更美好,它会选择自己更乐意接受的语法形式

子查询优化技术

在逻辑优化阶段,其中一个核心工作就是进行子查询的优化,因为子查询实在是太消耗性能了,MySQL查询优化器在演进历程中渐渐支持了子查询的优化,这还是相当有必要的。在MySQL Optimizer中,子查询的处理贯穿在整个优化器代码中,


在JOIN::prepare查询准备阶段就进行了一部分特殊子查询的优化,如

remove_redundant_subquery_clauses //去冗余子句
resolve_subquery //简单格式的semi-join优化,否则调用select_transformer
select_transformer //其他类型子查询优化,如IN/ALL/ANY等价转换

JOIN::optimize方法中,融合了semi-join、Materialization以及EXISTS strategy这三种主要的优化策略,如

flatten_subqueries //子查询展开,即子查询上拉,无法上拉则使用EXISTS strategy方式
make_join_statistics //完成多表连接,此函数会进行嵌套连接的Materialization优化,并确定子查询的优化策略

子查询优化的主流技术有三种:

子查询合并。即把多个子查询合并成一个子查询,作用是减少元组扫描的次数

子查询反嵌套。也称作子查询上拉,即将子查询重写为等价的多表连接

聚集子查询消除。即聚集函数上推,将消除聚集函数的子查询与父查询做外连接


其中,子查询反嵌套是一种最常用的子查询优化技术,同时MySQL中也只支持这种优化技术。MySQL 5.6中处理子查询的思路是,基于查询重写技术的规则,尽可能将子查询转换为连接(semi-join Semi-Join Transformations),并配合基于代价估算的Materialize、exists 等优化策略让子查询执行更优


接下来会对各种技术以及实现场景进行测试验证,测试用表如下,

[root@sakila 11:49:54]> desc customer_t;
+-------------+----------------------+------+-----+-------------------+-----------------------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+----------------------+------+-----+-------------------+-----------------------------+
| | smallint(5) unsigned | NO | PRI | NULL | |
...
...
9 rows in set (0.07 sec)

[root@sakila 02:38:29]> desc payment_t;
+--------------+----------------------+------+-----+-------------------+-----------------------------+
| Field | Type | Null | Key | Default | Extra |
+--------------+----------------------+------+-----+-------------------+-----------------------------+
| payment_id | smallint(5) unsigned | NO | PRI | NULL | |
|  | smallint(5) unsigned | NO | MUL | NULL | |
...
...
7 rows in set (0.00 sec)

不支持的优化技术

1、子查询合并技术

对于两个相似的子查询作为where条件的查询,MySQL不支持合并处理。如下,

[root@sakila 05:00:22]> explain extended select customer_t.store_id  from customer_t where customer_id in (select customer_id from payment_t where payment_id>1000) or  customer_id in (select customer_id from payment_t where payment_id<10);
+----+-------------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
| id | select_type | table      | type  | possible_keys | key     | key_len | ref  | rows | filtered | Extra       |
+----+-------------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
|  1 | PRIMARY     | customer_t | ALL   | NULL          | NULL    | NULL    | NULL |  644 |   100.00 | Using where |
|  3 | SUBQUERY    | payment_t  | range | PRIMARY       | PRIMARY | 2       | NULL |    9 |   100.00 | Using where |
|  2 | SUBQUERY    | payment_t  | range | PRIMARY       | PRIMARY | 2       | NULL | 8272 |   100.00 | Using where |
+----+-------------+------------+-------+---------------+---------+---------+------+------+----------+-------------+

改写后的SQL语句如下,

/* select#1 */ select `sakila`.`customer_t`.`store_id` AS `store_id` from `sakila`.`customer_t` where ((`sakila`.`customer_t`.`customer_id`,`sakila`.`customer_t`.`customer_id` in (  (/* select#2 */ select `sakila`.`payment_t`.`customer_id` from `sakila`.`payment_t` where (`sakila`.`payment_t`.`payment_id` > 1000) ), (`sakila`.`customer_t`.`customer_id` in  on  where ((`sakila`.`customer_t`.`customer_id` = `materialized-subquery`.`customer_id`))))) or (`sakila`.`customer_t`.`customer_id`,`sakila`.`customer_t`.`customer_id` in (  (/* select#3 */ select `sakila`.`payment_t`.`customer_id` from `sakila`.`payment_t` where (`sakila`.`payment_t`.`payment_id` < 10) ), (`sakila`.`customer_t`.`customer_id` in  on  where ((`sakila`.`customer_t`.`customer_id` = `materialized-subquery`.`customer_id`))))))

从转换结果来看,优化器使用了多种优化策略来获得更快的元组扫描速率,但是两个子查询仍然单独进行,表payment_t被扫描了两次,而没有将两个子查询进行合并处理

2、聚集子查询消除。

子查询包含聚集函数,MySQL不支持消除,如下,

[root@sakila 09:38:54]> explain extended select * from customer_t where customer_id in (select min(customer_id) from payment_t);
+----+-------------+------------+------+---------------+------+---------+------+-------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------+------+---------------+------+---------+------+-------+----------+-------------+
| 1 | PRIMARY | customer_t | ALL | NULL | NULL | NULL | NULL | 644 | 100.00 | Using where |
| 2 | SUBQUERY | payment_t | ALL | NULL | NULL | NULL | NULL | 15448 | 100.00 | NULL |
+----+-------------+------------+------+---------------+------+---------+------+-------+----------+-------------+

优化后语句如下,

/* select#1 */ select `sakila`.`customer_t`.`customer_id` AS `customer_id`,`sakila`.`customer_t`.`store_id` AS `store_id`,`sakila`.`customer_t`.`first_name` AS `first_name`,`sakila`.`customer_t`.`last_name` AS `last_name`,`sakila`.`customer_t`.`email` AS `email`,`sakila`.`customer_t`.`address_id` AS `address_id`,`sakila`.`customer_t`.`active` AS `active`,`sakila`.`customer_t`.`create_date` AS `create_date`,`sakila`.`customer_t`.`last_update` AS `last_update` from `sakila`.`customer_t` where (`sakila`.`customer_t`.`customer_id`,`sakila`.`customer_t`.`customer_id` in (  (/* select#2 */ select min(`sakila`.`payment_t`.`customer_id`) from `sakila`.`payment_t` having 1 ), (`sakila`.`customer_t`.`customer_id` in  on  where ((`sakila`.`customer_t`.`customer_id` = `materialized-subquery`.`min(customer_id)`)))))

可以发现被优化后的查询仍然有子查询,select_type值为SUBQUERY,这里用到了materialize、auto_key等各种优化手段,但是没有实现聚集子查询消除

特殊类型子查询的优化

1.IN类型。

MySQL支持简单in查询的优化,主要策略为semi-join、Materialization,否则会使用EXISTS strategy优化。这里用两个测试场景来分析优化行为,


场景一:... out_expr in (select primary_key ... )

[root@sakila 08:30:46]> explain extended select * from payment_t where customer_id in (select customer_id from customer_t);
+----+-------------+------------+--------+---------------+---------+---------+------------------------------+-------+----------+-------------+
| id | select_type | table      | type   | possible_keys | key     | key_len | ref                          | rows  | filtered | Extra       |
+----+-------------+------------+--------+---------------+---------+---------+------------------------------+-------+----------+-------------+
|  1 | SIMPLE      | payment_t  | ALL    | NULL          | NULL    | NULL    | NULL                         | 15448 |   100.00 | NULL        |
|  1 | SIMPLE      | customer_t | eq_ref | PRIMARY       | PRIMARY | 2       | sakila.payment_t.customer_id |     1 |   100.00 | Using index |
+----+-------------+------------+--------+---------------+---------+---------+------------------------------+-------+----------+-------------+
改写后的语句如下,
 /* select#1 */ select `sakila`.`payment_t`.`payment_id` AS `payment_id`,`sakila`.`payment_t`.`customer_id` AS `customer_id`,`sakila`.`payment_t`.`staff_id` AS `staff_id`,`sakila`.`payment_t`.`rental_id` AS `rental_id`,`sakila`.`payment_t`.`amount` AS `amount`,`sakila`.`payment_t`.`payment_date` AS `payment_date`,`sakila`.`payment_t`.`last_update` AS `last_update` from `sakila`.`customer_t` join `sakila`.`payment_t` where (`sakila`.`customer_t`.`customer_id` = `sakila`.`payment_t`.`customer_id`)
通过优化器的转换结果,我们发现这是一个标准的子查询反嵌套,子查询被上拉至外部表,句式被转换成内表与外表的inner join,并且执行计划中也没有subquery。


场景二:... primary_key in (select inner_expr ... )


[root@sakila 08:30:21]> explain extended select * from customer_t where customer_id in (select customer_id from payment_t);
+----+--------------+-------------+--------+---------------+------------+---------+-------------------------------+-------+----------+-------------+
| id | select_type  | table       | type   | possible_keys | key        | key_len | ref                           | rows  | filtered | Extra       |
+----+--------------+-------------+--------+---------------+------------+---------+-------------------------------+-------+----------+-------------+
|  1 | SIMPLE       | customer_t  | ALL    | PRIMARY       | NULL       | NULL    | NULL                          |   644 |   100.00 | Using where |
|  1 | SIMPLE       |  | eq_ref |     |  | 2       | sakila.customer_t.customer_id |     1 |   100.00 | NULL        |
|  2 | MATERIALIZED | payment_t   | ALL    | NULL          | NULL       | NULL    | NULL                          | 15448 |   100.00 | NULL        |
+----+--------------+-------------+--------+---------------+------------+---------+-------------------------------+-------+----------+-------------+

转换后的语句如下,
/* select#1 */ select `sakila`.`customer_t`.`customer_id` AS `customer_id`,`sakila`.`customer_t`.`store_id` AS `store_id`,`sakila`.`customer_t`.`first_name` AS `first_name`,`sakila`.`customer_t`.`last_name` AS `last_name`,`sakila`.`customer_t`.`email` AS `email`,`sakila`.`customer_t`.`address_id` AS `address_id`,`sakila`.`customer_t`.`active` AS `active`,`sakila`.`customer_t`.`create_date` AS `create_date`,`sakila`.`customer_t`.`last_update` AS `last_update` from `sakila`.`customer_t` semi join (`sakila`.`payment_t`) where (``.`customer_id` = `sakila`.`customer_t`.`customer_id`)

这个语句和前面一个很类似,但是处理结果有一些区别。从执行计划看,好像还是有subquery,但是结合转换结果看,这种子查询被物化处理,然后进行了与外部表的semi join转换。


这两个语句的差别在哪里呢?

语句本身无差异,但是转换成join就会有很大区别,就是场景二的SQL转换为join之后还需要进行消重处理,这里的处理策略是物化为临时表,然后semi join消重。在这种场景下MySQL优化器还提供多种优化策略,如FirstMatch、LooseScan等,根据成本选择。


这两种场景都实现了MySQL的子查询反嵌套技术,只是在后一种场景下进行了更细致的处理,让SQL的处理消耗更少的资源。在MySQL中,与各种独特优化策略相结合的子查询上拉技术,被统称为Semi-Join Transformations。

2.ANY/ALL/SOME。

MySQL对于ANY/ALL/SOME类型的优化,主要策略是基于查询重写技术,将查询转换成更易被优化的min()/max(),= ANY则等价于in,= ALL一般会通过EXISTS strategy进行优化


场景一:outer_expr > ANY(select inner_expr ... )

[root@sakila 10:32:10]> explain extended select * from customer_t where customer_id > ANY(select customer_id from payment_t);
+----+-------------+------------+------+---------------+------+---------+------+-------+----------+-------------+
| id | select_type | table      | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra       |
+----+-------------+------------+------+---------------+------+---------+------+-------+----------+-------------+
|  1 | PRIMARY     | customer_t | ALL  | NULL          | NULL | NULL    | NULL |   644 |   100.00 | Using where |
|  2 | SUBQUERY    | payment_t  | ALL  | NULL          | NULL | NULL    | NULL | 15448 |   100.00 | NULL        |
+----+-------------+------------+------+---------------+------+---------+------+-------+----------+-------------+

转换后的语句如下,
/* select#1 */ select `sakila`.`customer_t`.`customer_id` AS `customer_id`,`sakila`.`customer_t`.`store_id` AS `store_id`,`sakila`.`customer_t`.`first_name` AS `first_name`,`sakila`.`customer_t`.`last_name` AS `last_name`,`sakila`.`customer_t`.`email` AS `email`,`sakila`.`customer_t`.`address_id` AS `address_id`,`sakila`.`customer_t`.`active` AS `active`,`sakila`.`customer_t`.`create_date` AS `create_date`,`sakila`.`customer_t`.`last_update` AS `last_update` from `sakila`.`customer_t` where ((`sakila`.`customer_t`.`customer_id` > (/* select#2 */ select min(`sakila`.`payment_t`.`customer_id`) from `sakila`.`payment_t`)))

从等价转换后的句式可以看到,子查询还是存在的,但是ANY被转换为>min()。那么这种转换的意义在哪里呢?

现在对payment_t表添加一个索引:alter table payment_t add key(customer_id),然后再看看这条查询的执行计划

[root@sakila 12:02:16]> explain extended select * from customer_t where customer_id > ANY(select customer_id from payment_t);
+----+-------------+------------+------+---------------+------+---------+------+------+----------+------------------------------+
| id | select_type | table      | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra                        |
+----+-------------+------------+------+---------------+------+---------+------+------+----------+------------------------------+
|  1 | PRIMARY     | customer_t | ALL  | NULL          | NULL | NULL    | NULL |  644 |   100.00 | Using where                  |
|  2 | SUBQUERY    | NULL       | NULL | NULL          | NULL | NULL    | NULL | NULL |     NULL | Select tables optimized away |
+----+-------------+------------+------+---------------+------+---------+------+------+----------+------------------------------+

可以发现,在customer_id字段添加了索引之后,优化器直接通过索引获取min(),Extra信息为Select tables optimized away。那么优化器的转换思路就是,朝更容易被使用索引进行物理优化的方向进行改写。


场景二:... outer_expr < ALL(select inner_expr ... )

[root@sakila 12:02:21]> explain extended select * from customer_t where customer_id < ALL(select customer_id from payment_t);

转换后的语句如下,
/* select#1 */ select `sakila`.`customer_t`.`customer_id` AS `customer_id`,`sakila`.`customer_t`.`store_id` AS `store_id`,`sakila`.`customer_t`.`first_name` AS `first_name`,`sakila`.`customer_t`.`last_name` AS `last_name`,`sakila`.`customer_t`.`email` AS `email`,`sakila`.`customer_t`.`address_id` AS `address_id`,`sakila`.`customer_t`.`active` AS `active`,`sakila`.`customer_t`.`create_date` AS `create_date`,`sakila`.`customer_t`.`last_update` AS `last_update` from `sakila`.`customer_t` where ((`sakila`.`customer_t`.`customer_id` >= (/* select#2 */ select min(`sakila`.`payment_t`.`customer_id`) from `sakila`.`payment_t`)))
和场景一类似,将其改写成更易获取的min()形式


场景三:... outer_expr = ANY(select inner_expr ... )

[root@sakila 12:06:15]> explain extended select * from customer_t where customer_id = ANY (select customer_id from payment_t);
+----+--------------+-------------+--------+---------------+------------+---------+-------------------------------+-------+----------+-------------+
| id | select_type  | table       | type   | possible_keys | key        | key_len | ref                           | rows  | filtered | Extra       |
+----+--------------+-------------+--------+---------------+------------+---------+-------------------------------+-------+----------+-------------+
|  1 | SIMPLE       | customer_t  | ALL    | PRIMARY       | NULL       | NULL    | NULL                          |   644 |   100.00 | Using where |
|  1 | SIMPLE       |  | eq_ref |     |  | 2       | sakila.customer_t.customer_id |     1 |   100.00 | NULL        |
|  2 | MATERIALIZED | payment_t   | ALL    | NULL          | NULL       | NULL    | NULL                          | 15448 |   100.00 | NULL        |
+----+--------------+-------------+--------+---------------+------------+---------+-------------------------------+-------+----------+-------------+

转换后如下,
/* select#1 */ select `sakila`.`customer_t`.`customer_id` AS `customer_id`,`sakila`.`customer_t`.`store_id` AS `store_id`,`sakila`.`customer_t`.`first_name` AS `first_name`,`sakila`.`customer_t`.`last_name` AS `last_name`,`sakila`.`customer_t`.`email` AS `email`,`sakila`.`customer_t`.`address_id` AS `address_id`,`sakila`.`customer_t`.`active` AS `active`,`sakila`.`customer_t`.`create_date` AS `create_date`,`sakila`.`customer_t`.`last_update` AS `last_update` from `sakila`.`customer_t` semi join (`sakila`.`payment_t`) where (``.`customer_id` = `sakila`.`customer_t`.`customer_id`)
此场景下,= ANY被作为IN处理,然后使用物化策略以及半连接转换进行了优化


场景四:... outer_expr = ALL(select inner_expr ... )

[root@sakila 12:06:21]> explain extended select * from customer_t where customer_id = ALL (select customer_id from payment_t);
+----+--------------------+------------+------+---------------+------+---------+------+-------+----------+-------------+
| id | select_type        | table      | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra       |
+----+--------------------+------------+------+---------------+------+---------+------+-------+----------+-------------+
|  1 | PRIMARY            | customer_t | ALL  | NULL          | NULL | NULL    | NULL |   644 |   100.00 | Using where |
|  2 | DEPENDENT SUBQUERY | payment_t  | ALL  | NULL          | NULL | NULL    | NULL | 15448 |   100.00 | Using where |
+----+--------------------+------------+------+---------------+------+---------+------+-------+----------+-------------+
转换后句式如下,
/* select#1 */ select `sakila`.`customer_t`.`customer_id` AS `customer_id`,`sakila`.`customer_t`.`store_id` AS `store_id`,`sakila`.`customer_t`.`first_name` AS `first_name`,`sakila`.`customer_t`.`last_name` AS `last_name`,`sakila`.`customer_t`.`email` AS `email`,`sakila`.`customer_t`.`address_id` AS `address_id`,`sakila`.`customer_t`.`active` AS `active`,`sakila`.`customer_t`.`create_date` AS `create_date`,`sakila`.`customer_t`.`last_update` AS `last_update` from `sakila`.`customer_t` where ((`sakila`.`customer_t`.`customer_id`,(/* select#2 */ select 1 from `sakila`.`payment_t` where ((`sakila`.`customer_t`.`customer_id`) <> `sakila`.`payment_t`.`customer_id`))))
从优化后的查询语句来看,= ALL被转换为exists语句,这说明使用了EXISTS strategy进行优化

非特殊子查询

1.Derived Tables。

MySQL支持对派生表形式的子查询进行优化。

对于派生类子查询的处理,优化器进一步优化了Materialization策略,只有当派生表中的数据必须加载时才可能被物化;对于join派生表的子查询类型,MySQL Optimizer支持auto_key策略优化,该策略会在物化的子查询结果集上添加索引,从而加速访问效率。


模拟SQL场景如下,

[root@sakila 02:32:16]> explain extended select customer_t.customer_id,derived_t.payment_id from customer_t join (select * from payment_t) 
as derived_t on customer_t.customer_id=derived_t.customer_id where derived_t.customer_id <100;
+----+-------------+------------+-------+---------------+-------------+---------+-------------------------------+-------+----------+--------------------------+
| id | select_type | table      | type  | possible_keys | key         | key_len | ref                           | rows  | filtered | Extra                    |
+----+-------------+------------+-------+---------------+-------------+---------+-------------------------------+-------+----------+--------------------------+
|  1 | PRIMARY     | customer_t | range | PRIMARY       | PRIMARY     | 2       | NULL                          |    98 |   100.00 | Using where; Using index |
|  1 | PRIMARY     |  | ref   |    |  | 2       | sakila.customer_t.customer_id |    23 |   100.00 | NULL                     |
|  2 | DERIVED     | payment_t  | ALL   | NULL          | NULL        | NULL    | NULL                          | 15448 |   100.00 | NULL                     |
+----+-------------+------------+-------+---------------+-------------+---------+-------------------------------+-------+----------+--------------------------+

优化器转换后如下,
/* select#1 */ select `sakila`.`customer_t`.`customer_id` AS `customer_id`,`derived_t`.`payment_id` AS `payment_id` from `sakila`.`customer_t` join 
(/* select#2 */ select `sakila`.`payment_t`.`payment_id` AS `payment_id`,`sakila`.`payment_t`.`customer_id` AS `customer_id`,
`sakila`.`payment_t`.`staff_id` AS `staff_id`,`sakila`.`payment_t`.`rental_id` AS `rental_id`,`sakila`.`payment_t`.`amount` AS `amount`,
`sakila`.`payment_t`.`payment_date` AS `payment_date`,`sakila`.`payment_t`.`last_update` AS `last_update` from `sakila`.`payment_t`) `derived_t` 
where ((`derived_t`.`customer_id` = `sakila`.`customer_t`.`customer_id`) and (`sakila`.`customer_t`.`customer_id` < 100))
可以看到,payment_t表首先被扫描,而在join操作时,派生表被使用索引方式访问,大大减少了不必要的扫描开销。但实际上这个SQL我们可以进行更高性能的上拉改写,从逻辑优化角度看,优化器能做到的并不多。

2.其他复杂子查询

MySQL Optimizer对具有UNION、ORDER BY ...LIMIT、GROUP BY等操作的子查询支持的也不是很好。大家可以找几个复杂的例子做下测试。

另外,对于NULL值的忠告,当一个涉及多表操作的子查询的外部表达式包含NULL值时(如NULL in (select inner_expe ...)),NULL值将会在join optimizer中进行单独处理,无法享受到优化待遇,大量NULL值可能会带来严重的性能问题。

辅助优化策略

知道了优化器怎么做,我们才知道如何让它更好的工作,那么想要构造高性能的子查询,使用策略主要是以下几点,

1、NOT NULL约束。不管是否为子查询,NULL值得处理都会是一个棘手的问题

2、建立合适的索引。优化器总是朝更容易利用索引进行优化的方向进行转换

3、多用连接而不是子查询。优化器处理子查询总是想方设法的将它转换为join,看来join才是他更喜欢的形式

4、不要太复杂。要知道,目前的MySQL查询优化器对复杂查询的优化支持的并不好






你可能感兴趣的:(MySQL Optimizer的子查询优化行为)