MySQL SQL优化案例:LIMIT M,N大偏移量分页

原查询语句:
SELECT
  loan_document_id,
  contract_id,
  applicant_contract_id,
  buyer_id,
  buyer_name,
  seller_id,
  seller_name,
  loan_document_no,
  loan_document_type,
  order_content,
  amount,
  buyer_cost,
  seller_cost,
  apply_date,
  apply_amount,
  can_apply_amount,
  start_time,
  end_time,
  loan_due_date,
  ar_due_date,
  buyback_due_date,
  lending_date,
  lending_amount,
  write_off_amount,
  write_off_date,
  submit_time,
  loan_document_state,
  state_change_time,
  attachment_count,
  created_by,
  create_time,
  update_by,
  update_time,
  delete_flag,
  digital_sign,
  pay_state,
  pay_state_change_time,
  pay_apply_time,
  loan_state,
  apply_date,
  applied_pay_amount,
  loan_state_change_time
FROM
  t_loan_document
WHERE
  (
    loan_document_state = 0
    OR loan_document_state = 5
    OR loan_document_state = 7
  )
AND ar_due_date < '2015-12-24 11:09:09'
AND delete_flag = 0
LIMIT 439000,
 100;

原语句执行计划:
+----+-------------+-----------------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table           | type | possible_keys | key  | key_len | ref  | rows   | Extra       |
+----+-------------+-----------------+------+---------------+------+---------+------+--------+-------------+
|  1 | SIMPLE      | t_loan_document | ALL  | NULL          | NULL | NULL    | NULL | 608512 | Using where |
+----+-------------+-----------------+------+---------------+------+---------+------+--------+-------------+
意味着要全表扫描,预估的扫描行数为608512行。执行了一下结果为:100 rows in set (12.03 sec),也即返回100条记录耗时12.03秒。

这个查询获取的字段数目非常多,且LIMIT偏移量非常大 LIMIT 439000,100意味着先要找到前439000条满足where条件的记录而后舍弃这些记录取之后满足where条件的100条,这样的代价注定会非常高。

看一下该语句执行过程中各个阶段耗费的资源情况:
set profiling=1;
select loan_document_id ...
show profiles;
show profile cpu,block io for query 1;
+----------------------+-----------+----------+------------+--------------+---------------+
| Status               | Duration  | CPU_user | CPU_system | Block_ops_in | Block_ops_out |
+----------------------+-----------+----------+------------+--------------+---------------+
| starting             |  0.000098 | 0.000000 |   0.000000 |            0 |             0 |
| checking permissions |  0.000007 | 0.000000 |   0.000000 |            0 |             0 |
| Opening tables       |  0.000018 | 0.000000 |   0.000000 |            0 |             0 |
| System lock          |  0.000009 | 0.000000 |   0.000000 |            0 |             0 |
| init                 |  0.000054 | 0.000000 |   0.000000 |            0 |             0 |
| optimizing           |  0.000013 | 0.000000 |   0.000000 |            0 |             0 |
| statistics           |  0.000014 | 0.000000 |   0.000000 |            0 |             0 |
| preparing            |  0.000016 | 0.000000 |   0.000000 |            0 |             0 |
| executing            |  0.000003 | 0.000000 |   0.000000 |            0 |             0 |
| Sending data         | 12.025931 | 6.341036 |   0.928859 |           16 |          1872 |
| end                  |  0.000048 | 0.000000 |   0.000000 |            0 |             0 |
| query end            |  0.000006 | 0.000000 |   0.000000 |            0 |             0 |
| closing tables       |  0.000006 | 0.000000 |   0.000000 |            0 |             0 |
| freeing items        |  0.000170 | 0.000000 |   0.000000 |            0 |             0 |
| logging slow query   |  0.000005 | 0.000000 |   0.000000 |            0 |             0 |
| logging slow query   |  0.000003 | 0.000000 |   0.000000 |            0 |             0 |
| cleaning up          |  0.000005 | 0.000000 |   0.000000 |            0 |             0 |
+----------------------+-----------+----------+------------+--------------+---------------+
17 rows in set (0.00 sec)
set profiling=0;
那,我们能不能通过在筛选字段上创建索引来进行优化呢?这是一般情况下会最先想到的优化方案。

首先来看一下where条件过滤字段的数据分布:
 
loan_document_state
+---------------------+----------------------------+
| loan_document_state | count(loan_document_state) |
+---------------------+----------------------------+
|                  -1 |                        132 |
|                   0 |                     503061 |
|                   1 |                          3 |
|                   2 |                          1 |
|                   3 |                        708 |
|                   4 |                        809 |
|                   5 |                       9588 |
|                   6 |                         12 |
|                   7 |                       1475 |
|                   8 |                      89014 |
+---------------------+----------------------------+

 delete_flag
+-------------+--------------------+
| delete_flag | count(delete_flag) |
+-------------+--------------------+
|           0 |             603129 |
|           1 |               1674 |
+-------------+--------------------+

ar_due_date
+-------------------+--------------------------+
| year(ar_due_date) | count(year(ar_due_date)) |
+-------------------+--------------------------+
|              2014 |                   103896 |
|              2015 |                   456324 |
|              2016 |                    43270 |
|              2017 |                      641 |
|              2018 |                      672 |
+-------------------+--------------------------+
可以看到每个筛选字段重复值都很多,且满足过滤条件的字段值占多数,换言之即使在对应字段添加索引,筛选度也会很低,可能仍会走全表扫描,对查询性能提升并没有什么显著效果,反而可能会因为过多的索引影响插入和更新性能。

另一优化思路: 该语句存在的最大问题在于limit M,N中偏移量M太大(我们暂不考虑筛选字段上要不要添加索引的影响),导致每次查询都要先从整个表中找到满足条件 的前M条记录,之后舍弃这M条记录并从第M+1条记录开始再依次找到N条满足条件的记录。如果表非常大,且筛选字段没有合适的索引,且M特别大那么这样的代价是非常高的。 试想,如我们下一次的查询能从前一次查询结束后标记的位置开始查找,找到满足条件的100条记录,并记下下一次查询应该开始的位置,以便于下一次查询能直接从该位置 开始,这样就不必每次查询都先从整个表中先找到满足条件的前M条记录,舍弃,在从M+1开始再找到100条满足条件的记录了。

所以问题转化为找到这个可以供我们标记使用的字段,这个字段需满足顺序排列,且有索引(这样才能在查找中更快的定位到上次查询标记的查询开始的记录位置)。满足这个 条件的字段非主键莫属了。

mysql> show create table t_loan_document\G
*************************** 1. row ***************************
       Table: t_loan_document
Create Table: CREATE TABLE `t_loan_document` (
  `loan_document_id` varchar(36) COLLATE utf8_bin NOT NULL COMMENT '融资单据ID',
  `contract_id` bigint(20) DEFAULT NULL COMMENT '合同ID,合同pkey',
  ...
   PRIMARY KEY (`loan_document_id`),
  ...
  )

可以看到loan_document_id是该表的主键,也即该表中的记录以loan_document_id为顺序升序排列。我们可以利用该主键对SQL进行优化,也即每次查询除了返回相应的 满足条件的包含所需字段的行外还须记录本次查询找到的最大loan_document_id值,下一次的查询直接从该位置开始向后查询而不是扫描整个表。

比如:
第一次查询(为了便于演示,略去其他查询字段只保留loan_document_id加以说明)
SELECT
  loan_document_id
FROM
  t_loan_document
WHERE
  (
    loan_document_state = 0
    OR loan_document_state = 5
    OR loan_document_state = 7
  )
AND ar_due_date < '2015-12-24 11:09:09'
AND delete_flag = 0
LIMIT 0,
 100;

+----------------------+
| loan_document_id     |
+----------------------+
| LD120140519140909023 |
| LD120140519140909044 |
| LD120140519140909065 |
|         ......       | 
| LD120140519140911117 |
| LD120140519140911142 |
| LD120140519140911160 |
+----------------------+

最大的loan_document_id为LD120140519140911160

那么第二次查询可以这样:
SELECT
  loan_document_id
FROM
  t_loan_document
WHERE
  loan_document_id > 'LD120140519140911160'
AND
  (
    loan_document_state = 0
    OR loan_document_state = 5
    OR loan_document_state = 7
  )
AND ar_due_date < '2015-12-24 11:09:09'
AND delete_flag = 0
LIMIT 100;
+----------------------+
| loan_document_id     |
+----------------------+
| LD120140519140911174 |
| LD120140519140911205 |
|         .........    |
| LD120140519140913184 |
| LD120140519140913199 |
| LD120140519140913216 |
+----------------------+
100 rows in set (0.00 sec)


该语句结果和 如下语句一致:

SELECT
  loan_document_id
FROM
  t_loan_document
WHERE
  (
    loan_document_state = 0
    OR loan_document_state = 5
    OR loan_document_state = 7
  )
AND ar_due_date < '2015-12-24 11:09:09'
AND delete_flag = 0
LIMIT 100,100;
+----------------------+
| loan_document_id     |
+----------------------+
| LD120140519140911174 |
| LD120140519140911205 |
|         .........    |
| LD120140519140913184 |
| LD120140519140913199 |
| LD120140519140913216 |
+----------------------+
100 rows in set (0.00 sec)
这一次最大的loan_document_id为LD120140519140913216

第三次查询
SELECT
  loan_document_id
FROM
  t_loan_document
WHERE
  loan_document_id > 'LD120140519140913216'
AND
  (
    loan_document_state = 0
    OR loan_document_state = 5
    OR loan_document_state = 7
  )
AND ar_due_date < '2015-12-24 11:09:09'
AND delete_flag = 0
LIMIT 100;
效果和
SELECT
  loan_document_id
FROM
  t_loan_document
WHERE
(
    loan_document_state = 0
    OR loan_document_state = 5
    OR loan_document_state = 7
  )
AND ar_due_date < '2015-12-24 11:09:09'
AND delete_flag = 0
LIMIT 200,100;
同理也是一样的。

当LIMIT M,N中M较小时前后两种查询的效率相差并不大,但是若M增加到一定程度则,两者效率相差甚远。

比如:
SELECT
  loan_document_id
FROM
  t_loan_document
WHERE
  (
    loan_document_state = 0
    OR loan_document_state = 5
    OR loan_document_state = 7
  )
AND ar_due_date < '2015-12-24 11:09:09'
AND delete_flag = 0
LIMIT 439000,
 100;
执行时间7秒左右,查寻到的最大loan_document_id为LD220150513105539579

若下一次查询仍使用
SELECT
  loan_document_id
FROM
  t_loan_document
WHERE
  (
    loan_document_state = 0
    OR loan_document_state = 5
    OR loan_document_state = 7
  )
AND ar_due_date < '2015-12-24 11:09:09'
AND delete_flag = 0
LIMIT 439100,
 100;
查询时间10秒左右

而要改成为
SELECT
  loan_document_id
FROM
  t_loan_document
WHERE
  loan_document_id > 'LD220150513105539579'
AND
  (
    loan_document_state = 0
    OR loan_document_state = 5
    OR loan_document_state = 7
  )
AND ar_due_date < '2015-12-24 11:09:09'
AND delete_flag = 0
LIMIT 100;
只需0.00秒

可见执行效率有了质的飞跃~

你可能感兴趣的:(MySQL,SQL,Optimization)