MySQL详细基础语法2

GROUP BY

GROUP BY即分组主要作用是把一个集合中根据group by后面的关键字进行分组，里面包含2个计算

去掉重复值(参考丁奇文章)

当没有索引时，根据group by后面的列，在内存中创建一个对应列为pk的临时表。对数据集进行遍历，没有则插入临时表，有则对对应的聚合函数列进行比较和更新，使最终结果中没有重复值

有索引时，直接使用索引，只需要一个计数器，对索引列进行更新，不需要维护临时表。

所以对group by进行优化的方式之一就是对group by后面的列添加索引，减少临时表(use tempfile)的产生和维护

减少use tempfile的两种方式
1. Loose Index Scan索引跳跃式扫描
  
  要实现跳跃式扫描要满足如下条件
  1. 联合索引的前列选择率要低且必须在索引中，group by依次按照索引顺序，且select列中只有索引的左边的一部分且不能回表
    
    例如如下表结构和执行计划
```
CREATE TABLE `t_group6` (
  `emp_no` int(11) NOT NULL,
  `dept_no` char(4) NOT NULL,
  `from_date` date NOT NULL,
  `to_date` date NOT NULL,
  KEY `ix_empno_to_date` (`dept_no`,`to_date`,`emp_no`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
```
    按照索引顺序的语句执行计划如下

group1.png

     但不按照索引顺序的时候则没有优化效果

group2.png

  2. 前导列是关键

Tight Index Scan

range scan或者index scan
1. 创建索引的时候必须按照where条件中的等号字段再加上group后面的列的顺序来创建

group3.png

进行排序(5.5~5.7)，8.0开始不在对group by后面的关键字进行排序

group by的排序方式，字符类型是按照ascii码从左到右进行排序的。当含有空和null时 NULL在前空在后
相关参数

5.7 tmp_table_size,max_heap_table_size
group by分为完全模式和不完全模式

开启完全group by形式之后 select后面的字段中不能出现非聚合函数包裹的字段否则会报错

通过set session sql_mode = ' '; 这个配置来控制

并且不完全group模式会随着执行计划不同导致结果不同

例如有如下的表，索引情况和数据如下

group4.png

当使用ix_torder_2索引时，执行结果如下

group5.png

当时用ix_torder_3索引时，执行结果就改变了

group6.png

DISTINCT

和group by一样，可以去掉重复值。区别在于distinct不排序。

在内存中创建一个对应列为pk的临时表。对数据集进行遍历，没有则插入临时表，有则违反uk约束

从执行计划上也能看出distinct和group by的区别

group.png

order by

order by后面可以写数字表示列名，如下例子中1表示emp_no

order.png

order by有两种排序方式

single pass

原理：一次性把sql中涉及的字段全部多出来，然后依据排序字段排序，最后直接返回结果。

优点：只需要一次顺序I/O，无需任何随机I/O，大大降低I/O开销

缺点：内存容不下大量数据时，可能会先放到磁盘上对数据切分，对小数据块逐个排序，然后将结果集merge，在返回结果集

例子如下，下图为数据库排序相关数据记录,sort_merge_passes为0，一般表示执行了一次排序

order1.png

执行语句如下

SELECT emp_no,first_name,last_name FROM employees ORDER BY first_name LIMIT 100;

执行后的数据纪录如下

order2.png

执行计划如下

order3.png

简单的说就是把emp_no,first_name,last_name全部装到sort_buffer中排序

SQL单次排序效率与order by后面的字段以及select后面的字段都有关系，主要优化点在于减少数据集面积。所以查询的时候一般不要用*

two pass

原理：先读取行指针和排序字段进行排序，然后依据排序结果在读取所需数据

优点：排序数据量小，完全可以在内存中完成

缺点：第二次读取时，会发生大量随机I/O，太昂贵

two pass有两个要求
1. 排序的列的值要大于max_length_for_sort_data
2. select列当中包含blob或者text的值
实例如下，下图为排序语句执行前的数据库记录

order1.png

执行语句如下

SELECT SQL_NO_CACHE COUNT(1) FROM (
    SELECT * FROM t11 IGNORE INDEX (PRIMARY) ORDER BY c10 DESC LIMIT 10000
) a;

执行语句后，服务器记录如下

order4.png

sort_merge_pass超过sort buffer大小时排序数据就会写入temp file中 temp file合并的次数数据的大小都和sort_buffer_size有直接关系

order by limit有重复列的情况

如下表，其中ID不是primary key

order5.png

三个问题

当只有8条数据记录的时候，用create_time排序分页会有数据重复和丢失的问题，而用age则没有出现，age也有重复值

运行SQL如下
```
SELECT * FROM test ORDER BY create_time LIMIT 0,4;
SELECT * FROM test ORDER BY create_time LIMIT 4,4;
```
运行结果如下

order6.png

  如图所示，id=8的数据重复，$\color{red}{原因是create_time重复，MySQL再回表查询剩余数据的时候，只要找到符合当前create_time的数据即可，所以造成了数据重复}$

当增加了4条记录后，用create_time排序分页没有问题，而使用age排序会出现数据重复或者丢失，age的重复值没有create_time多
给排序列加索引对这个问题没有帮助

解决方法

加入id列作为二级排序列
切换innodb引擎，对create_time列添加索引

创建内部视图，使用distinct关键字阻止视图合并

语句如下

SELECT * FROM (SELECT DISTINCT t.* FROM test t ORDER BY create_time) a LIMIT 0,4;

COUNT

count不计算NULL，在不影响业务的情况下可以使用count(1)或者count(*)。但是由于count是不计算NULL的，所以最好使用非空并且not null的列进行统计，如自增的primary key

count优化的例子

有如下的两条SQL
```
1. SELECT COUNT(fid=6 AND displayorder >= 0) AS posts FROM cdb_threads;
2. SELECT COUNT(fid) AS threads FROM cdb_threads WHERE fid=6 AND displayorder >= 0;
```
上面两条SQL对比，2要比1效率高，主要原因在于count的调用次数。假设当前表没有任何索引，查询fid=6并且displayorder >= 0的数据都需要全表扫描，效果相同。如果结果集数据为100条，2只需要对结果集进行一次count，但是1在整个过程中要进行100次count，显然1的效率更低
原语句如下
```
SELECT COUNT(emp_no IS NULL) AS c1 FROM t_order;
```
优化语句如下
```
1. SELECT COUNT(*) c1 FROM t_order WHERE emp_no IS NULL;
2. SELECT COUNT(case when emp_no is null then 1 end) c1 FROM t_order;
```
如上两个SQL 2相对1来说，效率会低。原因是行数不同。1会根据条件进行过滤，然后count统计实际的数据量。2则是全表扫描，会产生大量I/O

MIN/MAX

min和max可以求出数据列的最大最小值，但是不包含null
min/max使用例子

查看employess表中是否存在emp_no=1000111，存在返回1不存在返回0。直接使用case加where不能满足，如图

max.png

根据图片我们可以发现，直接使用case加where有值返回1，但是没值不会返回0。我们可以使用max或者min解决上面的问题，如图

max1.png

max/min在没有值返回的时候，会返回NULL，可以使用这个特性解决问题

实际优化案例

有如下语句，运行时间超过5s

SELECT t.systen_phone AS plattorm, t.platform_name AS channel,
COALESCE(
  sum(case 
      when t.is_success='T' and t.create_time >= '2018-04-24' and t.create_time <= '2018-04-24 23:59:59'            
      then 1 else 0 end
     )
  ,0) as applytotal ,
COALESCE(
  sum(case 
      when is_påss='T' and tcl.is_refuse = 'F' and t.create_time >= '2018-04-24' and t.create_time <= '2018-04-24 23:59:59' 
      then 1 else 0 end
     )
  ,0) as tpasscount,
COALESCE(
  sum(case 
      when (status >= 40 or status = -40) and yrd_time IS NOT NULL and t.yrd_time >= '2018-04-24' and t.yrd_time <= '2018-04-24 23:59:59' 
      then 1 else 0 end
     )
  ,0) as transportnum,
COALESCE(
  sum(case 
      when (status >= 40 or status = -40)and yrd_time IS NOT NULL and t.yrd_time >= '2018-04-24' and t.yrd_time <= '2018-04-24 23:59:59' 
      then t.apply_amount/10000 else 0 end
     )
  ,0) as transportamount, 
COALESCE(
  sum(
    case 
    when audit_time IS NOT NULL and t.audit_time >= '2018-04-24' and t.audit_time <= '2018-04-24 23:59:59' 
    then 1 else 0 end
  )
  ,0) as cponum,
COALESCE(
  sum(
    case 
    when (status !=50 ) and audit_time IS NOT NULL and t.audit_time >= '2018-04-24' and t.audit_time <= '2018-04-24 23:59:59' 
    then 1 else 0 end
  )
  ,0) as cpopassnum,
COALESCE(
  sum(
    case when (status !=50 ) and audit_time IS NOT NULL and t.audit_time >= '2018-04-24' and t.audit_time <= '2018-04-24 23:59:59' 
    then t.apply_amount/10000 else 0 end
  )
  ,0) as cpopassamount,
COALESCE(
  sum(
    case 
    when (status = 90 or status = 80 or status = 60) and tc.contractamount is not NULL and t.loan_date >= '2018-04-24' and t.loan_date <= '2018-04-24 23:59:59' 
    then tc.contractamount/10000 
    when (status = 90 or status = 80 or status = 60 ) and tc.contractamount is NULL and t.loan_date >= '2018-04-24' and t.loan_date <= '2018-04-24 23:59:59' 
    then t.applt_amount/10000 else 0 end
  )
  ,0) as contractamount,
COALESCE(
  sum(
    case when (status = 90 or status = 80 or status = 60) and t.loan_date >= '2018-04-24' and t.loan_date <= '2018-04-24 23:59:59' 
    then 1 else 0 end
  )
  ,0) as closenum
FROM tb_cbm_transport_yrd AS t
LEFT JOIN tb_cbm_transport_result tc ON t.id = tc.transport_id
LEFT JOIN tb_cbm_1imit tcl ON t.limit_id = tcl.limit_id
WHERE t.tools_platorm_code = 'uxWap'
GROUP BY left(t.system_phone,1);

执行计划如下图

max2.png

查询效率低的主要是原因是where条件提前，造成全表扫描，I/O较大，优化思路就是where条件优化，提取公共部分，并且创建索引

优化后语句如下，优化后时间为0.07s。可以看出优化的主要位置是将where提取出来

SELECT t.systen_phone AS plattorm, t.platform_name AS channel,
COALESCE(
    sum(case 
    when t.is_success='T' and t.create_time >= '2018-04-24' and t.create_time <= '2018-04-24 23:59:59' 
    then 1 else 0 end
    )
,0) as applytotal ,
COALESCE(
    sum(case 
    when is_påss='T' and tcl.is_refuse = 'F' and t.create_time >= '2018-04-24' and t.create_time <= '2018-04-24 23:59:59' 
    then 1 else 0 end
    )
,0) as tpasscount,
COALESCE(
    sum(case 
    when (status >= 40 or status = -40) and yrd_time IS NOT NULL and t.yrd_time >= '2018-04-24' and t.yrd_time <= '2018-04-24 23:59:59' 
    then 1 else 0 end
    )
,0) as transportnum,
COALESCE(
    sum(case 
    when (status >= 40 or status = -40)and yrd_time IS NOT NULL and t.yrd_time >= '2018-04-24' and t.yrd_time <= '2018-04-24 23:59:59' 
    then t.apply_amount/10000 else 0 end
    )
,0) as transportamount, 
COALESCE(
    sum(case 
    when audit_time IS NOT NULL and t.audit_time >= '2018-04-24' and t.audit_time <= '2018-04-24 23:59:59' 
    then 1 else 0 end
    )
,0) as cponum,
COALESCE(
    sum(case 
    when (status !=50 ) and audit_time IS NOT NULL and t.audit_time >= '2018-04-24' and t.audit_time <= '2018-04-24 23:59:59' 
    then 1 else 0 end
    )
,0) as cpopassnum,
COALESCE(
    sum(case 
    when (status !=50 ) and audit_time IS NOT NULL and t.audit_time >= '2018-04-24' and t.audit_time <= '2018-04-24 23:59:59' 
    then t.apply_amount/10000 else 0 end
    )
,0) as cpopassamount,
COALESCE(
    sum(case 
    when (status = 90 or status = 80 or status = 60) and tc.contractamount is not NULL and t.loan_date >= '2018-04-24' and t.loan_date <= '2018-04-24 23:59:59' 
    then tc.contractamount/10000 
    when (status = 90 or status = 80 or status = 60 ) and tc.contractamount is NULL and t.loan_date >= '2016-04-24' and t.loan_date <= '2018-04-24 23:59:59' 
    then t.applt_amount/10000 else 0 end
    )
,0) as contractamount,
COALESCE(
    sum(case 
    when (status = 90 or status = 80 or status = 60) and t.loan_date >= '2018-04-24' and t.loan_date <= '2018-04-24 23:59:59' 
    then 1 else 0 end
    )
,0) as closenum
FROM(
    SELECT ifnull(max(t.system_phone) , 'androld') system_phone,
    ifnull(max(t.platform_name),'YRD_APP') platform_name,
    t.is_success, t.create_time ,t.is_påss,t.status,t.yrd_time,
    t.apply_amount, t.audit_time, t.loan_date,t.id ,t.limit_id
    FROM
    tb_cbm_transport_yrd AS t
    where t.tools_platorm_code = 'uxWap'
    and e.system_phone = 'android'
    and t.create_time >= '2018-04-24' and t.create_time <= '2018-04-24 23:59:59'
    and (
        (t.create_time >= '2018-04-24' and t.create_time <= '2018-04-24 23:59:59') 
        or (t.yrd_time >= '2018-04-24' and t.yrd_time <= '2018-04-24 23:59:59') 
        or (t.audit_time >= '2018-04-24' and t.audit_time <= '2018-04-24 23:59:59')
        or (t.loan_date >= '2018-04-24' and t.loan_date <= '2018-04-24 23:59:59')
    )
    union all
    SELECT ifnull(max(t.system_phone) , 'androld') system_phone,
    ifnull(max(t.platform_name),'YRD_APP') platform_name,
    t.is_success, t.create_time ,t.is_påss,t.status,t.yrd_time,
    t.apply_amount, t.audit_time, t.loan_date,t.id ,t.limit_id
    FROM
    tb_cbm_transport_yrd AS t
    where t.tools_platorm_code = 'uxWap'
    and e.system_phone = 'ios'
    and t.create_time >= '2018-04-24' and t.create_time <= '2018-04-24 23:59:59'
    and (
        (t.create_time >= '2018-04-24' and t.create_time <= '2018-04-24 23:59:59') 
        or (t.yrd_time >= '2018-04-24' and t.yrd_time <= '2018-04-24 23:59:59') 
        or (t.audit_time >= '2018-04-24' and t.audit_time <= '2018-04-24 23:59:59')
        or (t.loan_date >= '2018-04-24' and t.loan_date <= '2018-04-24 23:59:59')
    )
) t
LEFT JOIN tb_cbm_transport_result tc ON t.id = tc.transport_id
LEFT JOIN tb_cbm_1imit tcl ON t.limit_id = tcl.limit_id
WHERE t.tools_platorm_code = 'uxWap' GROUP BY left(t.system_phone,1);

sum/avg

只能对数字或者能转化成数字的字符串类型计算

在MySQL中1/0返回NULL，如下图

sum.png

avg是不会计算NULL那一行的，有如下例子

sum1.png

$\color{red}{用avg计算平均数和使用sum加count计算平均数的值存在差异，原因是avg会排除NULL那一行，不进行计算。从count可以看出，count进行统计时，使用*会统计全部的行数，但是使用列名则会排除空行，所以avg是总数除以9行的结果。}$

HAVING

having主要用于group by后的二次过滤

having可以减少子查询嵌套，如下例子

SELECT * FROM (SELECT 'wassup' AS hi FROM (select 1) x) a WHERE hi = 'wassup';

可以使用having改写，简化嵌套

SELECT 'wassup' AS hi FROM (select 1) x HAVING hi = 'wassup';

行转列，列转行

行转列

line.png

列转行

line1.png

思路：

先把一行边两行

line2.png

使用case when分类

line3.png

GROUP BY和JOIN优化案例

有如下语句，运行时间为166s

SELECT product_sku.branch_id AS branchId,product_sku.product_id AS productId,
sum((product_sku.num - ifnull(branch_return_goods_info.num,0))) AS totalNum,
sum((product_sku.unit_price * product_sku.num) - ifnull(branch_return_goods_info.money,0)) AS totalMoney,
sum(ifnull(branch_return_goods_info.num,0)) returnNum,
sum(ifnull(branch_return_goods_info.money,0)) returnMoney,
product_info.branch_name AS branchName,
product_info.brand_name AS brandName,
product_info.product_name AS productName,
product_info.cate_one_name AS cateOneName,
product_info.cate_two_name AS cateTwoName,
product_info.cate_three_name AS cateThreeName,
sk.miniNum AS miniNum,
product_info.sell_unit_convert AS sellUnitConvert,
sk.stockMoney As stockMoney
FROM product_sku_201704 poduct_sku FORCE INDEX (IDX_time_order)
INNER JOIN product_info on product_info.branch_id= product_sku.branch_id AND product_info.pcoduct_id= poduct_sku.product_id
AND product_sku.store_id NOT in (108,109)
LEFT JOIN branch_return_goods_info on branch_return_goods_info.order_id = product_sku.orderNo AND branch_return_goods_info.hmp_id = product_sku.product_id AND branch_return_goods_info.type = 2
LEFT JOIN stock_turnover as sk on sk.store_id = product_sku.store_id AND sk.product_id = product_sku.product_id and sk.datetime ='2017-04-25'
WHERE product_sku.order_type >= 0
AND product_sku.orderNo not like 'YL%'
AND product_sku.ordetrNo not like 'PT%' 
AND product_sku.order_time >= '2017-04-01 00:00:00'
AND product_sku.order_time <= '2017-04-25 23:59:59'
AND product_sku.orderState = 100
GROUP BY product_sku.product_id,product_sku.branch_id;

执行计划如下

group7.png

从执行计划我们可以看出， $\color{red}{using where，using temporary，using filesort同时产生，说明group by发生在join之后，是对整个结果集进行操作，性能会比较差。}$ 阅读上述SQL可以发现，where过滤条件只和product_sku表有关，和join表没有关系，所以优化思路是先过滤，缩小结果集，然后join

优化后的SQL如下，优化后执行时间为25s

SELECT
product_sku.btanch_id as btanchId,
product_sku.ptoduct_id As ptoductId,
sum((product_sku.num - ifnull(branch_return_goods_info.num,0))) AS totalNum,
sum((product_sku.unit_price * product_sku.num) - ifnull(branch_return_goods_info.money,0)) AS totalMoney,
sum(ifnull(branch_return_goods_info.num,0)) returnNum,
sum(ifnull(branch_return_goods_info.money,0)) returnMoney,
product_info.branch_name AS branchName,
product_info.brand_name AS brandName,
product_info.product_name AS productName,
product_info.cate_one_name AS cateOneName,
product_info.cate_two_name AS cateTwoName,
product_info.cate_three_name AS cateThreeName,
sk.miniNum AS miniNum,
product_info.sell_unit_convert AS sellUnitConvert,
sk.stockMoney As stockMoney
FROM (
    SELECT 
    product_sku.prcdct_id,
    pcoduct_sku.branch_id,
    sum(product_sku.num) num,
    SUM(product_sku.unit_price) unit_price,
    product_sku.orderNo,
    product_sku.store_id
    FROM proauct_sku_201704 produet_sku
    WHERE product_sku.order_type >= 0
    AND product_sku.orderNo not like 'YL%'
    AND product_sku.ordetrNo not like 'PT%' 
    AND product_sku.order_time >= '2017-04-01 00:00:00'
    AND product_sku.order_time <= '2017-04-25 23:59:59'
    AND product_sku.orderState = 100
    AND product_sku.store_id NOT in (108,109)
    GROUP BY product_sku.product_id,product_sku.branch_id
) product_sku
INNER JOIN product_info on product_info.branch_id= product_sku.branch_id AND product_info.pcoduct_id= poduct_sku.product_id
LEFT JOIN branch_return_goods_info on branch_return_goods_info.order_id = product_sku.orderNo AND branch_return_goods_info.hmp_id = product_sku.product_id AND branch_return_goods_info.type = 2
LEFT JOIN stock_turnover as sk on sk.store_id = product_sku.store_id AND sk.product_id = product_sku.product_id and sk.datetime ='2017-04-25'
GROUP BY product_sku.product_id,product_sku.branch_id;

优化后执行计划如下

group8.png