MySQL详细基础语法2

GROUP BY


GROUP BY即分组 主要作用是把一个集合中根据group by后面的关键字进行分组,里面包含2个计算

  1. 去掉重复值(参考丁奇文章)

    当没有索引时,根据group by后面的列,在内存中创建一个对应列为pk的临时表。对数据集进行遍历,没有则插入临时表,有则对对应的聚合函数列进行比较和更新,使最终结果中没有重复值

    有索引时,直接使用索引,只需要一个计数器,对索引列进行更新,不需要维护临时表。

    所以对group by进行优化的方式之一就是对group by后面的列添加索引,减少临时表(use tempfile)的产生和维护

    减少use tempfile的两种方式

    1. Loose Index Scan索引跳跃式扫描

      要实现跳跃式扫描要满足如下条件

      1. 联合索引的前列选择率要低且必须在索引中,group by依次按照索引顺序,且select列中只有索引的左边的一部分且不能回表

        例如如下表结构和执行计划

        CREATE TABLE `t_group6` (
          `emp_no` int(11) NOT NULL,
          `dept_no` char(4) NOT NULL,
          `from_date` date NOT NULL,
          `to_date` date NOT NULL,
          KEY `ix_empno_to_date` (`dept_no`,`to_date`,`emp_no`)
        ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
        

        按照索引顺序的语句执行计划如下

group1.png
     但不按照索引顺序的时候则没有优化效果
group2.png
  2. 前导列是关键
  1. Tight Index Scan

    range scan或者index scan

    1. 创建索引的时候必须按照where条件中的等号字段再加上group后面的列的顺序来创建
group3.png
  1. 进行排序(5.5~5.7),8.0开始不在对group by后面的关键字进行排序

    group by的排序方式,字符类型是按照ascii码从左到右进行排序的。当含有空和null时 NULL在前空在后

  2. 相关参数

    5.7 tmp_table_size,max_heap_table_size

  3. group by分为完全模式和不完全模式

    开启完全group by形式之后 select后面的字段中不能出现非聚合函数包裹的字段否则会报错

    通过set session sql_mode = ' '; 这个配置来控制

    并且不完全group模式会随着执行计划不同导致结果不同

    例如有如下的表,索引情况和数据如下

group4.png

当使用ix_torder_2索引时,执行结果如下

group5.png

当时用ix_torder_3索引时,执行结果就改变了

group6.png

DISTINCT


和group by一样,可以去掉重复值。区别在于distinct不排序。

在内存中创建一个对应列为pk的临时表。对数据集进行遍历,没有则插入临时表,有则违反uk约束

从执行计划上也能看出distinct和group by的区别

group.png

order by


order by后面可以写数字表示列名,如下例子中1表示emp_no

order.png

order by有两种排序方式

  1. single pass

    原理:一次性把sql中涉及的字段全部多出来,然后依据排序字段排序,最后直接返回结果。

    优点:只需要一次顺序I/O,无需任何随机I/O,大大降低I/O开销

    缺点:内存容不下大量数据时,可能会先放到磁盘上对数据切分,对小数据块逐个排序,然后将结果集merge,在返回结果集

    例子如下,下图为数据库排序相关数据记录,sort_merge_passes为0,一般表示执行了一次排序

order1.png

执行语句如下

SELECT emp_no,first_name,last_name FROM employees ORDER BY first_name LIMIT 100;

执行后的数据纪录如下

order2.png

执行计划如下

order3.png

简单的说就是把emp_no,first_name,last_name全部装到sort_buffer中排序

SQL单次排序效率与order by后面的字段以及select后面的字段都有关系,主要优化点在于减少数据集面积。所以查询的时候一般不要用*

  1. two pass

    原理:先读取行指针和排序字段进行排序,然后依据排序结果在读取所需数据

    优点:排序数据量小,完全可以在内存中完成

    缺点:第二次读取时,会发生大量随机I/O,太昂贵

    two pass有两个要求

    1. 排序的列的值要大于max_length_for_sort_data
    2. select列当中包含blob或者text的值

    实例如下,下图为排序语句执行前的数据库记录

order1.png

执行语句如下

SELECT SQL_NO_CACHE COUNT(1) FROM (
    SELECT * FROM t11 IGNORE INDEX (PRIMARY) ORDER BY c10 DESC LIMIT 10000
) a;

执行语句后,服务器记录如下

order4.png

sort_merge_pass超过sort buffer大小时 排序数据就会写入temp file中 temp file合并的次数 数据的大小都和sort_buffer_size有直接关系

order by limit有重复列的情况

如下表,其中ID不是primary key

order5.png

三个问题

  1. 当只有8条数据记录的时候,用create_time排序分页会有数据重复和丢失的问题,而用age则没有出现,age也有重复值

    运行SQL如下

    SELECT * FROM test ORDER BY create_time LIMIT 0,4;
    SELECT * FROM test ORDER BY create_time LIMIT 4,4;
    

    运行结果如下

order6.png
  如图所示,id=8的数据重复,$\color{red}{原因是create_time重复,MySQL再回表查询剩余数据的时候,只要找到符合当前create_time的数据即可,所以造成了数据重复}$
  1. 当增加了4条记录后,用create_time排序分页没有问题,而使用age排序会出现数据重复或者丢失,age的重复值没有create_time多

  2. 给排序列加索引对这个问题没有帮助

解决方法

  1. 加入id列作为二级排序列

  2. 切换innodb引擎,对create_time列添加索引

  3. 创建内部视图,使用distinct关键字阻止视图合并

    语句如下

    SELECT * FROM (SELECT DISTINCT t.* FROM test t ORDER BY create_time) a LIMIT 0,4;
    

COUNT


count不计算NULL,在不影响业务的情况下可以使用count(1)或者count(*)。但是由于count是不计算NULL的,所以最好使用非空并且not null的列进行统计,如自增的primary key

count优化的例子

  1. 有如下的两条SQL

    1. SELECT COUNT(fid=6 AND displayorder >= 0) AS posts FROM cdb_threads;
    2. SELECT COUNT(fid) AS threads FROM cdb_threads WHERE fid=6 AND displayorder >= 0;
    

    上面两条SQL对比,2要比1效率高,主要原因在于count的调用次数。假设当前表没有任何索引,查询fid=6并且displayorder >= 0的数据都需要全表扫描,效果相同。如果结果集数据为100条,2只需要对结果集进行一次count,但是1在整个过程中要进行100次count,显然1的效率更低

  2. 原语句如下

    SELECT COUNT(emp_no IS NULL) AS c1 FROM t_order;
    

    优化语句如下

    1. SELECT COUNT(*) c1 FROM t_order WHERE emp_no IS NULL;
    2. SELECT COUNT(case when emp_no is null then 1 end) c1 FROM t_order;
    

    如上两个SQL 2相对1来说,效率会低。原因是行数不同。1会根据条件进行过滤,然后count统计实际的数据量。2则是全表扫描,会产生大量I/O

MIN/MAX


  1. min和max可以求出数据列的最大最小值,但是不包含null

  2. min/max使用例子

    查看employess表中是否存在emp_no=1000111,存在返回1不存在返回0。直接使用case加where不能满足,如图

max.png

根据图片我们可以发现,直接使用case加where有值返回1,但是没值不会返回0。我们可以使用max或者min解决上面的问题,如图

max1.png

max/min在没有值返回的时候,会返回NULL,可以使用这个特性解决问题

  1. 实际优化案例

    有如下语句,运行时间超过5s

    SELECT t.systen_phone AS plattorm, t.platform_name AS channel,
    COALESCE(
      sum(case 
          when t.is_success='T' and t.create_time >= '2018-04-24' and t.create_time <= '2018-04-24 23:59:59'            
          then 1 else 0 end
         )
      ,0) as applytotal ,
    COALESCE(
      sum(case 
          when is_påss='T' and tcl.is_refuse = 'F' and t.create_time >= '2018-04-24' and t.create_time <= '2018-04-24 23:59:59' 
          then 1 else 0 end
         )
      ,0) as tpasscount,
    COALESCE(
      sum(case 
          when (status >= 40 or status = -40) and yrd_time IS NOT NULL and t.yrd_time >= '2018-04-24' and t.yrd_time <= '2018-04-24 23:59:59' 
          then 1 else 0 end
         )
      ,0) as transportnum,
    COALESCE(
      sum(case 
          when (status >= 40 or status = -40)and yrd_time IS NOT NULL and t.yrd_time >= '2018-04-24' and t.yrd_time <= '2018-04-24 23:59:59' 
          then t.apply_amount/10000 else 0 end
         )
      ,0) as transportamount, 
    COALESCE(
      sum(
        case 
        when audit_time IS NOT NULL and t.audit_time >= '2018-04-24' and t.audit_time <= '2018-04-24 23:59:59' 
        then 1 else 0 end
      )
      ,0) as cponum,
    COALESCE(
      sum(
        case 
        when (status !=50 ) and audit_time IS NOT NULL and t.audit_time >= '2018-04-24' and t.audit_time <= '2018-04-24 23:59:59' 
        then 1 else 0 end
      )
      ,0) as cpopassnum,
    COALESCE(
      sum(
        case when (status !=50 ) and audit_time IS NOT NULL and t.audit_time >= '2018-04-24' and t.audit_time <= '2018-04-24 23:59:59' 
        then t.apply_amount/10000 else 0 end
      )
      ,0) as cpopassamount,
    COALESCE(
      sum(
        case 
        when (status = 90 or status = 80 or status = 60) and tc.contractamount is not NULL and t.loan_date >= '2018-04-24' and t.loan_date <= '2018-04-24 23:59:59' 
        then tc.contractamount/10000 
        when (status = 90 or status = 80 or status = 60 ) and tc.contractamount is NULL and t.loan_date >= '2018-04-24' and t.loan_date <= '2018-04-24 23:59:59' 
        then t.applt_amount/10000 else 0 end
      )
      ,0) as contractamount,
    COALESCE(
      sum(
        case when (status = 90 or status = 80 or status = 60) and t.loan_date >= '2018-04-24' and t.loan_date <= '2018-04-24 23:59:59' 
        then 1 else 0 end
      )
      ,0) as closenum
    FROM tb_cbm_transport_yrd AS t
    LEFT JOIN tb_cbm_transport_result tc ON t.id = tc.transport_id
    LEFT JOIN tb_cbm_1imit tcl ON t.limit_id = tcl.limit_id
    WHERE t.tools_platorm_code = 'uxWap'
    GROUP BY left(t.system_phone,1);
    

    执行计划如下图

max2.png

查询效率低的主要是原因是where条件提前,造成全表扫描,I/O较大,优化思路就是where条件优化,提取公共部分,并且创建索引

优化后语句如下,优化后时间为0.07s。可以看出优化的主要位置是将where提取出来

SELECT t.systen_phone AS plattorm, t.platform_name AS channel,
COALESCE(
    sum(case 
    when t.is_success='T' and t.create_time >= '2018-04-24' and t.create_time <= '2018-04-24 23:59:59' 
    then 1 else 0 end
    )
,0) as applytotal ,
COALESCE(
    sum(case 
    when is_påss='T' and tcl.is_refuse = 'F' and t.create_time >= '2018-04-24' and t.create_time <= '2018-04-24 23:59:59' 
    then 1 else 0 end
    )
,0) as tpasscount,
COALESCE(
    sum(case 
    when (status >= 40 or status = -40) and yrd_time IS NOT NULL and t.yrd_time >= '2018-04-24' and t.yrd_time <= '2018-04-24 23:59:59' 
    then 1 else 0 end
    )
,0) as transportnum,
COALESCE(
    sum(case 
    when (status >= 40 or status = -40)and yrd_time IS NOT NULL and t.yrd_time >= '2018-04-24' and t.yrd_time <= '2018-04-24 23:59:59' 
    then t.apply_amount/10000 else 0 end
    )
,0) as transportamount, 
COALESCE(
    sum(case 
    when audit_time IS NOT NULL and t.audit_time >= '2018-04-24' and t.audit_time <= '2018-04-24 23:59:59' 
    then 1 else 0 end
    )
,0) as cponum,
COALESCE(
    sum(case 
    when (status !=50 ) and audit_time IS NOT NULL and t.audit_time >= '2018-04-24' and t.audit_time <= '2018-04-24 23:59:59' 
    then 1 else 0 end
    )
,0) as cpopassnum,
COALESCE(
    sum(case 
    when (status !=50 ) and audit_time IS NOT NULL and t.audit_time >= '2018-04-24' and t.audit_time <= '2018-04-24 23:59:59' 
    then t.apply_amount/10000 else 0 end
    )
,0) as cpopassamount,
COALESCE(
    sum(case 
    when (status = 90 or status = 80 or status = 60) and tc.contractamount is not NULL and t.loan_date >= '2018-04-24' and t.loan_date <= '2018-04-24 23:59:59' 
    then tc.contractamount/10000 
    when (status = 90 or status = 80 or status = 60 ) and tc.contractamount is NULL and t.loan_date >= '2016-04-24' and t.loan_date <= '2018-04-24 23:59:59' 
    then t.applt_amount/10000 else 0 end
    )
,0) as contractamount,
COALESCE(
    sum(case 
    when (status = 90 or status = 80 or status = 60) and t.loan_date >= '2018-04-24' and t.loan_date <= '2018-04-24 23:59:59' 
    then 1 else 0 end
    )
,0) as closenum
FROM(
    SELECT ifnull(max(t.system_phone) , 'androld') system_phone,
    ifnull(max(t.platform_name),'YRD_APP') platform_name,
    t.is_success, t.create_time ,t.is_påss,t.status,t.yrd_time,
    t.apply_amount, t.audit_time, t.loan_date,t.id ,t.limit_id
    FROM
    tb_cbm_transport_yrd AS t
    where t.tools_platorm_code = 'uxWap'
    and e.system_phone = 'android'
    and t.create_time >= '2018-04-24' and t.create_time <= '2018-04-24 23:59:59'
    and (
        (t.create_time >= '2018-04-24' and t.create_time <= '2018-04-24 23:59:59') 
        or (t.yrd_time >= '2018-04-24' and t.yrd_time <= '2018-04-24 23:59:59') 
        or (t.audit_time >= '2018-04-24' and t.audit_time <= '2018-04-24 23:59:59')
        or (t.loan_date >= '2018-04-24' and t.loan_date <= '2018-04-24 23:59:59')
    )
    union all
    SELECT ifnull(max(t.system_phone) , 'androld') system_phone,
    ifnull(max(t.platform_name),'YRD_APP') platform_name,
    t.is_success, t.create_time ,t.is_påss,t.status,t.yrd_time,
    t.apply_amount, t.audit_time, t.loan_date,t.id ,t.limit_id
    FROM
    tb_cbm_transport_yrd AS t
    where t.tools_platorm_code = 'uxWap'
    and e.system_phone = 'ios'
    and t.create_time >= '2018-04-24' and t.create_time <= '2018-04-24 23:59:59'
    and (
        (t.create_time >= '2018-04-24' and t.create_time <= '2018-04-24 23:59:59') 
        or (t.yrd_time >= '2018-04-24' and t.yrd_time <= '2018-04-24 23:59:59') 
        or (t.audit_time >= '2018-04-24' and t.audit_time <= '2018-04-24 23:59:59')
        or (t.loan_date >= '2018-04-24' and t.loan_date <= '2018-04-24 23:59:59')
    )
) t
LEFT JOIN tb_cbm_transport_result tc ON t.id = tc.transport_id
LEFT JOIN tb_cbm_1imit tcl ON t.limit_id = tcl.limit_id
WHERE t.tools_platorm_code = 'uxWap' GROUP BY left(t.system_phone,1);

sum/avg


  1. 只能对数字或者能转化成数字的字符串类型计算

    在MySQL中1/0返回NULL,如下图

sum.png
avg是不会计算NULL那一行的,有如下例子
sum1.png

\color{red}{用avg计算平均数和使用sum加count计算平均数的值存在差异,原因是avg会排除NULL那一行,不进行计算。从count可以看出,count进行统计时,使用*会统计全部的行数,但是使用列名则会排除空行,所以avg是总数除以9行的结果。}

HAVING


  1. having主要用于group by后的二次过滤

  2. having可以减少子查询嵌套,如下例子

    SELECT * FROM (SELECT 'wassup' AS hi FROM (select 1) x) a WHERE hi = 'wassup';
    

    可以使用having改写,简化嵌套

    SELECT 'wassup' AS hi FROM (select 1) x HAVING hi = 'wassup';
    

行转列,列转行


  1. 行转列
line.png
  1. 列转行
line1.png

思路:

  1. 先把一行边两行
line2.png
  1. 使用case when分类
line3.png

GROUP BY和JOIN优化案例


有如下语句,运行时间为166s

SELECT product_sku.branch_id AS branchId,product_sku.product_id AS productId,
sum((product_sku.num - ifnull(branch_return_goods_info.num,0))) AS totalNum,
sum((product_sku.unit_price * product_sku.num) - ifnull(branch_return_goods_info.money,0)) AS totalMoney,
sum(ifnull(branch_return_goods_info.num,0)) returnNum,
sum(ifnull(branch_return_goods_info.money,0)) returnMoney,
product_info.branch_name AS branchName,
product_info.brand_name AS brandName,
product_info.product_name AS productName,
product_info.cate_one_name AS cateOneName,
product_info.cate_two_name AS cateTwoName,
product_info.cate_three_name AS cateThreeName,
sk.miniNum AS miniNum,
product_info.sell_unit_convert AS sellUnitConvert,
sk.stockMoney As stockMoney
FROM product_sku_201704 poduct_sku FORCE INDEX (IDX_time_order)
INNER JOIN product_info on product_info.branch_id= product_sku.branch_id AND product_info.pcoduct_id= poduct_sku.product_id
AND product_sku.store_id NOT in (108,109)
LEFT JOIN branch_return_goods_info on branch_return_goods_info.order_id = product_sku.orderNo AND branch_return_goods_info.hmp_id = product_sku.product_id AND branch_return_goods_info.type = 2
LEFT JOIN stock_turnover as sk on sk.store_id = product_sku.store_id AND sk.product_id = product_sku.product_id and sk.datetime ='2017-04-25'
WHERE product_sku.order_type >= 0
AND product_sku.orderNo not like 'YL%'
AND product_sku.ordetrNo not like 'PT%' 
AND product_sku.order_time >= '2017-04-01 00:00:00'
AND product_sku.order_time <= '2017-04-25 23:59:59'
AND product_sku.orderState = 100
GROUP BY product_sku.product_id,product_sku.branch_id;

执行计划如下

group7.png

从执行计划我们可以看出,\color{red}{using where,using temporary,using filesort同时产生,说明group by发生在join之后,是对整个结果集进行操作,性能会比较差。}阅读上述SQL可以发现,where过滤条件只和product_sku表有关,和join表没有关系,所以优化思路是先过滤,缩小结果集,然后join

优化后的SQL如下,优化后执行时间为25s

SELECT
product_sku.btanch_id as btanchId,
product_sku.ptoduct_id As ptoductId,
sum((product_sku.num - ifnull(branch_return_goods_info.num,0))) AS totalNum,
sum((product_sku.unit_price * product_sku.num) - ifnull(branch_return_goods_info.money,0)) AS totalMoney,
sum(ifnull(branch_return_goods_info.num,0)) returnNum,
sum(ifnull(branch_return_goods_info.money,0)) returnMoney,
product_info.branch_name AS branchName,
product_info.brand_name AS brandName,
product_info.product_name AS productName,
product_info.cate_one_name AS cateOneName,
product_info.cate_two_name AS cateTwoName,
product_info.cate_three_name AS cateThreeName,
sk.miniNum AS miniNum,
product_info.sell_unit_convert AS sellUnitConvert,
sk.stockMoney As stockMoney
FROM (
    SELECT 
    product_sku.prcdct_id,
    pcoduct_sku.branch_id,
    sum(product_sku.num) num,
    SUM(product_sku.unit_price) unit_price,
    product_sku.orderNo,
    product_sku.store_id
    FROM proauct_sku_201704 produet_sku
    WHERE product_sku.order_type >= 0
    AND product_sku.orderNo not like 'YL%'
    AND product_sku.ordetrNo not like 'PT%' 
    AND product_sku.order_time >= '2017-04-01 00:00:00'
    AND product_sku.order_time <= '2017-04-25 23:59:59'
    AND product_sku.orderState = 100
    AND product_sku.store_id NOT in (108,109)
    GROUP BY product_sku.product_id,product_sku.branch_id
) product_sku
INNER JOIN product_info on product_info.branch_id= product_sku.branch_id AND product_info.pcoduct_id= poduct_sku.product_id
LEFT JOIN branch_return_goods_info on branch_return_goods_info.order_id = product_sku.orderNo AND branch_return_goods_info.hmp_id = product_sku.product_id AND branch_return_goods_info.type = 2
LEFT JOIN stock_turnover as sk on sk.store_id = product_sku.store_id AND sk.product_id = product_sku.product_id and sk.datetime ='2017-04-25'
GROUP BY product_sku.product_id,product_sku.branch_id;

优化后执行计划如下

group8.png

你可能感兴趣的:(MySQL详细基础语法2)