GROUP BY
GROUP BY即分组 主要作用是把一个集合中根据group by后面的关键字进行分组,里面包含2个计算
-
去掉重复值(参考丁奇文章)
当没有索引时,根据group by后面的列,在内存中创建一个对应列为pk的临时表。对数据集进行遍历,没有则插入临时表,有则对对应的聚合函数列进行比较和更新,使最终结果中没有重复值
有索引时,直接使用索引,只需要一个计数器,对索引列进行更新,不需要维护临时表。
所以对group by进行优化的方式之一就是对group by后面的列添加索引,减少临时表(use tempfile)的产生和维护
减少use tempfile的两种方式
-
Loose Index Scan索引跳跃式扫描
要实现跳跃式扫描要满足如下条件
-
联合索引的前列选择率要低且必须在索引中,group by依次按照索引顺序,且select列中只有索引的左边的一部分且不能回表
例如如下表结构和执行计划
CREATE TABLE `t_group6` ( `emp_no` int(11) NOT NULL, `dept_no` char(4) NOT NULL, `from_date` date NOT NULL, `to_date` date NOT NULL, KEY `ix_empno_to_date` (`dept_no`,`to_date`,`emp_no`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
按照索引顺序的语句执行计划如下
-
-
但不按照索引顺序的时候则没有优化效果
2. 前导列是关键
-
Tight Index Scan
range scan或者index scan
- 创建索引的时候必须按照where条件中的等号字段再加上group后面的列的顺序来创建
-
进行排序(5.5~5.7),8.0开始不在对group by后面的关键字进行排序
group by的排序方式,字符类型是按照ascii码从左到右进行排序的。当含有空和null时 NULL在前空在后
-
相关参数
5.7 tmp_table_size,max_heap_table_size
-
group by分为完全模式和不完全模式
开启完全group by形式之后 select后面的字段中不能出现非聚合函数包裹的字段否则会报错
通过set session sql_mode = ' '; 这个配置来控制
并且不完全group模式会随着执行计划不同导致结果不同
例如有如下的表,索引情况和数据如下
当使用ix_torder_2索引时,执行结果如下
当时用ix_torder_3索引时,执行结果就改变了
DISTINCT
和group by一样,可以去掉重复值。区别在于distinct不排序。
在内存中创建一个对应列为pk的临时表。对数据集进行遍历,没有则插入临时表,有则违反uk约束
从执行计划上也能看出distinct和group by的区别
order by
order by后面可以写数字表示列名,如下例子中1表示emp_no
order by有两种排序方式
-
single pass
原理:一次性把sql中涉及的字段全部多出来,然后依据排序字段排序,最后直接返回结果。
优点:只需要一次顺序I/O,无需任何随机I/O,大大降低I/O开销
缺点:内存容不下大量数据时,可能会先放到磁盘上对数据切分,对小数据块逐个排序,然后将结果集merge,在返回结果集
例子如下,下图为数据库排序相关数据记录,sort_merge_passes为0,一般表示执行了一次排序
执行语句如下
SELECT emp_no,first_name,last_name FROM employees ORDER BY first_name LIMIT 100;
执行后的数据纪录如下
执行计划如下
简单的说就是把emp_no,first_name,last_name全部装到sort_buffer中排序
SQL单次排序效率与order by后面的字段以及select后面的字段都有关系,主要优化点在于减少数据集面积。所以查询的时候一般不要用*
-
two pass
原理:先读取行指针和排序字段进行排序,然后依据排序结果在读取所需数据
优点:排序数据量小,完全可以在内存中完成
缺点:第二次读取时,会发生大量随机I/O,太昂贵
two pass有两个要求
- 排序的列的值要大于max_length_for_sort_data
- select列当中包含blob或者text的值
实例如下,下图为排序语句执行前的数据库记录
执行语句如下
SELECT SQL_NO_CACHE COUNT(1) FROM (
SELECT * FROM t11 IGNORE INDEX (PRIMARY) ORDER BY c10 DESC LIMIT 10000
) a;
执行语句后,服务器记录如下
sort_merge_pass超过sort buffer大小时 排序数据就会写入temp file中 temp file合并的次数 数据的大小都和sort_buffer_size有直接关系
order by limit有重复列的情况
如下表,其中ID不是primary key
三个问题
-
当只有8条数据记录的时候,用create_time排序分页会有数据重复和丢失的问题,而用age则没有出现,age也有重复值
运行SQL如下
SELECT * FROM test ORDER BY create_time LIMIT 0,4; SELECT * FROM test ORDER BY create_time LIMIT 4,4;
运行结果如下
如图所示,id=8的数据重复,$\color{red}{原因是create_time重复,MySQL再回表查询剩余数据的时候,只要找到符合当前create_time的数据即可,所以造成了数据重复}$
当增加了4条记录后,用create_time排序分页没有问题,而使用age排序会出现数据重复或者丢失,age的重复值没有create_time多
给排序列加索引对这个问题没有帮助
解决方法
加入id列作为二级排序列
切换innodb引擎,对create_time列添加索引
-
创建内部视图,使用distinct关键字阻止视图合并
语句如下
SELECT * FROM (SELECT DISTINCT t.* FROM test t ORDER BY create_time) a LIMIT 0,4;
COUNT
count不计算NULL,在不影响业务的情况下可以使用count(1)或者count(*)。但是由于count是不计算NULL的,所以最好使用非空并且not null的列进行统计,如自增的primary key
count优化的例子
-
有如下的两条SQL
1. SELECT COUNT(fid=6 AND displayorder >= 0) AS posts FROM cdb_threads; 2. SELECT COUNT(fid) AS threads FROM cdb_threads WHERE fid=6 AND displayorder >= 0;
上面两条SQL对比,2要比1效率高,主要原因在于count的调用次数。假设当前表没有任何索引,查询fid=6并且displayorder >= 0的数据都需要全表扫描,效果相同。如果结果集数据为100条,2只需要对结果集进行一次count,但是1在整个过程中要进行100次count,显然1的效率更低
-
原语句如下
SELECT COUNT(emp_no IS NULL) AS c1 FROM t_order;
优化语句如下
1. SELECT COUNT(*) c1 FROM t_order WHERE emp_no IS NULL; 2. SELECT COUNT(case when emp_no is null then 1 end) c1 FROM t_order;
如上两个SQL 2相对1来说,效率会低。原因是行数不同。1会根据条件进行过滤,然后count统计实际的数据量。2则是全表扫描,会产生大量I/O
MIN/MAX
min和max可以求出数据列的最大最小值,但是不包含null
-
min/max使用例子
查看employess表中是否存在emp_no=1000111,存在返回1不存在返回0。直接使用case加where不能满足,如图
根据图片我们可以发现,直接使用case加where有值返回1,但是没值不会返回0。我们可以使用max或者min解决上面的问题,如图
max/min在没有值返回的时候,会返回NULL,可以使用这个特性解决问题
-
实际优化案例
有如下语句,运行时间超过5s
SELECT t.systen_phone AS plattorm, t.platform_name AS channel, COALESCE( sum(case when t.is_success='T' and t.create_time >= '2018-04-24' and t.create_time <= '2018-04-24 23:59:59' then 1 else 0 end ) ,0) as applytotal , COALESCE( sum(case when is_påss='T' and tcl.is_refuse = 'F' and t.create_time >= '2018-04-24' and t.create_time <= '2018-04-24 23:59:59' then 1 else 0 end ) ,0) as tpasscount, COALESCE( sum(case when (status >= 40 or status = -40) and yrd_time IS NOT NULL and t.yrd_time >= '2018-04-24' and t.yrd_time <= '2018-04-24 23:59:59' then 1 else 0 end ) ,0) as transportnum, COALESCE( sum(case when (status >= 40 or status = -40)and yrd_time IS NOT NULL and t.yrd_time >= '2018-04-24' and t.yrd_time <= '2018-04-24 23:59:59' then t.apply_amount/10000 else 0 end ) ,0) as transportamount, COALESCE( sum( case when audit_time IS NOT NULL and t.audit_time >= '2018-04-24' and t.audit_time <= '2018-04-24 23:59:59' then 1 else 0 end ) ,0) as cponum, COALESCE( sum( case when (status !=50 ) and audit_time IS NOT NULL and t.audit_time >= '2018-04-24' and t.audit_time <= '2018-04-24 23:59:59' then 1 else 0 end ) ,0) as cpopassnum, COALESCE( sum( case when (status !=50 ) and audit_time IS NOT NULL and t.audit_time >= '2018-04-24' and t.audit_time <= '2018-04-24 23:59:59' then t.apply_amount/10000 else 0 end ) ,0) as cpopassamount, COALESCE( sum( case when (status = 90 or status = 80 or status = 60) and tc.contractamount is not NULL and t.loan_date >= '2018-04-24' and t.loan_date <= '2018-04-24 23:59:59' then tc.contractamount/10000 when (status = 90 or status = 80 or status = 60 ) and tc.contractamount is NULL and t.loan_date >= '2018-04-24' and t.loan_date <= '2018-04-24 23:59:59' then t.applt_amount/10000 else 0 end ) ,0) as contractamount, COALESCE( sum( case when (status = 90 or status = 80 or status = 60) and t.loan_date >= '2018-04-24' and t.loan_date <= '2018-04-24 23:59:59' then 1 else 0 end ) ,0) as closenum FROM tb_cbm_transport_yrd AS t LEFT JOIN tb_cbm_transport_result tc ON t.id = tc.transport_id LEFT JOIN tb_cbm_1imit tcl ON t.limit_id = tcl.limit_id WHERE t.tools_platorm_code = 'uxWap' GROUP BY left(t.system_phone,1);
执行计划如下图
查询效率低的主要是原因是where条件提前,造成全表扫描,I/O较大,优化思路就是where条件优化,提取公共部分,并且创建索引
优化后语句如下,优化后时间为0.07s。可以看出优化的主要位置是将where提取出来
SELECT t.systen_phone AS plattorm, t.platform_name AS channel,
COALESCE(
sum(case
when t.is_success='T' and t.create_time >= '2018-04-24' and t.create_time <= '2018-04-24 23:59:59'
then 1 else 0 end
)
,0) as applytotal ,
COALESCE(
sum(case
when is_påss='T' and tcl.is_refuse = 'F' and t.create_time >= '2018-04-24' and t.create_time <= '2018-04-24 23:59:59'
then 1 else 0 end
)
,0) as tpasscount,
COALESCE(
sum(case
when (status >= 40 or status = -40) and yrd_time IS NOT NULL and t.yrd_time >= '2018-04-24' and t.yrd_time <= '2018-04-24 23:59:59'
then 1 else 0 end
)
,0) as transportnum,
COALESCE(
sum(case
when (status >= 40 or status = -40)and yrd_time IS NOT NULL and t.yrd_time >= '2018-04-24' and t.yrd_time <= '2018-04-24 23:59:59'
then t.apply_amount/10000 else 0 end
)
,0) as transportamount,
COALESCE(
sum(case
when audit_time IS NOT NULL and t.audit_time >= '2018-04-24' and t.audit_time <= '2018-04-24 23:59:59'
then 1 else 0 end
)
,0) as cponum,
COALESCE(
sum(case
when (status !=50 ) and audit_time IS NOT NULL and t.audit_time >= '2018-04-24' and t.audit_time <= '2018-04-24 23:59:59'
then 1 else 0 end
)
,0) as cpopassnum,
COALESCE(
sum(case
when (status !=50 ) and audit_time IS NOT NULL and t.audit_time >= '2018-04-24' and t.audit_time <= '2018-04-24 23:59:59'
then t.apply_amount/10000 else 0 end
)
,0) as cpopassamount,
COALESCE(
sum(case
when (status = 90 or status = 80 or status = 60) and tc.contractamount is not NULL and t.loan_date >= '2018-04-24' and t.loan_date <= '2018-04-24 23:59:59'
then tc.contractamount/10000
when (status = 90 or status = 80 or status = 60 ) and tc.contractamount is NULL and t.loan_date >= '2016-04-24' and t.loan_date <= '2018-04-24 23:59:59'
then t.applt_amount/10000 else 0 end
)
,0) as contractamount,
COALESCE(
sum(case
when (status = 90 or status = 80 or status = 60) and t.loan_date >= '2018-04-24' and t.loan_date <= '2018-04-24 23:59:59'
then 1 else 0 end
)
,0) as closenum
FROM(
SELECT ifnull(max(t.system_phone) , 'androld') system_phone,
ifnull(max(t.platform_name),'YRD_APP') platform_name,
t.is_success, t.create_time ,t.is_påss,t.status,t.yrd_time,
t.apply_amount, t.audit_time, t.loan_date,t.id ,t.limit_id
FROM
tb_cbm_transport_yrd AS t
where t.tools_platorm_code = 'uxWap'
and e.system_phone = 'android'
and t.create_time >= '2018-04-24' and t.create_time <= '2018-04-24 23:59:59'
and (
(t.create_time >= '2018-04-24' and t.create_time <= '2018-04-24 23:59:59')
or (t.yrd_time >= '2018-04-24' and t.yrd_time <= '2018-04-24 23:59:59')
or (t.audit_time >= '2018-04-24' and t.audit_time <= '2018-04-24 23:59:59')
or (t.loan_date >= '2018-04-24' and t.loan_date <= '2018-04-24 23:59:59')
)
union all
SELECT ifnull(max(t.system_phone) , 'androld') system_phone,
ifnull(max(t.platform_name),'YRD_APP') platform_name,
t.is_success, t.create_time ,t.is_påss,t.status,t.yrd_time,
t.apply_amount, t.audit_time, t.loan_date,t.id ,t.limit_id
FROM
tb_cbm_transport_yrd AS t
where t.tools_platorm_code = 'uxWap'
and e.system_phone = 'ios'
and t.create_time >= '2018-04-24' and t.create_time <= '2018-04-24 23:59:59'
and (
(t.create_time >= '2018-04-24' and t.create_time <= '2018-04-24 23:59:59')
or (t.yrd_time >= '2018-04-24' and t.yrd_time <= '2018-04-24 23:59:59')
or (t.audit_time >= '2018-04-24' and t.audit_time <= '2018-04-24 23:59:59')
or (t.loan_date >= '2018-04-24' and t.loan_date <= '2018-04-24 23:59:59')
)
) t
LEFT JOIN tb_cbm_transport_result tc ON t.id = tc.transport_id
LEFT JOIN tb_cbm_1imit tcl ON t.limit_id = tcl.limit_id
WHERE t.tools_platorm_code = 'uxWap' GROUP BY left(t.system_phone,1);
sum/avg
-
只能对数字或者能转化成数字的字符串类型计算
在MySQL中1/0返回NULL,如下图
avg是不会计算NULL那一行的,有如下例子
HAVING
having主要用于group by后的二次过滤
-
having可以减少子查询嵌套,如下例子
SELECT * FROM (SELECT 'wassup' AS hi FROM (select 1) x) a WHERE hi = 'wassup';
可以使用having改写,简化嵌套
SELECT 'wassup' AS hi FROM (select 1) x HAVING hi = 'wassup';
行转列,列转行
- 行转列
- 列转行
思路:
- 先把一行边两行
- 使用case when分类
GROUP BY和JOIN优化案例
有如下语句,运行时间为166s
SELECT product_sku.branch_id AS branchId,product_sku.product_id AS productId,
sum((product_sku.num - ifnull(branch_return_goods_info.num,0))) AS totalNum,
sum((product_sku.unit_price * product_sku.num) - ifnull(branch_return_goods_info.money,0)) AS totalMoney,
sum(ifnull(branch_return_goods_info.num,0)) returnNum,
sum(ifnull(branch_return_goods_info.money,0)) returnMoney,
product_info.branch_name AS branchName,
product_info.brand_name AS brandName,
product_info.product_name AS productName,
product_info.cate_one_name AS cateOneName,
product_info.cate_two_name AS cateTwoName,
product_info.cate_three_name AS cateThreeName,
sk.miniNum AS miniNum,
product_info.sell_unit_convert AS sellUnitConvert,
sk.stockMoney As stockMoney
FROM product_sku_201704 poduct_sku FORCE INDEX (IDX_time_order)
INNER JOIN product_info on product_info.branch_id= product_sku.branch_id AND product_info.pcoduct_id= poduct_sku.product_id
AND product_sku.store_id NOT in (108,109)
LEFT JOIN branch_return_goods_info on branch_return_goods_info.order_id = product_sku.orderNo AND branch_return_goods_info.hmp_id = product_sku.product_id AND branch_return_goods_info.type = 2
LEFT JOIN stock_turnover as sk on sk.store_id = product_sku.store_id AND sk.product_id = product_sku.product_id and sk.datetime ='2017-04-25'
WHERE product_sku.order_type >= 0
AND product_sku.orderNo not like 'YL%'
AND product_sku.ordetrNo not like 'PT%'
AND product_sku.order_time >= '2017-04-01 00:00:00'
AND product_sku.order_time <= '2017-04-25 23:59:59'
AND product_sku.orderState = 100
GROUP BY product_sku.product_id,product_sku.branch_id;
执行计划如下
从执行计划我们可以看出,阅读上述SQL可以发现,where过滤条件只和product_sku表有关,和join表没有关系,所以优化思路是先过滤,缩小结果集,然后join
优化后的SQL如下,优化后执行时间为25s
SELECT
product_sku.btanch_id as btanchId,
product_sku.ptoduct_id As ptoductId,
sum((product_sku.num - ifnull(branch_return_goods_info.num,0))) AS totalNum,
sum((product_sku.unit_price * product_sku.num) - ifnull(branch_return_goods_info.money,0)) AS totalMoney,
sum(ifnull(branch_return_goods_info.num,0)) returnNum,
sum(ifnull(branch_return_goods_info.money,0)) returnMoney,
product_info.branch_name AS branchName,
product_info.brand_name AS brandName,
product_info.product_name AS productName,
product_info.cate_one_name AS cateOneName,
product_info.cate_two_name AS cateTwoName,
product_info.cate_three_name AS cateThreeName,
sk.miniNum AS miniNum,
product_info.sell_unit_convert AS sellUnitConvert,
sk.stockMoney As stockMoney
FROM (
SELECT
product_sku.prcdct_id,
pcoduct_sku.branch_id,
sum(product_sku.num) num,
SUM(product_sku.unit_price) unit_price,
product_sku.orderNo,
product_sku.store_id
FROM proauct_sku_201704 produet_sku
WHERE product_sku.order_type >= 0
AND product_sku.orderNo not like 'YL%'
AND product_sku.ordetrNo not like 'PT%'
AND product_sku.order_time >= '2017-04-01 00:00:00'
AND product_sku.order_time <= '2017-04-25 23:59:59'
AND product_sku.orderState = 100
AND product_sku.store_id NOT in (108,109)
GROUP BY product_sku.product_id,product_sku.branch_id
) product_sku
INNER JOIN product_info on product_info.branch_id= product_sku.branch_id AND product_info.pcoduct_id= poduct_sku.product_id
LEFT JOIN branch_return_goods_info on branch_return_goods_info.order_id = product_sku.orderNo AND branch_return_goods_info.hmp_id = product_sku.product_id AND branch_return_goods_info.type = 2
LEFT JOIN stock_turnover as sk on sk.store_id = product_sku.store_id AND sk.product_id = product_sku.product_id and sk.datetime ='2017-04-25'
GROUP BY product_sku.product_id,product_sku.branch_id;
优化后执行计划如下