场景一 .去重场景问题
1) UNION -- UNION ALL 之间的区别,如何取舍
2) DISTINCT 替代方式 GROUP BY
1) UNION -- UNION ALL 之间的区别,如何取舍
注意SQL 中 UNION ALL 与 UNION 是不一样的,
UNION ALL 不会对合并的数据去重
UNION 会对合并的数据去重
例子 :
EXPLAIN
SELECT
company_name
,dep_name
,user_id
,user_name
FROM datacube_salary_org
WHERE pt = '20200405'
UNION / UNION ALL
SELECT
company_name
,dep_name
,user_id
,user_name
FROM datacube_salary_org
WHERE pt = '20200406'
;
UNION ALL 的 EXPLAIN 结果
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: datacube_salary_org
filterExpr: (pt = '20200405') (type: boolean)
Statistics: Num rows: 1 Data size: 342 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: company_name (type: string), dep_name (type: string), user_id (type: bigint), user_name (type: string)
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 1 Data size: 342 Basic stats: COMPLETE Column stats: NONE
Union
Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
TableScan
alias: datacube_salary_org
filterExpr: (pt = '20200406') (type: boolean)
Statistics: Num rows: 1 Data size: 412 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: company_name (type: string), dep_name (type: string), user_id (type: bigint), user_name (type: string)
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 1 Data size: 412 Basic stats: COMPLETE Column stats: NONE
Union
Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
UNION 的 EXPLAIN 结果
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: datacube_salary_org
filterExpr: (pt = '20200405') (type: boolean)
Statistics: Num rows: 1 Data size: 342 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: company_name (type: string), dep_name (type: string), user_id (type: bigint), user_name (type: string)
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 1 Data size: 342 Basic stats: COMPLETE Column stats: NONE
Union
Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE
Group By Operator
keys: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: string)
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: string)
sort order: ++++
Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: string)
Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE
TableScan
alias: datacube_salary_org
filterExpr: (pt = '20200406') (type: boolean)
Statistics: Num rows: 1 Data size: 412 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: company_name (type: string), dep_name (type: string), user_id (type: bigint), user_name (type: string)
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 1 Data size: 412 Basic stats: COMPLETE Column stats: NONE
Union
Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE
Group By Operator
keys: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: string)
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: string)
sort order: ++++
Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: string)
Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE
Reduce Operator Tree:
Group By Operator
keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: bigint), KEY._col3 (type: string)
mode: mergepartial
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 1 Data size: 377 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 377 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
对比两个的EXPLAIN 结果,我们不难发现,UNION 会多出一个Reduce 流程。这也不难理,为什么在无去重需求下,使用 UNION ALL 而不是 UNION 。
另外据说 使用 UNION ALL ,再去使用 GROUP BY 去做去重效果 会比 UNION 效率要更高。
SELECT
company_name
,dep_name
,user_id
,user_name
FROM datacube_salary_org
WHERE pt = '20200405'
UNION
SELECT
company_name
,dep_name
,user_id
,user_name
FROM datacube_salary_org
WHERE pt = '20200406'
;
改为
SELECT
company_name
,dep_name
,user_id
,user_name
FROM
(
SELECT
company_name
,dep_name
,user_id
,user_name
FROM datacube_salary_org
WHERE pt = '20200405'
UNION ALL
SELECT
company_name
,dep_name
,user_id
,user_name
FROM datacube_salary_org
WHERE pt = '20200406'
) tmp
GROUP BY
company_name
,dep_name
,user_id
,user_name
;
我认为效率一致,看下改进方式的 EXPLAIN 结果
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: datacube_salary_org
filterExpr: (pt = '20200405') (type: boolean)
Statistics: Num rows: 1 Data size: 342 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: company_name (type: string), dep_name (type: string), user_id (type: bigint), user_name (type: string)
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 1 Data size: 342 Basic stats: COMPLETE Column stats: NONE
Union
Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE
Group By Operator
keys: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: string)
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: string)
sort order: ++++
Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: string)
Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE
TableScan
alias: datacube_salary_org
filterExpr: (pt = '20200406') (type: boolean)
Statistics: Num rows: 1 Data size: 412 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: company_name (type: string), dep_name (type: string), user_id (type: bigint), user_name (type: string)
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 1 Data size: 412 Basic stats: COMPLETE Column stats: NONE
Union
Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE
Group By Operator
keys: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: string)
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: string)
sort order: ++++
Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: bigint), _col3 (type: string)
Statistics: Num rows: 2 Data size: 754 Basic stats: COMPLETE Column stats: NONE
Reduce Operator Tree:
Group By Operator
keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: bigint), KEY._col3 (type: string)
mode: mergepartial
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 1 Data size: 377 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 377 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
两个方式的EXPLAIN 无区别,故认为没优化
对比下时间(小数据量级)UNION ALL 和 GROUP BY 耗时都是 5.2s
经过以上对比,可以认为无差别
2) DISTINCT 替代方式 GROUP BY
在实际的去重场景中,我们会选用 DISTINCT 去做去重。
但是实际场景下,选择 GROUP BY 效率会更高。下面我们进行下实验。
我们先选用低效率的 COUNT(DISTINCT ) 方式
SQL
SELECT
COUNT(DISTINCT company_name, dep_name, user_id)
FROM datacube_salary_org
;
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: datacube_salary_org
Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: company_name (type: string), dep_name (type: string), user_id (type: bigint)
outputColumnNames: company_name, dep_name, user_id
Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: count(DISTINCT company_name, dep_name, user_id)
keys: company_name (type: string), dep_name (type: string), user_id (type: bigint)
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: bigint)
sort order: +++
Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE
Reduce Operator Tree:
Group By Operator
aggregations: count(DISTINCT KEY._col0:0._col0, KEY._col0:0._col1, KEY._col0:0._col2)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 16 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 16 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
小数据量运行时间为4s
====================
我们再选用高效率的 GROUP BY 方式
SQL
SELECT COUNT(1)
FROM (
SELECT
company_name
,dep_name
,user_id
FROM datacube_salary_org
GROUP BY
company_name
,dep_name
,user_id
) AS tmp
;
EXPLAIN 结果
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1
Stage-0 depends on stages: Stage-2
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: datacube_salary_org
Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: company_name (type: string), dep_name (type: string), user_id (type: bigint)
outputColumnNames: company_name, dep_name, user_id
Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE
Group By Operator
keys: company_name (type: string), dep_name (type: string), user_id (type: bigint)
mode: hash
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: bigint)
sort order: +++
Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: bigint)
Statistics: Num rows: 7 Data size: 340 Basic stats: COMPLETE Column stats: NONE
Reduce Operator Tree:
Group By Operator
keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: bigint)
mode: mergepartial
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 3 Data size: 145 Basic stats: COMPLETE Column stats: NONE
Select Operator
Statistics: Num rows: 3 Data size: 145 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: count(1)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
Stage: Stage-2
Map Reduce
Map Operator Tree:
TableScan
Reduce Output Operator
sort order:
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
value expressions: _col0 (type: bigint)
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
小数据量运行时间8s
优化原理:
我们先说下为什么大数据集下 先 GROUP BY 再COUNT 的效率 要优于 直接 COUNT(DISTINCT ...) .
因为 COUNT(DISTINCT ...) , 会把相关的列组成一个key 传入到 Reducer 中。即 count(DISTINCT KEY._col0:0._col0, KEY._col0:0._col1, KEY._col0:0._col2) | 这样需要在 一个 Reducer 中 ,完成全排序并去重。
先GROUP BY 再去 COUNT ,则GROUP BY 可以 将不同的KEY , 分发到多个 Reducer 中,在 GROUP BY流程中完成了去重。此时,去重时并不会把数据放入到 一个 Reducer 中,利用了分布式的优势。这个去重效率更高。在下一步 COUNT 阶段,再将上一步奏 GROUP BY 去重后的 KEY , 进行统计计算。
所以大数据量下 先GROUP BY ,再去 COUNT 效率比 COUNT(DISTINCT) 更高。
我们对比下上述的小数据量运行结果
EXPLAIN 中 :COUNT(DISTINCT ) 比 先GROUP BY 再 COUNT 的阶段少 。因为 GROUP BY 已经是一个 MR STAGE, 而 COUNT 是另一个 STAGE.
运行时间上 :可以看到两者并无差别,甚至 COUNT(DISTINCT ) 总时间小于 先GROUP BY 再 COUNT。这是因为,运行一个 STAGE 需要申请资源,开辟资源,有时间成本。故小数据量下 , 先GROUP BY 再 COUNT 时间多于 COUNT(DISTINCT ) , 主要是花费在 申请资源,创建容器的时间上。
并且 总运行时间 COUNT(DISTINCT ) 小于 先GROUP BY 再 COUNT
产生上述结果的原因,还是因为数据集大小的问题。即 一个 Reducer 全局排序的时间成本,与划分多个作业阶段申请资源的成本的比较 !!!
因此,我们因根据实际的数据量做合理的取舍 !!!!
————————————————
本文声明:本文为CSDN博主「高达一号」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/u010003835/article/details/105493563