场景二.减少JOB的数量
1) 巧妙的使用 UNION ALL 减少 JOB 数量
2) 利用多表相同的JOIN 关联条件字段,去减少 JOB 的数量
1) 巧妙的使用 UNION ALL 减少 JOB 数量
假如如下的场景,我们需要统计每多张表的数据量。
首先我们可以编写多条SQL进行统计,这样的效率不高。(没意义)
或者 我们采用UNION ALL 的形式把多个结果合并起来,但是这样效率也比较低
如:
SELECT
'a' AS type
,COUNT(1) AS num
FROM datacube_salary_basic_aggr AS a
UNION ALL
SELECT
'b' AS type
,COUNT(1) AS num
FROM datacube_salary_company_aggr AS b
UNION ALL
SELECT
'c' AS type
,COUNT(1) AS num
FROM datacube_salary_dep_aggr AS c
UNION ALL
SELECT
'd' AS type
,COUNT(1) AS num
FROM datacube_salary_total_aggr AS d
;
较为优势的写法是,我们将多个表的数据先读取进来,然后 打上标记然后再去做聚合统计
这里由于是多个表作为输入,所以会有多个Mapper
示例如下:
SELECT
type
,COUNT(1)
FROM
(
SELECT
'a' AS type
,total_salary
FROM datacube_salary_basic_aggr AS a
UNION ALL
SELECT
'b' AS type
,total_salary
FROM datacube_salary_company_aggr AS b
UNION ALL
SELECT
'c' AS type
,total_salary
FROM datacube_salary_dep_aggr AS c
UNION ALL
SELECT
'd' AS type
,total_salary
FROM datacube_salary_total_aggr AS d
) AS tmp
GROUP BY
type
;
我们通过EXLAIN 看下具体这2种方式有什么不同
第一种写法的explain
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1, Stage-3, Stage-4, Stage-5
Stage-3 is a root stage
Stage-4 is a root stage
Stage-5 is a root stage
Stage-0 depends on stages: Stage-2
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: a
Statistics: Num rows: 7 Data size: 2086 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
Statistics: Num rows: 7 Data size: 2086 Basic stats: COMPLETE Column stats: COMPLETE
Group By Operator
aggregations: count(1)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
sort order:
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
value expressions: _col0 (type: bigint)
Execution mode: vectorized
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: 'a' (type: string), _col0 (type: bigint)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
Stage: Stage-2
Map Reduce
Map Operator Tree:
TableScan
Union
Statistics: Num rows: 4 Data size: 372 Basic stats: COMPLETE Column stats: COMPLETE
File Output Operator
compressed: false
Statistics: Num rows: 4 Data size: 372 Basic stats: COMPLETE Column stats: COMPLETE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
TableScan
Union
Statistics: Num rows: 4 Data size: 372 Basic stats: COMPLETE Column stats: COMPLETE
File Output Operator
compressed: false
Statistics: Num rows: 4 Data size: 372 Basic stats: COMPLETE Column stats: COMPLETE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
TableScan
Union
Statistics: Num rows: 4 Data size: 372 Basic stats: COMPLETE Column stats: COMPLETE
File Output Operator
compressed: false
Statistics: Num rows: 4 Data size: 372 Basic stats: COMPLETE Column stats: COMPLETE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
TableScan
Union
Statistics: Num rows: 4 Data size: 372 Basic stats: COMPLETE Column stats: COMPLETE
File Output Operator
compressed: false
Statistics: Num rows: 4 Data size: 372 Basic stats: COMPLETE Column stats: COMPLETE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
alias: b
Statistics: Num rows: 2 Data size: 400 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
Statistics: Num rows: 2 Data size: 400 Basic stats: COMPLETE Column stats: COMPLETE
Group By Operator
aggregations: count(1)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
sort order:
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
value expressions: _col0 (type: bigint)
Execution mode: vectorized
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: 'b' (type: string), _col0 (type: bigint)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
Stage: Stage-4
Map Reduce
Map Operator Tree:
TableScan
alias: c
Statistics: Num rows: 4 Data size: 1160 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
Statistics: Num rows: 4 Data size: 1160 Basic stats: COMPLETE Column stats: COMPLETE
Group By Operator
aggregations: count(1)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
sort order:
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
value expressions: _col0 (type: bigint)
Execution mode: vectorized
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: 'c' (type: string), _col0 (type: bigint)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
Stage: Stage-5
Map Reduce
Map Operator Tree:
TableScan
alias: d
Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE Column stats: COMPLETE
Group By Operator
aggregations: count(1)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
sort order:
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
value expressions: _col0 (type: bigint)
Execution mode: vectorized
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: 'd' (type: string), _col0 (type: bigint)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
可以看到作业分为了6个STAGE,其中 SELECT '' AS type, COUNT(1) FROM xxx; 是一个阶段,由于是4张表,所以划分了4个作业,即 STAGE1, STAGE3, STAGE4, STAGE5,STAGE 2 为 UNION 的作业,将上述结果聚合起来
我们再来看下 优化方法的 EXPLAIN 结果
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: a
Statistics: Num rows: 7 Data size: 2086 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: 'a' (type: string)
outputColumnNames: _col0
Statistics: Num rows: 7 Data size: 595 Basic stats: COMPLETE Column stats: COMPLETE
Union
Statistics: Num rows: 14 Data size: 1190 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: _col0 (type: string)
outputColumnNames: _col0
Statistics: Num rows: 14 Data size: 1190 Basic stats: COMPLETE Column stats: COMPLETE
Group By Operator
aggregations: count(1)
keys: _col0 (type: string)
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE
value expressions: _col1 (type: bigint)
TableScan
alias: b
Statistics: Num rows: 2 Data size: 400 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: 'b' (type: string)
outputColumnNames: _col0
Statistics: Num rows: 2 Data size: 170 Basic stats: COMPLETE Column stats: COMPLETE
Union
Statistics: Num rows: 14 Data size: 1190 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: _col0 (type: string)
outputColumnNames: _col0
Statistics: Num rows: 14 Data size: 1190 Basic stats: COMPLETE Column stats: COMPLETE
Group By Operator
aggregations: count(1)
keys: _col0 (type: string)
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE
value expressions: _col1 (type: bigint)
TableScan
alias: c
Statistics: Num rows: 4 Data size: 1160 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: 'c' (type: string)
outputColumnNames: _col0
Statistics: Num rows: 4 Data size: 340 Basic stats: COMPLETE Column stats: COMPLETE
Union
Statistics: Num rows: 14 Data size: 1190 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: _col0 (type: string)
outputColumnNames: _col0
Statistics: Num rows: 14 Data size: 1190 Basic stats: COMPLETE Column stats: COMPLETE
Group By Operator
aggregations: count(1)
keys: _col0 (type: string)
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE
value expressions: _col1 (type: bigint)
TableScan
alias: d
Statistics: Num rows: 1 Data size: 112 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: 'd' (type: string)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 85 Basic stats: COMPLETE Column stats: COMPLETE
Union
Statistics: Num rows: 14 Data size: 1190 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: _col0 (type: string)
outputColumnNames: _col0
Statistics: Num rows: 14 Data size: 1190 Basic stats: COMPLETE Column stats: COMPLETE
Group By Operator
aggregations: count(1)
keys: _col0 (type: string)
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE
value expressions: _col1 (type: bigint)
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
keys: KEY._col0 (type: string)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
可以看到 ,运行只有两个阶段 STAGE0, STAGE1
这里为什么比上面的写法划分的阶段数量更少呢 ?因为我们先把所有数据UNION ALL了之后,再去做的统计,相当于多个表的数据利用 type 作为区分,一次扫描了进来,所以效率更高。但是相应的这个STAGE 阶段需要的Mapper 数量也更多,毕竟我们是一下扫描 的4张表的数据 。
我们再去对比下两者的执行时间 (小数据规模下):
效率较低的方法,第一个SQL:138.285 seconds
效率更高的写法:38.545 seconds
可以看到方式二明显更快了,这个是因为划分的阶段少,申请资源的次数更少,所以,效率耗时更少了
2) 利用多表相同的JOIN 条件,去减少 JOB 的数量
这里,我们获取到 获取用户的所有信息,我们有以下2种写法
低效的写法,由于两个user_id, mobile 都能关联
1.我们先用user_id 关联,再用 mobile 关联
EXPLAIN
SELECT
a.user_id
,a.mobile
,a.sex
,b.user_name
,c.type
FROM join_multi_a AS a
LEFT JOIN join_multi_b AS b
ON a.user_id = b.user_id
LEFT JOIN join_multi_c AS c
ON b.mobile = c.mobile
;
2.都用user_id 做关联
SELECT
a.user_id
,a.mobile
,a.sex
,b.user_name
,c.type
FROM join_multi_a AS a
LEFT JOIN join_multi_b AS b
ON a.user_id = b.user_id
LEFT JOIN join_multi_c AS c
ON a.user_id = c.user_id
;
我们分别看下以上两个语句,在不使用MAP JOIN 后的 EXPLAIN 结果
先用user_id 关联,再用 mobile 关联
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1
Stage-0 depends on stages: Stage-2
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: a
Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: user_id (type: bigint)
sort order: +
Map-reduce partition columns: user_id (type: bigint)
Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE
value expressions: mobile (type: string), sex (type: bigint)
TableScan
alias: b
Statistics: Num rows: 3 Data size: 42 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: user_id (type: bigint)
sort order: +
Map-reduce partition columns: user_id (type: bigint)
Statistics: Num rows: 3 Data size: 42 Basic stats: COMPLETE Column stats: NONE
value expressions: mobile (type: string), user_name (type: string)
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
keys:
0 user_id (type: bigint)
1 user_id (type: bigint)
outputColumnNames: _col0, _col1, _col2, _col7, _col8
Statistics: Num rows: 3 Data size: 33 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
Stage: Stage-2
Map Reduce
Map Operator Tree:
TableScan
Reduce Output Operator
key expressions: _col7 (type: string)
sort order: +
Map-reduce partition columns: _col7 (type: string)
Statistics: Num rows: 3 Data size: 33 Basic stats: COMPLETE Column stats: NONE
value expressions: _col0 (type: bigint), _col1 (type: string), _col2 (type: bigint), _col8 (type: string)
TableScan
alias: c
Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: mobile (type: string)
sort order: +
Map-reduce partition columns: mobile (type: string)
Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE
value expressions: type (type: string)
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
keys:
0 _col7 (type: string)
1 mobile (type: string)
outputColumnNames: _col0, _col1, _col2, _col8, _col14
Statistics: Num rows: 3 Data size: 36 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: bigint), _col1 (type: string), _col2 (type: bigint), _col8 (type: string), _col14 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4
Statistics: Num rows: 3 Data size: 36 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 3 Data size: 36 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
都用user_id 做关联
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: a
Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: user_id (type: bigint)
sort order: +
Map-reduce partition columns: user_id (type: bigint)
Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE
value expressions: mobile (type: string), sex (type: bigint)
TableScan
alias: b
Statistics: Num rows: 3 Data size: 42 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: user_id (type: bigint)
sort order: +
Map-reduce partition columns: user_id (type: bigint)
Statistics: Num rows: 3 Data size: 42 Basic stats: COMPLETE Column stats: NONE
value expressions: user_name (type: string)
TableScan
alias: c
Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: user_id (type: bigint)
sort order: +
Map-reduce partition columns: user_id (type: bigint)
Statistics: Num rows: 3 Data size: 30 Basic stats: COMPLETE Column stats: NONE
value expressions: type (type: string)
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
Left Outer Join0 to 2
keys:
0 user_id (type: bigint)
1 user_id (type: bigint)
2 user_id (type: bigint)
outputColumnNames: _col0, _col1, _col2, _col8, _col14
Statistics: Num rows: 6 Data size: 66 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: bigint), _col1 (type: string), _col2 (type: bigint), _col8 (type: string), _col14 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4
Statistics: Num rows: 6 Data size: 66 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 6 Data size: 66 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
可以看到 先用 user_id JOIN, 再用 mobile JOIN 会多出一个 STAGE.
而两次都用 user_id JOIN, 由于是相同的key,3个表JOIN 的流程会在一个 STAGE.
下面,我们看下两种方式的执行时间:
先 user_id , 再 mobile执行时间:62.167 seconds
都是通过user_id JOIN,执行时间:32.123 seconds
可以看到使用相同条件做JOIN, 执行效率要明显优于不同JOIN 条件的SQL
我们这里再做下总结 :
由于相同条件的JOIN, STAGE 数量数量更少,所以减少了 mr job 的数量,所以效率更快。
而不同条件的JOIN ,STAGE 数量更多,增加了一个 mr job的 计算时间,与申请资源时间所以效率更低。
————————————————
本文声明:本文为CSDN博主「高达一号」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/u010003835/article/details/105493938