执行计划例子:
insert overwrite TABLE lpx SELECT t1.bar, t1.foo, t2.foo FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) ;
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF (TOK_TABNAME pokes) t1) (TOK_TABREF (TOK_TABNAME invites) t2) (= (. (TOK_TABLE_OR_COL t1) bar) (. (TOK_TABLE_OR_COL t2) bar)))) (TOK_INSERT (TOK_DESTINATION (TOK_TAB (TOK_TABNAME lpx))) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL t1) bar)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL t1) foo)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL t2) foo)))))
STAGE DEPENDENCIES:
Stage-1 is a root stage /根
Stage-0 depends on stages: Stage-1 /0依赖1
Stage-2 depends on stages: Stage-0 /2依赖0
STAGE PLANS:
Stage: Stage-1
Map Reduce//这个阶段是一个mapreduce作业 Alias -> Map Operator Tree: //map操作树,对应map阶段
t1
TableScan //扫描表获取数据 from加载表,描述中有行数和大小等
alias: t1 //表别名
Reduce Output Operator //这里描述map的输出,也就是reduce的输入。比如key,partition,sort等信息
key expressions: //t1表输出到reduce阶段的key信息
expr: bar
type: string
sort order: + //一个排序字段,这个排序字段是key=bar,多个排序字段多个+
Map-reduce partition columns: //partition的信息,由此也可以看出hive在join的时候会以join on后的列作为partition的列,以保证具有相同此列的值的行被分到同一个reduce中去
expr: bar
type: string
tag: 0 //对t1表打标签
value expressions: //t1表输出到reduce阶段的value信息
expr: foo
type: int
expr: bar
type: string
t2
TableScan
alias: t2
Reduce Output Operator
key expressions:
expr: bar
type: string
sort order: +
Map-reduce partition columns:
expr: bar
type: string
tag: 1
value expressions:
expr: foo
type: int
Reduce Operator Tree://reduce操作树,相当于reduce阶段Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {VALUE._col0} {VALUE._col1} //对应前面t1.bar, t1.foo
1 {VALUE._col0} //对应前面t2.foo
handleSkewJoin: false
outputColumnNames: _col0, _col1, _col5
Select Operator //筛选列,描述中有列名、类型,输出类型、大小等。
expressions:
expr: _col1
type: string
expr: _col0
type: int
expr: _col5
type: int
outputColumnNames: _col0, _col1, _col2 //为临时结果字段按规则起的临时字段名
File Output Operator //输出结果到临时文件中,描述介绍了压缩格式、输出文件格式。
compressed: false
GlobalTableId: 1
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: default.lpx
Stage: Stage-0
Move Operator //Stage-0简单把结果从临时目录,移动到表lpx相关的目录。
tables:
replace: true
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: default.lpx
Stage: Stage-2
Stats-Aggr Operator
========================================
========================================
从信息头:
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
Stage-2 depends on stages: Stage-0
从这里可以看出Plan计划的Job任务结构,整个任务会分为3个Job执行,第一个Job将由Stage-1构成;
第二个Job处理由Stage-0构成,Stage-0的处理必须依赖Stage-1阶段的结果;
第三个Job处理由Stage-2构成,Stage-2的处理必须依赖Stage-0阶段的结果。
下面分别解释Stage-1和Stage-0,执行SQL可以分成两步:(1)、SELECT t1.bar, t1.foo, t2.foo FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar);
(2)
、insert overwrite TABLE lpx;
Stage: Stage-1对应一次完整的Map Reduce任务,包括:Map Operator Tree和Reduce Operator Tree两步操作,Map Operator Tree对应Map任务,Reduce Operator Tree对应Reduce任务。从Map Operator Tree阶段可以看出进行了两个并列的操作t1和t2,分别SELECT t1.bar, t1.foo FROM t1;和SELECT t2.foo FROM t2;而且两个Map任务分别产生了Reduce阶段的输入[Reduce Output Operator]。从Reduce Operator Tree分析可以看到如下信息,条件连接Map的输出以及通过预定义的输出格式生成符合default.lpx的存储格式的数据存储到HDFS中。在我们创建lpx表的时候,没有指定该表的存储格式,默认会以Text为存储格式,输入输出会以TextInputFormat与TextOutputFormat进行读写:
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: default.lpx
input format的值对应org.apache.hadoop.mapred.TextInputFormat,这是因为在开始的Map阶段产生的临时输出文件是以TextOutputFormat格式保存的,自然Reduce的读取是由TextInputFormat格式处理读入数据。这些是由Hadoop的MapReduce处理细节来控制,而Hive只需要指定处理格式即可。
Serde值为org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe类,这时这个对象的保存的值为_col0, _col1, _col2,也就是我们预期要查询的t1.bar, t1.foo, t2.foo,这个值具体的应该为_col0+表lpx设置的列分割符+_col1+表lpx设置的列分割符+_col2。outputformat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat可以知道output的处理是使用该类来处理的。
Stage-0对应上面提到的第二步操作。这时stage-1产生的临时处理文件举例如tmp,需要经过stage-0阶段操作处理到lpx表中。Move Operator代表了这并不是一个MapReduce任务,只需要调用MoveTask的处理就行,在处理之前会去检查输入文件是否符合lpx表的存储格式。
hive执行计划作用
分析作业执行过程,优化作业执行流程,提升作业执行效率;例如,数据过滤条件从reduce端提前到map端,有效减少map/reduce间shuffle数据量,提升作业执行效率;
提前过滤数据数据集,减少不必要的读取操作;例如: hive join操作先于where条件顾虑,将分区条件放入on语句中,能够有效减少输入数据集;
执行计划分析问题hql
select a.*, b.cust_uid
from ods_ad_bid_deliver_info b join mds_ad_algo_feed_monitor_data_table a
where a.dt<=20140101 and a.dt<=20140108 and key='deliver_id_bucket_id' and a.dt=b.dt and a.key_slice=b.deliver_id
==========================================================================
==========================================================================
执行计划:
抽象语法树:
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF (TOK_TABNAME ods_ad_bid_deliver_info) b) (TOK_TABREF (TOK_TABNAME mds_ad_algo_feed_monitor_data_table) a))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_ALLCOLREF (TOK_TABNAME a))) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) cust_uid))) (TOK_WHERE (and (and (and (and (<= (. (TOK_TABLE_OR_COL a) dt) 20140101) (<= (. (TOK_TABLE_OR_COL a) dt) 20140108)) (= (TOK_TABLE_OR_COL key) 'deliver_id_bucket_id')) (= (. (TOK_TABLE_OR_COL a) dt) (. (TOK_TABLE_OR_COL b) dt))) (= (. (TOK_TABLE_OR_COL a) key_slice) (. (TOK_TABLE_OR_COL b) deliver_id))))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
a
TableScan
alias: a
Filter Operator
predicate:
expr: (key = 'deliver_id_bucket_id') //按key指定值在map阶段过滤
type: boolean
Reduce Output Operator
sort order:
tag: 1
value expressions: //select *导致输出到reduce的数据是全部的列信息
expr: key
type: string
expr: key_slice
type: string
expr: billing_mode_slice
type: string
expr: bucket_id
type: string
expr: ctr
type: string
expr: ecpm
type: string
expr: auc
type: string
expr: pctr
type: string
expr: pctr_ctr
type: string
expr: total_pv
type: string
expr: total_click
type: string
expr: dt
type: string
b
TableScan
alias: b
Reduce Output Operator
sort order:
tag: 0
value expressions:
expr: deliver_id
type: string
expr: cust_uid
type: string
expr: dt
type: string
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {VALUE._col0} {VALUE._col6} {VALUE._col35}
1 {VALUE._col0} {VALUE._col1} {VALUE._col2} {VALUE._col3} {VALUE._col4} {VALUE._col5} {VALUE._col6} {VALUE._col7} {VALUE._col8} {VALUE._col9} {VALUE._col10} {VALUE._col11}
handleSkewJoin: false
outputColumnNames: _col0, _col6, _col35, _col38, _col39, _col40, _col41, _col42, _col43, _col44, _col45, _col46, _col47, _col48, _col49
Filter Operator
predicate:
expr: (((((_col49 <= 20140101) and (_col49 <= 20140108)) and (_col38 = 'deliver_id_bucket_id')) and (_col49 = _col35)) and (_col39 = _col0))
type: boolean
Select Operator
expressions:
expr: _col38
type: string
expr: _col39
type: string
expr: _col40
type: string
expr: _col41
type: string
expr: _col42
type: string
expr: _col43
type: string
expr: _col44
type: string
expr: _col45
type: string
expr: _col46
type: string
expr: _col47
type: string
expr: _col48
type: string
expr: _col49
type: string
expr: _col6
type: string
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
优化之后hql:
select a.*, b.cust_uid
from ods_ad_bid_deliver_info b
join mds_ad_algo_feed_monitor_data_table a
on(a.dt<=20140101 and a.dt<=20140108 and a.dt=b.dt and a.key_slice=b.deliver_id and a.key='deliver_id_bucket_id')
=================================================================
=================================================================
执行计划:
抽象语法树:
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
a
TableScan
alias: a
Filter Operator
predicate:
expr: (key = 'deliver_id_bucket_id')
type: boolean
Filter Operator
predicate:
expr: (dt <= 20140101) //分区过滤条件在map端生效
type: boolean
Filter Operator
predicate:
expr: (dt <= 20140108) //分区过滤条件在map端生效
type: boolean
Filter Operator
predicate:
expr: (key = 'deliver_id_bucket_id')
type: boolean
Reduce Output Operator
key expressions:
expr: dt
type: string
expr: key_slice
type: string
sort order: ++
Map-reduce partition columns:
expr: dt
type: string
expr: key_slice
type: string
tag: 1
value expressions:
expr: key
type: string
expr: key_slice
type: string
expr: billing_mode_slice
type: string
expr: bucket_id
type: string
expr: ctr
type: string
expr: ecpm
type: string
expr: auc
type: string
expr: pctr
type: string
expr: pctr_ctr
type: string
expr: total_pv
type: string
expr: total_click
type: string
expr: dt
type: string
b
TableScan
alias: b
Reduce Output Operator
key expressions:
expr: dt
type: string
expr: deliver_id
type: string
sort order: ++
Map-reduce partition columns:
expr: dt
type: string
expr: deliver_id
type: string
tag: 0
value expressions:
expr: cust_uid
type: string
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {VALUE._col6}
1 {VALUE._col0} {VALUE._col1} {VALUE._col2} {VALUE._col3} {VALUE._col4} {VALUE._col5} {VALUE._col6} {VALUE._col7} {VALUE._col8} {VALUE._col9} {VALUE._col10} {VALUE._col11}
handleSkewJoin: false
outputColumnNames: _col6, _col38, _col39, _col40, _col41, _col42, _col43, _col44, _col45, _col46, _col47, _col48, _col49
Select Operator
expressions:
expr: _col38
type: string
expr: _col39
type: string
expr: _col40
type: string
expr: _col41
type: string
expr: _col42
type: string
expr: _col43
type: string
expr: _col44
type: string
expr: _col45
type: string
expr: _col46
type: string
expr: _col47
type: string
expr: _col48
type: string
expr: _col49
type: string
expr: _col6
type: string
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
例子:
select * from emp e
left join dept d on e.deptno=d.deptno
where d.dt='2018-06-04';
花费时间:Time taken: 44.401 seconds, Fetched: 17 row(s)
执行计划:
STAGE DEPENDENCIES:
Stage-4 is a root stage
Stage-3 depends on stages: Stage-4
Stage-0 depends on stages: Stage-3
STAGE PLANS:
Stage: Stage-4
Map Reduce Local Work /本地执行
Alias -> Map Local Tables:
d
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
d
TableScan
alias: d
Statistics: Num rows: 1 Data size: 168 Basic stats: PARTIAL Column stats: PARTIAL
HashTable Sink Operator/ReduceSinkOperator将Map端的字段组合序列化为Reduce Key/value, Partition Key,只可能出现在Map阶段,同时也标志着Hive生成的MapReduce程序中Map阶段的结束。
keys:
0 deptno (type: string)
1 deptno (type: string)
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
alias: e
Statistics: Num rows: 1 Data size: 757 Basic stats: PARTIAL Column stats: PARTIAL
Map Join Operator
condition map:
Left Outer Join0 to 1
keys:
0 deptno (type: string)
1 deptno (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col12, _col13, _col14, _col15
Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (_col15 = '2018-06-04') (type: boolean)
Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: str
ing), _col12 (type: string), _col13 (type: string), _col14 (type: string), '2018-06-04' (type: string) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12
Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Local Work:
Map Reduce Local Work
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
select * from emp e
left join dept d on (e.deptno=d.deptno and d.dt='2018-06-04');
花费时间:Time taken: 23.804 seconds, Fetched: 17 row(s)
STAGE DEPENDENCIES:
Stage-4 is a root stage
Stage-3 depends on stages: Stage-4
Stage-0 depends on stages: Stage-3
STAGE PLANS:
Stage: Stage-4
Map Reduce Local Work
Alias -> Map Local Tables:
d
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
d
TableScan
alias: d
filterExpr: (dt = '2018-06-04') (type: boolean)
Statistics: Num rows: 1 Data size: 84 Basic stats: PARTIAL Column stats: PARTIAL
HashTable Sink Operator
keys:
0 deptno (type: string)
1 deptno (type: string)
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
alias: e
Statistics: Num rows: 1 Data size: 757 Basic stats: PARTIAL Column stats: PARTIAL
Map Join Operator
condition map:
Left Outer Join0 to 1
keys:
0 deptno (type: string)
1 deptno (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col12, _col13, _col14, _col15
Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: strin
g), _col12 (type: string), _col13 (type: string), _col14 (type: string), _col15 (type: string) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12
Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Local Work:
Map Reduce Local Work
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink