Hql:
insert overwrite TABLE lpx SELECT t1.bar, t1.foo, t2.foo FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) ;
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF (TOK_TABNAME pokes) t1) (TOK_TABREF (TOK_TABNAME invites) t2) (= (. (TOK_TABLE_OR_COL t1) bar) (. (TOK_TABLE_OR_COL t2) bar)))) (TOK_INSERT (TOK_DESTINATION (TOK_TAB (TOK_TABNAME lpx))) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL t1) bar)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL t1) foo)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL t2) foo)))))
STAGE DEPENDENCIES:
Stage-1 is a root stage /根
Stage-0 depends on stages: Stage-1 /0依赖1
Stage-2 depends on stages: Stage-0 /2依赖0
STAGE PLANS:
Stage: Stage-1
Map Reduce
//
这个阶段是一个
mapreduce
作业
Alias -> Map Operator Tree:
//map
操作树,对应
map
阶段
t1
TableScan //
扫描表获取数据
from加载表,描述中有行数和大小等
alias: t1 //
表别名
Reduce Output Operator //
这里描述
map
的输出,也就是
reduce
的输入。比如
key
,
partition
,
sort
等信息
key expressions:
//t1
表输出到
reduce
阶段的
key
信息
expr: bar
type: string
sort order: + //
一个排序字段,这个排序字段是
key=bar
,多个排序字段多个
+
Map-reduce partition columns: //
partition
的信息,由此也可以看出
hive
在
join
的时候会以
join on
后的列作为
partition
的列,以保证具有相同此列的值的行被分到同一个
reduce
中去
expr: bar
type: string
tag: 0 //
对
t1
表打标签
value expressions: //
t1
表输出到
reduce
阶段的
value
信息
expr: foo
type: int
expr: bar
type: string
t2
TableScan
alias: t2
Reduce Output Operator
key expressions:
expr: bar
type: string
sort order: +
Map-reduce partition columns:
expr: bar
type: string
tag: 1
value expressions:
expr: foo
type: int
Reduce Operator Tree:
//reduce
操作树,相当于
reduce
阶段
Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {VALUE._col0} {VALUE._col1} //对应前面 t1.bar, t1.foo
1 {VALUE._col0} //
对应前面t2.foo
handleSkewJoin: false //是否处理倾斜join
outputColumnNames: _col0, _col1, _col5
Select Operator //
筛选列,描述中有列名、类型,输出类型、大小等。
expressions:
expr: _col1
type: string
expr: _col0
type: int
expr: _col5
type: int
outputColumnNames: _col0, _col1, _col2 //
为临时结果字段按规则起的临时字段名
File Output Operator //
输出结果到临时文件中,描述介绍了压缩格式、输出文件格式。
compressed: false
GlobalTableId: 1
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: default.lpx
Stage: Stage-0
Move Operator //
Stage-0简单把结果从临时目录,移动到表
lpx
相关的目录。
tables:
replace: true
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: default.lpx
Stage: Stage-2
Stats-Aggr Operator
从信息头:
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
Stage-2 depends on stages: Stage-0
从这里可以看出
Plan
计划的
Job
任务结构,整个任务会分为
3
个
Job
执行,
第一个
Job
将由
Stage-1
构成
;
第二个
Job
处理由
Stage-0
构成,
Stage-0
的处理必须依赖
Stage-1
阶段的结果
;
第三个
Job
处理由
Stage-2
构成,
Stage-2
的处理必须依赖
Stage-0
阶段的结果。
下面分别解释
Stage-1
和
Stage-0
,执行
SQL
可以分成两步:
(1)
、
SELECT t1.bar, t1.foo, t2.foo FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar);
(2)
、
insert overwrite TABLE lpx;
Stage: Stage-1
对应一次完整的
Map Reduce
任务,包括:
Map Operator Tree
和
Reduce Operator Tree
两步操作
,Map Operator Tree
对应
Map
任务,
Reduce Operator Tree
对应
Reduce
任务。
从
Map Operator Tree
阶段可以看出进行了两个并列的操作
t1
和
t2
,分别
SELECT t1.bar, t1.foo FROM t1;
和
SELECT t2.foo FROM t2;
而且两个
Map
任务分别产生了
Reduce
阶段的输入
[Reduce Output Operator]
。
从
Reduce Operator Tree
分析可以看到如下信息,条件连接
Map
的输出以及通过预定义的输出格式生成符合
default.lpx
的存储格式的数据存储到
HDFS
中。在我们创建
lpx
表
的时候,没有指定该表的存储格式,默认会以
Text
为存储格式,输入输出会以
TextInputFormat
与
TextOutputFormat
进行读写:
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: default.lpx
input format
的值对应
org.apache.hadoop.mapred.TextInputFormat
,
这是因为在开始的
Map
阶段产生的临时输出文件是以
TextOutputFormat
格式保存的,自然
Reduce
的读取是由
TextInputFormat
格式处理读入数据。这些是由
Hadoop
的
MapReduce
处理细节来控制,而
Hive
只需要指定处理格式即可。
Serde
值为
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
类,这时这个对象的保存的值为
_col0, _col1, _col2
,也就是我们预期要查询的
t1.bar, t1.foo, t2.foo
,这个值具体的应该为
_col0+
表
lpx
设置的列分割符
+_col1+
表
lpx
设置的列分割符
+_col2
。
outputformat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
可以知道
output
的处理是使用该类来处理的。
Stage-0
对应上面提到的第二步操作。这时
stage-1
产生的临时处理文件举例如
tmp,
需要经过
stage-0
阶段操作处理到
lpx
表中。
Move Operator
代表了这并不是一个
MapReduce
任务,只需要调用
MoveTask
的处理就行,在处理之前会去检查输入文件是否符合
lpx
表的存储格式。
select a.*, b.cust_uid
from ods_ad_bid_deliver_info b join mds_ad_algo_feed_monitor_data_table a
where a.dt<=20140101 and a.dt<=20140108 and key='deliver_id_bucket_id' and a.dt=b.dt and a.key_slice=b.deliver_id
执行计划:
抽象语法树:
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF (TOK_TABNAME ods_ad_bid_deliver_info) b) (TOK_TABREF (TOK_TABNAME mds_ad_algo_feed_monitor_data_table) a))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_ALLCOLREF (TOK_TABNAME a))) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) cust_uid))) (TOK_WHERE (and (and (and (and (<= (. (TOK_TABLE_OR_COL a) dt) 20140101) (<= (. (TOK_TABLE_OR_COL a) dt) 20140108)) (= (TOK_TABLE_OR_COL key) 'deliver_id_bucket_id')) (= (. (TOK_TABLE_OR_COL a) dt) (. (TOK_TABLE_OR_COL b) dt))) (= (. (TOK_TABLE_OR_COL a) key_slice) (. (TOK_TABLE_OR_COL b) deliver_id))))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
a
TableScan
alias: a
Filter Operator
predicate:
expr: (key = 'deliver_id_bucket_id')
//
按
key
指定值在
map
阶段过滤
type: boolean
Reduce Output Operator
sort order:
tag: 1
value expressions:
//select *
导致输出到
reduce
的数据是全部的列信息
expr: key
type: string
expr: key_slice
type: string
expr: billing_mode_slice
type: string
expr: bucket_id
type: string
expr: ctr
type: string
expr: ecpm
type: string
expr: auc
type: string
expr: pctr
type: string
expr: pctr_ctr
type: string
expr: total_pv
type: string
expr: total_click
type: string
expr: dt
type: string
b
TableScan
alias: b
Reduce Output Operator
sort order:
tag: 0
value expressions:
expr: deliver_id
type: string
expr: cust_uid
type: string
expr: dt
type: string
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {VALUE._col0} {VALUE._col6} {VALUE._col35}
1 {VALUE._col0} {VALUE._col1} {VALUE._col2} {VALUE._col3} {VALUE._col4} {VALUE._col5} {VALUE._col6} {VALUE._col7} {VALUE._col8} {VALUE._col9} {VALUE._col10} {VALUE._col11}
handleSkewJoin: false
outputColumnNames: _col0, _col6, _col35, _col38, _col39, _col40, _col41, _col42, _col43, _col44, _col45, _col46, _col47, _col48, _col49
Filter Operator
predicate:
expr: (((((_col49 <= 20140101) and (_col49 <= 20140108)) and (_col38 = 'deliver_id_bucket_id')) and (_col49 = _col35)) and (_col39 = _col0))
type: boolean
Select Operator
expressions:
expr: _col38
type: string
expr: _col39
type: string
expr: _col40
type: string
expr: _col41
type: string
expr: _col42
type: string
expr: _col43
type: string
expr: _col44
type: string
expr: _col45
type: string
expr: _col46
type: string
expr: _col47
type: string
expr: _col48
type: string
expr: _col49
type: string
expr: _col6
type: string
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
修改后
select a.*, b.cust_uid
from ods_ad_bid_deliver_info b
join mds_ad_algo_feed_monitor_data_table a
on(a.dt<=20140101 and a.dt<=20140108 and a.dt=b.dt and a.key_slice=b.deliver_id and a.key='deliver_id_bucket_id')
执行计划:
抽象语法树:
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
a
TableScan
alias: a
Filter Operator
predicate:
expr: (key = 'deliver_id_bucket_id')
type: boolean
Filter Operator
predicate:
expr: (dt <= 20140101) //
分区过滤条件在
map
端生效
type: boolean
Filter Operator
predicate:
expr: (dt <= 20140108) //
分区过滤条件在
map
端生效
type: boolean
Filter Operator
predicate:
expr: (key = 'deliver_id_bucket_id')
type: boolean
Reduce Output Operator
key expressions:
expr: dt
type: string
expr: key_slice
type: string
sort order: ++
Map-reduce partition columns:
expr: dt
type: string
expr: key_slice
type: string
tag: 1
value expressions:
expr: key
type: string
expr: key_slice
type: string
expr: billing_mode_slice
type: string
expr: bucket_id
type: string
expr: ctr
type: string
expr: ecpm
type: string
expr: auc
type: string
expr: pctr
type: string
expr: pctr_ctr
type: string
expr: total_pv
type: string
expr: total_click
type: string
expr: dt
type: string
b
TableScan
alias: b
Reduce Output Operator
key expressions:
expr: dt
type: string
expr: deliver_id
type: string
sort order: ++
Map-reduce partition columns:
expr: dt
type: string
expr: deliver_id
type: string
tag: 0
value expressions:
expr: cust_uid
type: string
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {VALUE._col6}
1 {VALUE._col0} {VALUE._col1} {VALUE._col2} {VALUE._col3} {VALUE._col4} {VALUE._col5} {VALUE._col6} {VALUE._col7} {VALUE._col8} {VALUE._col9} {VALUE._col10} {VALUE._col11}
handleSkewJoin: false
outputColumnNames: _col6, _col38, _col39, _col40, _col41, _col42, _col43, _col44, _col45, _col46, _col47, _col48, _col49
Select Operator
expressions:
expr: _col38
type: string
expr: _col39
type: string
expr: _col40
type: string
expr: _col41
type: string
expr: _col42
type: string
expr: _col43
type: string
expr: _col44
type: string
expr: _col45
type: string
expr: _col46
type: string
expr: _col47
type: string
expr: _col48
type: string
expr: _col49
type: string
expr: _col6
type: string
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
栗子:
select * from emp e
left join dept d on e.deptno=d.deptno
where d.dt='2018-06-04';
花费时间:Time taken: 44.401 seconds, Fetched: 17 row(s)
执行计划:
STAGE DEPENDENCIES:
Stage-4 is a root stage
Stage-3 depends on stages: Stage-4
Stage-0 depends on stages: Stage-3
STAGE PLANS:
Stage: Stage-4
//1、将启用本地MR读取小表
Map Reduce Local Work /本地执行
Alias -> Map Local Tables:
d
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
d
TableScan
alias: d
Statistics: Num rows: 1 Data size: 168 Basic stats: PARTIAL Column stats: PARTIAL
//2、写入哈希表文件
HashTable Sink Operator
keys:
0 deptno (type: string)
1 deptno (type: string)
//3、上传至hdfs DC中 执行计划中没有
2018-06-07 11:04:40 Uploaded 1 File to: file:/tmp/dw/fd4d06b7-523f-41a0-a8ff-ee4b13995b7e/hive_2018-06-07_11-04-35_149_2329359575876874876-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile01--.hashtable (583 bytes)
//4、执行一个map作业,读取大表,并与缓存中的小表连接操作
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
alias: e
Statistics: Num rows: 1 Data size: 757 Basic stats: PARTIAL Column stats: PARTIAL
Map Join Operator
condition map:
Left Outer Join0 to 1
keys:
0 deptno (type: string)
1 deptno (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col12, _col13, _col14, _col15
Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (_col15 = '2018-06-04') (type: boolean)
Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: str
ing), _col12 (type: string), _col13 (type: string), _col14 (type: string), '2018-06-04' (type: string) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12
Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Local Work:
Map Reduce Local Work
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
--修改后
select * from emp e
left join dept d on (e.deptno=d.deptno and d.dt='2018-06-04');
花费时间:Time taken: 23.804 seconds, Fetched: 17 row(s)
STAGE DEPENDENCIES:
Stage-4 is a root stage
Stage-3 depends on stages: Stage-4
Stage-0 depends on stages: Stage-3
STAGE PLANS:
Stage: Stage-4
Map Reduce Local Work
Alias -> Map Local Tables:
d
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
d
TableScan
alias: d
filterExpr: (dt = '2018-06-04') (type: boolean)
Statistics: Num rows: 1 Data size: 84 Basic stats: PARTIAL Column stats: PARTIAL
HashTable Sink Operator
keys:
0 deptno (type: string)
1 deptno (type: string)
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
alias: e
Statistics: Num rows: 1 Data size: 757 Basic stats: PARTIAL Column stats: PARTIAL
Map Join Operator
condition map:
Left Outer Join0 to 1
keys:
0 deptno (type: string)
1 deptno (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col12, _col13, _col14, _col15
Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: strin
g), _col12 (type: string), _col13 (type: string), _col14 (type: string), _col15 (type: string) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12
Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 832 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Local Work:
Map Reduce Local Work
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
资料来源:
https://tech.meituan.com/hive-sql-to-mapreduce.html 美团技术博客
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain hive文档
秦老师