官方文档
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain
一般执行计划有两个部分:
stage dependencies 各个stage之间的依赖性
stage plan 各个stage的执行计划
一个stage并不一定是一个MR,有可能是Fetch Operator,也有可能是Move Operator。
一个MR的执行计划分为两个部分:
Map Operator Tree MAP端的执行计划
Reduce Operator Tree Reduce端的执行计划
一些常见的Operator:
TableScan 读取数据,常见的属性 alias
Select Operator 选取操作
Group By Operator 分组聚合, 常见的属性 aggregations、mode , 当没有keys属性时只有一个分组。
Reduce Output Operator 输出结果给Reduce , 常见的属性 sort order
Fetch Operator 客户端获取数据 , 常见属性 limit
常见的属性的取值及含义:
aggregations 用在Group By Operator中
count()计数
mode 用在Group By Operator中
hash 待定
mergepartial 合并部分聚合结果
final
sort order 用于Reduce Output Operator中
+ 正序排序
不排序
++按两列正序排序,如果有两列
+- 正反排序,如果有两列
-反向排序
如此类推
下面是一些典型的操作的执行计划
hive> explain select count(*) from t_data1 ;
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
#说明stage之间的依赖性
STAGE PLANS: #各个stage的执行计划
Stage: Stage-1
Map Reduce #这个stage是一个MR
Map Operator Tree: #Map阶段的操作树
TableScan #扫描表,获取数据
alias: t_data1 扫描的表别名
Statistics: Num rows: 1 Data size: 43835224 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator #选取操作
Statistics: Num rows: 1 Data size: 43835224 Basic stats: COMPLETE Column stats: COMPLETE
Group By Operator #分组聚合操作,不指定Key,只有一个分组
aggregations: count() 聚合操作
mode: hash 模式?
outputColumnNames: _col0 输出列名
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator #输出结果给Reduce
sort order: #不排序
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
value expressions: _col0 (type: bigint) #value表达式
Reduce Operator Tree: #Reduce的操作树
Group By Operator #分组聚合操作
aggregations: count(VALUE._col0) 聚合操作
mode: mergepartial 合并各个map所贡献的各部分
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
File Output Operator #文件输出操作
compressed: false 不压缩
Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0 #依赖于Stage1的stage0
Fetch Operator #获取数据操作
limit: -1 #不限定
Processor Tree:
ListSink
这是一个简单的count(*)的执行计划
hive> explain select count(distinct sid) from t_data1 ;
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: t_data1
Statistics: Num rows: 5479403 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: sid (type: bigint) #选取SID
outputColumnNames: sid
Statistics: Num rows: 5479403 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
Group By Operator #分组聚合操作
aggregations: count(DISTINCT sid) #聚合算子
keys: sid (type: bigint) #分组键
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 5479403 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator #输出到Reduce
key expressions: _col0 (type: bigint) #键表达式
sort order: + #正向排序
Statistics: Num rows: 5479403 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
Reduce Operator Tree:
Group By Operator #分组聚合操作
aggregations: count(DISTINCT KEY._col0:0._col0)
mode: mergepartial #合并各个部分聚合结果
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 16 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 16 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
explain select applove_date , count(*) from t_data1 group by applove_date ;
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: t_data1
Statistics: Num rows: 1095880 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: applove_date (type: timestamp)
outputColumnNames: applove_date
Statistics: Num rows: 1095880 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: count()
keys: applove_date (type: timestamp) #聚合键(分区键)
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 1095880 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: timestamp)
sort order: +
Map-reduce partition columns: _col0 (type: timestamp)
Statistics: Num rows: 1095880 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
value expressions: _col1 (type: bigint)
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
keys: KEY._col0 (type: timestamp) #聚合键
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 547940 Data size: 21917612 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 547940 Data size: 21917612 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
hive> explain select sid , rn from (select sid , row_number()over(order by sid ) rn from t_data1 ) t1 where rn < 10 ;
OK
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: t_data1
Statistics: Num rows: 5479403 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: 0 (type: int), sid (type: bigint)
sort order: ++
Map-reduce partition columns: 0 (type: int)
Statistics: Num rows: 5479403 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
Reduce Operator Tree:
Select Operator
expressions: KEY.reducesinkkey1 (type: bigint)
outputColumnNames: _col0
Statistics: Num rows: 5479403 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
PTF Operator
Statistics: Num rows: 5479403 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (_wcol0 < 10) (type: boolean)
Statistics: Num rows: 1826467 Data size: 14611736 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: bigint), _wcol0 (type: int)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1826467 Data size: 14611736 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1826467 Data size: 14611736 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
hive> explain select sid from t_data1 order by sid limit 10 ;
OK
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: t_data1
Statistics: Num rows: 5479403 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: sid (type: bigint)
outputColumnNames: _col0
Statistics: Num rows: 5479403 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: bigint)
sort order: +
Statistics: Num rows: 5479403 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
TopN Hash Memory Usage: 0.1
Reduce Operator Tree:
Select Operator
expressions: KEY.reducesinkkey0 (type: bigint)
outputColumnNames: _col0
Statistics: Num rows: 5479403 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
Limit
Number of rows: 10
Statistics: Num rows: 10 Data size: 80 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 10 Data size: 80 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: 10
Processor Tree:
ListSink
hive> explain select a.sid , b.b_name from t_bin a join t_data1 b on(a.sid = b.sid ) where a.sid < 10000 ;
OK
STAGE DEPENDENCIES:
Stage-5 is a root stage , consists of Stage-1
Stage-1
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-5
Conditional Operator
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: a
Statistics: Num rows: 4247118 Data size: 33976944 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (sid < 10000) (type: boolean)
Statistics: Num rows: 1415706 Data size: 11325648 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: sid (type: bigint)
sort order: +
Map-reduce partition columns: sid (type: bigint)
Statistics: Num rows: 1415706 Data size: 11325648 Basic stats: COMPLETE Column stats: NONE
TableScan
alias: b
Statistics: Num rows: 405881 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (sid < 10000) (type: boolean)
Statistics: Num rows: 135293 Data size: 14611669 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: sid (type: bigint)
sort order: +
Map-reduce partition columns: sid (type: bigint)
Statistics: Num rows: 135293 Data size: 14611669 Basic stats: COMPLETE Column stats: NONE
value expressions: b_name (type: string)
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
keys:
0 sid (type: bigint)
1 sid (type: bigint)
outputColumnNames: _col0, _col32
Statistics: Num rows: 1557276 Data size: 12458213 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: bigint), _col32 (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1557276 Data size: 12458213 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1557276 Data size: 12458213 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
hive> explain select a.sid , b.b_name from t_bin a join t_data1 b on(a.sid = b.sid and a.sid < 10000) ;
OK
STAGE DEPENDENCIES:
Stage-5 is a root stage , consists of Stage-1
Stage-1
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-5
Conditional Operator
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: a
Statistics: Num rows: 4247118 Data size: 33976944 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (sid < 10000) (type: boolean)
Statistics: Num rows: 1415706 Data size: 11325648 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: sid (type: bigint)
sort order: +
Map-reduce partition columns: sid (type: bigint)
Statistics: Num rows: 1415706 Data size: 11325648 Basic stats: COMPLETE Column stats: NONE
TableScan
alias: b
Statistics: Num rows: 405881 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (sid < 10000) (type: boolean)
Statistics: Num rows: 135293 Data size: 14611669 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: sid (type: bigint)
sort order: +
Map-reduce partition columns: sid (type: bigint)
Statistics: Num rows: 135293 Data size: 14611669 Basic stats: COMPLETE Column stats: NONE
value expressions: b_name (type: string)
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
keys:
0 sid (type: bigint)
1 sid (type: bigint)
outputColumnNames: _col0, _col32
Statistics: Num rows: 1557276 Data size: 12458213 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: bigint), _col32 (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1557276 Data size: 12458213 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1557276 Data size: 12458213 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
hive> explain select a.sid , b.b_name from t_bin a full outer join t_data1 b on(a.sid = b.sid and a.sid < 10000) ;
OK
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: a
Statistics: Num rows: 4247118 Data size: 33976944 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: sid (type: bigint)
sort order: +
Map-reduce partition columns: sid (type: bigint)
Statistics: Num rows: 4247118 Data size: 33976944 Basic stats: COMPLETE Column stats: NONE
TableScan
alias: b
Statistics: Num rows: 405881 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: sid (type: bigint)
sort order: +
Map-reduce partition columns: sid (type: bigint)
Statistics: Num rows: 405881 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
value expressions: b_name (type: string)
Reduce Operator Tree:
Join Operator
condition map:
Outer Join 0 to 1
filter predicates:
0 {(KEY.reducesinkkey0 < 10000)}
1
keys:
0 sid (type: bigint)
1 sid (type: bigint)
outputColumnNames: _col0, _col32
Statistics: Num rows: 4671829 Data size: 37374639 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: bigint), _col32 (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 4671829 Data size: 37374639 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 4671829 Data size: 37374639 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
hive> explain select a.sid , b.b_name from t_bin a left outer join t_data1 b on(a.sid = b.sid) ;
OK
STAGE DEPENDENCIES:
Stage-4 is a root stage , consists of Stage-1
Stage-1
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-4
Conditional Operator
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: a
Statistics: Num rows: 4247118 Data size: 33976944 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: sid (type: bigint)
sort order: +
Map-reduce partition columns: sid (type: bigint)
Statistics: Num rows: 4247118 Data size: 33976944 Basic stats: COMPLETE Column stats: NONE
TableScan
alias: b
Statistics: Num rows: 405881 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: sid (type: bigint)
sort order: +
Map-reduce partition columns: sid (type: bigint)
Statistics: Num rows: 405881 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
value expressions: b_name (type: string)
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
keys:
0 sid (type: bigint)
1 sid (type: bigint)
outputColumnNames: _col0, _col32
Statistics: Num rows: 4671829 Data size: 37374639 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: bigint), _col32 (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 4671829 Data size: 37374639 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 4671829 Data size: 37374639 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink