如下两个sql语句,一般意义上大家都认为第二个执行更快,执行完大家就会发现第一个耗时149s,第二个耗时160s,可以说两个的执行计划近乎相同,有兴趣的同学可以看下下面的语法树。
注:等值连接,两者没什么差别;但是外连接,只有第二种能保证查询结果正确,因为外连接先执行join操作,reduce时候执行where操作
hql1:
hive -e "use acorn_3g;
explain select t.app_id,count(distinct t.uid)
from tmp_user_info t
join tmp_user_info n
on t.uid = n.uid
where t.log_date='2013-11-05' and n.log_date='2013-12-05'
group by t.app_id;"
hql2:
hive -e "use acorn_3g;
explain select t.app_id,count(distinct t.uid)
from ( select app_id,uid
from tmp_user_info
where log_date='2013-11-05') t
join ( select uid
from tmp_user_info
where log_date='2013-12-05') n
on t.uid = n.uid group by t.app_id;"
hq1抽象语法树:
ABSTRACT SYNTAX TREE:
(TOK_QUERY
(TOK_FROM
(TOK_JOIN
(TOK_TABREF
(TOK_TABNAME tmp_user_info) t
)
(TOK_TABREF
(TOK_TABNAME tmp_user_info) n
)
( =
(. (TOK_TABLE_OR_COL t) uid)
(. (TOK_TABLE_OR_COL n) uid)
)
)
)
(TOK_INSERT
(TOK_DESTINATION
(TOK_DIR TOK_TMP_FILE)
)
(TOK_SELECT
(TOK_SELEXPR
(. (TOK_TABLE_OR_COL t) app_id)
)
(TOK_SELEXPR
(TOK_FUNCTIONDI count
(. (TOK_TABLE_OR_COL t) uid)
)
)
)
(TOK_WHERE
( and
(= (. (TOK_TABLE_OR_COL t) log_date) '2013-11-05')
(= (. (TOK_TABLE_OR_COL n) log_date) '2013-12-05')
))
(TOK_GROUPBY (. (TOK_TABLE_OR_COL t) app_id))
)
)
hq1计划引擎
STAGE DEPENDENCIES:
Stage-6 is a root stage , consists of Stage-1
Stage-1
Stage-2 depends on stages: Stage-1
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-6
Conditional Operator
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
n
TableScan //扫描表
alias: n //表的别名
Reduce Output Operator //输出key,value 给reduce (uid,null)
key expressions:
expr: uid
type: bigint
sort order: +
Map-reduce partition columns: //hadoop partition
expr: uid
type: bigint
tag: 1 //表标签1
t
TableScan //扫描表
alias: t //表的别名
Reduce Output Operator //输出key,value给reduce key:uid value:uid,app_id
key expressions:
expr: uid
type: bigint
sort order: +
Map-reduce partition columns: //hadoop partition
expr: uid
type: bigint
tag: 0
value expressions:
expr: uid
type: bigint
expr: app_id
type: string
Reduce Operator Tree: //reduce join 操作
Join Operator
condition map:
Inner Join 0 to 1 //表 n join 表 t
condition expressions:
0 {VALUE._col1} {VALUE._col3} //表0 uid,app_id
1
handleSkewJoin: false
outputColumnNames: _col1, _col3
Select Operator //选择字段
expressions:
expr: _col3 //n.app_id
type: string
expr: _col1 //n.uid
type: bigint
outputColumnNames: _col3, _col1 //map输出
Group By Operator //在第一次MR reduce阶段先做一次聚合
aggregations:
expr: count(DISTINCT _col1) //聚合函数
bucketGroup: false
keys:
expr: _col3 //app_id
type: string
expr: _col1 //uid
type: bigint
mode: hash
outputColumnNames: _col0, _col1, _col2
File Output Operator //输出到文件,第一次MR落地
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Stage: Stage-2 //MR 第二阶段
Map Reduce
Alias -> Map Operator Tree:
hdfs://YZSJHL18-25.opi.com/tmp/hive-dp/hive_2013-12-09_21-06-05_895_5853839692383923149/-mr-10002
Reduce Output Operator
key expressions:
expr: _col0 //app_id
type: string
expr: _col1 //uid
type: bigint
sort order: ++
Map-reduce partition columns: //hadoop partition app_id
expr: _col0
type: string
tag: -1
value expressions:
expr: _col2
type: bigint
Reduce Operator Tree: //在reduce端聚合
Group By Operator
aggregations:
expr: count(DISTINCT KEY._col1:0._col0) //聚合函数,distinct 操作 uid,app_id作为联合key
bucketGroup: false
keys:
expr: KEY._col0
type: string
mode: mergepartial
outputColumnNames: _col0, _col1
Select Operator
expressions:
expr: _col0
type: string
expr: _col1
type: bigint
outputColumnNames: _col0, _col1
File Output Operator //输出到文件,任务结束
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
hql2计划引擎:
STAGE DEPENDENCIES:
Stage-6 is a root stage , consists of Stage-1
Stage-1
Stage-2 depends on stages: Stage-1
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-6
Conditional Operator
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
n:tmp_user_info //表别名
TableScan //扫描表
alias: tmp_user_info
Select Operator //选择字段
expressions:
expr: uid
type: bigint
outputColumnNames: _col0
Reduce Output Operator //输出key,value给reduce (uid,uid)
key expressions:
expr: _col0
type: bigint
sort order: +
Map-reduce partition columns:
expr: _col0
type: bigint
tag: 1 //表标签1
t:tmp_user_info //表别名
TableScan //扫描表
alias: tmp_user_info
Select Operator
expressions:
expr: app_id
type: string
expr: uid
type: bigint
outputColumnNames: _col0, _col1 //map输出 app_id,uid
Reduce Output Operator //key:uid value:uid,app_id
key expressions:
expr: _col1
type: bigint
sort order: +
Map-reduce partition columns: //hadoop partition
expr: _col1
type: bigint
tag: 0 //表标签0
value expressions:
expr: _col0
type: string
expr: _col1
type: bigint
Reduce Operator Tree: //reduce端join
Join Operator
condition map: //common reduce join
Inner Join 0 to 1
condition expressions:
0 {VALUE._col0} {VALUE._col1}
1
handleSkewJoin: false
outputColumnNames: _col0, _col1
Select Operator
expressions:
expr: _col0
type: string
expr: _col1
type: bigint
outputColumnNames: _col0, _col1
Group By Operator
aggregations:
expr: count(DISTINCT _col1)
bucketGroup: false
keys:
expr: _col0
type: string
expr: _col1
type: bigint
mode: hash
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Stage: Stage-2
Map Reduce
Alias -> Map Operator Tree:
hdfs://YZSJHL18-25.opi.com/tmp/hive-dp/hive_2013-12-09_21-05-50_461_737142336422281010/-mr-10002
Reduce Output Operator
key expressions:
expr: _col0
type: string
expr: _col1
type: bigint
sort order: ++
Map-reduce partition columns:
expr: _col0
type: string
tag: -1
value expressions:
expr: _col2
type: bigint
Reduce Operator Tree:
Group By Operator
aggregations:
expr: count(DISTINCT KEY._col1:0._col0)
bucketGroup: false
keys:
expr: KEY._col0
type: string
mode: mergepartial
outputColumnNames: _col0, _col1
Select Operator
expressions:
expr: _col0
type: string
expr: _col1
type: bigint
outputColumnNames: _col0, _col1
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain