因为日志急速增长,原来放在Mysql上的统计 越来越吃力,所以公司决定把统计业务迁移到Hadoop上。在比对数据的时候,发现了Hive中的一个坑
select a.* from default.t_softuser a left join t_softuser b on a.hid=b.hid and a.corp=b.corp and a.softid=b.softid and a.statdate='2015-01-27' and b.statdate='2015-01-27' where b.hid is null
STAGE DEPENDENCIES:
Stage-4 is a root stage , consists of Stage-1
Stage-1
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-4
Conditional Operator
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: a
Statistics: Num rows: 878083830 Data size: 60095442437 Basic stats: PARTIAL Column stats: NONE
Reduce Output Operator
key expressions: hid (type: string), corp (type: string), softid (type: int)
sort order: +++
Map-reduce partition columns: hid (type: string), corp (type: string), softid (type: int)
Statistics: Num rows: 878083830 Data size: 60095442437 Basic stats: COMPLETE Column stats: NONE
value expressions: hid (type: string), corp (type: string), softid (type: int), install_time (type: string), lastvisit_time (type: string), active_day (type: int), state (type: int), statdate (type: string)
TableScan
alias: b
Statistics: Num rows: 32428238 Data size: 2219197485 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: hid (type: string), corp (type: string), softid (type: int)
sort order: +++
Map-reduce partition columns: hid (type: string), corp (type: string), softid (type: int)
Statistics: Num rows: 32428238 Data size: 2219197485 Basic stats: COMPLETE Column stats: NONE
value expressions: hid (type: string)
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col0} {VALUE._col1} {VALUE._col2} {VALUE._col3} {VALUE._col4} {VALUE._col5} {VALUE._col6} {VALUE._col7}
1 {VALUE._col0}
filter predicates:
0 {(VALUE._col7 = '2015-01-27')}
1
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col10
Statistics: Num rows: 965892224 Data size: 66104987648 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: _col10 is null (type: boolean)
Statistics: Num rows: 482946112 Data size: 33052493824 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: string), _col1 (type: string), _col2 (type: int), _col3 (type: string), _col4 (type: string), _col5 (type: int), _col6 (type: int), _col7 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
Statistics: Num rows: 482946112 Data size: 33052493824 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 482946112 Data size: 33052493824 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
select a.* from
(select * from default.t_softuser where statdate='2015-01-27') a
left join
(select * from t_softuser where statdate='2015-01-27') b on
a.hid=b.hid and a.corp=b.corp and a.softid=b.softid
解析后
STAGE DEPENDENCIES:
Stage-4 is a root stage , consists of Stage-1
Stage-1
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-4
Conditional Operator
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: t_softuser
Statistics: Num rows: 32428238 Data size: 2219197485 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: hid (type: string), corp (type: string), softid (type: int), install_time (type: string), lastvisit_time (type: string), active_day (type: int), state (type: int), statdate (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
Statistics: Num rows: 32428238 Data size: 2219197485 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: int)
sort order: +++
Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: int)
Statistics: Num rows: 32428238 Data size: 2219197485 Basic stats: COMPLETE Column stats: NONE
value expressions: _col0 (type: string), _col1 (type: string), _col2 (type: int), _col3 (type: string), _col4 (type: string), _col5 (type: int), _col6 (type: int), _col7 (type: string)
TableScan
alias: t_softuser
Statistics: Num rows: 32428238 Data size: 2219197485 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: hid (type: string), corp (type: string), softid (type: int)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 32428238 Data size: 2219197485 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: int)
sort order: +++
Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: int)
Statistics: Num rows: 32428238 Data size: 2219197485 Basic stats: COMPLETE Column stats: NONE
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col0} {VALUE._col1} {VALUE._col2} {VALUE._col3} {VALUE._col4} {VALUE._col5} {VALUE._col6} {VALUE._col7}
1
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
Statistics: Num rows: 35671064 Data size: 2441117184 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: string), _col1 (type: string), _col2 (type: int), _col3 (type: string), _col4 (type: string), _col5 (type: int), _col6 (type: int), _col7 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
Statistics: Num rows: 35671064 Data size: 2441117184 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 35671064 Data size: 2441117184 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
大家注意看 加红的部分,执行效率就不用说了