测试的数据请看上一篇博客,数据行数500万。
explain
select s_age,s_score
from student_tb_seq
where s_age=20;
+----------------------------------------------------+--+
| Explain |
+----------------------------------------------------+--+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: student_tb_seq |
| filterExpr: (s_age = 20) (type: boolean) |
| Statistics: Num rows: 5000000 Data size: 5478827365 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: (s_age = 20) (type: boolean) |
| Statistics: Num rows: 2500000 Data size: 2739413682 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: 20 (type: bigint), s_score (type: bigint) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 2500000 Data size: 2739413682 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 2500000 Data size: 2739413682 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
+----------------------------------------------------+--+
执行流程:
Map Operator Tree:
select-from-where型简单的SQL只有Map阶段,都是在Map端过滤,都是本地计算,所以运行效率完全不输Spark计算引擎。
explain
select
nvl(s_no,'undefine') sno,
case when s_score>20 then '高级评分' else '低级评分' end level
from student_tb_orc
where s_age in (18,19,20);
+----------------------------------------------------+--+
| Explain |
+----------------------------------------------------+--+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: student_tb_orc |
| filterExpr: (s_age) IN (18, 19, 20) (type: boolean) |
| Statistics: Num rows: 5000000 Data size: 7602500000 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: (s_age) IN (18, 19, 20) (type: boolean) |
| Statistics: Num rows: 2500000 Data size: 3801250000 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: if s_no is null returns'undefine' (type: string), CASE WHEN ((s_score > 20)) THEN ('高级评分') ELSE ('低级评分') END (type: string) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 2500000 Data size: 3801250000 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 2500000 Data size: 3801250000 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
+----------------------------------------------------+--+
执行流程:
Map Operator Tree:
带普通函数SQL的执行计划和select-from-where型简单的SQL一样,都只有Map阶段,都是在Map端过滤,都是本地计算。
explain
select count(1)
from student_tb_seq
where s_age=20;
+----------------------------------------------------+--+
| Explain |
+----------------------------------------------------+--+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: student_tb_seq |
| filterExpr: (s_age = 20) (type: boolean) |
| Statistics: Num rows: 5000000 Data size: 5478827365 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: (s_age = 20) (type: boolean) |
| Statistics: Num rows: 2500000 Data size: 2739413682 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| Statistics: Num rows: 2500000 Data size: 2739413682 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| aggregations: count(1) |
| mode: hash |
| outputColumnNames: _col0 |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| sort order: |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
| value expressions: _col0 (type: bigint) |
| Reduce Operator Tree: |
| Group By Operator |
| aggregations: count(VALUE._col0) |
| mode: mergepartial |
| outputColumnNames: _col0 |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
+----------------------------------------------------+--+
Map Operator Tree:
Reduce Operator Tree:
如果只算行数count(*),则map可以输出context.write(null, count),即key值能为null。
如果hive.map.aggr=false,则map端没有Group By Operator。
explain
select s_age,avg(s_score) avg_score
from student_tb_orc
where s_age<20
group by s_age;
+----------------------------------------------------+--+
| Explain |
+----------------------------------------------------+--+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: student_tb_orc |
| filterExpr: (s_age < 20) (type: boolean) |
| Statistics: Num rows: 5000000 Data size: 7602500000 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: (s_age < 20) (type: boolean) |
| Statistics: Num rows: 1666666 Data size: 2534165653 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| aggregations: avg(s_score) |
| keys: s_age (type: bigint) |
| mode: hash |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 1666666 Data size: 2534165653 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: bigint) |
| sort order: + |
| Map-reduce partition columns: _col0 (type: bigint) |
| Statistics: Num rows: 1666666 Data size: 2534165653 Basic stats: COMPLETE Column stats: NONE |
| value expressions: _col1 (type: struct) |
| Execution mode: vectorized |
| Reduce Operator Tree: |
| Group By Operator |
| aggregations: avg(VALUE._col0) |
| keys: KEY._col0 (type: bigint) |
| mode: mergepartial |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 833333 Data size: 1267082826 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 833333 Data size: 1267082826 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
+----------------------------------------------------+--+
Map Operator Tree:
Reduce Operator Tree:
group by算平均数,在map端的聚合是不能算平均数的,因为会有小数点,以免再在reduce端算平均数造成误差。
所以分析中显示的map端的输出的value为 struct
explain
select s_no,row_number() over(partition by s_age order by s_score) rk
from student_tb_orc;
+----------------------------------------------------+--+
| Explain |
+----------------------------------------------------+--+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: student_tb_orc |
| Statistics: Num rows: 5000000 Data size: 7602500000 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: s_age (type: bigint), s_score (type: bigint) |
| sort order: ++ |
| Map-reduce partition columns: s_age (type: bigint) |
| Statistics: Num rows: 5000000 Data size: 7602500000 Basic stats: COMPLETE Column stats: NONE |
| value expressions: s_no (type: string) |
| Execution mode: vectorized |
| Reduce Operator Tree: |
| Select Operator |
| expressions: VALUE._col0 (type: string), KEY.reducesinkkey0 (type: bigint), KEY.reducesinkkey1 (type: bigint) |
| outputColumnNames: _col0, _col3, _col5 |
| Statistics: Num rows: 5000000 Data size: 7602500000 Basic stats: COMPLETE Column stats: NONE |
| PTF Operator |
| Function definitions: |
| Input definition |
| input alias: ptf_0 |
| output shape: _col0: string, _col3: bigint, _col5: bigint |
| type: WINDOWING |
| Windowing table definition |
| input alias: ptf_1 |
| name: windowingtablefunction |
| order by: _col5 |
| partition by: _col3 |
| raw input shape: |
| window functions: |
| window function definition |
| alias: _wcol0 |
| name: row_number |
| window function: GenericUDAFRowNumberEvaluator |
| window frame: PRECEDING(MAX)~FOLLOWING(MAX) |
| isPivotResult: true |
| Statistics: Num rows: 5000000 Data size: 7602500000 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col0 (type: string), _wcol0 (type: int) |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 5000000 Data size: 7602500000 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 5000000 Data size: 7602500000 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
+----------------------------------------------------+--+
Map Operator Tree:
Reduce Operator Tree:
map端输出row_number() over内的两个参数当作key,s_no当作value。
row_number() over内的两个参数都要进行排序。
explain
select a.s_no,a.s_score,b.s_score
from student_tb_orc a
inner join student_tb_orc b
on a.s_no=b.s_no;
+----------------------------------------------------+--+
| Explain |
+----------------------------------------------------+--+
| STAGE DEPENDENCIES: |
| Stage-5 is a root stage , consists of Stage-1 |
| Stage-1 |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-5 |
| Conditional Operator |
| |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: a |
| filterExpr: s_no is not null (type: boolean) |
| Statistics: Num rows: 5000000 Data size: 7602500000 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: s_no is not null (type: boolean) |
| Statistics: Num rows: 2500000 Data size: 3801250000 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: s_no (type: string) |
| sort order: + |
| Map-reduce partition columns: s_no (type: string) |
| Statistics: Num rows: 2500000 Data size: 3801250000 Basic stats: COMPLETE Column stats: NONE |
| value expressions: s_score (type: bigint) |
| TableScan |
| alias: b |
| filterExpr: s_no is not null (type: boolean) |
| Statistics: Num rows: 5000000 Data size: 7602500000 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: s_no is not null (type: boolean) |
| Statistics: Num rows: 2500000 Data size: 3801250000 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: s_no (type: string) |
| sort order: + |
| Map-reduce partition columns: s_no (type: string) |
| Statistics: Num rows: 2500000 Data size: 3801250000 Basic stats: COMPLETE Column stats: NONE |
| value expressions: s_score (type: bigint) |
| Reduce Operator Tree: |
| Join Operator |
| condition map: |
| Inner Join 0 to 1 |
| keys: |
| 0 s_no (type: string) |
| 1 s_no (type: string) |
| outputColumnNames: _col0, _col5, _col15 |
| Statistics: Num rows: 2750000 Data size: 4181375090 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col0 (type: string), _col5 (type: bigint), _col15 (type: bigint) |
| outputColumnNames: _col0, _col1, _col2 |
| Statistics: Num rows: 2750000 Data size: 4181375090 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 2750000 Data size: 4181375090 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
+----------------------------------------------------+--+
Map Operator Tree:
Reduce Operator Tree:
map中两张表的相同的key的数据进入到一个reduce,但是reduce不知道每行的数据是表a的还是表b的,所以表标号0和1,来进行区分。
set hive.auto.convert.join; 这个参数默认是true。
如果一个表小于某个值,则会进行缓存,就是MapJoin的原理,不会在reduce中join。