Hive性能调优(二)——一文搞懂HiveSQL执行计划

测试的数据请看上一篇博客,数据行数500万。

目录

  • 一.简单SQL的执行计划
  • 二.带普通函数SQL的执行计划
  • 三.带聚合函数SQL的执行计划
  • 四.带窗口函数SQL的执行计划
  • 五.表连接的SQL的执行计划

一.简单SQL的执行计划

explain 
select s_age,s_score
from student_tb_seq
where s_age=20;
+----------------------------------------------------+--+
|                      Explain                       |
+----------------------------------------------------+--+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: student_tb_seq                  |
|             filterExpr: (s_age = 20) (type: boolean) |
|             Statistics: Num rows: 5000000 Data size: 5478827365 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: (s_age = 20) (type: boolean) |
|               Statistics: Num rows: 2500000 Data size: 2739413682 Basic stats: COMPLETE Column stats: NONE |
|               Select Operator                      |
|                 expressions: 20 (type: bigint), s_score (type: bigint) |
|                 outputColumnNames: _col0, _col1    |
|                 Statistics: Num rows: 2500000 Data size: 2739413682 Basic stats: COMPLETE Column stats: NONE |
|                 File Output Operator               |
|                   compressed: false                |
|                   Statistics: Num rows: 2500000 Data size: 2739413682 Basic stats: COMPLETE Column stats: NONE |
|                   table:                           |
|                       input format: org.apache.hadoop.mapred.TextInputFormat |
|                       output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
|                       serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
+----------------------------------------------------+--+

执行流程:

Map Operator Tree:

TableScan
Filter Operator
Select Operator
File Output Operator
Fetch Operator

select-from-where型简单的SQL只有Map阶段,都是在Map端过滤,都是本地计算,所以运行效率完全不输Spark计算引擎。

二.带普通函数SQL的执行计划

explain 
select 
  nvl(s_no,'undefine') sno,
  case when s_score>20 then '高级评分' else '低级评分' end level
from student_tb_orc
where s_age in (18,19,20);
+----------------------------------------------------+--+
|                      Explain                       |
+----------------------------------------------------+--+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: student_tb_orc                  |
|             filterExpr: (s_age) IN (18, 19, 20) (type: boolean) |
|             Statistics: Num rows: 5000000 Data size: 7602500000 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: (s_age) IN (18, 19, 20) (type: boolean) |
|               Statistics: Num rows: 2500000 Data size: 3801250000 Basic stats: COMPLETE Column stats: NONE |
|               Select Operator                      |
|                 expressions: if s_no is null returns'undefine' (type: string), CASE WHEN ((s_score > 20)) THEN ('高级评分') ELSE ('低级评分') END (type: string) |
|                 outputColumnNames: _col0, _col1    |
|                 Statistics: Num rows: 2500000 Data size: 3801250000 Basic stats: COMPLETE Column stats: NONE |
|                 File Output Operator               |
|                   compressed: false                |
|                   Statistics: Num rows: 2500000 Data size: 3801250000 Basic stats: COMPLETE Column stats: NONE |
|                   table:                           |
|                       input format: org.apache.hadoop.mapred.TextInputFormat |
|                       output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
|                       serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
+----------------------------------------------------+--+

执行流程:

Map Operator Tree:

TableScan
Filter Operator
Select Operator
File Output Operator
Fetch Operator

带普通函数SQL的执行计划和select-from-where型简单的SQL一样,都只有Map阶段,都是在Map端过滤,都是本地计算。

三.带聚合函数SQL的执行计划

explain
select count(1)
from student_tb_seq
where s_age=20;
+----------------------------------------------------+--+
|                      Explain                       |
+----------------------------------------------------+--+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: student_tb_seq                  |
|             filterExpr: (s_age = 20) (type: boolean) |
|             Statistics: Num rows: 5000000 Data size: 5478827365 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: (s_age = 20) (type: boolean) |
|               Statistics: Num rows: 2500000 Data size: 2739413682 Basic stats: COMPLETE Column stats: NONE |
|               Select Operator                      |
|                 Statistics: Num rows: 2500000 Data size: 2739413682 Basic stats: COMPLETE Column stats: NONE |
|                 Group By Operator                  |
|                   aggregations: count(1)           |
|                   mode: hash                       |
|                   outputColumnNames: _col0         |
|                   Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
|                   Reduce Output Operator           |
|                     sort order:                    |
|                     Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
|                     value expressions: _col0 (type: bigint) |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           aggregations: count(VALUE._col0)         |
|           mode: mergepartial                       |
|           outputColumnNames: _col0                 |
|           Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.TextInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
+----------------------------------------------------+--+

Map Operator Tree:

TableScan
Filter Operator
Select Operator
Group By Operator
Reduce Output Operator

Reduce Operator Tree:

Group By Operator
File Output Operator

如果只算行数count(*),则map可以输出context.write(null, count),即key值能为null。

如果hive.map.aggr=false,则map端没有Group By Operator。

explain
select s_age,avg(s_score) avg_score
from student_tb_orc
where s_age<20
group by s_age;
+----------------------------------------------------+--+
|                      Explain                       |
+----------------------------------------------------+--+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: student_tb_orc                  |
|             filterExpr: (s_age < 20) (type: boolean) |
|             Statistics: Num rows: 5000000 Data size: 7602500000 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: (s_age < 20) (type: boolean) |
|               Statistics: Num rows: 1666666 Data size: 2534165653 Basic stats: COMPLETE Column stats: NONE |
|               Group By Operator                    |
|                 aggregations: avg(s_score)         |
|                 keys: s_age (type: bigint)         |
|                 mode: hash                         |
|                 outputColumnNames: _col0, _col1    |
|                 Statistics: Num rows: 1666666 Data size: 2534165653 Basic stats: COMPLETE Column stats: NONE |
|                 Reduce Output Operator             |
|                   key expressions: _col0 (type: bigint) |
|                   sort order: +                    |
|                   Map-reduce partition columns: _col0 (type: bigint) |
|                   Statistics: Num rows: 1666666 Data size: 2534165653 Basic stats: COMPLETE Column stats: NONE |
|                   value expressions: _col1 (type: struct) |
|       Execution mode: vectorized                   |
|       Reduce Operator Tree:                        |
|         Group By Operator                          |
|           aggregations: avg(VALUE._col0)           |
|           keys: KEY._col0 (type: bigint)           |
|           mode: mergepartial                       |
|           outputColumnNames: _col0, _col1          |
|           Statistics: Num rows: 833333 Data size: 1267082826 Basic stats: COMPLETE Column stats: NONE |
|           File Output Operator                     |
|             compressed: false                      |
|             Statistics: Num rows: 833333 Data size: 1267082826 Basic stats: COMPLETE Column stats: NONE |
|             table:                                 |
|                 input format: org.apache.hadoop.mapred.TextInputFormat |
|                 output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
|                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
+----------------------------------------------------+--+

Map Operator Tree:

TableScan
Filter Operator
Group By Operator
Reduce Output Operator

Reduce Operator Tree:

Group By Operator
File Output Operator

group by算平均数,在map端的聚合是不能算平均数的,因为会有小数点,以免再在reduce端算平均数造成误差。

所以分析中显示的map端的输出的value为 struct,总和+个数。

四.带窗口函数SQL的执行计划

explain
select s_no,row_number() over(partition by s_age order by s_score) rk
from student_tb_orc;
+----------------------------------------------------+--+
|                      Explain                       |
+----------------------------------------------------+--+
| STAGE DEPENDENCIES:                                |
|   Stage-1 is a root stage                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: student_tb_orc                  |
|             Statistics: Num rows: 5000000 Data size: 7602500000 Basic stats: COMPLETE Column stats: NONE |
|             Reduce Output Operator                 |
|               key expressions: s_age (type: bigint), s_score (type: bigint) |
|               sort order: ++                       |
|               Map-reduce partition columns: s_age (type: bigint) |
|               Statistics: Num rows: 5000000 Data size: 7602500000 Basic stats: COMPLETE Column stats: NONE |
|               value expressions: s_no (type: string) |
|       Execution mode: vectorized                   |
|       Reduce Operator Tree:                        |
|         Select Operator                            |
|           expressions: VALUE._col0 (type: string), KEY.reducesinkkey0 (type: bigint), KEY.reducesinkkey1 (type: bigint) |
|           outputColumnNames: _col0, _col3, _col5   |
|           Statistics: Num rows: 5000000 Data size: 7602500000 Basic stats: COMPLETE Column stats: NONE |
|           PTF Operator                             |
|             Function definitions:                  |
|                 Input definition                   |
|                   input alias: ptf_0               |
|                   output shape: _col0: string, _col3: bigint, _col5: bigint |
|                   type: WINDOWING                  |
|                 Windowing table definition         |
|                   input alias: ptf_1               |
|                   name: windowingtablefunction     |
|                   order by: _col5                  |
|                   partition by: _col3              |
|                   raw input shape:                 |
|                   window functions:                |
|                       window function definition   |
|                         alias: _wcol0              |
|                         name: row_number           |
|                         window function: GenericUDAFRowNumberEvaluator |
|                         window frame: PRECEDING(MAX)~FOLLOWING(MAX) |
|                         isPivotResult: true        |
|             Statistics: Num rows: 5000000 Data size: 7602500000 Basic stats: COMPLETE Column stats: NONE |
|             Select Operator                        |
|               expressions: _col0 (type: string), _wcol0 (type: int) |
|               outputColumnNames: _col0, _col1      |
|               Statistics: Num rows: 5000000 Data size: 7602500000 Basic stats: COMPLETE Column stats: NONE |
|               File Output Operator                 |
|                 compressed: false                  |
|                 Statistics: Num rows: 5000000 Data size: 7602500000 Basic stats: COMPLETE Column stats: NONE |
|                 table:                             |
|                     input format: org.apache.hadoop.mapred.TextInputFormat |
|                     output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
|                     serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
+----------------------------------------------------+--+

Map Operator Tree:

TableScan
Reduce Output Operator

Reduce Operator Tree:

Select Operator
PTF Operator
Select Operator
File Output Operator

map端输出row_number() over内的两个参数当作key,s_no当作value。

row_number() over内的两个参数都要进行排序。

五.表连接的SQL的执行计划

explain
select a.s_no,a.s_score,b.s_score
from student_tb_orc a
inner join student_tb_orc b
on a.s_no=b.s_no;
+----------------------------------------------------+--+
|                      Explain                       |
+----------------------------------------------------+--+
| STAGE DEPENDENCIES:                                |
|   Stage-5 is a root stage , consists of Stage-1    |
|   Stage-1                                          |
|   Stage-0 depends on stages: Stage-1               |
|                                                    |
| STAGE PLANS:                                       |
|   Stage: Stage-5                                   |
|     Conditional Operator                           |
|                                                    |
|   Stage: Stage-1                                   |
|     Map Reduce                                     |
|       Map Operator Tree:                           |
|           TableScan                                |
|             alias: a                               |
|             filterExpr: s_no is not null (type: boolean) |
|             Statistics: Num rows: 5000000 Data size: 7602500000 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: s_no is not null (type: boolean) |
|               Statistics: Num rows: 2500000 Data size: 3801250000 Basic stats: COMPLETE Column stats: NONE |
|               Reduce Output Operator               |
|                 key expressions: s_no (type: string) |
|                 sort order: +                      |
|                 Map-reduce partition columns: s_no (type: string) |
|                 Statistics: Num rows: 2500000 Data size: 3801250000 Basic stats: COMPLETE Column stats: NONE |
|                 value expressions: s_score (type: bigint) |
|           TableScan                                |
|             alias: b                               |
|             filterExpr: s_no is not null (type: boolean) |
|             Statistics: Num rows: 5000000 Data size: 7602500000 Basic stats: COMPLETE Column stats: NONE |
|             Filter Operator                        |
|               predicate: s_no is not null (type: boolean) |
|               Statistics: Num rows: 2500000 Data size: 3801250000 Basic stats: COMPLETE Column stats: NONE |
|               Reduce Output Operator               |
|                 key expressions: s_no (type: string) |
|                 sort order: +                      |
|                 Map-reduce partition columns: s_no (type: string) |
|                 Statistics: Num rows: 2500000 Data size: 3801250000 Basic stats: COMPLETE Column stats: NONE |
|                 value expressions: s_score (type: bigint) |
|       Reduce Operator Tree:                        |
|         Join Operator                              |
|           condition map:                           |
|                Inner Join 0 to 1                   |
|           keys:                                    |
|             0 s_no (type: string)                  |
|             1 s_no (type: string)                  |
|           outputColumnNames: _col0, _col5, _col15  |
|           Statistics: Num rows: 2750000 Data size: 4181375090 Basic stats: COMPLETE Column stats: NONE |
|           Select Operator                          |
|             expressions: _col0 (type: string), _col5 (type: bigint), _col15 (type: bigint) |
|             outputColumnNames: _col0, _col1, _col2 |
|             Statistics: Num rows: 2750000 Data size: 4181375090 Basic stats: COMPLETE Column stats: NONE |
|             File Output Operator                   |
|               compressed: false                    |
|               Statistics: Num rows: 2750000 Data size: 4181375090 Basic stats: COMPLETE Column stats: NONE |
|               table:                               |
|                   input format: org.apache.hadoop.mapred.TextInputFormat |
|                   output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
|                   serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
|                                                    |
|   Stage: Stage-0                                   |
|     Fetch Operator                                 |
|       limit: -1                                    |
|       Processor Tree:                              |
|         ListSink                                   |
+----------------------------------------------------+--+

Map Operator Tree:

Map Operator Tree
TableScan
TableScan
Filter Operator
Reduce Output Operator
Filter Operator
Reduce Output Operator

Reduce Operator Tree:

Join Operator
File Output Operator

map中两张表的相同的key的数据进入到一个reduce,但是reduce不知道每行的数据是表a的还是表b的,所以表标号0和1,来进行区分。

set hive.auto.convert.join; 这个参数默认是true。

如果一个表小于某个值,则会进行缓存,就是MapJoin的原理,不会在reduce中join。

你可能感兴趣的:(#,Hive,hadoop,hive,大数据,mapreduce)