看懂Hive的执行计划

官方文档
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain

关于Hive执行计划简述

一般执行计划有两个部分:
stage dependencies 各个stage之间的依赖性
stage plan 各个stage的执行计划

一个stage并不一定是一个MR,有可能是Fetch Operator,也有可能是Move Operator。

一个MR的执行计划分为两个部分:
Map Operator Tree MAP端的执行计划
Reduce Operator Tree Reduce端的执行计划

一些常见的Operator:
TableScan 读取数据,常见的属性 alias

Select Operator 选取操作
Group By Operator 分组聚合, 常见的属性 aggregations、mode , 当没有keys属性时只有一个分组。
Reduce Output Operator 输出结果给Reduce , 常见的属性 sort order
Fetch Operator 客户端获取数据 , 常见属性 limit

常见的属性的取值及含义:
aggregations 用在Group By Operator中
count()计数

mode 用在Group By Operator中
hash 待定
mergepartial 合并部分聚合结果
final

sort order 用于Reduce Output Operator中
+ 正序排序
不排序
++按两列正序排序,如果有两列
+- 正反排序,如果有两列
-反向排序
如此类推

下面是一些典型的操作的执行计划

先看一个简单的执行计划

hive> explain select count(*) from t_data1 ;

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1
#说明stage之间的依赖性

STAGE PLANS:  #各个stage的执行计划
  Stage: Stage-1
    Map Reduce  #这个stage是一个MR
      Map Operator Tree:  #Map阶段的操作树
          TableScan  #扫描表,获取数据
            alias: t_data1  扫描的表别名
            Statistics: Num rows: 1 Data size: 43835224 Basic stats: COMPLETE Column stats: COMPLETE
            Select Operator  #选取操作
              Statistics: Num rows: 1 Data size: 43835224 Basic stats: COMPLETE Column stats: COMPLETE
              Group By Operator   #分组聚合操作,不指定Key,只有一个分组
                aggregations: count()  聚合操作
                mode: hash    模式?    
                outputColumnNames: _col0  输出列名
                Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
                Reduce Output Operator   #输出结果给Reduce
                  sort order:         #不排序
                  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
                  value expressions: _col0 (type: bigint)   #value表达式
      Reduce Operator Tree:   #Reduce的操作树
        Group By Operator    #分组聚合操作
          aggregations: count(VALUE._col0)   聚合操作
          mode: mergepartial      合并各个map所贡献的各部分
          outputColumnNames: _col0
          Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
          File Output Operator   #文件输出操作
            compressed: false    不压缩
            Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
            table:
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0   #依赖于Stage1的stage0
    Fetch Operator  #获取数据操作
      limit: -1     #不限定
      Processor Tree:  
        ListSink

这是一个简单的count(*)的执行计划

再来看一个count(distinct)的执行计划

hive> explain select count(distinct sid) from t_data1 ;

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: t_data1
            Statistics: Num rows: 5479403 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: sid (type: bigint)  #选取SID
              outputColumnNames: sid
              Statistics: Num rows: 5479403 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
              Group By Operator   #分组聚合操作
                aggregations: count(DISTINCT sid)  #聚合算子
                keys: sid (type: bigint)   #分组键
                mode: hash
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 5479403 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator  #输出到Reduce
                  key expressions: _col0 (type: bigint)  #键表达式
                  sort order: +     #正向排序
                  Statistics: Num rows: 5479403 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
      Reduce Operator Tree:  
        Group By Operator   #分组聚合操作
          aggregations: count(DISTINCT KEY._col0:0._col0)
          mode: mergepartial   #合并各个部分聚合结果
          outputColumnNames: _col0
          Statistics: Num rows: 1 Data size: 16 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            Statistics: Num rows: 1 Data size: 16 Basic stats: COMPLETE Column stats: NONE
            table:
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

分组聚合的例子

explain select applove_date , count(*) from t_data1 group by applove_date ;

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: t_data1
            Statistics: Num rows: 1095880 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: applove_date (type: timestamp)
              outputColumnNames: applove_date
              Statistics: Num rows: 1095880 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
              Group By Operator
                aggregations: count()
                keys: applove_date (type: timestamp)   #聚合键(分区键)
                mode: hash
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 1095880 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: timestamp)
                  sort order: +
                  Map-reduce partition columns: _col0 (type: timestamp)
                  Statistics: Num rows: 1095880 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col1 (type: bigint)
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(VALUE._col0)
          keys: KEY._col0 (type: timestamp)  #聚合键
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 547940 Data size: 21917612 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            Statistics: Num rows: 547940 Data size: 21917612 Basic stats: COMPLETE Column stats: NONE
            table:
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

一个开窗函数的例子

hive> explain select sid , rn from (select sid , row_number()over(order by sid ) rn from t_data1 ) t1  where rn < 10 ;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: t_data1
            Statistics: Num rows: 5479403 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
            Reduce Output Operator
              key expressions: 0 (type: int), sid (type: bigint)
              sort order: ++
              Map-reduce partition columns: 0 (type: int)
              Statistics: Num rows: 5479403 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
      Reduce Operator Tree:
        Select Operator
          expressions: KEY.reducesinkkey1 (type: bigint)
          outputColumnNames: _col0
          Statistics: Num rows: 5479403 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
          PTF Operator
            Statistics: Num rows: 5479403 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (_wcol0 < 10) (type: boolean)
              Statistics: Num rows: 1826467 Data size: 14611736 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: _col0 (type: bigint), _wcol0 (type: int)
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 1826467 Data size: 14611736 Basic stats: COMPLETE Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 1826467 Data size: 14611736 Basic stats: COMPLETE Column stats: NONE
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

另一个TOP操作

hive> explain select sid from t_data1 order by sid limit 10 ;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: t_data1
            Statistics: Num rows: 5479403 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: sid (type: bigint)
              outputColumnNames: _col0
              Statistics: Num rows: 5479403 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: _col0 (type: bigint)
                sort order: +
                Statistics: Num rows: 5479403 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
                TopN Hash Memory Usage: 0.1
      Reduce Operator Tree:
        Select Operator
          expressions: KEY.reducesinkkey0 (type: bigint)
          outputColumnNames: _col0
          Statistics: Num rows: 5479403 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
          Limit
            Number of rows: 10
            Statistics: Num rows: 10 Data size: 80 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 10 Data size: 80 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: 10
      Processor Tree:
        ListSink

连接操作,注意where条件

hive> explain select a.sid , b.b_name from t_bin a join t_data1 b on(a.sid = b.sid ) where a.sid < 10000 ;
OK
STAGE DEPENDENCIES:
  Stage-5 is a root stage , consists of Stage-1
  Stage-1
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-5
    Conditional Operator

  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: a
            Statistics: Num rows: 4247118 Data size: 33976944 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (sid < 10000) (type: boolean)
              Statistics: Num rows: 1415706 Data size: 11325648 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: sid (type: bigint)
                sort order: +
                Map-reduce partition columns: sid (type: bigint)
                Statistics: Num rows: 1415706 Data size: 11325648 Basic stats: COMPLETE Column stats: NONE
          TableScan
            alias: b
            Statistics: Num rows: 405881 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (sid < 10000) (type: boolean)
              Statistics: Num rows: 135293 Data size: 14611669 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: sid (type: bigint)
                sort order: +
                Map-reduce partition columns: sid (type: bigint)
                Statistics: Num rows: 135293 Data size: 14611669 Basic stats: COMPLETE Column stats: NONE
                value expressions: b_name (type: string)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Inner Join 0 to 1
          keys:
            0 sid (type: bigint)
            1 sid (type: bigint)
          outputColumnNames: _col0, _col32
          Statistics: Num rows: 1557276 Data size: 12458213 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: bigint), _col32 (type: string)
            outputColumnNames: _col0, _col1
            Statistics: Num rows: 1557276 Data size: 12458213 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 1557276 Data size: 12458213 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

注意on条件(与上面的做对比,并没有区别,说明了hive的谓词前推)

hive> explain select a.sid , b.b_name from t_bin a join t_data1 b on(a.sid = b.sid and a.sid < 10000) ;
OK
STAGE DEPENDENCIES:
  Stage-5 is a root stage , consists of Stage-1
  Stage-1
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-5
    Conditional Operator

  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: a
            Statistics: Num rows: 4247118 Data size: 33976944 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (sid < 10000) (type: boolean)
              Statistics: Num rows: 1415706 Data size: 11325648 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: sid (type: bigint)
                sort order: +
                Map-reduce partition columns: sid (type: bigint)
                Statistics: Num rows: 1415706 Data size: 11325648 Basic stats: COMPLETE Column stats: NONE
          TableScan
            alias: b
            Statistics: Num rows: 405881 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (sid < 10000) (type: boolean)
              Statistics: Num rows: 135293 Data size: 14611669 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: sid (type: bigint)
                sort order: +
                Map-reduce partition columns: sid (type: bigint)
                Statistics: Num rows: 135293 Data size: 14611669 Basic stats: COMPLETE Column stats: NONE
                value expressions: b_name (type: string)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Inner Join 0 to 1
          keys:
            0 sid (type: bigint)
            1 sid (type: bigint)
          outputColumnNames: _col0, _col32
          Statistics: Num rows: 1557276 Data size: 12458213 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: bigint), _col32 (type: string)
            outputColumnNames: _col0, _col1
            Statistics: Num rows: 1557276 Data size: 12458213 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 1557276 Data size: 12458213 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

全外连接

hive> explain select a.sid , b.b_name from t_bin a full outer join t_data1 b on(a.sid = b.sid and a.sid < 10000) ;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: a
            Statistics: Num rows: 4247118 Data size: 33976944 Basic stats: COMPLETE Column stats: NONE
            Reduce Output Operator
              key expressions: sid (type: bigint)
              sort order: +
              Map-reduce partition columns: sid (type: bigint)
              Statistics: Num rows: 4247118 Data size: 33976944 Basic stats: COMPLETE Column stats: NONE
          TableScan
            alias: b
            Statistics: Num rows: 405881 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
            Reduce Output Operator
              key expressions: sid (type: bigint)
              sort order: +
              Map-reduce partition columns: sid (type: bigint)
              Statistics: Num rows: 405881 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
              value expressions: b_name (type: string)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Outer Join 0 to 1
          filter predicates:
            0 {(KEY.reducesinkkey0 < 10000)}
            1 
          keys:
            0 sid (type: bigint)
            1 sid (type: bigint)
          outputColumnNames: _col0, _col32
          Statistics: Num rows: 4671829 Data size: 37374639 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: bigint), _col32 (type: string)
            outputColumnNames: _col0, _col1
            Statistics: Num rows: 4671829 Data size: 37374639 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 4671829 Data size: 37374639 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

左外连接

hive> explain select a.sid , b.b_name from t_bin a left outer join t_data1 b on(a.sid = b.sid) ;
OK
STAGE DEPENDENCIES:
  Stage-4 is a root stage , consists of Stage-1
  Stage-1
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-4
    Conditional Operator

  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: a
            Statistics: Num rows: 4247118 Data size: 33976944 Basic stats: COMPLETE Column stats: NONE
            Reduce Output Operator
              key expressions: sid (type: bigint)
              sort order: +
              Map-reduce partition columns: sid (type: bigint)
              Statistics: Num rows: 4247118 Data size: 33976944 Basic stats: COMPLETE Column stats: NONE
          TableScan
            alias: b
            Statistics: Num rows: 405881 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
            Reduce Output Operator
              key expressions: sid (type: bigint)
              sort order: +
              Map-reduce partition columns: sid (type: bigint)
              Statistics: Num rows: 405881 Data size: 43835224 Basic stats: COMPLETE Column stats: NONE
              value expressions: b_name (type: string)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Left Outer Join0 to 1
          keys:
            0 sid (type: bigint)
            1 sid (type: bigint)
          outputColumnNames: _col0, _col32
          Statistics: Num rows: 4671829 Data size: 37374639 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: bigint), _col32 (type: string)
            outputColumnNames: _col0, _col1
            Statistics: Num rows: 4671829 Data size: 37374639 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 4671829 Data size: 37374639 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

你可能感兴趣的:(大数据技术)