Hive 研发笔记---LEFT JOIN 中的坑

              因为日志急速增长,原来放在Mysql上的统计 越来越吃力,所以公司决定把统计业务迁移到Hadoop上。

在比对数据的时候,发现了Hive中的一个坑

select a.* from  default.t_softuser a
left join
 t_softuser b on
a.hid=b.hid and a.corp=b.corp and a.softid=b.softid and a.statdate='2015-01-27' and b.statdate='2015-01-27'
where b.hid is null

Explain

STAGE DEPENDENCIES:
  Stage-4 is a root stage , consists of Stage-1
  Stage-1
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-4
    Conditional Operator

  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: a
            Statistics: Num rows: 878083830 Data size: 60095442437 Basic stats: PARTIAL Column stats: NONE
            Reduce Output Operator
              key expressions: hid (type: string), corp (type: string), softid (type: int)
              sort order: +++
              Map-reduce partition columns: hid (type: string), corp (type: string), softid (type: int)
              Statistics: Num rows: 878083830 Data size: 60095442437 Basic stats: COMPLETE Column stats: NONE
              value expressions: hid (type: string), corp (type: string), softid (type: int), install_time (type: string), lastvisit_time (type: string), active_day (type: int), state (type: int), statdate (type: string)
          TableScan
            alias: b
            Statistics: Num rows: 32428238 Data size: 2219197485 Basic stats: COMPLETE Column stats: NONE
            Reduce Output Operator
              key expressions: hid (type: string), corp (type: string), softid (type: int)
              sort order: +++
              Map-reduce partition columns: hid (type: string), corp (type: string), softid (type: int)
              Statistics: Num rows: 32428238 Data size: 2219197485 Basic stats: COMPLETE Column stats: NONE
              value expressions: hid (type: string)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Left Outer Join0 to 1
          condition expressions:
            0 {VALUE._col0} {VALUE._col1} {VALUE._col2} {VALUE._col3} {VALUE._col4} {VALUE._col5} {VALUE._col6} {VALUE._col7}
            1 {VALUE._col0}
          filter predicates:
            0 {(VALUE._col7 = '2015-01-27')}
            1 
          outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col10
          Statistics: Num rows: 965892224 Data size: 66104987648 Basic stats: COMPLETE Column stats: NONE
          Filter Operator
            predicate: _col10 is null (type: boolean)
            Statistics: Num rows: 482946112 Data size: 33052493824 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: _col0 (type: string), _col1 (type: string), _col2 (type: int), _col3 (type: string), _col4 (type: string), _col5 (type: int), _col6 (type: int), _col7 (type: string)
              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
              Statistics: Num rows: 482946112 Data size: 33052493824 Basic stats: COMPLETE Column stats: NONE
              File Output Operator
                compressed: false
                Statistics: Num rows: 482946112 Data size: 33052493824 Basic stats: COMPLETE Column stats: NONE
                table:
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1

修改成如下Hql 后

select a.* from 
(select * from default.t_softuser where statdate='2015-01-27') a 
left join 
(select * from t_softuser where statdate='2015-01-27') b on 
a.hid=b.hid and a.corp=b.corp and a.softid=b.softid
解析后

STAGE DEPENDENCIES:
  Stage-4 is a root stage , consists of Stage-1
  Stage-1
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-4
    Conditional Operator

  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: t_softuser
            Statistics: Num rows: 32428238 Data size: 2219197485 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: hid (type: string), corp (type: string), softid (type: int), install_time (type: string), lastvisit_time (type: string), active_day (type: int), state (type: int), statdate (type: string)
              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
              Statistics: Num rows: 32428238 Data size: 2219197485 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: int)
                sort order: +++
                Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: int)
                Statistics: Num rows: 32428238 Data size: 2219197485 Basic stats: COMPLETE Column stats: NONE
                value expressions: _col0 (type: string), _col1 (type: string), _col2 (type: int), _col3 (type: string), _col4 (type: string), _col5 (type: int), _col6 (type: int), _col7 (type: string)
          TableScan
            alias: t_softuser
            Statistics: Num rows: 32428238 Data size: 2219197485 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: hid (type: string), corp (type: string), softid (type: int)
              outputColumnNames: _col0, _col1, _col2
              Statistics: Num rows: 32428238 Data size: 2219197485 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: int)
                sort order: +++
                Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: int)
                Statistics: Num rows: 32428238 Data size: 2219197485 Basic stats: COMPLETE Column stats: NONE
      Reduce Operator Tree:
        Join Operator
          condition map:
               Left Outer Join0 to 1
          condition expressions:
            0 {VALUE._col0} {VALUE._col1} {VALUE._col2} {VALUE._col3} {VALUE._col4} {VALUE._col5} {VALUE._col6} {VALUE._col7}
            1 
          outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
          Statistics: Num rows: 35671064 Data size: 2441117184 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: string), _col1 (type: string), _col2 (type: int), _col3 (type: string), _col4 (type: string), _col5 (type: int), _col6 (type: int), _col7 (type: string)
            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
            Statistics: Num rows: 35671064 Data size: 2441117184 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 35671064 Data size: 2441117184 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1


大家注意看 加红的部分,执行效率就不用说了


你可能感兴趣的:(hive)