hive之mapjoin

一:该如何使其生效:

  • set hive.auto.convert.join = true;  --是否开自动mapjoin
  • set hive.mapjoin.smalltable.filesize;   --mapjoin的表size大小

两个同时设置。具体的 filesize 量力而行,默认我记得好像使25m 很多帖子上的奇怪语法你也不用去看,都是老掉牙的东西了,比如:/*+ mapjoin(A)*/,除非你的hive版本很低,否则根本用不上。

还有一个:set hive.ignore.mapjoin.hint=true; 这个的话我觉得咩有必要。集群本身也是有参数设置的,且运维是有考量的。即使这么干了也不一定就会生效。走常规的方式即可。适当的调整hive.mapjoin.smalltable.filesize 这个值的大小。其实这个本身就是对小表来说的,但是大小是相对的,你如果有一个500g的表和一个50g的小表关联,你放内存真不一定就合适。个人建议1g以下的可以考虑,太大的话就没必要了。

二:hive 的mapjoin起作用了我们如何确定?

  • 只是inner join 的时候

看日志吧,最为直观:

2021-12-10 12:05:41	Starting to launch local task to process map join;	maximum memory = 954728448
2021-12-10 12:05:44	Processing rows:	200000	Hashtable size:	199999	Memory usage:	135058920	percentage:	0.141
2021-12-10 12:05:44	Dump the side-table into file: file:/tmp/hive_2021-12-10_11-47-34_913_2061727660300134431-1/-local-10007/HashTable-Stage-13/MapJoin-mapfile10--.hashtable
2021-12-10 12:05:44	Uploaded 1 File to: file:/tmp/hive_2021-12-10_11-47-34_913_2061727660300134431-1/-local-10007/HashTable-Stage-13/MapJoin-mapfile10--.hashtable (3517 bytes)
2021-12-10 12:05:44	Dump the side-table into file: file:/tmp/hive_2021-12-10_11-47-34_913_2061727660300134431-1/-local-10007/HashTable-Stage-13/MapJoin-mapfile12--.hashtable
2021-12-10 12:05:44	Uploaded 1 File to: file:/tmp/hive_2021-12-10_11-47-34_913_2061727660300134431-1/-local-10007/HashTable-Stage-13/MapJoin-mapfile12--.hashtable (8683158 bytes)
2021-12-10 12:05:44	End of local task; Time Taken: 3.034 sec.
Execution completed successfully

 关键点:

  1. Starting to launch local task to process map join; 这个说的够直白了吧
  2. Uploaded 1 File to: file:/tmp/hive_2021-12-10_11-47-34_913_2061727660300134431-1/-local-10007/HashTable-Stage-13/MapJoin-mapfile10--.hashtable    hashtable 
  3. end of local task  
  4. 起一个local task 映射成一个hashtable 

补充一下:

我发现其实left join 在满足条件的时候也是会走mapjoin的。

STAGE DEPENDENCIES:
  Stage-9 is a root stage , consists of Stage-11, Stage-1
  Stage-11 has a backup stage: Stage-1
  Stage-8 depends on stages: Stage-11
  Stage-7 depends on stages: Stage-1, Stage-8 , consists of Stage-10, Stage-2
  Stage-10 has a backup stage: Stage-2
  Stage-6 depends on stages: Stage-10
  Stage-3 depends on stages: Stage-2, Stage-6
  Stage-2
  Stage-1
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-9
    Conditional Operator

  Stage: Stage-11
    Map Reduce Local Work
      Alias -> Map Local Tables:
        t_2:temp_sjs_interact_cf_top10_t1 --21.3m
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        t_2:temp_sjs_interact_cf_top10_t1
          TableScan
            alias: temp_sjs_interact_cf_top10_t1
            Filter Operator
              predicate: sjs_r is not null (type: boolean)
              Select Operator
                expressions: uid (type: string), inter_type (type: string), sjs_r (type: string), level_cf (type: string)
                outputColumnNames: _col0, _col1, _col2, _col3
                HashTable Sink Operator
                  condition expressions:
                    0 {_col0} {_col1} {_col2} {_col3}
                    1 {_col2}
                  keys:
                    0 _col0 (type: string), _col2 (type: string), _col1 (type: string)
                    1 _col0 (type: string), _col3 (type: string), _col1 (type: string)

  Stage: Stage-8
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: temp_sjs_interact_cf_top10_t2
            Select Operator
              expressions: uid (type: string), inter_type (type: string), level_cf (type: string), cnt (type: string)
              outputColumnNames: _col0, _col1, _col2, _col3
              Map Join Operator
                condition map:
                     Left Outer Join0 to 1
                condition expressions:
                  0 {_col0} {_col1} {_col2} {_col3}
                  1 {_col2}
                keys:
                  0 _col0 (type: string), _col2 (type: string), _col1 (type: string)
                  1 _col0 (type: string), _col3 (type: string), _col1 (type: string)
                outputColumnNames: _col0, _col1, _col2, _col3, _col6
                File Output Operator
                  compressed: false
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
      Local Work:
        Map Reduce Local Work

  Stage: Stage-7
    Conditional Operator

  Stage: Stage-10
    Map Reduce Local Work
      Alias -> Map Local Tables:
        t_3:ods_user_base_info
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        t_3:ods_user_base_info
          TableScan
            alias: ods_user_base_info
            Select Operator
              expressions: uid (type: string), nick (type: string)
              outputColumnNames: _col0, _col1
              HashTable Sink Operator
                condition expressions:
                  0 {_col6} {_col0} {_col1} {_col2} {_col3}
                  1 {_col1}
                keys:
                  0 _col0 (type: string)
                  1 _col0 (type: string)

  Stage: Stage-6
    Map Reduce
      Map Operator Tree:
          TableScan
            Map Join Operator
              condition map:
                   Left Outer Join0 to 1
              condition expressions:
                0 {_col6} {_col0} {_col1} {_col2} {_col3}
                1 {_col1}
              keys:
                0 _col0 (type: string)
                1 _col0 (type: string)
              outputColumnNames: _col2, _col4, _col5, _col6, _col7, _col9
              Select Operator
                expressions: _col4 (type: string), _col9 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col2 (type: string)
                outputColumnNames: _col4, _col9, _col5, _col6, _col7, _col2
                Group By Operator
                  keys: _col4 (type: string), _col9 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col2 (type: string)
                  mode: hash
                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
                  File Output Operator
                    compressed: false
                    table:
                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
      Local Work:
        Map Reduce Local Work

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            Reduce Output Operator
              key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string)
              sort order: ++++++
              Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string)
              Statistics: Num rows: 1598946560 Data size: 319789301760 Basic stats: COMPLETE Column stats: NONE
      Reduce Operator Tree:
        Group By Operator
          keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string), KEY._col3 (type: string), KEY._col4 (type: string), KEY._col5 (type: string)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
          Statistics: Num rows: 799473280 Data size: 159894650880 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string)
            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
            Statistics: Num rows: 799473280 Data size: 159894650880 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 799473280 Data size: 159894650880 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-2
    Map Reduce
      Map Operator Tree:
          TableScan
            Reduce Output Operator
              key expressions: _col0 (type: string)
              sort order: +
              Map-reduce partition columns: _col0 (type: string)
              Statistics: Num rows: 30685 Data size: 12274456 Basic stats: COMPLETE Column stats: NONE
              value expressions: _col6 (type: string), _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string)
          TableScan
            alias: ods_user_base_info
            Statistics: Num rows: 1453587694 Data size: 290717538880 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: uid (type: string), nick (type: string)
              outputColumnNames: _col0, _col1
              Statistics: Num rows: 1453587694 Data size: 290717538880 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: _col0 (type: string)
                sort order: +
                Map-reduce partition columns: _col0 (type: string)
                Statistics: Num rows: 1453587694 Data size: 290717538880 Basic stats: COMPLETE Column stats: NONE
                value expressions: _col1 (type: string)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Left Outer Join0 to 1
          condition expressions:
            0 {VALUE._col2} {VALUE._col4} {VALUE._col5} {VALUE._col6} {VALUE._col7}
            1 {VALUE._col1}
          outputColumnNames: _col2, _col4, _col5, _col6, _col7, _col9
          Statistics: Num rows: 1598946560 Data size: 319789301760 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col4 (type: string), _col9 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col2 (type: string)
            outputColumnNames: _col4, _col9, _col5, _col6, _col7, _col2
            Statistics: Num rows: 1598946560 Data size: 319789301760 Basic stats: COMPLETE Column stats: NONE
            Group By Operator
              keys: _col4 (type: string), _col9 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col2 (type: string)
              mode: hash
              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
              Statistics: Num rows: 1598946560 Data size: 319789301760 Basic stats: COMPLETE Column stats: NONE
              File Output Operator
                compressed: false
                table:
                    input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: temp_sjs_interact_cf_top10_t2
            Statistics: Num rows: 5 Data size: 2336 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: uid (type: string), inter_type (type: string), level_cf (type: string), cnt (type: string)
              outputColumnNames: _col0, _col1, _col2, _col3
              Statistics: Num rows: 5 Data size: 2336 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: _col0 (type: string), _col2 (type: string), _col1 (type: string)
                sort order: +++
                Map-reduce partition columns: _col0 (type: string), _col2 (type: string), _col1 (type: string)
                Statistics: Num rows: 5 Data size: 2336 Basic stats: COMPLETE Column stats: NONE
                value expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string)
          TableScan
            alias: temp_sjs_interact_cf_top10_t1
            Statistics: Num rows: 55792 Data size: 22317192 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: sjs_r is not null (type: boolean)
              Statistics: Num rows: 27896 Data size: 11158596 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: uid (type: string), inter_type (type: string), sjs_r (type: string), level_cf (type: string)
                outputColumnNames: _col0, _col1, _col2, _col3
                Statistics: Num rows: 27896 Data size: 11158596 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: string), _col3 (type: string), _col1 (type: string)
                  sort order: +++
                  Map-reduce partition columns: _col0 (type: string), _col3 (type: string), _col1 (type: string)
                  Statistics: Num rows: 27896 Data size: 11158596 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col2 (type: string)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Left Outer Join0 to 1
          condition expressions:
            0 {VALUE._col0} {VALUE._col1} {VALUE._col2} {VALUE._col3}
            1 {VALUE._col2}
          outputColumnNames: _col0, _col1, _col2, _col3, _col6
          Statistics: Num rows: 30685 Data size: 12274456 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            table:
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1

执行计划已经很好的说明了问题。

你可能感兴趣的:(hive,hive,hadoop,数据仓库)