【Hive】性能调优 - Map JOIN

Hive版本: hive-3.1.3

map-side JOIN和Map JOIN的区别

  1. map-side JOIN就是预聚合,在map阶段先聚合一下,这样数据到了reduce有可能就不倾斜了
  2. Map JOIN就是缓存小表,没有shuffle,没有reduce

概述
如果表关联时,有一张表很小,那么可以在大表通过mapper时将小表完全加载到内存中,Hive可以在map端完成关联过程,这就是所谓的map-side JOIN。
使用map-side JOIN可以省掉常规的reduce过程,从而提升Hive的效率。
Hive中有三个和map-side JOIN相关的参数:

参数 默认值
hive.auto.convert.join true (Hive 0.11.0+)
hive.auto.convert.join.noconditionaltask true
hive.auto.convert.join.noconditionaltask.size 10000000 (10M)

如果想使用map-side JOIN,重点关注hive.auto.convert.join.noconditionaltask.size就可以了,可以认为如果小表的大小超过此参数的值(默认10M),Hive就不会自动优化为map-side JOIN。但可以根据节点的实际内存大小,合理调整此参数值。
Hive官方文档对参数有如下解释:
【Hive】性能调优 - Map JOIN_第1张图片
注意:

由于map-side JOIN只能流化一个表到内存,而全外连接( Full Outer join)需要关联后两张表的数据,所以暂不能使用此优化。
在这里插入图片描述
测试

下面是对hive.auto.convert.join.noconditionaltask.size的简单测试:
测试环境中有两张表order_detail(1.09G),province_info(369B)

1. hive.auto.convert.join.noconditionaltask.size调整为1B

因以上二表的任一个都大于此参数的值(1B),所以Hive会执行reduce-side JOIN,而非map-side JOIN。

set hive.auto.convert.join.noconditionaltask.size = 1;

explain 
select t1.id from province_info t1 
join
order_detail t2
on t1.id = t2.province_id
limit 10;

在执行计划中可以看出,Hive依然执行了reduce过程:

Explain
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Spark
      Edges:
        Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 92), Map 3 (PARTITION-LEVEL SORT, 92)
      DagName: atguigu_20230602210922_012167eb-c3f8-4e7b-a3a9-df1e56e64ff3:7
      Vertices:
        Map 1
            Map Operator Tree:
                TableScan
                  alias: t1
                  Statistics: Num rows: 1 Data size: 3690 Basic stats: COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: id is not null (type: boolean)
                    Statistics: Num rows: 1 Data size: 3690 Basic stats: COMPLETE Column stats: NONE
                    Select Operator
                      expressions: id (type: string)
                      outputColumnNames: _col0
                      Statistics: Num rows: 1 Data size: 3690 Basic stats: COMPLETE Column stats: NONE
                      Reduce Output Operator
                        key expressions: _col0 (type: string)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: string)
                        Statistics: Num rows: 1 Data size: 3690 Basic stats: COMPLETE Column stats: NONE
            Execution mode: vectorized
        Map 3
            Map Operator Tree:
                TableScan
                  alias: t2
                  Statistics: Num rows: 13066777 Data size: 11760099340 Basic stats: COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: province_id is not null (type: boolean)
                    Statistics: Num rows: 13066777 Data size: 11760099340 Basic stats: COMPLETE Column stats: NONE
                    Select Operator
                      expressions: province_id (type: string)
                      outputColumnNames: _col0
                      Statistics: Num rows: 13066777 Data size: 11760099340 Basic stats: COMPLETE Column stats: NONE
                      Reduce Output Operator
                        key expressions: _col0 (type: string)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: string)
                        Statistics: Num rows: 13066777 Data size: 11760099340 Basic stats: COMPLETE Column stats: NONE
            Execution mode: vectorized
        Reducer 2
            Reduce Operator Tree:
              Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 _col0 (type: string)
                  1 _col0 (type: string)
                outputColumnNames: _col0
                Statistics: Num rows: 14373455 Data size: 12936109554 Basic stats: COMPLETE Column stats: NONE
                Limit
                  Number of rows: 10
                  Statistics: Num rows: 10 Data size: 9000 Basic stats: COMPLETE Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 10 Data size: 9000 Basic stats: COMPLETE Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: 10
      Processor Tree:
        ListSink

Time taken: 0.369 seconds, Fetched: 74 row(s)

2. hive.auto.convert.join.noconditionaltask.size调整为20M

因province_info(369B),符合小于hive.auto.convert.join.noconditionaltask.size的值,所以Hive会自动调整为map-side JOIN。

set hive.auto.convert.join.noconditionaltask.size = 20000000;

explain 
select t1.id from province_info t1 
join
order_detail t2
on t1.id = t2.province_id
limit 10;

在执行计划中可以看出,Hive只执行map,而没有reduce过程(非重要部分已在截图中省略):

Explain
STAGE DEPENDENCIES:
  Stage-2 is a root stage
  Stage-1 depends on stages: Stage-2
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-2
    Spark
      DagName: atguigu_20230602193520_04e371da-7c56-4d2f-8946-f46024760b34:6
      Vertices:
        Map 1
            Map Operator Tree:
                TableScan
                  alias: t1
                  Statistics: Num rows: 1 Data size: 3690 Basic stats: COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: id is not null (type: boolean)
                    Statistics: Num rows: 1 Data size: 3690 Basic stats: COMPLETE Column stats: NONE
                    Select Operator
                      expressions: id (type: string)
                      outputColumnNames: _col0
                      Statistics: Num rows: 1 Data size: 3690 Basic stats: COMPLETE Column stats: NONE
                      Spark HashTable Sink Operator
                        keys:
                          0 _col0 (type: string)
                          1 _col0 (type: string)
            Execution mode: vectorized
            Local Work:
              Map Reduce Local Work

  Stage: Stage-1
    Spark
      DagName: atguigu_20230602193520_04e371da-7c56-4d2f-8946-f46024760b34:5
      Vertices:
        Map 2
            Map Operator Tree:
                TableScan
                  alias: t2
                  Statistics: Num rows: 13066777 Data size: 11760099340 Basic stats: COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: province_id is not null (type: boolean)
                    Statistics: Num rows: 13066777 Data size: 11760099340 Basic stats: COMPLETE Column stats: NONE
                    Select Operator
                      expressions: province_id (type: string)
                      outputColumnNames: _col0
                      Statistics: Num rows: 13066777 Data size: 11760099340 Basic stats: COMPLETE Column stats: NONE
                      Map Join Operator
                        condition map:
                             Inner Join 0 to 1
                        keys:
                          0 _col0 (type: string)
                          1 _col0 (type: string)
                        outputColumnNames: _col0
                        input vertices:
                          0 Map 1
                        Statistics: Num rows: 14373455 Data size: 12936109554 Basic stats: COMPLETE Column stats: NONE
                        Limit
                          Number of rows: 10
                          Statistics: Num rows: 10 Data size: 9000 Basic stats: COMPLETE Column stats: NONE
                          File Output Operator
                            compressed: false
                            Statistics: Num rows: 10 Data size: 9000 Basic stats: COMPLETE Column stats: NONE
                            table:
                                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
            Execution mode: vectorized
            Local Work:
              Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: 10
      Processor Tree:
        ListSink

Time taken: 0.357 seconds, Fetched: 76 row(s)

3. Full Outer Join不会使用map-side JOIN

关联方式改为Full Outer Join后,即使hive.auto.convert.join.noconditionaltask.size调整为20M,Hive依然使用reduce-side JOIN,而非map-side JOIN。

set hive.auto.convert.join.noconditionaltask.size = 20000000;

explain 
select t1.id from province_info t1 
full join
order_detail t2
on t1.id = t2.province_id
limit 10;

【Hive】性能调优 - Map JOIN_第2张图片

总结

map-side JOIN可以省掉reduce过程,从而提高Hive效率;
Hive 0.11.0版本后,默认开启自动map-side JOIN优化,我们需要合理调整hive.auto.convert.join.noconditionaltask.size参数值。

你可能感兴趣的:(Hive,hive,大数据,hadoop)