Hive版本: hive-3.1.3
概述
如果表关联时,有一张表很小,那么可以在大表通过mapper时将小表完全加载到内存中,Hive可以在map端完成关联过程,这就是所谓的map-side JOIN。
使用map-side JOIN可以省掉常规的reduce过程,从而提升Hive的效率。
Hive中有三个和map-side JOIN相关的参数:
参数 | 默认值 |
---|---|
hive.auto.convert.join | true (Hive 0.11.0+) |
hive.auto.convert.join.noconditionaltask | true |
hive.auto.convert.join.noconditionaltask.size | 10000000 (10M) |
如果想使用map-side JOIN,重点关注hive.auto.convert.join.noconditionaltask.size就可以了,可以认为如果小表的大小超过此参数的值(默认10M),Hive就不会自动优化为map-side JOIN。但可以根据节点的实际内存大小,合理调整此参数值。
Hive官方文档对参数有如下解释:
注意:
由于map-side JOIN只能流化一个表到内存,而全外连接( Full Outer join)需要关联后两张表的数据,所以暂不能使用此优化。
测试
下面是对hive.auto.convert.join.noconditionaltask.size的简单测试:
测试环境中有两张表order_detail(1.09G),province_info(369B)
因以上二表的任一个都大于此参数的值(1B),所以Hive会执行reduce-side JOIN,而非map-side JOIN。
set hive.auto.convert.join.noconditionaltask.size = 1;
explain
select t1.id from province_info t1
join
order_detail t2
on t1.id = t2.province_id
limit 10;
在执行计划中可以看出,Hive依然执行了reduce过程:
Explain
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Spark
Edges:
Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 92), Map 3 (PARTITION-LEVEL SORT, 92)
DagName: atguigu_20230602210922_012167eb-c3f8-4e7b-a3a9-df1e56e64ff3:7
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: t1
Statistics: Num rows: 1 Data size: 3690 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: id is not null (type: boolean)
Statistics: Num rows: 1 Data size: 3690 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: string)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 3690 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 1 Data size: 3690 Basic stats: COMPLETE Column stats: NONE
Execution mode: vectorized
Map 3
Map Operator Tree:
TableScan
alias: t2
Statistics: Num rows: 13066777 Data size: 11760099340 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: province_id is not null (type: boolean)
Statistics: Num rows: 13066777 Data size: 11760099340 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: province_id (type: string)
outputColumnNames: _col0
Statistics: Num rows: 13066777 Data size: 11760099340 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 13066777 Data size: 11760099340 Basic stats: COMPLETE Column stats: NONE
Execution mode: vectorized
Reducer 2
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col0 (type: string)
1 _col0 (type: string)
outputColumnNames: _col0
Statistics: Num rows: 14373455 Data size: 12936109554 Basic stats: COMPLETE Column stats: NONE
Limit
Number of rows: 10
Statistics: Num rows: 10 Data size: 9000 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 10 Data size: 9000 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: 10
Processor Tree:
ListSink
Time taken: 0.369 seconds, Fetched: 74 row(s)
因province_info(369B),符合小于hive.auto.convert.join.noconditionaltask.size的值,所以Hive会自动调整为map-side JOIN。
set hive.auto.convert.join.noconditionaltask.size = 20000000;
explain
select t1.id from province_info t1
join
order_detail t2
on t1.id = t2.province_id
limit 10;
在执行计划中可以看出,Hive只执行map,而没有reduce过程(非重要部分已在截图中省略):
Explain
STAGE DEPENDENCIES:
Stage-2 is a root stage
Stage-1 depends on stages: Stage-2
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-2
Spark
DagName: atguigu_20230602193520_04e371da-7c56-4d2f-8946-f46024760b34:6
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: t1
Statistics: Num rows: 1 Data size: 3690 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: id is not null (type: boolean)
Statistics: Num rows: 1 Data size: 3690 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: id (type: string)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 3690 Basic stats: COMPLETE Column stats: NONE
Spark HashTable Sink Operator
keys:
0 _col0 (type: string)
1 _col0 (type: string)
Execution mode: vectorized
Local Work:
Map Reduce Local Work
Stage: Stage-1
Spark
DagName: atguigu_20230602193520_04e371da-7c56-4d2f-8946-f46024760b34:5
Vertices:
Map 2
Map Operator Tree:
TableScan
alias: t2
Statistics: Num rows: 13066777 Data size: 11760099340 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: province_id is not null (type: boolean)
Statistics: Num rows: 13066777 Data size: 11760099340 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: province_id (type: string)
outputColumnNames: _col0
Statistics: Num rows: 13066777 Data size: 11760099340 Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col0 (type: string)
1 _col0 (type: string)
outputColumnNames: _col0
input vertices:
0 Map 1
Statistics: Num rows: 14373455 Data size: 12936109554 Basic stats: COMPLETE Column stats: NONE
Limit
Number of rows: 10
Statistics: Num rows: 10 Data size: 9000 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 10 Data size: 9000 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Execution mode: vectorized
Local Work:
Map Reduce Local Work
Stage: Stage-0
Fetch Operator
limit: 10
Processor Tree:
ListSink
Time taken: 0.357 seconds, Fetched: 76 row(s)
关联方式改为Full Outer Join后,即使hive.auto.convert.join.noconditionaltask.size调整为20M,Hive依然使用reduce-side JOIN,而非map-side JOIN。
set hive.auto.convert.join.noconditionaltask.size = 20000000;
explain
select t1.id from province_info t1
full join
order_detail t2
on t1.id = t2.province_id
limit 10;
map-side JOIN可以省掉reduce过程,从而提高Hive效率;
Hive 0.11.0版本后,默认开启自动map-side JOIN优化,我们需要合理调整hive.auto.convert.join.noconditionaltask.size参数值。