两个同时设置。具体的 filesize 量力而行,默认我记得好像使25m 很多帖子上的奇怪语法你也不用去看,都是老掉牙的东西了,比如:/*+ mapjoin(A)*/,除非你的hive版本很低,否则根本用不上。
还有一个:set hive.ignore.mapjoin.hint=true; 这个的话我觉得咩有必要。集群本身也是有参数设置的,且运维是有考量的。即使这么干了也不一定就会生效。走常规的方式即可。适当的调整hive.mapjoin.smalltable.filesize 这个值的大小。其实这个本身就是对小表来说的,但是大小是相对的,你如果有一个500g的表和一个50g的小表关联,你放内存真不一定就合适。个人建议1g以下的可以考虑,太大的话就没必要了。
看日志吧,最为直观:
2021-12-10 12:05:41 Starting to launch local task to process map join; maximum memory = 954728448
2021-12-10 12:05:44 Processing rows: 200000 Hashtable size: 199999 Memory usage: 135058920 percentage: 0.141
2021-12-10 12:05:44 Dump the side-table into file: file:/tmp/hive_2021-12-10_11-47-34_913_2061727660300134431-1/-local-10007/HashTable-Stage-13/MapJoin-mapfile10--.hashtable
2021-12-10 12:05:44 Uploaded 1 File to: file:/tmp/hive_2021-12-10_11-47-34_913_2061727660300134431-1/-local-10007/HashTable-Stage-13/MapJoin-mapfile10--.hashtable (3517 bytes)
2021-12-10 12:05:44 Dump the side-table into file: file:/tmp/hive_2021-12-10_11-47-34_913_2061727660300134431-1/-local-10007/HashTable-Stage-13/MapJoin-mapfile12--.hashtable
2021-12-10 12:05:44 Uploaded 1 File to: file:/tmp/hive_2021-12-10_11-47-34_913_2061727660300134431-1/-local-10007/HashTable-Stage-13/MapJoin-mapfile12--.hashtable (8683158 bytes)
2021-12-10 12:05:44 End of local task; Time Taken: 3.034 sec.
Execution completed successfully
关键点:
补充一下:
我发现其实left join 在满足条件的时候也是会走mapjoin的。
STAGE DEPENDENCIES:
Stage-9 is a root stage , consists of Stage-11, Stage-1
Stage-11 has a backup stage: Stage-1
Stage-8 depends on stages: Stage-11
Stage-7 depends on stages: Stage-1, Stage-8 , consists of Stage-10, Stage-2
Stage-10 has a backup stage: Stage-2
Stage-6 depends on stages: Stage-10
Stage-3 depends on stages: Stage-2, Stage-6
Stage-2
Stage-1
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-9
Conditional Operator
Stage: Stage-11
Map Reduce Local Work
Alias -> Map Local Tables:
t_2:temp_sjs_interact_cf_top10_t1 --21.3m
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
t_2:temp_sjs_interact_cf_top10_t1
TableScan
alias: temp_sjs_interact_cf_top10_t1
Filter Operator
predicate: sjs_r is not null (type: boolean)
Select Operator
expressions: uid (type: string), inter_type (type: string), sjs_r (type: string), level_cf (type: string)
outputColumnNames: _col0, _col1, _col2, _col3
HashTable Sink Operator
condition expressions:
0 {_col0} {_col1} {_col2} {_col3}
1 {_col2}
keys:
0 _col0 (type: string), _col2 (type: string), _col1 (type: string)
1 _col0 (type: string), _col3 (type: string), _col1 (type: string)
Stage: Stage-8
Map Reduce
Map Operator Tree:
TableScan
alias: temp_sjs_interact_cf_top10_t2
Select Operator
expressions: uid (type: string), inter_type (type: string), level_cf (type: string), cnt (type: string)
outputColumnNames: _col0, _col1, _col2, _col3
Map Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {_col0} {_col1} {_col2} {_col3}
1 {_col2}
keys:
0 _col0 (type: string), _col2 (type: string), _col1 (type: string)
1 _col0 (type: string), _col3 (type: string), _col1 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col6
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
Local Work:
Map Reduce Local Work
Stage: Stage-7
Conditional Operator
Stage: Stage-10
Map Reduce Local Work
Alias -> Map Local Tables:
t_3:ods_user_base_info
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
t_3:ods_user_base_info
TableScan
alias: ods_user_base_info
Select Operator
expressions: uid (type: string), nick (type: string)
outputColumnNames: _col0, _col1
HashTable Sink Operator
condition expressions:
0 {_col6} {_col0} {_col1} {_col2} {_col3}
1 {_col1}
keys:
0 _col0 (type: string)
1 _col0 (type: string)
Stage: Stage-6
Map Reduce
Map Operator Tree:
TableScan
Map Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {_col6} {_col0} {_col1} {_col2} {_col3}
1 {_col1}
keys:
0 _col0 (type: string)
1 _col0 (type: string)
outputColumnNames: _col2, _col4, _col5, _col6, _col7, _col9
Select Operator
expressions: _col4 (type: string), _col9 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col2 (type: string)
outputColumnNames: _col4, _col9, _col5, _col6, _col7, _col2
Group By Operator
keys: _col4 (type: string), _col9 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col2 (type: string)
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
Local Work:
Map Reduce Local Work
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
Reduce Output Operator
key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string)
sort order: ++++++
Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string)
Statistics: Num rows: 1598946560 Data size: 319789301760 Basic stats: COMPLETE Column stats: NONE
Reduce Operator Tree:
Group By Operator
keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string), KEY._col3 (type: string), KEY._col4 (type: string), KEY._col5 (type: string)
mode: mergepartial
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
Statistics: Num rows: 799473280 Data size: 159894650880 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
Statistics: Num rows: 799473280 Data size: 159894650880 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 799473280 Data size: 159894650880 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-2
Map Reduce
Map Operator Tree:
TableScan
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 30685 Data size: 12274456 Basic stats: COMPLETE Column stats: NONE
value expressions: _col6 (type: string), _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string)
TableScan
alias: ods_user_base_info
Statistics: Num rows: 1453587694 Data size: 290717538880 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: uid (type: string), nick (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1453587694 Data size: 290717538880 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 1453587694 Data size: 290717538880 Basic stats: COMPLETE Column stats: NONE
value expressions: _col1 (type: string)
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col2} {VALUE._col4} {VALUE._col5} {VALUE._col6} {VALUE._col7}
1 {VALUE._col1}
outputColumnNames: _col2, _col4, _col5, _col6, _col7, _col9
Statistics: Num rows: 1598946560 Data size: 319789301760 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col4 (type: string), _col9 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col2 (type: string)
outputColumnNames: _col4, _col9, _col5, _col6, _col7, _col2
Statistics: Num rows: 1598946560 Data size: 319789301760 Basic stats: COMPLETE Column stats: NONE
Group By Operator
keys: _col4 (type: string), _col9 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col2 (type: string)
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
Statistics: Num rows: 1598946560 Data size: 319789301760 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: temp_sjs_interact_cf_top10_t2
Statistics: Num rows: 5 Data size: 2336 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: uid (type: string), inter_type (type: string), level_cf (type: string), cnt (type: string)
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 5 Data size: 2336 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string), _col2 (type: string), _col1 (type: string)
sort order: +++
Map-reduce partition columns: _col0 (type: string), _col2 (type: string), _col1 (type: string)
Statistics: Num rows: 5 Data size: 2336 Basic stats: COMPLETE Column stats: NONE
value expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string)
TableScan
alias: temp_sjs_interact_cf_top10_t1
Statistics: Num rows: 55792 Data size: 22317192 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: sjs_r is not null (type: boolean)
Statistics: Num rows: 27896 Data size: 11158596 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: uid (type: string), inter_type (type: string), sjs_r (type: string), level_cf (type: string)
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 27896 Data size: 11158596 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string), _col3 (type: string), _col1 (type: string)
sort order: +++
Map-reduce partition columns: _col0 (type: string), _col3 (type: string), _col1 (type: string)
Statistics: Num rows: 27896 Data size: 11158596 Basic stats: COMPLETE Column stats: NONE
value expressions: _col2 (type: string)
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col0} {VALUE._col1} {VALUE._col2} {VALUE._col3}
1 {VALUE._col2}
outputColumnNames: _col0, _col1, _col2, _col3, _col6
Statistics: Num rows: 30685 Data size: 12274456 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
Stage: Stage-0
Fetch Operator
limit: -1
执行计划已经很好的说明了问题。