hive小表与大表join提升运行效率


问题描述:一小表 1000 row

一大表 60w row

方案一:

在运行的时候发现会自动转为map join

本以为会很快,但是只起了一个map ,join 的计算量 : 单机计算6 亿次,结果一直map 0%  最后运行 1800s 


方案二:

采用关闭map join :

但是依旧会很慢 what,why? 因为mapper的数量还是太小了,并行度不够啊。

Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapreduce.job.reduces=
Starting Job = job_1492598920618_36034, Tracking URL = http://qing-hadoop-master-srv1:8088/proxy/application_1492598920618_36034/
Kill Command = /data/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hadoop/bin/hadoop job  -kill job_1492598920618_36034
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2017-06-23 13:30:34,807 Stage-1 map = 0%,  reduce = 0%
2017-06-23 13:30:38,949 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 1.3 sec
2017-06-23 13:30:39,975 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.45 sec
2017-06-23 13:30:50,349 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 15.63 sec
2017-06-23 13:31:32,703 Stage-1 map = 100%,  reduce = 68%, Cumulative CPU 62.93 sec
2017-06-23 13:32:33,304 Stage-1 map = 100%,  reduce = 68%, Cumulative CPU 125.97 sec
2017-06-23 13:32:47,645 Stage-1 map = 100%,  reduce = 69%, Cumulative CPU 141.24 sec
2017-06-23 13:33:48,111 Stage-1 map = 100%,  reduce = 69%, Cumulative CPU 204.34 sec
2017-06-23 13:33:57,326 Stage-1 map = 100%,  reduce = 70%, Cumulative CPU 213.84 sec
2017-06-23 13:34:57,940 Stage-1 map = 100%,  reduce = 70%, Cumulative CPU 276.55 sec
2017-06-23 13:35:04,081 Stage-1 map = 100%,  reduce = 71%, Cumulative CPU 282.85 sec

 

方案三:

考虑优化一下。map join ,并且提高map的并行度:

这里设置如下,开启map join, 然后设置合适的split的大小,来增加到合适的mapper数量。


 set mapred.max.split.size=1000;

set mapred.min.split.size.per.node=1000;
set mapred.min.split.size.per.rack=1000;
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;


 set hive.ignore.mapjoin.hint=false; 
 set hive.auto.convert.join.noconditionaltask=false;
 set hive.auto.convert.join.noconditionaltask.size=100000000 ;
 
 
drop table if exists tmp_table.table20170622_2 ;
create table tmp_table.table20170622_2 as 
select a.address, sum (case when ( distance_lat_lng(if(a.lat <> ' ',  a.lat,0 ),if(a.lng <> ' ',a.lng,0), b.lat,b.lng)< 2) then 1 else 0 end ) as cnt
from (
select address, split(lnglat,'\\|')[1] as lat, split(lnglat,'\\|')[0] as lng from tmp_table.address_sample_latlng  
) a left join 
tmp_table.hotel_location b  where b.lat > 10 and b.lng > 10
group by a.address  ;

运行日志

2017-06-23 11:30:45     Starting to launch local task to process map join;      maximum memory = 2022178816
2017-06-23 11:30:47     Dump the side-table for tag: 1 with group count: 1 into file: file:/tmp/hdfs/4c51d439-1dac-4b0d-9476-b03afba927f1/hive_2017-06-23_11-30-41_940_1742320364386420907-1/-local-10004/HashTable-Stage-5/MapJoin-mapfile31--.hashtable
2017-06-23 11:30:47     Uploaded 1 File to: file:/tmp/hdfs/4c51d439-1dac-4b0d-9476-b03afba927f1/hive_2017-06-23_11-30-41_940_1742320364386420907-1/-local-10004/HashTable-Stage-5/MapJoin-mapfile31--.hashtable (16285494 bytes)
2017-06-23 11:30:47     End of local task; Time Taken: 1.882 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 2 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1492598920618_35989, Tracking URL = http://qing-hadoop-master-srv1:8088/proxy/application_1492598920618_35989/
Kill Command = /data/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hadoop/bin/hadoop job  -kill job_1492598920618_35989
Hadoop job information for Stage-5: number of mappers: 64; number of reducers: 0



日志分析:

Hadoop job information for Stage-5: number of mappers: 64; number of reducers: 0 

查看 hdfs文件,发现小表的文件大小为64k左右,

hive> dfs -du -s -h /user/hive/warehouse/tmp_table.db/address_sample_latlng ;

63.4 K  190.1 K  /user/hive/warehouse/tmp_table.db/address_sample_latlng 



上面设置最大的map split 为1000 即1k,所以起来 64 mapper

但是发现仍然会启动map join,而且大表没有做任何切分 ,看来优化点还在于小文件切分上。

  

优化后起了64 个mapper ,优化后运行的时间105 s。

这里没有在测试设置map的最大切分大小,进行进一步优化,相比于第一个join运行效率已经得到很大的提升。


你可能感兴趣的:(hive,map)