1、空KEY过滤
有时join超时是因为某些key对应的数据太多,而相同key对应的数据都会发送到相同的reducer上,从而导致内存不够。此时我们应该仔细分析这些异常的key,很多情况下,这些key对应的数据是异常数据,我们需要在SQL语句中进行过滤。例如key对应的字段为空,操作如下
案例实操
(1)配置历史服务器
配置mapred-site.xml
mapreduce.jobhistory.address
hadoop102:10020
mapreduce.jobhistory.webapp.address
hadoop102:19888
启动历史服务器
[victor@hadoop102 hadoop] sbin/mr-jobhistory-daemon.sh start historyserver
查看jobhistory
http://192.168.1.102:19888/jobhistory
(2)创建原始数据表、空id表、合并后数据表
create table ori(id bigint, time bigint, uid string, keyword string, url_rank int,
click_num int, click_url string)
row format delimited fields terminated by '\t';
create table nullidtable(id bigint, time bigint, uid string, keyword string,
url_rank int, click_num int, click_url string)
row format delimited fields terminated by '\t';
create table jointable(id bigint, time bigint, uid string, keyword string,
url_rank int, click_num int, click_url string)
row format delimited fields terminated by '\t';
(3)分别加载原始数据和空id数据到对应表中
hive (default)> load data local inpath '/opt/module/datas/ori'
into table ori;
hive (default)> load data local inpath '/opt/module/datas/nullid'
into table nullidtable;
(4)测试不过滤空id
hive (default)> insert overwrite table jointable
select n.* from nullidtable n left join ori o on n.id = o.id;
Time taken: 42.038 seconds
(5)测试过滤空id
hive (default)> insert overwrite table jointable
select n.* from (select * from nullidtable where id is not null ) n
left join ori o on n.id = o.id;
Time taken: 31.725 seconds
2、空key转换
有时虽然某个key为空对应的数据很多,但是相应的数据不是异常数据,必须要包含在join的结果中,此时我们可以表a中key为空的字段赋一个随机的值,使得数据随机均匀地分不到不同的reducer上。例如:
案例实操
不随机分布空null值
(1)设置5个reduce个数
hive (default)> set mapreduce.job.reduces = 5;
(2)JOIN 两张表
hive (default)> insert overwrite table jointable
select n.* from nullidtable n
left join ori b on n.id = b.id;
结果:可以看出来,出现了数据倾斜,某些reducer的资源消耗远大于其他reducer。
随机分布空null值
(1)设置5个reduce个数
hive (default)> set mapreduce.job.reduces = 5;
(2)JOIN两张表
hive (default)> insert overwrite table jointable
select n.* from nullidtable n full join ori o on
case when n.id is null then concat('hive', rand())
else n.id end = o.id;
结果:可以看出来,消除了数据倾斜,负载均衡reducer的资源消耗