select t1.id
from table_a t1
left join table_b t2
on t1.id = t2.id
如果主表的关联字段 t1.id
存在过多的NULL值,那么可能会造成数据倾斜
解决办法如下 (将NULL赋 随机值)
select t1.id
from table_a t1
left join table_b t2
on nvl(t1.id, rand()) = t2.id
select
/*+ MAPJOIN(t1)*/
t1.id
from table_a t1
left join table_b t2
on t1.id = t2.id
PS. map join 不起作用时,可参考文章 Hive MapJoin
hive.auto.convert.join=false (关闭自动MAPJOIN转换操作)
hive.ignore.mapjoin.hint=false (不忽略MAPJOIN标记)
-- 例子
select
count(1)
from table_a t1
inner join table_b t2
on t1.id = t2.id
-- 解决办法如下
select
sum(t1.pv)
from (
select
id,
count(1) pv
from table_a
group by id
) t1
inner join table_b t2
on t1.id = t2.id
-- 如果 table_a表 id=0的值过多,且table_b表中id=0 `不`存在
select t1.*
from table_a t1
left join table_b t2
on if(t1.id=0, rand(), t1.id) = t2.id
-- 如果 table_a表 id=0的值过多,且table_b表中id=0 `存在`
select
t1.*
from table_a t1
left join table_b t2
on t1.id = t2.id
where t1.id <> 0
union all
select
/*+ MAPJOIN(t1)*/
t1.*
from table_a t1
left join table_b t2
where t1.id = 0
PS. 不同数据类型 JOIN 也会产生数据倾斜, 可以先explain sql语句,查看JOIN的数据类型
select
column_1,
count(distinct column_2)
from table_a
group by column_1
如果 column_1 + column_2 存在大量的重复数据,那么可以先进行去重再Group By
解决办法如下
select
column_1,
count(1)
from
(
select
distinct column_1, column_2
from table_a
distribute by column_1, column_2
) t
group by column_1
比如 用户行为日志里 IDFA异常值(全零或者为空) 数据量大,如果不影响最后统计过滤即可
PS. 在JOIN条件也是一样
select
idfa,
count(1)
from app_user_log
where idfa <> '00000000-0000-0000-0000-000000000000'
group by idfa
参考之前的博客 获取YARN上执行时间最长的JOB列表,并查看是否存在数据倾斜
参考文章 Hive的数据倾斜