hive join中出现的数据暴增(数据重复)

什么是join过程中导致的数据暴增?
例如:给左表的每个用户打上是否是新用户的标签,左表的用户数为100,但是关联右表之后,得到的用户数为200甚至更多
什么原因导致的数据暴增呢?

我们来看一下案例:
spark-sql> with test1 as
         > (select '10001' as uid,'xiaomi' as qid
         > union all
         > select '10002' as uid,'huawei' as qid
         > union all
         > select 'null' as uid,'oppo' as qid
         > union all
         > select 'null' as uid,'appstore' as qid),
         > 
         > test2 as
         > (select '10001' as uid
         > union all
         > select '10002' as uid
         > union all
         > select 'null' as uid
         > union all
         > select 'null' as uid)
         > 
         > select 
         >   t1.uid,
         >   t1.qid,
         >   case when t2.uid is not null then 'new' else 'old' end as user_type
         > from 
         > (select 
         >   uid,
         >   qid
         > from test1) t1
         > 
         > left join
         > (select 
         >   uid
         > from test2) t2
         > on t1.uid=t2.uid;
10001   	xiaomi  	new                                                             
10002		huawei		new
null		oppo		new
null		oppo		new
null		appstore	new
null		appstore	new
Time taken: 16.026 seconds, Fetched 6 row(s)

我们预定的结果应该为4条数据,但是实际结果却出现了6条数据。
什么原因导致的这个问题呢?由于左右两边的表都有null值,导致null和null关联出现了笛卡尔积现象,从而导致了数据暴增。

解决办法:
1.左右两边的表都排除null值,可参考不同实际应用场景进行排除
2.对右表进行去重,保留一个null值,关联的时候不会出现数据暴增现象
3.实际应用中,'',' ','null','NULL',null等值可以根据实际应用场景排除之后再进行关联

你可能感兴趣的:(hive,大数据,hadoop,hive)