HIVE去重问题

代码1:查询重复记录

SELECT * FROM tb a WHERE (a.account) IN
(SELECT account FROM tb GROUP BY account HAVING COUNT(*) > 1) LIMIT 2;

代码2:查询记录数大于2的记录数目

SELECT COUNT(mobile) AS duplicatetotal,mobile FROM db.tb
GROUP BY mobile HAVING COUNT(*)>1 ORDER BY duplicatetotal DESC;

代码3:去重操作

DROP TABLE IF EXISTS db.tb_new;
CREATE TABLE IF NOT EXISTS db.tb_new AS
SELECT * FROM
(SELECT *, row_number() OVER (PARTITION BY mobile ORDER BY created_at DESC) AS num FROM
(SELECT * FROM db.tb_old WHERE mobile REGEXP "^[1][34578][0-9]{9}$") tmp1 ) t
WHERE t.num=1;

说明:mobile为除重依据, created_at 为排序依据; DESC留下最近的一条记录。

代码4:查询电话数目为3的记录

SELECT account, COUNT(account) AS duplicatetotal FROM db.tb 
GROUP BY account HAVING COUNT(*)=3 LIMIT 10;

代码5:

hiveContext.sql("SELECT COUNT(*) AS count_sum, user_profile_id FROM stage_data.temp_table GROUP BY user_profile_id HAVING COUNT(*) > 1").show()

代码6:

hiveContext.sql("SELECT COUNT(*) FROM stage_data.temp_table WHERE LENGTH(TRIM(user_profile_id)) = 0 ").show()

去除user_profile_id两边的空格之后,计算长度。若长度为0,则表明该字段为空。


后记:

  1. left join之后,发现左表的记录数多出来了,一般是由于左表partition之后,NULL行也被当做单独一行,导致左表记录数增加。查询该记录用
SELECT * FROM temp_table WHERE user_profile_id IS NULL;

去除该记录的方法是用原来的表筛选用 IS NOT NULL;

  1. join表之前一般不要进行去重操作,join完之后再进行去重操作,效率会提高。
  2. 关联表之前需要检查左右表关联字段是否有重复字段,若存在重复字段则可能产生一对多问题。

完。

你可能感兴趣的:(HIVE去重问题)