Hive中count(distinct)转group by

在Hive中,为了防止出现数据倾斜,我们会尽量避免使用count(distinct),一般我们会使用group by进行替换,这里简单说明。

  • 测试表表结构
CREATE TABLE IF NOT EXISTS test_01 (
school_id int COMMENT '学校id',
name string COMMENT '姓名',
level string COMMENT '综合评级',
class_name string COMMENT '所属班级'
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
  • 测试表表数据
hive (test)> select * from test_01;
OK
1	Lucy	s	c_01
1	Marry	s	c_01
1	Jack	a	c_02
1	Tom		a	c_02
1	Rose	a	c_03
1	Curry	b	c_03
1	Tonny	c	c_03
  • 这里统计不同的综合评级和班级,使用count(distinct)查询结果
hive (test)> select 
school_id,
count(distinct level) ,
count(distinct class_name) 
from test_01 group by school_id;

OK
1	4	3
  • 使用group by查询结果,这里我们采用通过空间换时间的思想来将count(distinct)转换成group by函数
//通过空间换时间
hive (test)> select 
school_id,
count(case when type = 'a' then 1 else null end) as num1,
count(case when type = 'b' then 1 else null end) as num2 from (
select school_id,level as col,'a' as type from test_01
union 
select school_id,class_name as col,'b' as type from test_01) t
group by school_id;

OK
1	4	3

符合预期!

注意:不要使用union all,否则会与预期不相符!

select 
school_id,
count(case when type = 'a' then 1 else null end) as num1,
count(case when type = 'b' then 1 else null end) as num2 from (
select school_id,level as col,'a' as type from test_01
union all
select school_id,class_name as col,'b' as type from test_01) t
group by school_id;

OK
1	7	7

你可能感兴趣的:(Hive)