在 Hive 中想实现按某字段分组,对另外字段进行合并,可通过collect_list()或者collect_set()实现。
collect_set()函数与collect_list()函数:列转行专用函数,都是将分组中的某列转为一个数组返回。有时为了字段拼接效果,多和concat_ws()函数连用。
collect_set()与collect_list()的区别:
有点类似于Python中的列表与集合。
创建测试表及插入数据
drop table test_1;
create table test_1(
id string,
cur_day string,
rule string
)
row format delimited fields terminated by ',';
insert into test_1 values
('a','20230809','501'),('a','20230811','502'),('a','20230812','503'),('a','20230812','501'),('a','20230813','512'),('b','20230809','511'),('b','20230811','512'),('b','20230812','513'),('b','20230812','511'),('b','20230813','512'),('b','20230809','511'),('c','20230811','512'),('c','20230812','513'),('c','20230812','511'),('c','20230813','512');
举例1:按照id,cur_day分组,取出每个id对应的所有rule(不去重)。
select id,cur_day,collect_set(rule) as rule_total from test_1 group by id,cur_day order by id,cur_day;
举例2:按照id,cur_day分组,取出每个id对应的所有rule(去重)。
select id,cur_day,collect_list(rule) as rule_total from test_1 group by id,cur_day order by id,cur_day;
select id,cur_day,collect_list(rule)[0] as rule_one from test_1 group by id,cur_day order by id,cur_day;
select id,cur_day,collect_set(rule)[0] as rule_one from test_1 group by id,cur_day order by id,cur_day;
select id,cur_day,concat_ws('|',collect_list(rule)) as rule_total from test_1 group by id,cur_day order by id,cur_day;
select id,cur_day,concat_ws('|',collect_set(rule)) as rule_totalfrom test_1 group by id,cur_day order by id,cur_day;
现在需求:严格按照同一个id进行分组,规则按时间升序排序,使用collect_list()将时间与规则按升序排序且一 一 对应展示出来。
1.原数据详情:
2.要求输出结果如下:按id分组,将rule按cur_day升序排序,将cur_day,rule放在一个列表中,并且列表中cur_day与rule是按升序一一对应的关系。
3.实现思路:将其使用row_number()over(partition by id order by cur_day as)排序,然后再使用collect_list()或者collect_list()/collect_set()进行聚合就可以了。
drop table test_2 ;
create table test_2 as
select id,collect_list(cur_day),collect_list(rule)
from (
select t.id,t.cur_day,t.rule,row_number() over(partition by id order by cur_day asc) rn from test_1 t
)t group by id ;
select * from test_2 group by id order by id;
4.发现问题:cur_day数组内的时间并没有按照升序排序输出。
5.原因分析:
6.解决方案:
drop table test_2 ;
create table test_2 as
select id,collect_list(cur_day),collect_list(rule)
from (
select t.* from(
select t.id,t.cur_day,t.rule,row_number() over(partition by id order by cur_day asc) rn from test_1 t
) t order by rn
)t group by id ;
select * from test_2 group by id order by id;
select
id,collect_list(cur_day),collect_list(rule)
from(
select
t.id,t.cur_day,t.rule
,row_number()over(partition by id order by cur_day asc) as rn
from(
select
t.id,t.cur_day,t.rule
from test_1 t
distribute by id sort by cur_day asc
)t
)t
group by id order by id;
select
id,concat_ws(',',collect_list(cur_day)),regexp_replace(concat_ws(',',sort_array(collect_list(concat_ws('|' ,lpad(cast(rn as string),2,'0') ,rule)))),'\\d+\\|','')
from(
select t.*
from(
select
id,cur_day,rule,
row_number()over(partition by id order by cur_day asc) as rn
from test_1
)t order by rn
)t group by id order by id;
上面代码用到相关函数解析:
lpad(str,len,pad) 函数:这个是对排序值(也就是rule)来补位的,当要排序的值过大时,因为sort_array是按顺序对字符进行排序(即11会在2的前面),所以可以使用此函数补位(即将1,2,3,4变成01,02,03,04),这样就能正常排序了。
regexp_replace(strA,strB,strC) 函数:将字符串A中的符合JAVA正则表达式B的部分替换为C,即排序之前将序号使用,跟需要的字段拼接,而排序之后,需要将序号和:去掉
sort_array(expr[, ascendingOrder])默认是升序排序,但其中可以带参数,默认为True,即按升序,如果输入False,就会按降序排序。
select id
,concat_ws(',',sort_array(collect_list(concat_ws('|' ,lpad(cast(rn as string),2,'0') ,rule)))) as middle_value --中间值
,regexp_replace(concat_ws(',',sort_array(collect_list(concat_ws('|' ,lpad(cast(rn as string),2,'0') ,rule)))),'\\d+\\|','') as result_values --最终结果
from(
select t.*
from(
select
id,cur_day,rule,
row_number()over(partition by id order by cur_day asc) as rn
from test_1
)t order by rn
)t group by id order by id;