由collect_list形成的列表经过concat_ws拼接后顺序具有随机性,要保证列表有序只需要在生成列表后使用sort_array函数进行排序即可。sort_array就是对array进行排序,且只能升序。
我在这里举一个完整的例子和代码:
如果,我们有如下的数据集(借助了参考文献1的数据),我们希望对memberid进行分组,依照legcount的顺序,对airways进行行转列。
memberid | airways | legcount |
1 | A | 3 |
1 | B | 2 |
1 | D | 1 |
1 | C | 4 |
2 | C | 4 |
2 | D | 3 |
如果我们直接这么写:
SELECT
memberid,
collect_list(cast(airways as string)),
concat_ws(',',collect_list(cast(airways as string)))
from
(
select 1 as memberid, 'A' as airways, 3 as legcount
union ALL
select 1 as memberid, 'B' as airways, 2 as legcount
union ALL
select 1 as memberid, 'D' as airways, 1 as legcount
union ALL
select 1 as memberid, 'C' as airways, 4 as legcount
union ALL
select 2 as memberid, 'C' as airways, 4 as legcount
union ALL
select 2 as memberid, 'D' as airways, 3 as legcount
) as t
group by
memberid
结果如下:
memberid | _c1 | _c2 |
1 | ["A","B","D","C"] | A,B,D,C |
2 | ["C","D"] | C,D |
产生这个问题的根本原因自然在MapReduce,如果启动了多于一个mapper/reducer来处理数据,select出来的数据顺序就几乎肯定与原始顺序不同了。考虑把mapper数固定成1比较麻烦),也不现实,所以要迂回地解决问题:把legcount加进来再进行一次排序,拼接完之后把legcount去掉。如下:
完成的代码如下:
SELECT
memberid,
regexp_replace(concat_ws('-', sort_array(collect_list(concat_ws(':',
cast(legcount as string), airways)))), '\\d\:', '') c5
from
(
select 1 as memberid, 'A' as airways, 3 as legcount
union ALL
select 1 as memberid, 'B' as airways, 2 as legcount
union ALL
select 1 as memberid, 'D' as airways, 1 as legcount
union ALL
select 1 as memberid, 'C' as airways, 4 as legcount
union ALL
select 2 as memberid, 'C' as airways, 4 as legcount
union ALL
select 2 as memberid, 'D' as airways, 3 as legcount
) as t
group by
memberid
结果为:
memberid | c5 |
1 | D-B-A-C |
2 | D-C |
大家肯定对结果比较懵逼,我们拆开代码 ,看看中间都输出了什么:
SELECT
memberid,
collect_list(concat_ws(':', cast(legcount as string), airways)) c2,
sort_array(collect_list(concat_ws(':', cast(legcount as string), airways))) c3,
concat_ws('-', sort_array(collect_list(concat_ws(':', cast(legcount as string), airways)))) c4,
regexp_replace(concat_ws('-', sort_array(collect_list(concat_ws(':', cast(legcount as string), airways)))), '\\d\:', '') c5
from
(
select 1 as memberid, 'A' as airways, 3 as legcount
union ALL
select 1 as memberid, 'B' as airways, 2 as legcount
union ALL
select 1 as memberid, 'D' as airways, 1 as legcount
union ALL
select 1 as memberid, 'C' as airways, 4 as legcount
union ALL
select 2 as memberid, 'C' as airways, 4 as legcount
union ALL
select 2 as memberid, 'D' as airways, 3 as legcount
) as t
group by
memberid
结果如下:
memberid | c2 | c3 | c4 | c5 |
1 | ["1:D","2:B","3:A","4:C"] | ["1:D","2:B","3:A","4:C"] | 1:D-2:B-3:A-4:C | D-B-A-C |
2 | ["3:D","4:C"] | ["3:D","4:C"] | 3:D-4:C | D-C |
本质上:
我们将legcount加入到了airways里,进行了一次数组的排序。
需要注意的是:rank列必须要在高位补足够的0对齐,因为排序的是字符串而不是数字,如果不补0的话,按字典序排序就会变成1, 10, 11, 12, 13, 2, 3, 4...
,又不对了。
【1】Hive | 用sort_array函数解决collet_list列表排序混乱问题
【2】HiveQL collect_list保持顺序小记