解决Hive中collet_list列表排序混乱:sort_array

由collect_list形成的列表经过concat_ws拼接后顺序具有随机性,要保证列表有序只需要在生成列表后使用sort_array函数进行排序即可。sort_array就是对array进行排序,且只能升序。


我在这里举一个完整的例子和代码:

如果,我们有如下的数据集(借助了参考文献1的数据),我们希望对memberid进行分组,依照legcount的顺序,对airways进行行转列

memberid airways legcount
1 A 3
1 B 2
1 D 1
1 C 4
2 C 4
2 D 3

如果我们直接这么写:

SELECT
	memberid,
    collect_list(cast(airways as string)),
	concat_ws(',',collect_list(cast(airways as string)))
from
	(
		select 1 as memberid, 'A' as airways, 3 as legcount
		
		union ALL
		
		select 1 as memberid, 'B' as airways, 2 as legcount
		
		union ALL
		
		select 1 as memberid, 'D' as airways, 1 as legcount
		
		union ALL
		
		select 1 as memberid, 'C' as airways, 4 as legcount
		
		union ALL
		
		select 2 as memberid, 'C' as airways, 4 as legcount
		
		union ALL
		
		select 2 as memberid, 'D' as airways, 3 as legcount
	) as t
group by
	memberid

结果如下:

memberid _c1 _c2
1 ["A","B","D","C"] A,B,D,C
2 ["C","D"] C,D

产生这个问题的根本原因自然在MapReduce,如果启动了多于一个mapper/reducer来处理数据,select出来的数据顺序就几乎肯定与原始顺序不同了。考虑把mapper数固定成1比较麻烦),也不现实,所以要迂回地解决问题:把legcount加进来再进行一次排序,拼接完之后把legcount去掉。如下:

完成的代码如下:

SELECT
	memberid,
	regexp_replace(concat_ws('-', sort_array(collect_list(concat_ws(':',
                         cast(legcount as string), airways)))), '\\d\:', '') c5
from
	(
		select 1 as memberid, 'A' as airways, 3 as legcount
		
		union ALL
		
		select 1 as memberid, 'B' as airways, 2 as legcount
		
		union ALL
		
		select 1 as memberid, 'D' as airways, 1 as legcount
		
		union ALL
		
		select 1 as memberid, 'C' as airways, 4 as legcount
		
		union ALL
		
		select 2 as memberid, 'C' as airways, 4 as legcount
		
		union ALL
		
		select 2 as memberid, 'D' as airways, 3 as legcount
	) as t
group by
	memberid

结果为:

memberid c5
1 D-B-A-C
2 D-C

大家肯定对结果比较懵逼,我们拆开代码 ,看看中间都输出了什么

SELECT
	memberid,
	collect_list(concat_ws(':', cast(legcount as string), airways)) c2,
	sort_array(collect_list(concat_ws(':', cast(legcount as string), airways))) c3,
	concat_ws('-', sort_array(collect_list(concat_ws(':', cast(legcount as string), airways)))) c4,
	regexp_replace(concat_ws('-', sort_array(collect_list(concat_ws(':', cast(legcount as string), airways)))), '\\d\:', '') c5
from
	(
		select 1 as memberid, 'A' as airways, 3 as legcount
		
		union ALL
		
		select 1 as memberid, 'B' as airways, 2 as legcount
		
		union ALL
		
		select 1 as memberid, 'D' as airways, 1 as legcount
		
		union ALL
		
		select 1 as memberid, 'C' as airways, 4 as legcount
		
		union ALL
		
		select 2 as memberid, 'C' as airways, 4 as legcount
		
		union ALL
		
		select 2 as memberid, 'D' as airways, 3 as legcount
	) as t
group by
	memberid

结果如下:

memberid c2 c3 c4 c5
1 ["1:D","2:B","3:A","4:C"] ["1:D","2:B","3:A","4:C"] 1:D-2:B-3:A-4:C D-B-A-C
2 ["3:D","4:C"] ["3:D","4:C"] 3:D-4:C D-C

本质上:

我们将legcount加入到了airways里,进行了一次数组的排序。

需要注意的是:rank列必须要在高位补足够的0对齐,因为排序的是字符串而不是数字,如果不补0的话,按字典序排序就会变成1, 10, 11, 12, 13, 2, 3, 4...,又不对了。

 

参考文献

【1】Hive | 用sort_array函数解决collet_list列表排序混乱问题

【2】HiveQL collect_list保持顺序小记

 



 

 

你可能感兴趣的:(大数据)