工作中遇到使用了Map结构来存储的数据。
比如下面所示的这种:(数据表1)
col1 | col2 |
{24235:r2,98766:r3} | {65432:r1,35689:r2,24577:r3} |
{13245:r3} | {34567:r1,87654:r3} |
需求是解析出所有的key,即24235,98766,65432,35689,.... 并统计这些key分别出现了多少次、每个key对应的value值(即r2, r3, r1...)按照key做分组聚合累加是多少。
数据是存储在HIVE表中的,脚本是Spark-Sql 或 Hive-Sql。
1、SparkSql / HiveSql有三种复杂数据结构,Array 、Map、Struct 。
它们访问value值的方式为:Arrays:通过0\1下标访问;Map(K-V对):通过["指定域名称"]访问;strcut内部的数据可以通过.来访问;
但这是指规范存储的数据格式:(数据表2)
col1-array | col2-map | col3-struct |
[5,30,40] | {math:90,english:95,language:98} | {province:zhejiang,city:hangzhou,county:xihu,zip:310000} |
[40,50,60] | {math:80,english:90,language:90} | {province:beijing,city:beijing,county:chaoyang,zip:100000} |
select col1-array[0] , col1-array[1], col2-map['math'] , col2-map['english'], col3-struct.province, col3-struct.city from table ;
#返回结果为:
5,30,90,95,zhe jiang,hangzhou
40,50,80,90, beijing, beijing
2、对于 数据表1 中的Map数据类型是非结构化的。每个map的长度也不固定。
所以,这时候要解析出key, value最方便的函数就是explode , lateral view explode
(1) explode函数只能单独使用,不能同时选择其他列,否则会报错。
select col1 , explode(col1) from table;
上面的命令行会报错:FAILED: SemanticException [Error 10081]: UDTF's are not supported outside the SELECT clause, nor nested in expressions
select explode(col1) from table;
这个命令行可以执行,运行结果为:
key value
24235 r2
98766 r3
13245 r3
(2)如果想要同时选择多列,那就要使用lateral view explode函数
select col1,
col2,
col1_key,
col1_value,
col2_key,
col2_value
from table
lateral VIEW explode(col1) col1s AS col1_key,col1_value
lateral VIEW explode(col2) col2s AS col2_key,col2_value
where ....
输出结果为:
col1 col2 col1_key col1_value col2_key col2_value
{24235:r2,98766:r3} {65432:r1,35689:r2,24577:r3} 24235 r2 65432 r1
{24235:r2,98766:r3} {65432:r1,35689:r2,24577:r3} 98766 r3 65432 r1
{24235:r2,98766:r3} {65432:r1,35689:r2,24577:r3} 24235 r2 35689 r2
{24235:r2,98766:r3} {65432:r1,35689:r2,24577:r3} 98766 r3 35689 r2
{24235:r2,98766:r3} {65432:r1,35689:r2,24577:r3} 24235 r2 24577 r3
{24235:r2,98766:r3} {65432:r1,35689:r2,24577:r3} 98766 r3 24577 r3
{13245:r3} {34567:r1,87654:r3} 13245 r3 34567 r1
{13245:r3} {34567:r1,87654:r3} 13245 r3 87654 r3
注意:使用lateral view explode函数时,与其他表做JOIN 会报错误。
错误是在ON这个地方。不知道什么原因。
select col1,
col2,
col1_key,
col1_value,
col2_key,
col2_value
from
table1 b
JOIN
table2 a
lateral VIEW explode(col1) col1s AS col1_key,col1_value
lateral VIEW explode(col2) col2s AS col2_key,col2_value
ON a.id =b.id
where ....
所以,现在只能这样使用
select col1,
col2,
col1_key,
col1_value,
col2_key,
col2_value
from
table1 b
JOIN
(
select ...
from
table2
lateral VIEW explode(col1) col1s AS col1_key,col1_value
lateral VIEW explode(col2) col2s AS col2_key,col2_value
) a
ON a.id =b.id
where ....