hive中的lateral view 与 explode函数,及collect_set函数的使用

原文链接: https://blog.csdn.net/guodong2k/article/details/79459282

    大纲:
    1、概述
    2、explode 使用例子
    3、引入lateral view的原因
    4、explode与lateral view使用示例1
    5、explode与lateral view使用示例2
    6、collect_set()函数示例
    7、substr()函数示例
    8、concat_ws()函数示例

1、概述

       explode与lateral view在关系型数据库中本身是不该出现的,因为他的出现本身就是在操作不满足第一范式的数据(每个属性都不可再分),本身已经违背了数据库的设计原理(不论是业务系统还是数据仓库系统),不过大数据技术普及后,很多类似pv,uv的数据,在业务系统中是存贮在非关系型数据库中,用json存储的概率比较大,直接导入hive为基础的数仓系统中,就需要经过ETL过程解析这类数据,explode与lateral view在这种场景下大显身手。

 2、explode 使用例子

       explode作用是处理map结构的字段,使用案例如下(hive自带map、struct、array字段类型,但是需要先定义好泛型,所以在此案例不使用):

2.1、建表语句:

drop table explode_lateral_view;
create table explode_lateral_view(
`area` string,
`goods_id` string,
`sale_info` string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS textfile;

one.txt 内容如下:

a:shandong,b:beijing,c:hebei|1,2,3,4,5,6,7,8,9|[{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]

2.2、导入数据并展示数据:

hive -e " load data local inpath '/home/hdp-credit/workdir/yaoyingzhe/test_shell/one.txt' into table explode_lateral_view;"  或者
hive>  load data local inpath '/home/hdp-credit/workdir/yaoyingzhe/test_shell/one.txt' into table explode_lateral_view;
Loading data to table hdp_credit.explode_lateral_view
Table hdp_credit.explode_lateral_view stats: [numFiles=1, totalSize=253]
OK
Time taken: 1.392 seconds

表内数据如下:

hive> select * from explode_lateral_view;
OK
a:shandong,b:beijing,c:hebei    1,2,3,4,5,6,7,8,9    [{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]
Time taken: 0.384 seconds, Fetched: 1 row(s)
hive> desc explode_lateral_view;
OK
area                            string                                      
goods_id                string                                      
sale_info               string                                      
Time taken: 0.435 seconds, Fetched: 3 row(s)

2.3、explode的使用:
我们只拆解array字段,语句为select explode(split(goods_id,',')) as goods_id from explode_lateral_view;
结果如下:

hive> select split(goods_id,',') as goods_id from explode_lateral_view;
OK
["1","2","3","4","5","6","7","8","9"]

Time taken: 3.351 seconds, Fetched: 1 row(s)

hive> select explode(split(goods_id,',')) as goods_id from explode_lateral_view;
OK
1
2
3
4
5
6
7
8
9
Time taken: 5.492 seconds, Fetched: 9 row(s)

2.4、拆解map字段:
语句为select explode(split(area,',')) as area from explode_lateral_view;
我们会得到如下结果:

hive> select split(area,',') as area from explode_lateral_view;
OK
["a:shandong","b:beijing","c:hebei"]
hive> select explode(split(area,',')) as area from explode_lateral_view;
OK
a:shandong
b:beijing
c:hebei
Time taken: 0.268 seconds, Fetched: 3 row(s)

2.5、拆解json字段:
这个时候要配合一下get_json_object
我们想获取所有的monthSales,第一步我们先把这个字段拆成list,并且拆成行展示:

hive> select explode(split(regexp_replace(regexp_replace(sale_info,'\\[\\{',''),'}]',''),'},\\{')) as  sale_info from explode_lateral_view;
OK
"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"
"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"
"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"
Time taken: 0.22 seconds, Fetched: 3 row(s)

hive> select split(regexp_replace(regexp_replace(sale_info,'\\[\\{',''),'}]',''),'},\\{') from explode_lateral_view;
OK
["\"source\":\"7fresh\",\"monthSales\":4900,\"userCount\":1900,\"score\":\"9.9\"","\"source\":\"jd\",\"monthSales\":2090,\"userCount\":78981,\"score\":\"9.8\"","\"source\":\"jdmart\",\"monthSales\":6987,\"userCount\":1600,\"score\":\"9.0\""]
Time taken: 0.219 seconds, Fetched: 1 row(s)
hive> select explode(split(regexp_replace(regexp_replace(sale_info,'\\[\\{',''),'}]',''),'},\\{')) from explode_lateral_view;
OK
"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"
"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"
"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"
Time taken: 0.222 seconds, Fetched: 3 row(s)

3、引入lateral view的原因

3.1、我们想用get_json_object来获取key为monthSales的数据,如下:

hive> select get_json_object(explode(split(regexp_replace(regexp_replace(sale_info,'\\[\\{',''),'}]',''),'},\\{')),'$.monthSales') from explode_lateral_view;
FAILED: SemanticException [Error 10081]: UDTF's are not supported outside the SELECT clause, nor nested in expressions

然后挂了FAILED: SemanticException [Error 10081]: UDTF's are not supported outside the SELECT clause, nor nested in expressions UDTF explode不能写在别的函数内;
3.2、另外,如下写法也会报错的:

hive> select explode(split(area,',')) as area,good_id from explode_lateral_view;
FAILED: SemanticException 1:40 Only a single expression in the SELECT clause is supported with UDTF's. Error encountered near token 'good_id'

如果你想查两个字段,select explode(split(area,',')) as area,good_id from explode_lateral_view;
会报错FAILED: SemanticException 1:40 Only a single expression in the SELECT clause is supported with UDTF's. Error encountered near token 'good_id'

使用UDTF的时候,只支持一个字段,这时候就需要LATERAL VIEW出场了

4、explode与lateral view使用示例1

LATERAL VIEW的使用:
侧视图的意义是配合explode(或者其他的UDTF),一个语句生成把单行数据拆解成多行后的数据结果集。


hive> select goods_id2,sale_info from explode_lateral_view LATERAL VIEW explode(split(goods_id,','))goods as goods_id2;
OK
1    [{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]
2    [{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]
3    [{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]
4    [{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]
5    [{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]
6    [{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]
7    [{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]
8    [{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]
9    [{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]
Time taken: 0.233 seconds, Fetched: 9 row(s)

其中LATERAL VIEW explode(split(goods_id,','))goods相当于一个虚拟表,与原表explode_lateral_view笛卡尔积关联。

它也可以多重使用。

hive> select goods_id2,sale_info,area2 from explode_lateral_view LATERAL VIEW explode(split(goods_id,','))goods as goods_id2 LATERAL VIEW explode(split(area,','))area as area2;
OK
1    [{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]    a:shandong
1    [{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]    b:beijing
1    [{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]    c:hebei
2    [{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]    a:shandong
2    [{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]    b:beijing
2    [{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]    c:hebei
3    [{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]    a:shandong
3    [{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]    b:beijing
3    [{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]    c:hebei
4    [{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]    a:shandong
4    [{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]    b:beijing
4    [{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]    c:hebei
5    [{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]    a:shandong
5    [{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]    b:beijing
5    [{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]    c:hebei
6    [{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]    a:shandong
6    [{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]    b:beijing
6    [{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]    c:hebei
7    [{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]    a:shandong
7    [{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]    b:beijing
7    [{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]    c:hebei
8    [{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]    a:shandong
8    [{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]    b:beijing
8    [{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]    c:hebei
9    [{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]    a:shandong
9    [{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]    b:beijing
9    [{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]    c:hebei
Time taken: 0.235 seconds, Fetched: 27 row(s)

现在我们解决一下上面的问题,从sale_info字段中找出所有的monthSales并且行展示

hive> select get_json_object(concat('{',sale_info_r,'}'),'$.monthSales') as monthSales from explode_lateral_view LATERAL VIEW explode(split(regexp_replace(regexp_replace(sale_info,'\\[\\{',''),'}]',''),'},\\{'))sale_info as sale_info_r;
OK
4900
2090
6987
Time taken: 0.323 seconds, Fetched: 3 row(s)

最终,我们可以通过下面的句子,把这个json格式的一行数据,完全转换成二维表的方式展现。

hive> select get_json_object(concat('{',sale_info_1,'}'),'$.source') as source,get_json_object(concat('{',sale_info_1,'}'),'$.monthSales') as monthSales,get_json_object(concat('{',sale_info_1,'}'),'$.userCount') as monthSales,get_json_object(concat('{',sale_info_1,'}'),'$.score') as monthSales from explode_lateral_view LATERAL VIEW explode(split(regexp_replace(regexp_replace(sale_info,'\\[\\{',''),'}]',''),'},\\{'))sale_info as sale_info_1;
OK
7fresh    4900    1900    9.9
jd    2090    78981    9.8
jdmart    6987    1600    9.0
Time taken: 1.106 seconds, Fetched: 3 row(s) 

 5、explode与lateral view使用示例2

explode()的功能是行转列

SELECT xd, pt1, COUNT(1) AS num 
FROM 
(
    SELECT CASE
        WHEN SUBSTR(answer, 1, 1) IN ('A','B','C','D','E','F') THEN '小学' 
        WHEN SUBSTR(answer, 1, 1) IN ('G','H','I') THEN '初中' 
        WHEN SUBSTR(answer, 1, 1) IN ('J','K','L') THEN '高中' END xd, u.userId 
    FROM userdata AS u 
    WHERE u.questionId = 29 
) a 
INNER JOIN 

    SELECT u.userId, CASE 
        WHEN pt='A' THEN '利用互联网查找资料,以更好地备课' 
        WHEN pt='B' THEN '给学生布置预习类的学习任务'
        WHEN pt='C' THEN '利用互联网学习平台提升课堂教学互动效果'
        WHEN pt='D' THEN '从教的视角出发准备数字资源,促进教学中更好地实现以学生为中心的教学活动实施'
        WHEN pt='E' THEN '在教学流程中课前、课中和课后连续使用网络平台,改变课堂教与学的方式,如翻转课堂'
        WHEN pt='F' THEN '从学的视角出发准备数字资源,支持学生依据需要进行自主、个性化学习'
        WHEN pt='G' THEN '对学生学习过程进行监督和管理,保障教学的高效开展'
        WHEN pt='H' THEN '对学生学习效果进行诊断和评价'
    END pt1 
    FROM answer u
    lateral view explode(split(u.answer, '')) c AS pt 
    WHERE u.questionId = 37 AND pt IN ('A','B','C','D','E','F','G','H')
) b 
ON a.userId = b.userId 
GROUP BY xd, pt1 
ORDER BY xd, num;

这里需要注意的是lateral view explode(split(u.answer, '')) c AS pt 这条语句中的c是虚表的名称,这个字段是必须要有的。说明:这条SQL语句查询出来的是一个问卷调查中某题各选项的选择人数,因为是多选题,所以首先要对结果进行split,然后再对split后的函数进行行转列操作。

 6、collect_set()函数示例

说到explode()函数就不得不说一下collect_set()函数。collect_set(col)函数只接受基本数据类型,它的主要作用是将某字段的值进行去重汇总,产生array类型字段。例如,要统计每种no下的score,直接对no分组后对score进行collect_set操作,如下:

hive> desc yyz_hive_callback_user_info;
OK
user_no                 string                  user_no             
click_id                string                  click_id            
date_str                string                  date_str            
rule_code               string                  rule_code           
Time taken: 0.424 seconds, Fetched: 4 row(s)
hive> select collect_set(rule_code) from yz_hive_callback_user_info;
OK
["R00004","E00001","E00002","C00012","C00002","C00014"]
Time taken: 73.458 seconds, Fetched: 1 row(s)
hive> select date_str,collect_set(rule_code) from yz_hive_callback_user_info group by date_str ;
2019-10-10    ["R00003","E00032","E00033","C00037"]
2019-10-11    ["E00024",,"C00005","E00026","C00033","E00022"]
2019-10-12    ["R00008",","C00018","C00031","E00015"]

select no,collect_set(score) from tablss group by no;
这样,就实现了将列转行的功效,但是注意只限同列基本数据类型,函数只能接受一列参数。

7、substr()函数示例

substr()是字符串截取函数,其语法为: substr(string A, int start, int len),返回值为 string类型。说明:返回字符串A从start位置开始,长度为len的字符串。这里需要注意的是初始位置是从1开始。

hive> select substr(goods_id,1,3),goods_id from explode_lateral_view;
OK
1,2    1,2,3,4,5,6,7,8,9
Time taken: 0.219 seconds, Fetched: 1 row(s)
hive> select substr(goods_id,1,4),goods_id from explode_lateral_view;
OK
1,2,    1,2,3,4,5,6,7,8,9
Time taken: 0.349 seconds, Fetched: 1 row(s)

8、concat_ws()函数示例

hive合并所有电话号码相同的问题内容,用冒号分割

SELECT B.LDHM, concat_ws(':',collect_set(b.WTNR))
FROM (
        SELECT A.LDHM, A.DJRQ, A.WTNR
        FROM TEST1_12366 A
        WHERE A.LDHM IS NOT NULL AND LENGTH(A.LDHM) > 5
        ORDER BY A.LDHM, A.DJRQ
     ) B
GROUP BY B.LDHM;

 参考: https://blog.csdn.net/guodong2k/article/details/79459282
          https://blog.csdn.net/gdkyxy2013/article/details/78683165

你可能感兴趣的:(hive)