[TOC]
用Hive处理数据的时候经常会遇到行列互相转换的需求,总结并记录一下行列转行的常见场景和操作语法
其中所有的操作都可以直接复制语句去自己的hive执行,查看结果
前置知识
执行hive或beeline进入,执行
desc function explode;
查看函数说明;
explode(a) - separates the elements of array a into multiple rows, or the elements of a map into multiple rows and columns
可以将数组炸开成多行,或者将map炸开成多行多列,是Hive内置的UDTF
split(str, regex) - Splits str around occurances that match regex
按照正则规则去切割字符串
collect_list(x) - Returns a list of objects with duplicates
返回不去重的集合
collect_set(x) - Returns a set of objects with duplicate elements eliminated
返回一个去重的集合
concat_ws(separator, [string | array(string)]+) - returns the concatenation of the strings separated by the separator
返回一个特定分隔符的拼接字符串
max(expr) - Returns the maximum value of expr
返回表达式的最大值
准备数据
创建一个name,subject,score的期末考试成绩表,代表每个同学每个学科的成绩
导入数据也可以自定义文本然后load也行,sql语句一步到位,测试简单方便
- 用于行专列的数据,后称表1
create table school_final_test as
select 'jack' as name, 'english' as subject, 70 as score union all
select 'jack' as name, 'math' as subject, 80 as score union all
select 'jack' as name, 'chinese' as subject, 90 as score union all
select 'tim' as name, 'english' as subject, 10 as score union all
select 'tim' as name, 'math' as subject, 20 as score union all
select 'tim' as name, 'chinese' as subject, 30 as score;
表1数据:
name | subject | score |
---|---|---|
jack | english | 70 |
jack | math | 80 |
jack | chinese | 90 |
tim | english | 10 |
tim | math | 20 |
tim | chinese | 30 |
- 用于列转行的数据,后称表2
create table school_final_test1 as
select 'jack' as name, 70 as english,80 as math, 90 as chinese union all
select 'tim' as name, 10 as english,20 as math, 30 as chinese;
表2数据:
name | english | math | chinese |
---|---|---|---|
jack | 70 | 80 | 90 |
tim | 10 | 20 | 30 |
开始测试
需求1
多行转多列,数据源是表1
结果表:
name | english | math | chinese |
---|---|---|---|
jack | 70 | 80 | 90 |
tim | 10 | 20 | 30 |
-
group by + max + case when
语法 - 巧妙的使用了数值型和0之间的大小关系
select name,
max(case subject when 'english' then score else 0 end) as english,
max(case subject when 'math' then score else 0 end) as math,
max(case subject when 'chinese' then score else 0 end) as chinese
from school_final_test
group by name;
- 疑问: 如果需要多行转多列的不是数值型,而是str类型是否仍可行呢?
select max(str) from
(select 'str' as str union all
select 'sts' as str union all
select null as str)t1; -- result : sts
- 很明显,如果是字符串,可以使用null作为默认值,那么max仍然可行
需求2
多行转单列,数据源是表1
结果表:
name | scores |
---|---|
jack | english:70,math:80,chinese:90 |
tim | english:10,math:20,chinese:30 |
-
group by + collect_list + concat_ws
语法 - 注意:函数中要求的类型
select name,concat_ws(',',
collect_list(
concat_ws(':',subject,cast(score as string))
)
) as scores
from school_final_test
group by name;
需求3
多列转多行,数据源是表2
结果表:
name | subject | score |
---|---|---|
jack | english | 70 |
jack | math | 80 |
jack | chinese | 90 |
tim | english | 10 |
tim | math | 20 |
tim | chinese | 30 |
- 将每列需要的数据单独提取出来合并即可
- 注意: union和union all的区别是是否去重
select name,'english' as subject,english as score from school_final_test1
union all
select name,'math' as subject,math as score from school_final_test1
union all
select name,'chinese' as subject,chinese as score from school_final_test1;
需求4
单行转多列
数据源:
create table school_final_test2 as
select name,concat_ws(',',
collect_list(
concat_ws(':',subject,cast(score as string))
)
) as scores
from school_final_test
group by name;
name | scores |
---|---|
jack | english:70,math:80,chinese:90 |
tim | english:10,math:20,chinese:30 |
结果表:
name | scores |
---|---|
jack | english:70 |
jack | math:80 |
jack | chinese:90 |
tim | english:10 |
tim | math:20 |
tim | chinese:30 |
-
split + explode
语法
Lateral View是一个连接用户自定义函数(UDFs)的连接词,可以有多个;如果炸裂的列中有null值但是需要显示可以在view后面加outer,类似表的左外连接;
- 其中table1表示开窗得到的新表别名,必须写但可不使用(不使用时必须保证没有重复的列名)
select name,table1.scores as scores
from school_final_test2
lateral view explode(split(scores,',')) table1 as scores;
- 如果需要将结果表转换成表1的形式,可以继续对scores字段按照
:
进行拆分,然后用数组角标方式{{{arrya[index]
}}}获取对应字段
select name,split(scores,':')[0] as subject,split(scores,':')[1] as score from (
select name,table1.scores as scores
from school_final_test2
lateral view explode(split(scores,',')) table1 as scores
)t1;
需求5
将字符串"keyl=valuel&key2=value2...keyn=valuen"进行分割放到一个字段里面,可以查出任意一个keyx对应的valuex值
- 核心思路就是将字符串内数据疯转在map结构中,可以使用内置函数
str_to_map
,如果有其他更复杂的操作可自定义UDF; - str_to_map(text, delimiter1, delimiter2) - Creates a map by parsing text; Split text into key-value pairs using two delimiters.
The first delimiter seperates pairs, and the second delimiter sperates key and value
. If only one parameter is given, default delimiters are used: ',' as delimiter1 and '=' as delimiter2.*
select result['key1'] from
(select str_to_map('key1=value1&key2=value2&keyn=valuen','&','=') as result)t1;