lateral view与 explode函数按理说是不应该在数据库里存在的,因为他违背了第一范式(每个属性不可再分)。但是实际的场景,如一些大数据场景还是会存在将一些低频使用但又不能丢失的数据存成json,这种场景下就需要解析json,将里面的数组和多key值展开。
初始化一份数据
data = [
{
"id": 1,
"name": "XiaoHua",
"age": 12,
"interests": "game,read,tv",
"interests_socre": {'game': 8, 'read': 7, 'tv': 8},
"scores": {
"scores": [{
"subject": "math",
"score": 80
}, {
"subject": "language",
"score": 90
}, {
"subject": "sports",
"score": 70
}],
"count": 3
},
"scores_str": '[{"subject": "math", "score": 80}, {"subject": "language", "score": 90}, {"subject": "sports", "score": 70}]'
},
{
"id": 2,
"name": "QiangQiang",
"age": 13,
"interests": "game,read,fishing,pingpong",
"interests_socre": {'game': 8, 'read': 7, 'fishing': 8, 'pingpong': 9},
"scores": {
"scores": [{
"subject": "math",
"score": 85
}, {
"subject": "language",
"score": 92
}, {
"subject": "sports",
"score": 73
}],
"count": 3
},
"scores_str": '[{"subject": "math", "score": 85}, {"subject": "language", "score": 92}, {"subject": "sports", "score": 73}]'
},
{
"id": 3,
"name": "YuanYuan",
"age": 12,
"interests": "read,dance",
"interests_socre": {'read': 7, 'dance': 9},
"scores": {
"scores": [{
"subject": "math",
"score": 82
}, {
"subject": "language",
"score": 94
}, {
"subject": "sports",
"score": 78
}],
"count": 3
},
"scores_str": '[{"subject": "math", "score": 82}, {"subject": "language", "score": 94}, {"subject": "sports", "score": 78}]'
}
]
df = spark.createDataFrame(data)
df.createOrReplaceTempView('df')
df.cache()
/usr/lib/spark/python/pyspark/sql/session.py:346: UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead
warnings.warn("inferring schema from dict is deprecated,"
DataFrame[age: bigint, id: bigint, interests: string, interests_socre: map, name: string, scores: map>>, scores_str: string]
print(df.schema)
StructType(List(StructField(age,LongType,true),StructField(id,LongType,true),StructField(interests,StringType,true),StructField(interests_socre,MapType(StringType,LongType,true),true),StructField(name,StringType,true),StructField(scores,MapType(StringType,ArrayType(MapType(StringType,StringType,true),true),true),true),StructField(scores_str,StringType,true)))
df.toPandas().head()
|
age |
id |
interests |
interests_socre |
name |
scores |
scores_str |
0 |
12 |
1 |
game,read,tv |
{'tv': 8, 'game': 8, 'read': 7} |
XiaoHua |
{'count': None, 'scores': [{'score': '80', 'su... |
[{"subject": "math", "score": 80}, {"subject":... |
1 |
13 |
2 |
game,read,fishing,pingpong |
{'game': 8, 'read': 7, 'pingpong': 9, 'fishing... |
QiangQiang |
{'count': None, 'scores': [{'score': '85', 'su... |
[{"subject": "math", "score": 85}, {"subject":... |
2 |
12 |
3 |
read,dance |
{'dance': 9, 'read': 7} |
YuanYuan |
{'count': None, 'scores': [{'score': '82', 'su... |
[{"subject": "math", "score": 82}, {"subject":... |
explode使用
spark.sql("""
select
id
,name
,explode(split(interests, ',')) as interest
from df
order by id
""").toPandas()
|
id |
name |
interest |
0 |
1 |
XiaoHua |
game |
1 |
1 |
XiaoHua |
read |
2 |
1 |
XiaoHua |
tv |
3 |
2 |
QiangQiang |
game |
4 |
2 |
QiangQiang |
read |
5 |
2 |
QiangQiang |
fishing |
6 |
2 |
QiangQiang |
pingpong |
7 |
3 |
YuanYuan |
read |
8 |
3 |
YuanYuan |
dance |
spark.sql("""
select
id
,name
,explode(interests_socre) as (key, value)
from df
order by id
""").toPandas()
|
id |
name |
key |
value |
0 |
1 |
XiaoHua |
tv |
8 |
1 |
1 |
XiaoHua |
game |
8 |
2 |
1 |
XiaoHua |
read |
7 |
3 |
2 |
QiangQiang |
game |
8 |
4 |
2 |
QiangQiang |
read |
7 |
5 |
2 |
QiangQiang |
pingpong |
9 |
6 |
2 |
QiangQiang |
fishing |
8 |
7 |
3 |
YuanYuan |
dance |
9 |
8 |
3 |
YuanYuan |
read |
7 |
spark.sql("""
SELECT
id
,name
,score.subject
,score.score
FROM(
select
id
,name
,explode(scores.scores) as score
from df
) as base
""").toPandas()
|
id |
name |
subject |
score |
0 |
1 |
XiaoHua |
math |
80 |
1 |
1 |
XiaoHua |
language |
90 |
2 |
1 |
XiaoHua |
sports |
70 |
3 |
2 |
QiangQiang |
math |
85 |
4 |
2 |
QiangQiang |
language |
92 |
5 |
2 |
QiangQiang |
sports |
73 |
6 |
3 |
YuanYuan |
math |
82 |
7 |
3 |
YuanYuan |
language |
94 |
8 |
3 |
YuanYuan |
sports |
78 |
lateral view
explode结合lateral view
- lateralView: LATERAL VIEW udtf(expression) tableAlias AS columnAlias (’,’ columnAlias)*
- fromClause: FROM baseTable (lateralView)*
udtf
- explode(ARRAY a)
- explode(MAP m)
- posexplode(ARRAY a)
- inline(ARRAY a)
- stack(int r,T1 V1,…,Tn/r Vn)
- json_tuple(string jsonStr,string k1,…,string kn)
- parse_url_tuple(string urlStr,string p1,…,string pn)
spark.sql("""
select
id
,name
,sc.subject
,sc.score
from df
lateral view explode(scores.scores) t as sc
""").toPandas()
|
id |
name |
subject |
score |
0 |
1 |
XiaoHua |
math |
80 |
1 |
1 |
XiaoHua |
language |
90 |
2 |
1 |
XiaoHua |
sports |
70 |
3 |
2 |
QiangQiang |
math |
85 |
4 |
2 |
QiangQiang |
language |
92 |
5 |
2 |
QiangQiang |
sports |
73 |
6 |
3 |
YuanYuan |
math |
82 |
7 |
3 |
YuanYuan |
language |
94 |
8 |
3 |
YuanYuan |
sports |
78 |
json_tuple可以一次性解析多个字段,而get_json_object一次只能解析一个字段。
- 1st:
regexp_replace(scores_str, ' ', '')
去掉字符串里的空格
- 2nd:
regexp_extract(1st, '^\\\\[(.+)\\\\]$', 1)
去掉中括号’[]’
- 3rd:
regexp_replace(2nd, '\\\\}\\\\,\\\\{', '\\\\}\\\\|\\\\|\\\\{')
将 “},{” => “}||{”
- 4th: split(3rd, ‘\\|\\|’) 将数组切分为一个个dict
- 5th: 分别取出dict里的元素
spark.sql("""
select
id
,name
-- json_tuple
,v2.subject
,v2.score
-- get_json_object
,get_json_object(sc, '$.subject') as subject_2
,get_json_object(sc, '$.score') as score_2
-- json_tuple
,json_tuple(t.sc,'subject','score') as (subject_3, score_3)
from(
select
id
,name
,split(
regexp_replace(
regexp_extract(regexp_replace(scores_str, ' ', ''),'^\\\\[(.+)\\\\]$', 1),
'\\\\}\\\\,\\\\{',
'\\\\}\\\\|\\\\|\\\\{'
), '\\\\|\\\\|') as scores
from df
) as base
lateral view explode(base.scores) t as sc
lateral view json_tuple(t.sc,'subject','score') v2 as subject,score
""").toPandas()
|
id |
name |
subject |
score |
subject_2 |
score_2 |
subject_3 |
score_3 |
0 |
1 |
XiaoHua |
math |
80 |
math |
80 |
math |
80 |
1 |
1 |
XiaoHua |
language |
90 |
language |
90 |
language |
90 |
2 |
1 |
XiaoHua |
sports |
70 |
sports |
70 |
sports |
70 |
3 |
2 |
QiangQiang |
math |
85 |
math |
85 |
math |
85 |
4 |
2 |
QiangQiang |
language |
92 |
language |
92 |
language |
92 |
5 |
2 |
QiangQiang |
sports |
73 |
sports |
73 |
sports |
73 |
6 |
3 |
YuanYuan |
math |
82 |
math |
82 |
math |
82 |
7 |
3 |
YuanYuan |
language |
94 |
language |
94 |
language |
94 |
8 |
3 |
YuanYuan |
sports |
78 |
sports |
78 |
sports |
78 |
参考
- hive中的lateral view 与 explode函数的使用
- hive中解析json数组