SparkSQL | 表生成函数

lateral view与 explode函数按理说是不应该在数据库里存在的,因为他违背了第一范式(每个属性不可再分)。但是实际的场景,如一些大数据场景还是会存在将一些低频使用但又不能丢失的数据存成json,这种场景下就需要解析json,将里面的数组和多key值展开。

初始化一份数据

# 随意造的一份数据,毫无意义
data = [
    {
        "id": 1,
        "name": "XiaoHua",
        "age": 12,
        "interests": "game,read,tv",
        "interests_socre": {'game': 8, 'read': 7, 'tv': 8},
        "scores": {
             "scores": [{
                    "subject": "math",
                    "score": 80
                }, {
                    "subject": "language",
                    "score": 90
                }, {
                    "subject": "sports",
                    "score": 70
            }],
            "count": 3
        },
        "scores_str": '[{"subject": "math", "score": 80}, {"subject": "language", "score": 90}, {"subject": "sports", "score": 70}]'
    },

    {
        "id": 2,
        "name": "QiangQiang",
        "age": 13,
        "interests": "game,read,fishing,pingpong",
        "interests_socre": {'game': 8, 'read': 7, 'fishing': 8, 'pingpong': 9},
        "scores": {
             "scores": [{
                    "subject": "math",
                    "score": 85
                }, {
                    "subject": "language",
                    "score": 92
                }, {
                    "subject": "sports",
                    "score": 73
            }],
            "count": 3
        },
        "scores_str": '[{"subject": "math", "score": 85}, {"subject": "language", "score": 92}, {"subject": "sports", "score": 73}]'
    },    

    {
        "id": 3,
        "name": "YuanYuan",
        "age": 12,
        "interests": "read,dance",
        "interests_socre": {'read': 7, 'dance': 9},
        "scores": {
             "scores": [{
                    "subject": "math",
                    "score": 82
                }, {
                    "subject": "language",
                    "score": 94
                }, {
                    "subject": "sports",
                    "score": 78
            }],
            "count": 3
        },
        "scores_str": '[{"subject": "math", "score": 82}, {"subject": "language", "score": 94}, {"subject": "sports", "score": 78}]'
    }      
]
df = spark.createDataFrame(data)
df.createOrReplaceTempView('df')
df.cache()
/usr/lib/spark/python/pyspark/sql/session.py:346: UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead
  warnings.warn("inferring schema from dict is deprecated,"





DataFrame[age: bigint, id: bigint, interests: string, interests_socre: map, name: string, scores: map>>, scores_str: string]
print(df.schema)
StructType(List(StructField(age,LongType,true),StructField(id,LongType,true),StructField(interests,StringType,true),StructField(interests_socre,MapType(StringType,LongType,true),true),StructField(name,StringType,true),StructField(scores,MapType(StringType,ArrayType(MapType(StringType,StringType,true),true),true),true),StructField(scores_str,StringType,true)))
df.toPandas().head()
age id interests interests_socre name scores scores_str
0 12 1 game,read,tv {'tv': 8, 'game': 8, 'read': 7} XiaoHua {'count': None, 'scores': [{'score': '80', 'su... [{"subject": "math", "score": 80}, {"subject":...
1 13 2 game,read,fishing,pingpong {'game': 8, 'read': 7, 'pingpong': 9, 'fishing... QiangQiang {'count': None, 'scores': [{'score': '85', 'su... [{"subject": "math", "score": 85}, {"subject":...
2 12 3 read,dance {'dance': 9, 'read': 7} YuanYuan {'count': None, 'scores': [{'score': '82', 'su... [{"subject": "math", "score": 82}, {"subject":...

explode使用

# Array
spark.sql("""
select 
    id
    ,name
    ,explode(split(interests, ',')) as interest
from df
order by id
""").toPandas()
id name interest
0 1 XiaoHua game
1 1 XiaoHua read
2 1 XiaoHua tv
3 2 QiangQiang game
4 2 QiangQiang read
5 2 QiangQiang fishing
6 2 QiangQiang pingpong
7 3 YuanYuan read
8 3 YuanYuan dance
# Map
spark.sql("""
select 
    id
    ,name    
    ,explode(interests_socre) as (key, value)
from df
order by id
""").toPandas()
id name key value
0 1 XiaoHua tv 8
1 1 XiaoHua game 8
2 1 XiaoHua read 7
3 2 QiangQiang game 8
4 2 QiangQiang read 7
5 2 QiangQiang pingpong 9
6 2 QiangQiang fishing 8
7 3 YuanYuan dance 9
8 3 YuanYuan read 7
# struct
spark.sql("""
SELECT
    id
    ,name
    ,score.subject
    ,score.score
FROM(
    select 
        id
        ,name
        ,explode(scores.scores) as score
    from df
) as base
""").toPandas()
id name subject score
0 1 XiaoHua math 80
1 1 XiaoHua language 90
2 1 XiaoHua sports 70
3 2 QiangQiang math 85
4 2 QiangQiang language 92
5 2 QiangQiang sports 73
6 3 YuanYuan math 82
7 3 YuanYuan language 94
8 3 YuanYuan sports 78

lateral view

explode结合lateral view

  • lateralView: LATERAL VIEW udtf(expression) tableAlias AS columnAlias (’,’ columnAlias)*
  • fromClause: FROM baseTable (lateralView)*

udtf

  • explode(ARRAY a)
  • explode(MAP m)
  • posexplode(ARRAY a)
  • inline(ARRAY a)
  • stack(int r,T1 V1,…,Tn/r Vn)
  • json_tuple(string jsonStr,string k1,…,string kn)
  • parse_url_tuple(string urlStr,string p1,…,string pn)
# 上面struct可以结合lateral view而避免嵌套
spark.sql("""
select 
    id
    ,name
    ,sc.subject
    ,sc.score
from df
lateral view explode(scores.scores) t as sc
""").toPandas()
id name subject score
0 1 XiaoHua math 80
1 1 XiaoHua language 90
2 1 XiaoHua sports 70
3 2 QiangQiang math 85
4 2 QiangQiang language 92
5 2 QiangQiang sports 73
6 3 YuanYuan math 82
7 3 YuanYuan language 94
8 3 YuanYuan sports 78

json_tuple可以一次性解析多个字段,而get_json_object一次只能解析一个字段。

  • 1st: regexp_replace(scores_str, ' ', '') 去掉字符串里的空格
  • 2nd: regexp_extract(1st, '^\\\\[(.+)\\\\]$', 1) 去掉中括号’[]’
  • 3rd: regexp_replace(2nd, '\\\\}\\\\,\\\\{', '\\\\}\\\\|\\\\|\\\\{') 将 “},{” => “}||{”
  • 4th: split(3rd, ‘\\|\\|’) 将数组切分为一个个dict
  • 5th: 分别取出dict里的元素
# json字符串解析
spark.sql("""
select
    id
    ,name
    -- json_tuple
    ,v2.subject
    ,v2.score
    -- get_json_object
    ,get_json_object(sc, '$.subject') as subject_2
    ,get_json_object(sc, '$.score') as score_2
    -- json_tuple
    ,json_tuple(t.sc,'subject','score') as (subject_3, score_3)
from(
    select 
        id
        ,name
        ,split(
            regexp_replace(
                regexp_extract(regexp_replace(scores_str, ' ', ''),'^\\\\[(.+)\\\\]$', 1),
                '\\\\}\\\\,\\\\{',
                '\\\\}\\\\|\\\\|\\\\{'
            ), '\\\\|\\\\|') as scores
    from df
) as base
lateral view explode(base.scores) t as sc
lateral view json_tuple(t.sc,'subject','score') v2 as subject,score
""").toPandas()
id name subject score subject_2 score_2 subject_3 score_3
0 1 XiaoHua math 80 math 80 math 80
1 1 XiaoHua language 90 language 90 language 90
2 1 XiaoHua sports 70 sports 70 sports 70
3 2 QiangQiang math 85 math 85 math 85
4 2 QiangQiang language 92 language 92 language 92
5 2 QiangQiang sports 73 sports 73 sports 73
6 3 YuanYuan math 82 math 82 math 82
7 3 YuanYuan language 94 language 94 language 94
8 3 YuanYuan sports 78 sports 78 sports 78

参考

  • hive中的lateral view 与 explode函数的使用
  • hive中解析json数组

你可能感兴趣的:(【Spark】)