Hive解析带有数组的Json字符串

要处理的数据如下:

{
    "IP": "192.168.1.1",
    "appName": "sichuan_yunyingyong",
    "customEvent": [
        {
            "eventName": "xx1",
            "du": "xx",
            "timestamp": "1480521763049",
            "eventParams": {
                "ContentID": "yixiuge",
                "account": "13856976635",
                "networkType": "WIFI",
                "result": "0",
                "type": "11"
            }
        },
        {
            "eventName": "xx2",
            "du": "xx",
            "timestamp": "1480521763049",
            "eventParams": {
                "ContentID": "yixiuge",
                "account": "13856976636",
                "networkType": "WIFI",
                "result": "0",
                "type": "11"
            }
        }
    ]
}

创建一个表tab_json,表中包含一个类型为String的字段json对应该字符串。
接下来要用到几个函数,先介绍一下lateral view,一般配合UDTF使用,能将一行变成多行输出。
具体可参考:https://blog.csdn.net/sunnyyoona/article/details/62894761
                      https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView
JSON_TUPLE(str, path1, path2 ..., pathN)根据输入的参数提取字符串

posexplode(array)把数组变成(pos, json) 的键值对

select
j1.j1_ip,
j1.j1_appName,
j2.j2_customEvent_json
FROM tab_json s
lateral view json_tuple(s.json, 'IP', 'appName', 'customEvent') j1 as j1_ip, j1_appName, j1_customEvent

lateral view posexplode(split(regexp_replace(regexp_replace(j1.j1_customEvent,'\\}\\,\\{','\\}\\|\\|\\{'),'\\[|\\]',''), '\\|\\|')) j2 as j2_customEvents_pos, j2_customEvent_json

执行结果为:

192.168.1.1     sichuan_yunyingyong     {"eventName":"xx1","du":"xx","timestamp":"1480521763049","eventParams":{"ContentID":"yixiuge","account":"13856976635","networkType":"WIFI","result":"0","type":"11"}}

192.168.1.1     sichuan_yunyingyong     {"eventName":"xx2","du":"xx","timestamp":"1480521763049","eventParams":{"ContentID":"yixiuge","account":"13856976636","networkType":"WIFI","result":"0","type":"11"}}

这里把json array的格式通过替换变成了 {json1} || {json2} , 再去掉数组的括号,最后根据 || 来拆开,形成了一个有两个元素的数组,接着 posexplode 在把数组变成(pos, json) 的键值对,pos记录了元素的位置,json就是实际的json数据,这样一条数据就变成了两条了

那么现在要获取IP , appName , account 就很简单了:

select
j1.j1_ip,
j1.j1_appName,
j4.j4_account
FROM tab_json s
lateral view json_tuple(s.json, 'IP', 'appName', 'customEvent') j1 as j1_ip, j1_appName, j1_customEvent
lateral view posexplode(split(regexp_replace(regexp_replace(j1.j1_customEvent,'\\}\\,\\{','\\}\\|\\|\\{'),'\\[|\\]',''), '\\|\\|')) j2 as j2_customEvents_pos, j2_customEvent_json
lateral view json_tuple(j2.j2_customEvent_json, 'eventParams') j3 as j3_eventParams
lateral view json_tuple(j3.j3_eventParams, 'account') j4 as j4_account
结果如下:
192.168.1.1     sichuan_yunyingyong     13856976635

192.168.1.1     sichuan_yunyingyong     13856976636


上面的例子json array有两个元素,如果你只关注其中一个元素,那么可以如下操作
lateral view posexplode(array(split(regexp_replace(regexp_replace(j1.j1_customEvent,'\\}\\,\\{','\\}\\|\\|\\{'),'\\[|\\]',''), '\\|\\|')[1])) j2 as j2_customEvents_pos, j2_customEvent_json
split 跟上数组下标,就能取出某个元素,由于posexplode只接受 array类型的参数,可以使用array函数转换成对应的数组,这样就只有一条数据了,结果如下
192.168.1.1     sichuan_yunyingyong     13856976636

参考:https://blog.csdn.net/lfq1532632051/article/details/63262519


你可能感兴趣的:(Hive)