json是常见的数据接口形式,实际中使用也很广泛,下面我们看看怎么在hive中解析json格式。
hive中常用的解析json格式的函数有两个:
先看看get_json_object
> desc function extended get_json_object;
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| tab_name |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| get_json_object(json_txt, path) - Extract a json object from path |
| Extract json object from a json string based on json path specified, and return json string of the extracted json object. It will return null if the input json string is invalid. |
| A limited version of JSONPath supported: |
| $ : Root object |
| . : Child operator |
| [] : Subscript operator for array |
| * : Wildcard for [] |
| Syntax not supported that's worth noticing: |
| '' : Zero length string as key |
| .. : Recursive descent |
| @ : Current object/element |
| () : Script expression |
| ?() : Filter (script) expression. |
| [,] : Union operator |
| [start:end:step] : array slice operator |
| |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
16 rows selected (0.579 seconds)
从上面可以看出,get_json_object输入的参数是两个,json_txt与path。
其中,json_txt就是我们需要解析的json字符串,而path表示json中的"字段"。
再看看json_tuple方法
> desc function extended json_tuple;
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| tab_name |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| json_tuple(jsonStr, p1, p2, ..., pn) - like get_json_object, but it takes multiple names and return a tuple. All the input parameters and output column types are string. |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
1 row selected (0.549 seconds)
从上面的解释不难看出,json_tuple与get_json_object的用法很类似,唯一的区别是,json_tuple输入的是json字符串中的多个"字段",返回一个元组,元组中的所有数据类型为string。
假设有如下一条测试数据
{"age":18, "name": "lili", "gender": "female"}
如果想要解析age字段
> select get_json_object('{"age":18, "name": "lili", "gender": "female"}', "$.age");
+------+--+
| _c0 |
+------+--+
| 18 |
+------+--+
如果使用json_tuple方法,也可以。
> select json_tuple('{"age":18, "name": "lili", "gender": "female"}', "age");
+-----+--+
| c0 |
+-----+--+
| 18 |
+-----+--+
如果json串里是个数组,可以采取如下提取的方式
> select get_json_object('[{"age":18, "name": "lili", "gender": "female"}, {"age":19, "name": "lucy", "gender": "female"}, {"age":15, "name": "mike", "gender": "male"}]', "$.[0,1,2].age") as age;
+-------------+--+
| age |
+-------------+--+
| [18,19,15] |
+-------------+--+
如果只是想查询数组里第一个对象的age值,可以用下面的提取方法
> select get_json_object('[{"age":18, "name": "lili", "gender": "female"}, {"age":19, "name": "lucy", "gender": "female"}, {"age":15, "name": "mike", "gender": "male"}]', "$.[0].age");
+------+--+
| _c0 |
+------+--+
| 18 |
+------+--+
如果想同时提取所有字段,可以用json_tuple方法。
> select json_tuple('{"age":18, "name": "lili", "gender": "female"}', "age", "name", "gender");
+-----+-------+---------+--+
| c0 | c1 | c2 |
+-----+-------+---------+--+
| 18 | lili | female |
+-----+-------+---------+--+
上面的例子,都是解析单个或者确定个数的json对象。但如果不知道json数组有多少个,或者数组长度不确定时,我们可以采用下面迂回的方式来解析。
还是以上面的数组为例
[{"age":18, "name": "lili", "gender": "female"},{"age":19, "name": "lucy", "gender": "female"},{"age":15, "name": "mike", "gender": "male"}]
如果我们想要解析出所有的age字段,该怎么办?
第一步,将"},{”中的,替换成; ,同时将数组的"[]"符号也去除:
> select regexp_replace(regexp_replace('[{"age":18, "name": "lili", "gender": "female"},{"age":19, "name": "lucy", "gender": "female"},{"age":15, "name": "mike", "gender": "male"}]', '\\}\\,\\{','\\}\\;\\{'), '\\[|\\]', '');
+---------------------------------------------------------------------------------------------------------------------------------------------+--+
| _c0 |
+---------------------------------------------------------------------------------------------------------------------------------------------+--+
| {"age":18, "name": "lili", "gender": "female"};{"age":19, "name": "lucy", "gender": "female"};{"age":15, "name": "mike", "gender": "male"} |
+---------------------------------------------------------------------------------------------------------------------------------------------+--+
第二部,将字符串用split方法分割,得到一个数组
> select split(regexp_replace(regexp_replace('[{"age":18, "name": "lili", "gender": "female"},{"age":19, "name": "lucy", "gender": "female"},{"age":15, "name": "mike", "gender": "male"}]', '\\}\\,\\{','\\}\\;\\{'), '\\[|\\]', ''), "\\;");
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| _c0 |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| ["{\"age\":18, \"name\": \"lili\", \"gender\": \"female\"}","{\"age\":19, \"name\": \"lucy\", \"gender\": \"female\"}","{\"age\":15, \"name\": \"mike\", \"gender\": \"male\"}"] |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
第三步,使用explode方法将数据变成一列
> select explode(split(regexp_replace(regexp_replace('[{"age":18, "name": "lili", "gender": "female"},{"age":19, "name": "lucy", "gender": "female"},{"age":15, "name": "mike", "gender": "male"}]', '\\}\\,\\{','\\}\\;\\{'), '\\[|\\]', ''), "\\;"));
+-------------------------------------------------+--+
| col |
+-------------------------------------------------+--+
| {"age":18, "name": "lili", "gender": "female"} |
| {"age":19, "name": "lucy", "gender": "female"} |
| {"age":15, "name": "mike", "gender": "male"} |
+-------------------------------------------------+--+
第四步,使用get_json_object解析即可。
> select get_json_object(json_data, "$.age") from (select explode(split(regexp_replace(regexp_replace('[{"age":18, "name": "lili", "gender": "female"},{"age":19, "name": "lucy", "gender": "female"},{"age":15, "name": "mike", "gender": "male"}]', '\\}\\,\\{','\\}\\;\\{'), '\\[|\\]', ''), "\\;")) as json_data) virtual_table;
+------+--+
| _c0 |
+------+--+
| 18 |
| 19 |
| 15 |
+------+--+
以上步骤,就完成了对json对象数组的解析。