hive与presto解析json数组并拆分为多行

案例需求

现有如下json字符串数组,需要把其拆分多行

[{
	"tabid": "新人专享",
	"id": 100015129300,
	"channelid": "APP",
	"ass_rule": {
		"lv1": "ass",
		"lv2": "new1"
	}
}, {
	"tabid": "新人专享",
	"id": 100006377942,
	"channelid": "APP",
	"uas_rule": {
		"lv1": "ass",
		"lv2": "new2"
	}
}, {
	"tabid": "新人专享",
	"id": 100014352353,
	"channelid": "APP",
	"ass_rule": {
		"lv1": "ass",
		"lv2": "new3"
	}
}]

一、hive用法

1、操作分析 

 分析:没有别的好办法,肯定提取“[]”之间的json字符串之后、后进行分割,逗号分割不行,逗号太多,而我们只想两个大括号之间的逗号做分割“ },{”,所以要把逗号替换成json数组中不存在的其它字符,之后再进行分割,如下,五步走:

1、regexp_extract提取[]内的json字符串

2、regexp_replace替换为逗号为自定义分隔符

3、split根据自定义分隔符分割

4、explode行转列函数

5、get_json_object解析json取字段值

2、完整SQL

SELECT
	get_json_object(str_json, '$.id') AS sid
FROM
	(
		SELECT
			event_param_json,
			split(regexp_replace(regexp_extract(event_param_json, '(\\[)(.*?)(\\])', 2), '\\},\\{', '\\}#\\{'), '\\#') AS json_list
		FROM
			abm.abm_wireless_exposure_log
		WHERE
			dt = '2020-09-06'
			AND event_id = 'ProExpo'
			AND event_param_json LIKE '%ass%'
	)
	a lateral VIEW explode(json_list) list_tab AS str_json

或者如下SQL都可以(两种SQL只是提取中括号[]间json字符串的正则不一样)

SELECT
	GET_JSON_OBJECT(json_str, '$.id') AS sid
FROM
	(
		SELECT
			json_str
		FROM
			(
				SELECT
					split(regexp_replace(regexp_extract(event_param_json -- 获取data数组,格式[{json},{json}]
					, '^\\[(.+)\\]$', 1) -- 删除字符串前后的[],格式{json},{json}
					, '\\}\\,\\{', '\\}\\|\\|\\{') -- 将josn字符串中的分隔符代换成||,格式{json}||{json}
					, '\\|\\|') AS json_list
				FROM
				abm.abm_wireless_exposure_log
				WHERE
					dt = '2020-09-06'
					AND page_id = 'ManChannel'
					-- and event_id = 'RecProExpo'
					AND event_param_json LIKE '%ass%'
			)
			a lateral VIEW explode(json_list) list_tab AS json_str
	)
	t

如图,运行中间sql,已经分割为多行单个json:

hive与presto解析json数组并拆分为多行_第1张图片

 

二、presto用法

 1、操作分析

是(replace(replace(replace(event_param_json, '[', ''), ']', ''), '},{', '}#{'), '#')

 分析:上面hive的正则在presto提取不到,所以json提起就换种方法。两次替换掉json中括号,后进行分割,逗号分割不行,逗号太多,而我们只想两个大括号之间的逗号做分割“ },{”,所以要把逗号替换成json数组中不存在的其它字符,之后再进行分割,如下,五步走:

1、 两次replace替换掉左右中括号[]

2、replace替换为逗号为自定义分隔符

3、split根据自定义分隔符分割

4、unnest行转列函数

5、json_extract_scalar解析json取字段值

 2、完整SQL

SELECT
    str_json,
	json_extract_scalar(str_json, '$.id') AS sid
FROM
	(
		SELECT
			event_param_json
		FROM
			abm.abm_wireless_exposure_log
		WHERE
			dt = '2020-09-06'
			AND event_id = 'NecProExpo'
			AND event_param_json LIKE '%ass%'
	)
CROSS JOIN unnest(SPLIT(REPLACE(REPLACE(REPLACE(event_param_json, '[', ''), ']', ''), '},{', '}#{'), '#')) AS t(str_json)

 运行结果如下图,json数组就分割为单个json,提取其中某个值了

hive与presto解析json数组并拆分为多行_第2张图片

 

参考:SELECT — Presto 0.273.3 Documentation

你可能感兴趣的:(Hive,Presto,hive,json,big,data)