最近有个需求,需要解析数仓中某张表的某个字段,该字段为Json,且为嵌套的多层Json,Json数据格式化之后如下:
由于是生产环境上的数据,因此对于某些value以xxx进行替代,并不影响sql的编写
样例:
[{"categoryId":"9","categoryName":"xxx","brandList":[{"brandId":"597","brandName":"xxx"}]},{"categoryId":"5","categoryName":"xxx","brandList":[{"brandId":"597","brandName":"xxx"}]},{"categoryId":"10","categoryName":"xxx","brandList":[{"brandId":"529","brandName":"xxx","seriesList":[{"seriesId":"22","seriesName":"xxx"}]}]}]
[{"brandList":[{"brandId":"752","brandName":"xxx"},{"brandId":"516","brandName":"xxx"},{"brandId":"650","brandName":"xxx"},{"brandId":"586","brandName":"xxx"},{"brandId":"630","brandName":"xxx"}],"categoryId":"542","categoryName":"xxx"},{"brandList":[{"brandId":"752","brandName":"xxx"},{"brandId":"650","brandName":"xxx"}],"categoryId":"7","categoryName":"xxx"},{"brandList":[{"brandId":"529","brandName":"xxx","seriesList":[{"seriesId":"22","seriesName":"xxx"}]}],"categoryId":"10","categoryName":"xxx"}]
需求是需要提取出每个这种json中所有的brandName
思路1:
使用hive自带的get_json_object函数进行处理:
select get_json_object(brand_control,"$[0].brandList"),
get_json_object(get_json_object(brand_control,"$[0].brandList"), "$[0].brandName")
from 库名.表名
where dayid='20190729'
尝试之后,发现最终的效果只能是取出其中一个brandname,并不能取出全部
思路2:
既然使用自带的json处理函数不能满足,那么就自己去开发一个udf函数,思路比较简单,只要将读入的每个json进行解析,一个for循环,将里面的brandName依次拿出来就行
思路3:
其实Hive SQL也可以一行sql直接搞定,没必要写什么udf,思路如下:
SQL如下:
select
seller_id,
collect_set(split(split(brand_name,'":"')[1],'"')[0]) as brand_name
from
(select
seller_id,
brand_name,
brand_control
from 库名.表名
lateral view
explode(
split(brand_control,'brandName')
) adTable as brand_name
where dayid = '20190729'
) a
where (split(split(brand_name,'":"')[1],'"')[0] REGEXP '[^0-9.]')!=0 --剔除brand_name为数字的情况
group by seller_id
这样,对应的一个seller_id就将对应的所有branName给全部取了出来,以一个list的形式拼接在了一起