最近在工作中遇到了嵌套多层的json串,不仅包括类似于python字典的嵌套,里面还有数组的嵌套。
具体的例子如下,在这里列举了三个例子,其实处理了大约500万条,虽然嵌套也挺多,但相对于专业的爬虫工程师来说,这肯定是小意思而已。这个json串的意思是一个企业所对应的规则,前面是企业的id(因为保密,所以企业id肯定更改了),因为本人是在滴滴,所以对应的规则当然是用车规则了,还是因为保密,所以规则的具体内容也不介绍了,只需要知道json串中的一个键值对对应一个规则即可。这里的需求是将所有嵌套去掉,将规则平铺,还要求出规则的数量。
1、"company_id":103619980061540 {"use_car_time":[],"use_car_position":[],"use_car_srv":[{"use_car_type":201,"require_level":[100,400,200]},{"use_car_type":301,"require_level":[600]},{"use_car_type":401,"require_level":[]},{"use_car_type":501,"require_level":[1000]}]}
2、"company_id":82920293004 {"use_car_position":{"cities_on":[{"id":5,"name":"\u676d\u5dde"}],"cities_off":[{"id":5,"name":"\u676d\u5dde"}],"cross_city":0},"use_car_srv":[{"use_car_type":"204","require_level":[100]},{"use_car_type":"203","require_level":[100]},{"use_car_type":"205","require_level":[100]},{"use_car_type":"206","require_level":[100]},{"use_car_type":"301","require_level":[600,900]},{"use_car_type":"305","require_level":[600,900]},{"use_car_type":"306","require_level":[600,900]}],"use_car_time":[]}}
3、"company_id":9019188294800 {"use_car_position":[],"use_car_srv":[{"use_car_type":201,"require_level":[100]},{"use_car_type":301,"require_level":[600]}],"use_car_time":{"public":{"work":{"start_time":"00:00","end_time":"23:59"},"holiday":[]}}}
4、"company_id":103619980061540 {"use_car_time":[],"use_car_position":[],"use_car_srv":[{"use_car_type":201,"require_level":[100,400,200]},{"use_car_type":301,"require_level":[600]},{"use_car_type":401,"require_level":[]},{"use_car_type":501,"require_level":[1000]}]}
下面奉上代码与解析~~
import json
import csv
import pandas as pd
dic={}
a=[]
# 判断是否是字典
if isinstance(jsonObj, dict):
for key in jsonObj:
if key=='id':
continue
# value是数组
elif isinstance(jsonObj[key], list):
# 数组元素
if len(jsonObj[key]) > 0:
if isinstance(jsonObj[key][0], dict):
flatten(jsonObj[key], result)
else:
try:
result.append(key + ':' + str(jsonObj[key]))
except BaseException:
print('错误')
else:
continue
else:
continue
# value是字典
elif isinstance(jsonObj[key], dict):
flatten(jsonObj[key], result)
else:
try:
result.append(key + ':' + str(jsonObj[key]))
except BaseException:
print('错误')
else:
continue
# 如果是数组
else:
for item in jsonObj:
flatten(item, result)
def parse(fpath):
id_list=[]
json_list=[]
my_dict={}
#result_file = open(r'C:\Users\wzywangzhongyuan_i\Desktop\jsonfile\jsonbbb.csv', 'w', encoding='utf8')
with open(fpath, newline='') as csvfile:
next(csvfile)
reader = csv.reader(csvfile, delimiter=',',quotechar='"')
for row in reader:
result = []
try:
flatten(json.loads(row[1]), result)
id_list.append(row[0])
json_list.append(result)
except BaseException:
print('错误')
else:
continue
#print(json_list)
my_dict['company_id']=id_list
my_dict['json_info']=json_list
# print(my_dict)
my_frame=pd.DataFrame(my_dict)
#print(my_frame)
my_frame2=my_frame.groupby('company_id',as_index=False).sum()
#print(my_frame2)
my_frame2['guize']=my_frame2['json_info'].map(set)
my_frame3=my_frame2.drop(['json_info'],axis=1)
my_frame3['cnt']=list(map(lambda x: len(list(x)), my_frame3['guize']))
print(my_frame3)
return my_frame3
my_frame3=parse(r'C:\Users\wzywangzhongyuan_i\Desktop\hebing\all.csv')
my_frame3.to_csv(r'C:\Users\wzywangzhongyuan_i\Desktop\jsonfile\jsonbbb.csv', sep=',', header=True, index=True)
这里用的是python解决的,因为需求是要求出键值对的数量,并且将键值对平铺,因此在思路是先平铺,然后再计算数量。
平铺的实现是对json串的遍历过程,采用判断和循环组合。计算数量则是采用pandas中的dataframe,为什么要这么做呢,因为在处理的500万数据中,有许多id相同,但json串不同数据,因此要把相同id的json串进行组合,并且剔除json串中相同的规则。这里使用了groupby和sum函数,并且用set进行去重,再将dataframe写入csv文件~。