需求: 供应商提供了如下数据txt文件,现在想通过python解析这些数据,共10列
"CN","823923424","SET","KG","01/01/2011","优美的小调(钢琴曲)","塑料,制品",,"","个人梳妆旅行用具"
难点:
1. 数据逗号分隔,但是数据中可能也包含逗号,如:"塑料,制品"
2. 数据多是通过双引号包裹,但是有些空值数据没有通过双引号包裹
尝试过的错误方法:
1. 通过replace,split函数
data = '"CN","823923424","SET","KG","01/01/2011","优美的小调(钢琴曲)","塑料,制品",,"","个人梳妆旅行用具"'
line_data = data.replace('"', '').split(',')
print(line_data)
运行结果:
['CN', '823923424', 'SET', 'KG', '01/01/2011', '优美的小调(钢琴曲)', '塑料', '制品', '', '', '个人梳妆旅行用具']
结果数据中的逗号也被解析了,结果多了一列,错误!
2. 通过shlex解析
import shlex
data = '"CN","823923424","SET","KG","01/01/2011","优美的小调(钢琴曲)","塑料,制品",,"","个人梳妆旅行用具"'
str = shlex.shlex(data, posix=True)
str.whitespace=','
b=list(str)
print(b)
运行结果:
['CN', '823923424', 'SET', 'KG', '01/01/2011', '优美的小调(钢琴曲)', '塑料,制品', '', '个人梳妆旅行用具']
解析结果只有9列,错误!
最终解决方案:
1. 使用pyparsing解析
import pyparsing as pp
data = '"CN","823923424","SET","KG","01/01/2011","优美的小调(钢琴曲)","塑料,制品",,"","个人梳妆旅行用具"'
print(pp.commaSeparatedList.parseString(data))
csv_line = pp.commaSeparatedList.copy().addParseAction(pp.tokenMap(lambda s: s.strip('"')))
print(csv_line.parseString(data).asList())
运行结果:
['"CN"', '"823923424"', '"SET"', '"KG"', '"01/01/2011"', '"优美的小调(钢琴曲)"', '"塑料,制品"', '', '""', '"个人梳妆旅行用具"']
['CN', '823923424', 'SET', 'KG', '01/01/2011', '优美的小调(钢琴曲)', '塑料,制品', '', '', '个人梳妆旅行用具']
2. 使用csv库
import csv
data = '"CN","823923424","SET","KG","01/01/2011","优美的小调(钢琴曲)","塑料,制品",,"","个人梳妆旅行用具"'
print(next(csv.reader(data.splitlines(), skipinitialspace=True)))
运行结果:
['CN', '823923424', 'SET', 'KG', '01/01/2011', '优美的小调(钢琴曲)', '塑料,制品', '', '', '个人梳妆旅行用具']
结果2效率高于1