今天需要解析一个非常长的json字符串,中间碰到了各种问题,总结了一下所有的注意事项。
首先我有一个字符串,原本非常长,我精简了一下,如下所示:
>>> s="{'product': u'\\u62c9\\u52fe\\u7f51', 'downtime': 3.128,
'monitors': [{'use': 100, 'monitorurl': u'http://oss.lagou.com','monitorweight': 10L,
'monitorname': u'\\u804c\\u4f4d\\u641c\\u7d22'}]}"
这应该不是正规调用json.dumps()得到的字符串,而是用str(),原数据结构是由字典、列表、字符串、长整型的数据拼接起来的,还包含着中文的Unicode字符。即
>>> origin={"product": u"\\u62c9\\u52fe\\u7f51", "downtime": 3.128,
"monitors": [{"use": 100, "monitorurl": u"http://oss.lagou.com","monitorweight": 10L,
"monitorname": u"\\u804c\\u4f4d\\u641c\\u7d22"}]}
>>> json.dumps(origin)
'{"product": "\\\\u62c9\\\\u52fe\\\\u7f51",
"monitors": [{"use": 100, "monitorweight": 10,
"monitorname": "\\\\u804c\\\\u4f4d\\\\u641c\\\\u7d22",
"monitorurl": "http://oss.lagou.com/"}], "downtime": 3.1280000000000001}'
>>> str(origin)
"{'product': u'\\\\u62c9\\\\u52fe\\\\u7f51',
'monitors': [{'use': 100, 'monitorweight': 10L,
'monitorname': u'\\\\u804c\\\\u4f4d\\\\u641c\\\\u7d22',
'monitorurl': u'http://oss.lagou.com'}], 'downtime': 3.128}"
如果是json.dumps(s),直接就可以用json.loads(s)便可转换为对象。那么针对这种用str()的,便会出现各种问题。总结出现的如下几点问题:
- 字符串里的键值对必须是用双引号,不能用单引号。单引号会报:Expecting property name: line 1 column 1 (char 1)
>>> s1="{'a':'a'}";s2='{"a":"a"}'
>>> json.loads(s1)
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib64/python2.6/json/__init__.py", line 307, in loads
return _default_decoder.decode(s)
File "/usr/lib64/python2.6/json/decoder.py", line 319, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib64/python2.6/json/decoder.py", line 336, in raw_decode
obj, end = self._scanner.iterscan(s, **kw).next()
File "/usr/lib64/python2.6/json/scanner.py", line 55, in iterscan
rval, next_pos = action(m, context)
File "/usr/lib64/python2.6/json/decoder.py", line 171, in JSONObject
raise ValueError(errmsg("Expecting property name", s, end))
ValueError: Expecting property name: line 1 column 1 (char 1)
>>> json.loads(s2)
{u'a': u'a'}
- str()后不管原来的键值是单引号还是双引号,最终都会变成单引号,外层是双引号。所以需要替换为双引号
>>> s={"a":"a"};str(s)
"{'a': 'a'}"
>>> s={'a':'a'};str(s)
"{'a': 'a'}"
>>> s={'a':'a'};s1=str(s)
>>> s1
"{'a': 'a'}"
>>> s2=s1.replace('\'','\"')
>>> s2
'{"a": "a"}'
>>> json.loads(s2)
{u'a': u'a'}
- unicode字符串,str()后还会带u标志,需要去掉。
>>> s={'a':u'拉勾网'}
>>> s
{'a': u'\u62c9\u52fe\u7f51'}
>>> s1=str(s)
>>> s1
"{'a': u'\\u62c9\\u52fe\\u7f51'}"
>>> json.loads(s1)
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib64/python2.6/json/__init__.py", line 307, in loads
return _default_decoder.decode(s)
File "/usr/lib64/python2.6/json/decoder.py", line 319, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib64/python2.6/json/decoder.py", line 336, in raw_decode
obj, end = self._scanner.iterscan(s, **kw).next()
File "/usr/lib64/python2.6/json/scanner.py", line 55, in iterscan
rval, next_pos = action(m, context)
File "/usr/lib64/python2.6/json/decoder.py", line 171, in JSONObject
raise ValueError(errmsg("Expecting property name", s, end))
ValueError: Expecting property name: line 1 column 1 (char 1)
>>>
4.长整型数据,str()后还带有L标志,也需要处理。
>>> s={"a":10L}
>>> s1=str(s)
>>> s1
"{'a': 10L}"
>>> s
>>> s1='{"a":10L}'
>>> json.loads(s1)
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib64/python2.6/json/__init__.py", line 307, in loads
return _default_decoder.decode(s)
File "/usr/lib64/python2.6/json/decoder.py", line 319, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib64/python2.6/json/decoder.py", line 336, in raw_decode
obj, end = self._scanner.iterscan(s, **kw).next()
File "/usr/lib64/python2.6/json/scanner.py", line 55, in iterscan
rval, next_pos = action(m, context)
File "/usr/lib64/python2.6/json/decoder.py", line 193, in JSONObject
raise ValueError(errmsg("Expecting , delimiter", s, end - 1))
ValueError: Expecting , delimiter: line 1 column 7 (char 7)
最后再回到之前那个复杂的字符串。
>>> s="{'product': u'\\u62c9\\u52fe\\u7f51', 'downtime': 3.128, 'monitors': [{'use': 100L, 'monitorurl': u'http://oss.lagou.com','monitorweight': 10L,'monitorname': u'\\u804c\\u4f4d\\u641c\\u7d22'}]}"
>>> #替换单引号为双引号
>>> s1=s.replace('\'','\"')
>>> s1
'{"product": u"\\u62c9\\u52fe\\u7f51", "downtime": 3.128, "monitors": [{"use": 100L, "monitorurl": u"http://oss.lagou.com","monitorweight": 10L,"monitorname": u"\\u804c\\u4f4d\\u641c\\u7d22"}]}'
>>> s2=s1.replace('u\"','\"')
>>> #去掉unicode标志u
>>> s2
'{"product": "\\u62c9\\u52fe\\u7f51", "downtime": 3.128, "monitors": [{"use": 100L, "monitorurl": "http://oss.lagou.com","monitorweight": 10L,"monitorname": "\\u804c\\u4f4d\\u641c\\u7d22"}]}'
>>> s3=s2.replace('..L','')
>>> s3
'{"product": "\\u62c9\\u52fe\\u7f51", "downtime": 3.128, "monitors": [{"use": 100L, "monitorurl": "http://oss.lagou.com","monitorweight": 10L,"monitorname": "\\u804c\\u4f4d\\u641c\\u7d22"}]}'
>>> #去掉长整型的L
>>> import re
>>> s3=re.sub(r'(\d+)L','\g<1>',s2)
>>> s3
'{"product": "\\u62c9\\u52fe\\u7f51", "downtime": 3.128, "monitors": [{"use": 100, "monitorurl": "http://oss.lagou.com","monitorweight": 10,"monitorname": "\\u804c\\u4f4d\\u641c\\u7d22"}]}'
>>> #最终可以用json.loads()了。
>>> json.loads(s3)
{u'product': u'\u62c9\u52fe\u7f51', u'monitors': [{u'use': 100, u'monitorweight': 10, u'monitorname': u'\u804c\u4f4d\u641c\u7d22', u'monitorurl': u'http://oss.lagou.com'}], u'downtime': 3.1280000000000001}