python中xml与json、dict、string的相互转换-xmltodict

xmltodict,xml与json的相互转换

源码:https://github.com/martinblech/xmltodict

在开发中经常遇到string、xml、json、dict对象的相互转换,这个工具和这里的方法全部都能够搞定。

XML文件转换流程

注意:以下代码只是示范逻辑,不能直接运行。

import os
import time
import lxml
from lxml import etree
import xmltodict, sys, gc

# 递归解析xml文件
context = etree.iterparse(osmfile,tag=["node","way","relation"])
fast_iter(context, process_element, maxline)
...

# xml对象转为字符串
elem_data = etree.tostring(elem)

# 生成dict对象
elem_dict = xmltodict.parse(elem_data)

# 从dict产生json字符串
elem_jsonStr = json.dumps(elem_dict)

# 从json字符串产生json对象
json_obj = json.dumps(elem_jsonStr)

递归解析XML

etree递归读取xml结构数据(占用资源少): http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
XML字符串转为json对象支持库 : https://github.com/martinblech/xmltodict  

xmltodict.parse()会将字段名输出添加@和#,在Spark查询中会引起问题,需要去掉。如下设置即可:

xmltodict.parse(elem_data,attr_prefix="",cdata_key="")

编码和错误xml文件恢复

如下:

magical_parser = lxml.etree.XMLParser(encoding='utf-8', recover=True)  
tree = etree.parse(StringIO(your_xml_string), magical_parser) #or pass in an open file object

先将element转为string,然后生成dict,再用json.dump()产生json字符串。

elem_data = etree.tostring(elem)
elem_dict = xmltodict.parse(elem_data)
elem_jsonStr = json.dumps(elem_dict)

可以使用json.loads(elem_jsonStr)创建出可编程的json对象。

xmltodict的用法

xmltodict is a Python module that makes working with XML feel like you are working with JSON, as in this "spec":

Build Status

>>> print(json.dumps(xmltodict.parse("""
...  <mydocument has="an attribute">
...    <and>
...      <many>elements</many>
...      <many>more elements</many>
...    </and>...    <plus a="complex">
...      element as well
...    </plus>
...  </mydocument>...  """), indent=4))
{ "mydocument": 
{ "@has": "an attribute", 
       "and": 
        {
          "many": ["elements",  "more elements"]
        }, 
        "plus": {"@a": "complex", "#text": "element as well"
        }
    }
}

Namespace support

By default, xmltodict does no XML namespace processing (it just treats namespace declarations as regular node attributes), but passing process_namespaces=True will make it expand namespaces for you:

>>> xml = """
... <root xmlns=" 
...       xmlns:a=" 
...       xmlns:b=" 
...   <x>1</x>...   <a:y>2</a:y>
...   <b:z>3</b:z>... </root>
... """

>>> xmltodict.parse(xml, process_namespaces=True) == {
...     'http://defaultns.com/:root': {
...         'http://defaultns.com/:x': '1',
...         'http://a.com/:y': '2',
...         'http://b.com/:z': '3',
...     }
... }
True

It also lets you collapse certain namespaces to shorthand prefixes, or skip them altogether:

>>> namespaces = {
...     'http://defaultns.com/': None, # skip this namespace
...     'http://a.com/': 'ns_a', # collapse "http://a.com/" -> "ns_a"
... }
>>> xmltodict.parse(xml, process_namespaces=True, namespaces=namespaces) == {
...     'root': {
...         'x': '1',
...         'ns_a:y': '2',
...         'http://b.com/:z': '3',
...     },
... }
True

Streaming mode

xmltodict is very fast (Expat-based) and has a streaming mode with a small memory footprint, suitable for big XML dumps like Discogs or Wikipedia:

>>> def handle_artist(_, artist):
...     print artist['name']
...     return True
>>> 
>>> xmltodict.parse(GzipFile('discogs_artists.xml.gz'),
...     item_depth=2, item_callback=handle_artist)
A Perfect Circle
Fantômas
King Crimson
Chris Potter
...

It can also be used from the command line to pipe objects to a script like this:

import sys, marshal
while True:
    _, article = marshal.load(sys.stdin)
    print article['title']
$ cat enwiki-pages-articles.xml.bz2 | bunzip2 | xmltodict.py 2 | myscript.py
AccessibleComputing
Anarchism
AfghanistanHistory
AfghanistanGeography
AfghanistanPeople
AfghanistanCommunications
Autism
...

Or just cache the dicts so you don't have to parse that big XML file again. You do this only once:

$ cat enwiki-pages-articles.xml.bz2 | bunzip2 | xmltodict.py 2 | gzip > enwiki.dicts.gz

And you reuse the dicts with every script that needs them:

$ cat enwiki.dicts.gz | gunzip | script1.py
$ cat enwiki.dicts.gz | gunzip | script2.py
...

Roundtripping

You can also convert in the other direction, using the unparse() method:

>>> mydict = {
...     'response': {
...             'status': 'good',
...             'last_updated': '2014-02-16T23:10:12Z',
...     }
... }

>>> print unparse(mydict, pretty=True)
<?xml version="1.0" encoding="utf-8"?>
<response>
    <status>good</status>
    <last_updated>2014-02-16T23:10:12Z</last_updated>
</response>

Text values for nodes can be specified with the cdata_key key in the python dict, while node properties can be specified with the attr_prefix prefixed to the key name in the python dict. The default value for attr_prefix is @ and the default value for cdata_key is #text.

>>> import xmltodict
>>> 
>>> mydict = {
...     'text': {
...         '@color':'red',
...         '@stroke':'2',
...         '#text':'This is a test'
...     }
... }
>>> print xmltodict.unparse(mydict, pretty=True)
<?xml version="1.0" encoding="utf-8"?>
<text stroke="2" color="red">This is a test</text>

Ok, how do I get it?

Using pypi

You just need to:

$ pip install xmltodict

RPM-based distro (Fedora, RHEL, …)

There is an official Fedora package for xmltodict.

$ sudo yum install python-xmltodict

Arch Linux

There is an official Arch Linux package for xmltodict.

$ sudo pacman -S python-xmltodict

Debian-based distro (Debian, Ubuntu, …)

There is an official Debian package for xmltodict.

$ sudo apt install python-xmltodict


你可能感兴趣的:(xml,json,python,String,dict,xml2dict)