转载自: http://blog.csdn.net/xia7139/article/details/10195849
运行下列命令,安装成功
apt-get install python2.6-dev
apt-get install libxml2-dev
apt-get install libxslt1-dev
easy_install lxml
另外,安装python idle
apt-get install idle
- 例子:dblp.xml(dblp数据的片段)
- xml version='1.0' encoding='utf-8'?>
- <dblp>
- <article mdate="2012-11-28" key="journals/entropy/BellucciFMY08">
- <author>Stefano Bellucciauthor>
- <author>Sergio Ferraraauthor>
- <author>Alessio Marraniauthor>
- <author>Armen Yeranyanauthor>
- <title>ES<sup>2sup>: A cloud data storage system for supporting both OLTP and OLAP.title>
- <pages>507-555pages>
- <year>2008year>
- <volume>10volume>
- <journal>Entropyjournal>
- <number>4number>
- <ee>http://dx.doi.org/10.3390/e10040507ee>
- <url>db/journals/entropy/entropy10.html#BellucciFMY08url>
- article>
- <article mdate="2013-03-04" key="journals/entropy/Knuth13">
- <author>Kevin H. Knuthauthor>
- <title><i>Entropyi> Best Paper Award 2013.title>
- <pages>698-699pages>
- <year>2013year>
- <volume>15volume>
- <journal>Entropyjournal>
- <number>2number>
- <ee>http://dx.doi.org/10.3390/e15020698ee>
- <url>db/journals/entropy/entropy15.html#Knuth13url>
- article>
- dblp>
1、将xml解析为树结构,并得到该树的根。
为了将xml解析为树结构,并得到该树的根,要进行如下的操作:
-
-
- from lxml import etree
- tree = etree.parse("dblp.xml")
- root = tree.getroot()
另外,如果xml数据中出现了关于dtd的声明(如下面的例子),那样的话,必须在使用lxml解析xml的时候,进行相应的声明。
- xml文件中含有dtd声明的例子:
- xml version="1.0" encoding="ISO-8859-1"?>
- >
- <dblp>
- <article mdate="2002-01-03" key="persons/Codd71a">
- <author>E. F. Coddauthor>
- <title>Further Normalization of the Data Base Relational Model.title>
- <journal>IBM Research Report, San Jose, Californiajournal>
- <volume>RJ909volume>
- <month>Augustmonth>
- <year>1971year>
- hadoop@hadoop:~/20130722dblpxml$ head -15 dblp.xml
- xml version="1.0" encoding="ISO-8859-1"?>
- >
- <dblp>
- <article mdate="2002-01-03" key="persons/Codd71a">
- <author>E. F. Coddauthor>
- <title>Further Normalization of the Data Base Relational Model.title>
- <journal>IBM Research Report, San Jose, Californiajournal>
- <volume>RJ909volume>
- <month>Augustmonth>
- <year>1971year>
- <cdrom>ibmTR/rj909.pdfcdrom>
- <ee>db/labs/ibm/RJ909.htmlee>
- article>
- dblp>
这时候,要想将xml数据解析为树结构并得到该树的树根,必须进行如下的操作:
-
-
- from lxml import etree
- parser=etree.XMLParser(load_dtd= True)
- tree = etree.parse("dblp.xml",parser)
- root = tree.getroot()
2、遍历树结构,获得各元素的属性及其子元素。
- for article in root:
- print "元素名称:",article.tag
- for field in article:
- print field.tag,":",field.text
- mdate=article.get("mdate")
- key=article.get("key")
- print "mdate:",mdate
- print "key",key
- print ""
到这里,便可以进行简单的xml数据的解析了。
3、解析xml数据的例子
用下面的代码解析文章开头的名为dblp.xml数据。
-
-
- from lxml import etree
- tree = etree.parse("dblp.xml")
- root = tree.getroot()
-
- for article in root:
- print "元素名称:",article.tag
- for field in article:
- print field.tag,":",field.text
- mdate=article.get("mdate")
- key=article.get("key")
- print "mdate:",mdate
- print "key",key
- print ""
便可以得到输出如下:
- 元素名称: article
- author : Stefano Bellucci
- author : Sergio Ferrara
- author : Alessio Marrani
- author : Armen Yeranyan
- title : ES
- pages : 507-555
- year : 2008
- volume : 10
- journal : Entropy
- number : 4
- ee : http://dx.doi.org/10.3390/e10040507
- url : db/journals/entropy/entropy10.html
- mdate: 2012-11-28
- key: journals/entropy/BellucciFMY08
-
-
- 元素名称: article
- author : Kevin H. Knuth
- title : None
- pages : 698-699
- year : 2013
- volume : 15
- journal : Entropy
- number : 2
- ee : http://dx.doi.org/10.3390/e15020698
- url : db/journals/entropy/entropy15.html
- mdate: 2013-03-04
- key: journals/entropy/Knuth13
4、元素既有sub-element,又有text的处理
可以看到在上面的例子中,title元素的内容是不正确的。由于title元素及包含sub-element,又有text内容(如下),这时简单的用.text,并不能正确的得到title元素的内容。上面的例子中,第一个article元素的title只取到了ES,而第二个article元素的title则什么都没取到,None。
- ES2: A cloud data storage system for supporting both OLTP and OLAP.
- Entropy Best Paper Award 2013.
由于在这个例子中,子元素比较简单,这里就简单的采取将子元素和text一起打印的方法来解决这一问题。代码如下:
-
-
- from lxml import etree
- tree = etree.parse("dblp.xml")
- root = tree.getroot()
-
- for article in root:
- print "元素名称:",article.tag
- for field in article:
- if field.tag=="title":
- print field.tag,":",etree.tostring(field,encoding='utf-8',pretty_print=False)
- else:
- print field.tag,":",field.text
- mdate=article.get("mdate")
- key=article.get("key")
- print "mdate:",mdate
- print "key:",key
- print ""
输出如下:
- 元素名称: article
- author : Stefano Bellucci
- author : Sergio Ferrara
- author : Alessio Marrani
- author : Armen Yeranyan
- title : ES2: A cloud data storage system for supporting both OLTP and OLAP.
-
- pages : 507-555
- year : 2008
- volume : 10
- journal : Entropy
- number : 4
- ee : http://dx.doi.org/10.3390/e10040507
- url : db/journals/entropy/entropy10.html
- mdate: 2012-11-28
- key: journals/entropy/BellucciFMY08
-
- 元素名称: article
- author : Kevin H. Knuth
- title : Entropy Best Paper Award 2013.
-
- pages : 698-699
- year : 2013
- volume : 15
- journal : Entropy
- number : 2
- ee : http://dx.doi.org/10.3390/e15020698
- url : db/journals/entropy/entropy15.html
- mdate: 2013-03-04
- key: journals/entropy/Knuth13
当然,不难看出这个问题用这种方法解决比较傻,后面还得将title内容中的tag等不需要部分通过各种字符串的处理将其去掉。最好的方法是能有比较简单的方法,分别获取到一个元素的text和sub_element。有比较好的解决办法,欢迎指教。