首先,sax解析最直观,当然,也可以容许xml文件出些错。
先给定一个xml文件book.xml,
<
catalog
>
<
book
isbn
="0-596-00128-2"
>
<
title
>
Python
&
XML
</
title
>
<
author
>
Jones,Drake
</
author
>
</
book
>
<
book
isbn
="0-596-00085-5"
>
<
title
>
ProgrammingPython
</
title
>
<
author
>
Lutz
</
author
>
</
book
>
<
book
isbn
="0-596-00281-5"
>
<
title
>
LearningPython
</
title
>
<
author
>
Lutz,Ascher
</
author
>
</
book
>
<
book
isbn
="0-596-00797-3"
>
<
title
>
PythonCookbook
</
title
>
<
author
>
Martelli,Ravenscroft,Ascher
</
author
>
</
book
>
<!--
imaginemoreentrieshere
-->
</
catalog
>
写一个BookHandler, 如下:
#
-*-coding:utf-8-*-
import
xml.sax.handler
class
BookHandler(xml.sax.handler.ContentHandler):
def
__init__
(self):
self.inTitle
=
0
#
handleXMLparserevents
self.mapping
=
{}
#
astatemachinemodel
def
startElement(self,name,attributes):
if
name
==
"
book
"
:
#
onstartbooktag
self.buffer
=
""
#
saveISBNfordictkey
self.isbn
=
attributes[
"
isbn
"
]
elif
name
==
"
title
"
:
#
onstarttitletag
self.inTitle
=
1
#
savetitletexttofollow
def
characters(self,data):
if
self.inTitle:
#
ontextwithintag
self.buffer
+=
data
#
savetextifintitle
def
endElement(self,name):
if
name
==
"
title
"
:
self.inTitle
=
0
#
onendtitletag
self.mapping[self.isbn]
=
self.buffer
#
storetitletextindict
import
xml.sax
import
pprint
parser
=
xml.sax.make_parser()
handler
=
BookHandler()
parser.setContentHandler(handler)
parser.parse(
'
book.xml
'
)
pprint.pprint(handler.mapping)
结果如下:
Process started >>>
{u'0-596-00085-5': u'Programming Python',
u'0-596-00128-2': u'Python & XML',
u'0-596-00281-5': u'Learning Python',
u'0-596-00797-3': u'Python Cookbook'}<<< Process finished.
================ READY ================
不过,这是比较简单的情况了。而且我们可以看到,结果全是以unicode串输出的。
<script type="text/javascript"><!-- google_ad_client = "ca-pub-7104628658411459"; /* wide1 */ google_ad_slot = "8564482570"; google_ad_width = 728; google_ad_height = 90; //--></script><script type="text/javascript" src="http://pagead2.googlesyndication.com/pagead/show_ads.js"></script>