lxml parse 方法读取整个文档并在内存中构建一个树。相对于 cElementTree,lxml 树的开销要高一些,因为它保持了更多有关节点上下文的信息,包括对其父节点的引用。使用这种方法解析一个 2G 的文档时,会使一个具有 2G RAM 的机器进入交换,这会大大影响性能。
lxml的iterparse方法是ElementTree API的扩展。 iterparse返回用于所选元素上下文的Python迭代器。 它接受两个有用的参数:要监视的事件的元组和标记名。
“首先访问外层 elements” 或“首先访问内层 elements”。
<level-1>
<level-2-1>
<level-3-1>level-3-1>
<level-3-2>level-3-2>
level-2-1>
<level-2-2>
<level-3-3>level-3-3>
<level-3-4>level-3-4>
level-2-2>
level-1>
from lxml.etree import iterparse
with open('foo.xml', 'r') as xml:
for event, element in iterparse(xml, events=['start']):
print(element.tag)
-----------------------------------------
level-1
level-2-1
level-3-1
level-3-2
level-2-2
level-3-3
level-3-4
with open('foo.xml', 'r') as xml:
for event, element in iterparse(xml, events=['end']):
print(element.tag)
---------------------------------------------
level-3-1
level-3-2
level-2-1
level-3-3
level-3-4
level-2-2
level-1
<?xml version="1.0"?>
<data>
<country name="china">
<rank updated="yes">2</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</country>
<country name="hangkong">
<rank updated="yes">5</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N"/>
</country>
<country name="Panama">
<rank updated="yes">69</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W"/>
<neighbor name="Colombia" direction="E"/>
</country>
</data>
------------------------------------------------------------
('start', <Element 'data' at 0x5d11828>
('start', <Element 'data' at 0x5d116a0>)
('start', <Element 'country' at 0x5d11c88>)
('start', <Element 'rank' at 0x5d11cc0>)
('end', <Element 'rank' at 0x5d11cc0>)
('start', <Element 'year' at 0x5d11d30>)
('end', <Element 'year' at 0x5d11d30>)
('start', <Element 'gdppc' at 0x5d11d68>)
('end', <Element 'gdppc' at 0x5d11d68>)
('start', <Element 'neighbor' at 0x5d11da0>)
('end', <Element 'neighbor' at 0x5d11da0>)
('start', <Element 'neighbor' at 0x5d11dd8>)
('end', <Element 'neighbor' at 0x5d11dd8>)
('end', <Element 'country' at 0x5d11c88>)
('start', <Element 'country' at 0x5d11e10>)
('start', <Element 'rank' at 0x5d11e48>)
('end', <Element 'rank' at 0x5d11e48>)
('start', <Element 'year' at 0x5d11e80>)
('end', <Element 'year' at 0x5d11e80>)
('start', <Element 'gdppc' at 0x5d11eb8>)
('end', <Element 'gdppc' at 0x5d11eb8>)
('start', <Element 'neighbor' at 0x5d11ef0>)
('end', <Element 'neighbor' at 0x5d11ef0>)
('end', <Element 'country' at 0x5d11e10>)
('start', <Element 'country' at 0x5d11f28>)
('start', <Element 'rank' at 0x5d11f60>)
('end', <Element 'rank' at 0x5d11f60>)
('start', <Element 'year' at 0x5d11f98>)
('end', <Element 'year' at 0x5d11f98>)
('start', <Element 'gdppc' at 0x5d11fd0>)
('end', <Element 'gdppc' at 0x5d11fd0>)
('start', <Element 'neighbor' at 0x43a3390>)
('end', <Element 'neighbor' at 0x43a3390>)
('start', <Element 'neighbor' at 0x43a3cc0>)
('end', <Element 'neighbor' at 0x43a3cc0>)
('end', <Element 'country' at 0x5d11f28>)
('end', <Element 'data' at 0x5d116a0>)
start就是一个标签的开始,end就是一个标签的结尾