lxml再探

常用缩略语

  1. API:应用程序编程接口
  2. DOM:文档对象模型
  3. HTML:超文本标记语言
  4. SAX:用于XML的简单API
  5. XML:可扩展标记语言
  6. XPath:XML路径语言
  7. XSLT:可扩展样式表语言转换

使用iterparse方法

lxml parse 方法读取整个文档并在内存中构建一个树。相对于 cElementTree,lxml 树的开销要高一些,因为它保持了更多有关节点上下文的信息,包括对其父节点的引用。使用这种方法解析一个 2G 的文档时,会使一个具有 2G RAM 的机器进入交换,这会大大影响性能。

lxml的iterparse方法是ElementTree API的扩展。 iterparse返回用于所选元素上下文的Python迭代器。 它接受两个有用的参数:要监视的事件的元组和标记名。

iterparse中的events参数start和end的用法

“首先访问外层 elements” 或“首先访问内层 elements”。

<level-1>
    <level-2-1>
        <level-3-1>level-3-1>
        <level-3-2>level-3-2>
    level-2-1>
    <level-2-2>
        <level-3-3>level-3-3>
        <level-3-4>level-3-4>
    level-2-2>
level-1>
  1. start
from lxml.etree import iterparse

with open('foo.xml', 'r') as xml:
    for event, element in iterparse(xml, events=['start']):
        print(element.tag)
-----------------------------------------
level-1
level-2-1
level-3-1
level-3-2
level-2-2
level-3-3
level-3-4
  1. end
with open('foo.xml', 'r') as xml:
    for event, element in iterparse(xml, events=['end']):
        print(element.tag)
---------------------------------------------
level-3-1
level-3-2
level-2-1
level-3-3
level-3-4
level-2-2
level-1
  1. [start,end]
<?xml version="1.0"?>
<data>
    <country name="china">
        <rank updated="yes">2</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="hangkong">
        <rank updated="yes">5</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank updated="yes">69</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>
------------------------------------------------------------
('start', <Element 'data' at 0x5d11828>
('start', <Element 'data' at 0x5d116a0>)
('start', <Element 'country' at 0x5d11c88>)
('start', <Element 'rank' at 0x5d11cc0>)
('end', <Element 'rank' at 0x5d11cc0>)
('start', <Element 'year' at 0x5d11d30>)
('end', <Element 'year' at 0x5d11d30>)
('start', <Element 'gdppc' at 0x5d11d68>)
('end', <Element 'gdppc' at 0x5d11d68>)
('start', <Element 'neighbor' at 0x5d11da0>)
('end', <Element 'neighbor' at 0x5d11da0>)
('start', <Element 'neighbor' at 0x5d11dd8>)
('end', <Element 'neighbor' at 0x5d11dd8>)
('end', <Element 'country' at 0x5d11c88>)
('start', <Element 'country' at 0x5d11e10>)
('start', <Element 'rank' at 0x5d11e48>)
('end', <Element 'rank' at 0x5d11e48>)
('start', <Element 'year' at 0x5d11e80>)
('end', <Element 'year' at 0x5d11e80>)
('start', <Element 'gdppc' at 0x5d11eb8>)
('end', <Element 'gdppc' at 0x5d11eb8>)
('start', <Element 'neighbor' at 0x5d11ef0>)
('end', <Element 'neighbor' at 0x5d11ef0>)
('end', <Element 'country' at 0x5d11e10>)
('start', <Element 'country' at 0x5d11f28>)
('start', <Element 'rank' at 0x5d11f60>)
('end', <Element 'rank' at 0x5d11f60>)
('start', <Element 'year' at 0x5d11f98>)
('end', <Element 'year' at 0x5d11f98>)
('start', <Element 'gdppc' at 0x5d11fd0>)
('end', <Element 'gdppc' at 0x5d11fd0>)
('start', <Element 'neighbor' at 0x43a3390>)
('end', <Element 'neighbor' at 0x43a3390>)
('start', <Element 'neighbor' at 0x43a3cc0>)
('end', <Element 'neighbor' at 0x43a3cc0>)
('end', <Element 'country' at 0x5d11f28>)
('end', <Element 'data' at 0x5d116a0>)

start就是一个标签的开始,end就是一个标签的结尾

Python:使用基于事件驱动的SAX解析XML

你可能感兴趣的:(python,python,java,前端)