Python -- xml.etree.ElementTree学习
ElementTree的xml是一个轻量级的DOM解析,有解析速度快,消耗内存小等优点
ElementTree中心就是Element类,它是设计用来存储分级tag标签的数据结构;
----------------------------------------------------------------------------------------------------------------------------
1. 先谈谈解析对象,xml的结构:
a. tag标签 string类型
b. attributes 标签属性 字典类型数据
c. text 标签的值value
d. 子标签 child element
创建element实例,可以使用构造函数和SubElement;ElementTree结构可以包含许多Element,并且可以转换成xml,也可以从xml解析而来
ElementTree represents the whole XML document as a tree, and Element represents a single node in this tree.
纯手工创建一个xml文件:
a = ET.Element('a')
b = ET.SubElement(a, 'b')
c = ET.SubElement(a, 'c')
d = ET.SubElement(c, 'd')
ET.dump(a)
----------------------------------------------------------------------------------------------------------------------------
2. 解析xml的步骤:
以以下country_xml为例:
1
2008
141100
4
2011
59900
68
2011
13600
------------------------------------------------------------------------------------------------------------------------
import xml.etree.ElementTree as ET
1. 导入xml数据 ---------- 直接从xml文件导入:
ElementTree = ET.parse("country.xml")
#整个xml树状结构
Element = ElementTree.getroot()
#获取root节点 ElementTree
导入xml数据 ---------- 从一个xml字符串导入,并得首节点:
Element_root = ET.fromstring(count_as_string)
------------------------------------------------------------------------------------------------------------------------
2. 查找数据
查找数据的方法有Element.iter('text') .findall('text') find('text')
iter(): 递归的查找,会查找当前节点,它的子节点。子节点......
findall(): 只会查找当前节点的子节点那一级目录
find():只是查找第一个,查找到后,可以用get('attribute_name')获取属性的值
example:
#!bin/bash
__author__ = 'JackZhous'
import logging
import xml.etree.ElementTree as ET
import sys
def script(xml_path, mode):
tree = ET.parse(xml_path)
node_root = tree.getroot()
iter_mode = '1'
if iter_mode == mode:
for node in node_root.iter('country'):
name = node.get('name')
year = node.find('year').text
print ('name = ' , name, 'year = ' , year)
else:
for node in node_root.findall('country'):
name = node.get('name')
year = node.find('year').text
print ('name = ' + name, 'year = ' + year)
if __name__ == '__main__':
print ("脚本名:", sys.argv[0])
print ("参数1:" , sys.argv[1])
print ('参数2:' , sys.argv[2])
script(sys.argv[1],sys.argv[2])
------------------------------------------------------------------------------------------------------------------------
3. 修改xml数据
根据上一步骤,查找到你感兴趣的数据后,可以使用修改节点属性值(element.text)或者增加/改变属性值set('attributes','values')或者删除某一个节点(remove(element)),最后一步直接输出到文件ElementTree.write('country.xml')
if(name == 'Jackzhous'):
node1.remove(node)
------------------------------------------------------------------------------------------------------------------------
4. 解决有名字空间namespace的xml问题,例如android的manifest里面有xmlns:android="http://schemas.android.com/apk/res/android"
命名空间里面装着很多标签名,防止这些
用字典或者字符串类型数据替换,如上dictionary = {'android':'http://schemas.android.com/apk/res/android'},或者 android_name = 'http://schemas.android.com/apk/res/android'
查找的时候前者用find('android:name',dictionary) 后者直接find(android_name:)
用命令空间进行查找时,需要特殊标识,如下:
android_name = 'http://schemas.android.com/apk/res/android'
查找该名字空间下name="a.b.activity",则用:
tree.find("./application/Activity[@{"+android_name+"}name='" + "a.b.activity']")这就可以找到
.代表当前节点 application/activity依次在这两个节点下[]这个符号里面表示查找的特性
以上表达式不明白请看:
tag
Selects all child elements with the given tag. For example, spam selects all child elements named spam, and spam/egg selects all grandchildren named egg in all children named spam.
*
Selects all child elements. For example, */egg selects all grandchildren named egg.
.
Selects the current node. This is mostly useful at the beginning of the path, to indicate that it’s a relative path.
//
Selects all subelements, on all levels beneath the current element. For example, .//egg selects all egg elements in the entire tree.
..
Selects the parent element.
[@attrib]
Selects all elements that have the given attribute.
[@attrib='value']
Selects all elements for which the given attribute has the given value. The value cannot contain quotes.
[tag]
Selects all elements that have a child named tag. Only immediate children are supported.
[tag='text']
Selects all elements that have a child named tag whose complete text content, including descendants, equals the given text.
[position]
Selects all elements that are located at the given position. The position can be either an integer (1 is the first position), the expression last() (for the last position), or a position relative to the last position (e.g. last()-1).
for循环语法,以android的manifest文件为例:
查找主activity名字
ET.register_namespace('android',android)
tree = ET.parse(path)
root = tree.find('application')
for activity in root.findall('activity'):
target = activity.find("./intent-filter/action[@{"+ android + "}name='" + "android.intent.action.MAIN']")
if target is None:
print('node has no intent-filter')
continue
main_activity = activity.get("{%s}name" % android)
print('got the main activity ' + main_activity)
break
备注:详情请访问:https://docs.python.org/2/library/xml.etree.elementtree.html?highlight=elementtree