dadadachai

xml.etree.ElementTree 操作 xml文件

什么是 XML

XML 指“可扩展标记语言”（eXtensible Markup Language），是一种用于存储和传输数据的标记语言。它使用标记来描述数据，类似于HTML，但XML的标记是自定义的，可以根据需要创建自己的标记。XML被广泛应用于Web服务、数据交换、配置文件等领域。


<library>
  <book id="12345">
    <title>Harry Potter and the Philosopher's Stonetitle>
    <author>J.K. Rowlingauthor>
    <year>1997year>
  book>
  <book>
    <title>The Catcher in the Ryetitle>
    <author>J.D. Salingerauthor>
    <year>1951year>
  book>
library>

XML 文档具有称为元素的部分，由开始标记>和结束标记定义。标签是一种以 <> 开头并以结尾的标记结构。开始标签和结束标签之间的字符（如果有）是元素的内容。元素可以包含标记，包括其他元素，称为“子元素”；
在上面的例子中，是开始标记，是结束标记，、和是元素。、和的父元素是。</li> <li>最大的顶级元素称为根，它包含所有其他元素； 在上面例子中，是根元素，它包含两个元素作为它的子元素。</li> <li>属性是存在于开始标签或空元素标签内的名称-值对。XML 属性只能有一个值，并且每个属性在每个元素上最多出现一次。 在上面例子中，第一个元素有一个名为id的属性，它的值是12345。属性可以用于标识元素，提供元素的其他信息等。注意，属性只能在开始标记中定义，而不能在结束标记中定义。</li> </ul> <h2>操作 XML</h2> Python可以使用xml.etree.ElementTree模块中的函数来读取解析XML文件。 以下是一个简单的示例代码： <pre><code class="prism language-python">import xml.etree.ElementTree as ET </code></pre> <h3>XML解析</h3> <h4>parse() & getroot()</h4> 解析XML文件，并获取XML根元素 <pre><code class="prism language-python"># 读取XML文件 tree = ET.parse('example.xml') root = tree.getroot() </code></pre> <h4>fromstring()</h4> 将XML字符串解析为ElementTree对象，并返回根元素 <pre><code class="prism language-python">xml_string = ''' <book id="12345"> <title>Harry Potter and the Philosopher's Stone J.K. Rowling 1997 ''' # 读取XML字符串 root = ET.fromstring(xml_string)

XML元素对象方法

tag

获取元素的标签名称。
上面例子中，library、book、title、author 和 year 都是元素的名称，也就是tag。在XML中，由于元素是自定义的，因此不需要使用tag属性来指定元素的类型。在HTML中，tag通常用于指定元素的类型。

# 遍历XML文档
for child in root:
    print(child.tag)

attrib

获取元素的属性。
上面例子中，id是book元素的一个属性，值为12345，我们可以用元素的attrib属性来获取它的属性字典：

book = root.find('book')
print(book.attrib)

## output：
{'id': '12345'}

text

获取元素的文本内容。
上面例子中，第一个title、author和year元素的文本内容分别是Harry Potter and the Philosopher’s Stone、J.K. Rowling和1997。我们可以使用元素的text属性来获取它的文本内容：

title = root.find('title')
print(title.text)

## output：
Harry Potter and the Philosopher's Stone

tail

获取元素结束标签之后的文本内容。
the a element has None for both text and tail attributes, the b element has text “1” and tail “4”, the c element has text “2” and tail None, and the d element has text None and tail “3”

<a><b>1<c>2<d/>3c>b>4a>

keys()

返回一个包含元素所有属性名（乱序）的列表。

items()

返回元素的所有属性和属性值。
items()与attrib的区别：

items()返回一个由元组组成的列表，每个元组包含一个属性名和对应的属性值；而attrib属性返回一个字典，其中键是属性名，值是对应的属性值；
items()方法只能用于Element对象，而attrib属性可以用于Element对象和ElementTree对象；
如果元素没有任何属性，则调用items()方法会返回一个空列表，而访问attrib属性会返回一个空字典。

get()

get(key, default=None)用于获取元素的指定属性值。第一个参数为要获取的属性，第二个参数为默认值，如果该属性不存在，则返回指定的默认值（如果提供了默认值），否则返回None。
上面例子中，我们用get()方法获取第一个book标签的id和category属性：

book = root.find('book')
id = book.get('id', 'unknown')
category = book.get('category', 'unknown')
print(id)
print(category)

## output：
12345
unknown

find()

find(match, namespaces=None)查找具有指定标签名的第一个子元素。如果找不到，则返回None。

# 查找第一个book元素
book = root.find('book')
print(book.attrib['id'])
# 查找第一个book元素下的第一个author元素
author = book.find('author')
print(author.text)

## output：
12345
J.K. Rowling

注意：find()只能查找XML文档中的直接子元素，不能查找孙元素。如果要查找孙元素，可以使用XPath表达式。

findall()

findall(match, namespaces=None)查找具有指定标签名的所有子元素，并返回一个列表。如果找不到，则返回一个空列表。
find 和 findall 方法都支持 XPath 表达式，可以使用 XPath 表达式更加灵活地搜索 XML 树结构。

# 使用XPath表达式查找具有id属性的book元素
books = tree.findall(".//book[@id]")

# 遍历每个具有id属性的book元素，并获取标题、作者和出版年份
for book in books:
    title = book.find('title').text
    author = book.find('author').text
    year = book.find('year').text
    print(f'Title: {title}, Author: {author}, Year: {year}')

## output：
Title: Harry Potter and the Philosopher's Stone, Author: J.K. Rowling, 
Year: 1997

findtext()

findtext(match, default=None, namespaces=None)用于查找第一个匹配的元素中的文本内容并返回。若没有匹配元素则返回default参数内容。

# 查找指定元素中的文本内容
root.findtext('.//year')

## output：
1997

iter()

iter(tag=None)方法是ElementTree对象提供的一个迭代器，用于迭代指定元素及其所有子元素。可以使用for循环遍历该迭代器获取或修改每个子元素。

# 迭代所有book元素
for title in root.iter('title'):
    print(title.text)

## output：
Harry Potter and the Philosopher's Stone
The Catcher in the Rye

# 修改并写入XML文档
for book in root.iter('book'):
    if book.get('id') == '12345':
        book.find('year').text = '1998'

tree.write('example.xml')

iterfind()

iterfind(match, namespaces=None)方法可以在XML文档中查找符合指定路径的元素。与iter()方法不同，iterfind()方法可以使用XPath表达式来指定路径。

for title in root.iterfind(".//book[@id='12345']/title"):
    print(title.text)

## output：
Harry Potter and the Philosopher's Stone

itertext()

迭代指定元素及其所有子元素的文本内容。

for text in root.find(".//book[@id='12345']").itertext():
    print(text)

## output：

Harry Potter and the Philosopher's Stone
    
J.K. Rowling
      
1997

在输出结果中，每个文本内容都被单独输出到了一行，并且每行之间都有一个空行。
这是由于XML文档中的元素和文本节点之间可能存在空白字符（例如换行符、制表符等），而itertext()方法会将这些空白字符也视为文本内容进行迭代。因此，在输出时，每个文本内容之间都会有一个空行，其中包含了原始XML文档中的空白字符。

clear()

用于删除元素的所有子元素和属性，并将元素的文本内容和tail属性设置为None。该方法可以用于清空元素，以便对其进行重新填充。

# 遍历XML文档中所有book元素，并清空它们的子元素和属性
for book in tree.iter('book'):
    book.clear()
# 输出清空后的XML文档
print(ET.tostring(tree).decode())

## output：
<library>
  
  <book />
  
  <book />
</library>

remove()

将元素从其父元素中删除。

# 遍历XML文档中所有book元素，并删除它们
for book in tree.findall('book'):
    tree.remove(book)
# 输出删除后的XML文档
print(ET.tostring(tree).decode())

## output：
<library />

tostring()

用于将XML文档或元素序列化为字符串。tostring()方法可以接受多个参数，常用的参数包括：

element：要序列化的元素对象，如果不指定该参数，则默认序列化整个XML文档。
encoding：指定输出的编码方式，默认为"utf-8"。
xml_declaration：是否在输出的字符串中包含XML声明，默认为False。
method：指定序列化的方法，常用的方法包括"xml"和"html"。

ET.tostring(tree, encoding='utf-8', xml_declaration=True)

## output：
b'\n\n  \n    Harry Potter and the Philosopher\'s Stone\n    J.K. Rowling\n    1997\n  \n  \n    The Catcher in the Rye\n    J.D. Salinger\n    1951\n  \n\n'

在输出结果中，字符串前面有一个b前缀，表示这是一个字节串。如果需要将字节串转换为字符串，可以使用**decode()**方法。

set()

set(key, value)设置元素的属性值。如果我们有一个Element对象e，那么可以使用e.set(name, value)方法来设置元素的属性值，其中name是属性名，value是属性值。

# 创建XML元素
e = ET.Element('book')

# 设置元素的属性值
e.set('id', '12345')
e.set('isbn', '978-3-16-148410-0')

# 输出结果
print(ET.tostring(e))

## output：
b''

除了使用set()方法设置元素的属性值外，还可以使用字典的方式来设置元素的属性。例如，可以使用e.attrib[name] = value的方式来设置元素的属性值。

append()

append(subelement)向元素中添加子元素。

# 创建XML元素
root = ET.Element('library')

# 创建子元素
book1 = ET.Element('book')
book2 = ET.Element('book')

# 向父元素中添加子元素
root.append(book1)
root.append(book2)

# 输出结果
print(ET.tostring(root))

## output：
b''

insert()

insert(index, subelement)在元素中指定位置index插入子元素，这个位置是从0开始计数的。

# 创建XML元素
root = ET.Element('library')

# 创建子元素
book1 = ET.Element('book')
book2 = ET.Element('book')
div = ET.Element('div')

# 向父元素中添加子元素
root.append(book1)
root.append(book2)

# 在指定位置插入子元素
root.insert(1, div)

# 输出结果
print(ET.tostring(root))

## output：
b''

extend()

extend(subelement)向元素中添加多个子元素。

# 创建XML元素
root = ET.Element('library')

# 创建子元素列表
children = [
    ET.Element('book'),
    ET.Element('book'),
    ET.Element('book')
]

# 向父元素中添加多个子元素
root.extend(children)

# 输出结果
print(ET.tostring(root))

## output：
b''

SubElement()

SubElement(fatherelement, tag, attrib={}, **extra)用于创建一个具有指定标签名的元素对象,在使用SubElement()方法创建子元素时，需要指定父元素对象和子元素标签名。

# 创建新的元素对象
new_element = ET.Element('book')

# 为元素对象添加属性
new_element.set('id', '12345')

# 为元素对象添加子元素
title = ET.SubElement(new_element, 'title')
title.text = 'Harry Potter and the Philosopher\'s Stone'

author = ET.SubElement(new_element, 'author')
author.text = 'J.K. Rowling'

year = ET.SubElement(new_element, 'year')
year.text = '1997'

# 将新的元素对象插入到文档中
root = ET.parse('example.xml').getroot()
root.append(new_element)

# 保存修改后的文档
ET.ElementTree(root).write('example.xml')

命名空间URI

在 XML 中，命名空间是通过 URI 来定义的，用于唯一地标识和定位命名空间。

**URI（Uniform Resource Identifier）**是统一资源标识符的缩写，是用于标识互联网上的资源的字符串。URI 包含两个子集：URL（Uniform Resource Locator）和 URN（Uniform Resource Name）。
URL 是 URI 的一种，用于标识互联网上的资源的位置。例如，https://www.example.com/index.html 是一个 URL，其中 https 是协议，www.example.com 是主机名，index.html 是资源路径。
URN 也是 URI 的一种，用于标识互联网上的资源的名称。与 URL 不同，URN 不包含位置信息。例如，urn:isbn:0451450523 是一个 URN，其中 urn 是协议，isbn:0451450523 是资源名称。
我们创建一个属于命名空间的XML文档，需要注意在创建元素时，需要使用花括号 {} 来指定命名空间URI：

# 创建根元素，并设置命名空间
root = ET.Element('{http://www.example.com/ns1}library')

# 创建子元素，并添加到根元素中
book1 = ET.SubElement(root, '{http://www.example.com/ns1}book', id='12345', published='2020-01-01')
book2 = ET.SubElement(root, '{http://www.example.com/ns1}book')

# 创建子元素，并添加到book1元素中
title1 = ET.SubElement(book1, '{http://www.example.com/ns1}title', lang='en')
author1 = ET.SubElement(book1, '{http://www.example.com/ns1}author')
genre1 = ET.SubElement(book1, '{http://www.example.com/ns1}genre')

# 创建子元素，并添加到book2元素中
title2 = ET.SubElement(book2, 'title', lang='en')
author2 = ET.SubElement(book2, '{http://www.example.com/ns1}author')
genre2 = ET.SubElement(book2, '{http://www.example.com/ns1}genre')

# 设置元素文本内容
title1.text = 'XML for Beginners'
author1.text = 'John Doe'
genre1.text = 'Fiction'

title2.text = 'Advanced XML Techniques'
author2.text = 'Jane Smith'
genre2.text = 'Fiction'

# 输出XML文档
ET.dump(root)

命名空间通过为元素和属性名称添加前缀来实现。例如，如果有两个元素具有相同的名称，但属于不同的命名空间，则可以使用前缀来区分它们。在这种情况下，前缀通常与URI相关联，以便唯一标识命名空间：


<ns1:library xmlns:ns0="http://www.example.com/ns1">
    <ns1:book id="12345" published='2021-01-01'>
        <ns1:title lang="en">XML for Beginnersns1:title>
        <ns1:author>John Doens1:author>
        <ns1:genre>Fictionns1:genre>
    ns1:book>
    <ns1:book>
        <title lang="en">Advanced XML Techniquestitle>
        <ns1:author>Jane Smithns1:author>
        <ns1:genre>Fictionns1:genre>
    ns1:book>
ns1:library>

在上面这个xml文件中，我们使用URI http://www.example.com/ns1 作为命名空间标识符，并将其与前缀 ns1 相关联。在 books 元素中，我们使用前缀 ns1 来标识属于命名空间 http://www.example.com/ns1 的 book 元素。
在XML文档中，子元素的命名空间继承自其父元素，属性的命名空间通常跟随着元素，但如果需要，可以为自定义元素和属性创建自己的命名空间，在元素的标记中显式地指定不同的命名空间。上面例子中的title元素和属性lang没有指定命名空间，因此默认属于ns1命名空间。
由于子元素的命名空间继承自其父元素，所以如果子元素属于某个命名空间，那么在父元素的标记中可以省略命名空间前缀。

# 查找属于命名空间的title元素
for title in root.findall('.//ns1:book[@id="12345"]/ns1:title', namespaces={'ns1': 'http://www.example.com/ns1'}):
    print(title.text)

## output：
XML for Beginners

一个例子

想从XML文件中获取到表格并转成DataFrame，以便后续数据处理
XML文件：



<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet"
 xmlns:o="urn:schemas-microsoft-com:office:office"
 xmlns:x="urn:schemas-microsoft-com:office:excel"
 xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
 xmlns:html="http://www.w3.org/TR/REC-html40">
 <DocumentProperties xmlns="urn:schemas-microsoft-com:office:office">
  <Title>Account statementsTitle>
  <Subject>Account statementsSubject>
  <Author>faktura.ruAuthor>
  <LastAuthor>Microsoft Office UserLastAuthor>
  <Created>2023-07-28T06:26:25ZCreated>
  <LastSaved>2023-07-28T06:43:14ZLastSaved>
  <Version>16.00Version>
 DocumentProperties>
 <OfficeDocumentSettings xmlns="urn:schemas-microsoft-com:office:office">
  <AllowPNG/>
 OfficeDocumentSettings>
 <ExcelWorkbook xmlns="urn:schemas-microsoft-com:office:excel">
  <WindowHeight>22400WindowHeight>
  <WindowWidth>32767WindowWidth>
  <WindowTopX>32767WindowTopX>
  <WindowTopY>32767WindowTopY>
  <ProtectStructure>FalseProtectStructure>
  <ProtectWindows>FalseProtectWindows>
  <DisplayInkNotes>FalseDisplayInkNotes>
 ExcelWorkbook>
 <Styles>