考虑在节点上使用lxml的xpath(),并使用lxml的parse()直接读取文件。XPath循环迭代地附加到list和dictionary容器以强制转换到dataframe。此外,所需的输出实际上与节点值不一致:import pandas as pd
from lxml import etree
import re
pd.set_option('display.width', 1000)
NSMAP = {'row': 'http://www.row.com',
'row3': 'http://www.row3.com',
'row1': 'http://www.row1.com',
'xs': 'http://www.xs.com',
'row2': 'http://www.row2.com'}
xmldata = etree.parse('RowAgent.xml')
data = []
inner = {}
for el in xmldata.xpath('//xs:top_col', namespaces=NSMAP):
for i in el: # PARSE CHILDREN
inner[i.tag] = i.text
if len(i.xpath('/*')) > 0: # PARSE GRANDCHILDREN
for subi in i:
inner[subi.tag] = subi.text
data.append(inner)
inner = {}
df = pd.DataFrame(data)
# REGEX TO REMOVE NAMESPACE URIs IN COL NAMES
df.columns = [re.sub(r'{.*}', '', col) for col in df.columns]
要解析无限的子元素,请使用XPath的descendant::*:
^{pr2}$
输出print(df)
# col11_1 col11_2 col8_1 col8_2 col1 col10 col12 col13_1 col2 col3 col4 col5 col6 col7 col9
# 0 2010 AB 20/SEC001 2010 2016 00032000 test_name pqr 000330 N 0 3 N I AA N
# 1 2016026 rty-qwe-01 2000 26000 03985 temp2 perrl 0117203 N 0 3 N a 9AA N
# 2 8965 147A-254-044 7896 NaN 00985 mjkl rtyyu 45612 N 0 3 N NaN yuio N
# 3 52369 ui 247/mh45 145ghg7 NaN 78965 ghyuio trwer 9874 N 0 5 N NaN 23rt N
由于descendants::*的性能问题,考虑先遍历所有子代的递归调用,然后再调用另一个调用来捕获数据帧列的父/子/孙名称。现在一定要使用OrderedDict:from collections import OrderedDict
#... same as above XML setup ... #
def recursiveParse(curr_elem, curr_inner):
if len(curr_elem.xpath('/*')) > 0:
for child_elem in curr_elem:
curr_inner[child_elem.tag] = child_elem.text
inner[i.tag] = i.text
if child_elem.attrib is not None:
for attrib in child_elem.attrib:
inner[attrib] = child_elem.attrib[attrib]
recursiveParse(child_elem, curr_inner)
return(curr_inner)
for el in xmldata.xpath('//xs:top_col', namespaces=NSMAP):
for i in el:
inner[i.tag] = i.text
if i.attrib is not None:
for attrib in i.attrib:
inner[attrib] = i.attrib[attrib]
recursiveParse(i, inner)
data.append(inner)
inner = {}
df = pd.DataFrame(data)
colnames = []
def recursiveNames(curr_elem, curr_inner, num):
if len(curr_elem.xpath('/*')) > 0:
for child_elem in curr_elem:
tmp = re.sub(r'{.*}', '', child_elem.tag)
curr_inner.append(colnames[num-1] +'.'+ tmp)
if child_elem.attrib is not None:
for attrib in child_elem.attrib:
curr_inner.append(curr_inner[len(curr_inner)-1] +'.'+ attrib)
recursiveNames(child_elem, curr_inner, len(colnames))
return(curr_inner)
for el in xmldata.xpath('//xs:top_col[1]', namespaces=NSMAP):
for i in el:
tmp = re.sub(r'{.*}', '', i.tag)
colnames.append(tmp)
recursiveNames(i, colnames, len(colnames))
df.columns = colnames
输出print(df)
# col1 col2 col3 col4 col5 col6 col7 col8 col8.col8_1 col8.col8_1.sName col8.col8_2 col9 col10 col11 col11.col11_1 col11.col11_2 col12 col13 col13.col13_1
# 0 00032000 N 0 3 N I AA \n 2010 pqrst 2016 N test_name \n 2010 AB 20/SEC001 pqr \n 000330
# 1 03985 N 0 3 N a 9AA \n 2000 NaN 26000 N temp2 \n 2016026 rty-qwe-01 perrl \n 0117203
# 2 00985 N 0 3 N NaN yuio \n 7896 NaN NaN N mjkl \n 8965 147A-254-044 rtyyu \n 45612
# 3 78965 N 0 5 N NaN 23rt \n 145ghg7 NaN NaN N ghyuio \n 52369 ui 247/mh45 trwer \n 9874
最后,将此处理和原始XML解析集成到一个循环中,循环遍历目录中的所有XML文件。但是,请确保将所有数据帧保存在一个数据帧列表中,然后使用^{}`追加/堆栈。在import # modules
dfList = []
for f in os.list.dir('/path/to/XML/files'):
#...xml parse... (passing in f for file name in parse())
#...dataframe build with recursive calls...
dfList.append(df)
finaldf = pd.concat(dfList)