python 处理xml pandas_python中的XML解析Pandas在

考虑在节点上使用lxml的xpath(),并使用lxml的parse()直接读取文件。XPath循环迭代地附加到list和dictionary容器以强制转换到dataframe。此外,所需的输出实际上与节点值不一致:import pandas as pd

from lxml import etree

import re

pd.set_option('display.width', 1000)

NSMAP = {'row': 'http://www.row.com',

'row3': 'http://www.row3.com',

'row1': 'http://www.row1.com',

'xs': 'http://www.xs.com',

'row2': 'http://www.row2.com'}

xmldata = etree.parse('RowAgent.xml')

data = []

inner = {}

for el in xmldata.xpath('//xs:top_col', namespaces=NSMAP):

for i in el: # PARSE CHILDREN

inner[i.tag] = i.text

if len(i.xpath('/*')) > 0: # PARSE GRANDCHILDREN

for subi in i:

inner[subi.tag] = subi.text

data.append(inner)

inner = {}

df = pd.DataFrame(data)

# REGEX TO REMOVE NAMESPACE URIs IN COL NAMES

df.columns = [re.sub(r'{.*}', '', col) for col in df.columns]

要解析无限的子元素,请使用XPath的descendant::*:

^{pr2}$

输出print(df)

# col11_1 col11_2 col8_1 col8_2 col1 col10 col12 col13_1 col2 col3 col4 col5 col6 col7 col9

# 0 2010 AB 20/SEC001 2010 2016 00032000 test_name pqr 000330 N 0 3 N I AA N

# 1 2016026 rty-qwe-01 2000 26000 03985 temp2 perrl 0117203 N 0 3 N a 9AA N

# 2 8965 147A-254-044 7896 NaN 00985 mjkl rtyyu 45612 N 0 3 N NaN yuio N

# 3 52369 ui 247/mh45 145ghg7 NaN 78965 ghyuio trwer 9874 N 0 5 N NaN 23rt N

由于descendants::*的性能问题,考虑先遍历所有子代的递归调用,然后再调用另一个调用来捕获数据帧列的父/子/孙名称。现在一定要使用OrderedDict:from collections import OrderedDict

#... same as above XML setup ... #

def recursiveParse(curr_elem, curr_inner):

if len(curr_elem.xpath('/*')) > 0:

for child_elem in curr_elem:

curr_inner[child_elem.tag] = child_elem.text

inner[i.tag] = i.text

if child_elem.attrib is not None:

for attrib in child_elem.attrib:

inner[attrib] = child_elem.attrib[attrib]

recursiveParse(child_elem, curr_inner)

return(curr_inner)

for el in xmldata.xpath('//xs:top_col', namespaces=NSMAP):

for i in el:

inner[i.tag] = i.text

if i.attrib is not None:

for attrib in i.attrib:

inner[attrib] = i.attrib[attrib]

recursiveParse(i, inner)

data.append(inner)

inner = {}

df = pd.DataFrame(data)

colnames = []

def recursiveNames(curr_elem, curr_inner, num):

if len(curr_elem.xpath('/*')) > 0:

for child_elem in curr_elem:

tmp = re.sub(r'{.*}', '', child_elem.tag)

curr_inner.append(colnames[num-1] +'.'+ tmp)

if child_elem.attrib is not None:

for attrib in child_elem.attrib:

curr_inner.append(curr_inner[len(curr_inner)-1] +'.'+ attrib)

recursiveNames(child_elem, curr_inner, len(colnames))

return(curr_inner)

for el in xmldata.xpath('//xs:top_col[1]', namespaces=NSMAP):

for i in el:

tmp = re.sub(r'{.*}', '', i.tag)

colnames.append(tmp)

recursiveNames(i, colnames, len(colnames))

df.columns = colnames

输出print(df)

# col1 col2 col3 col4 col5 col6 col7 col8 col8.col8_1 col8.col8_1.sName col8.col8_2 col9 col10 col11 col11.col11_1 col11.col11_2 col12 col13 col13.col13_1

# 0 00032000 N 0 3 N I AA \n 2010 pqrst 2016 N test_name \n 2010 AB 20/SEC001 pqr \n 000330

# 1 03985 N 0 3 N a 9AA \n 2000 NaN 26000 N temp2 \n 2016026 rty-qwe-01 perrl \n 0117203

# 2 00985 N 0 3 N NaN yuio \n 7896 NaN NaN N mjkl \n 8965 147A-254-044 rtyyu \n 45612

# 3 78965 N 0 5 N NaN 23rt \n 145ghg7 NaN NaN N ghyuio \n 52369 ui 247/mh45 trwer \n 9874

最后,将此处理和原始XML解析集成到一个循环中,循环遍历目录中的所有XML文件。但是,请确保将所有数据帧保存在一个数据帧列表中,然后使用^{}`追加/堆栈。在import # modules

dfList = []

for f in os.list.dir('/path/to/XML/files'):

#...xml parse... (passing in f for file name in parse())

#...dataframe build with recursive calls...

dfList.append(df)

finaldf = pd.concat(dfList)

你可能感兴趣的:(python,处理xml,pandas)