[Python正则表达式] 字符串中xml标签的匹配

  现在有一个需求,比如给定如下数据:

0-0-0 0:0:0 #### the 68th annual golden globe awards ####  the king s speech earns 7 nominations  ####  <LOCATION>LOS ANGELESLOCATION> <ORGANIZATION>Dec Xinhua Kings SpeechORGANIZATION> historical drama British king stammer beat competitors Tuesday grab seven nominations Golden Globe Awards nominations included best film drama nod contested award organizers said films competing best picture <ORGANIZATION>Social Network Black Swan Fighter Inception Kings SpeechORGANIZATION> earned nominations best performance actor olin <PERSON>FirthPERSON> best performance actress <PERSON>Helena BonhamPERSON> arter best supporting actor <PERSON>Geoffrey RushPERSON> best director <PERSON>Tom HooperPERSON> best screenplay <PERSON>David SeidlerPERSON> best movie score <ORGANIZATION>Alexandre Desplat Social Network FighterORGANIZATION> earned nods apiece Black Swan Inception Kids Right tied place movie race nominations best motion picture comedy musical category <ORGANIZATION>Alice Wonderland Burlesque Kids Right Red TouristORGANIZATION> compete Nominated best actor motion picture olin <ORGANIZATION>Firth Kings Speech James Franco Hours Ryan Gosling Blue Valentine Mark Wahlberg Fighter Jesse Eisenberg Social NetworkORGANIZATION> best actress motion picture nominees <PERSON>Halle Berry Frankie Alice Nicole KidmanPERSON> Rabbit Hole <PERSON>Jennifer LawrencePERSON> <ORGANIZATION>Winters Bone Natalie Portman Black Swan Michelle Williams Blue Valentine TVORGANIZATION> categories Glee nominee nods followed Rock Boardwalk Empire Dexter Good Wife Mad Men Modern Family Pillars Earth Temple <PERSON>GrandinPERSON> tied nods apiece awards announced Jan 

 

  要求按行把<>标签内的字符串中的空格替换成下划线_,并且将数据转换形式,例:A B C需要转换成A_B_C/X

  由于正则表达式匹配是贪婪模式,即尽可能匹配到靠后,那么就非常麻烦,而且仅仅是用?是无法真正保证是非贪婪的。所以需要在正则匹配时给之前匹配好的字符串标一个名字。

python下,正则最终写出来是这样:

1 LABEL_PATTERN = re.compile('(<(?P')

  接下来我们需要做是在原字符串中找出对应的子串,并且记下他们的位置,接下来就是预处理出需要替换成的样子,再用一个正则就好了。

1 LABEL_CONTENT_PATTERN = re.compile('<(?P')

  对字符串集合做整次的map,对每一个字符串进行匹配,再吧这两部分匹配结果zip在一起,就可以获得一个start-end的tuple,大致这样。

 1 ('LOS ANGELES', 'LOS_ANGELES/LOCATION')
 2 ('Dec Xinhua Kings Speech', 'Dec_Xinhua_Kings_Speech/ORGANIZATION')
 3 ('Social Network Black Swan Fighter Inception Kings Speech', 'Social_Network_Black_Swan_Fighter_Inception_Kings_Speech/ORGANIZATION')
 4 ('Firth', 'Firth/PERSON')
 5 ('Helena Bonham', 'Helena_Bonham/PERSON')
 6 ('Geoffrey Rush', 'Geoffrey_Rush/PERSON')
 7 ('Tom Hooper', 'Tom_Hooper/PERSON')
 8 ('David Seidler', 'David_Seidler/PERSON')
 9 ('Alexandre Desplat Social Network Fighter', 'Alexandre_Desplat_Social_Network_Fighter/ORGANIZATION')
10 ('Alice Wonderland Burlesque Kids Right Red Tourist', 'Alice_Wonderland_Burlesque_Kids_Right_Red_Tourist/ORGANIZATION')
11 ('Firth Kings Speech James Franco Hours Ryan Gosling Blue Valentine Mark Wahlberg Fighter Jesse Eisenberg Social Network', 'Firth_Kings_Speech_James_Franco_Hours_Ryan_Gosling_Blue_Valentine_Mark_Wahlberg_Fighter_Jesse_Eisenberg_Social_Network/ORGANIZATION')
12 ('Halle Berry Frankie Alice Nicole Kidman', 'Halle_Berry_Frankie_Alice_Nicole_Kidman/PERSON')
13 ('Jennifer Lawrence', 'Jennifer_Lawrence/PERSON')
14 ('Winters Bone Natalie Portman Black Swan Michelle Williams Blue Valentine TV', 'Winters_Bone_Natalie_Portman_Black_Swan_Michelle_Williams_Blue_Valentine_TV/ORGANIZATION')
15 ('Grandin', 'Grandin/PERSON')
16 ('BEIJING', 'BEIJING/LOCATION')
17 ('Xinhua Sanlu Group', 'Xinhua_Sanlu_Group/ORGANIZATION')
18 ('Gansu', 'Gansu/LOCATION')
19 ('Sanlu', 'Sanlu/ORGANIZATION')

  处理的代码如下:

 1 def read_file(path):
 2     if not os.path.exists(path):
 3         print 'path : \''+ path + '\' not find.'
 4         return []
 5     content = ''
 6     try:
 7         with open(path, 'r') as fp:
 8             content += reduce(lambda x,y:x+y, fp)
 9     finally:
10         fp.close()
11     return content.split('\n')
12 
13 def get_label(each):
14     pair = zip(LABEL_PATTERN.findall(each),
15                          map(lambda x: x[1].replace(' ', '_')+'/'+x[0], LABEL_CONTENT_PATTERN.findall(each)))
16     return map(lambda x: (x[0][0], x[1]), pair)
17 
18 src = read_file(FILE_PATH)
19 pattern = map(get_label, src)

 

  接下来简单处理以下就好:

1 for i in range(0, len(src)):
2     for pat in pattern[i]:
3         src[i] = re.sub(pat[0], pat[1], src[i])

 

  所有代码:

 1 # -*- coding: utf-8 -*-
 2 import re
 3 import os
 4 
 5 # FILE_PATH = '/home/kirai/workspace/sina_news_process/disworded_sina_news_attr_handled.txt'
 6 FILE_PATH = '/home/kirai/workspace/sina_news_process/test.txt'
 7 LABEL_PATTERN = re.compile('(<(?P')
 8 LABEL_CONTENT_PATTERN = re.compile('<(?P')
 9 
10 def read_file(path):
11     if not os.path.exists(path):
12         print 'path : \''+ path + '\' not find.'
13         return []
14     content = ''
15     try:
16         with open(path, 'r') as fp:
17             content += reduce(lambda x,y:x+y, fp)
18     finally:
19         fp.close()
20     return content.split('\n')
21 
22 def get_label(each):
23     pair = zip(LABEL_PATTERN.findall(each),
24                          map(lambda x: x[1].replace(' ', '_')+'/'+x[0], LABEL_CONTENT_PATTERN.findall(each)))
25     return map(lambda x: (x[0][0], x[1]), pair)
26 
27 src = read_file(FILE_PATH)
28 pattern = map(get_label, src)
29 
30 for i in range(0, len(src)):
31     for pat in pattern[i]:
32         src[i] = re.sub(pat[0], pat[1], src[i])

 

转载于:https://www.cnblogs.com/kirai/p/6189611.html

你可能感兴趣的:([Python正则表达式] 字符串中xml标签的匹配)