给老师做RA,需要做文本匹配相关的研究,所以自学一下python的正则表达式,记录一下学习笔记。正则表达式是一个特殊的字符序列,它能方便的检查一个字符串是否与某种模式匹配。
截图自:http://c.biancheng.net/view/7768.html
此实例来自Cousera网站,密歇根大学的课程“Applied Text Mining in Python"的课程实例。
import pandas as pd
time_sentences = ["Monday: The doctor's appointment is at 2:45pm.",
"Tuesday: The dentist's appointment is at 11:30 am.",
"Wednesday: At 7:00pm, there is a basketball game!",
"Thursday: Be back home by 11:15 pm at the latest.",
"Friday: Take the train at 08:10 am, arrive at 09:00am."]
df = pd.DataFrame(time_sentences, columns=['text'])
df
Output
text
0 Monday: The doctor's appointment is at 2:45pm.
1 Tuesday: The dentist's appointment is at 11:30...
2 Wednesday: At 7:00pm, there is a basketball game!
3 Thursday: Be back home by 11:15 pm at the latest.
4 Friday: Take the train at 08:10 am, arrive at ...
Find the number of characters for each string in df[‘text’]
df['text'].str.len()
Output
0 46
1 50
2 49
3 49
4 54
Name: text, dtype: int64
Find the number of tokens for each string in df[‘text’]
df['text'].str.split().str.len()
Output
0 7
1 8
2 8
3 10
4 10
Name: text, dtype: int64
Find which entries contain the word ‘appointment’
df['text'].str.split().str.len()
Output
0 7
1 8
2 8
3 10
4 10
Name: text, dtype: int64
Find which entries contain the word ‘appointment’
df['text'].str.contains('appointment')
Output
0 True
1 True
2 False
3 False
4 False
Name: text, dtype: bool
Find how many times a digit occurs in each string
df['text'].str.count(r'\d')
Output
0 3
1 4
2 3
3 4
4 8
Name: text, dtype: int64
Find all occurances of the digits
df['text'].str.findall(r'\d')
Output
0 [2, 4, 5]
1 [1, 1, 3, 0]
2 [7, 0, 0]
3 [1, 1, 1, 5]
4 [0, 8, 1, 0, 0, 9, 0, 0]
Name: text, dtype: object
Group and find the hours and minutes
df['text'].str.findall(r'(\d?\d):(\d\d)')
()代表一个子表达式的起始和结束。(\d?\d)代表第一个子表达式的,可能是一个一位数,也可以是两位数,(\d\d)代表第二个子表达式,只能是两位数
0 [(2, 45)]
1 [(11, 30)]
2 [(7, 00)]
3 [(11, 15)]
4 [(08, 10), (09, 00)]
Name: text, dtype: object
Replace weekdays with ‘???’
df['text'].str.replace(r'\w+day\b', '???')
本段正则表达式的含义是匹配以“day”结尾的单词,然后和”???“替换
0 ???: The doctor's appointment is at 2:45pm.
1 ???: The dentist's appointment is at 11:30 am.
2 ???: At 7:00pm, there is a basketball game!
3 ???: Be back home by 11:15 pm at the latest.
4 ???: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object
Replace weekdays with 3 letter abbrevations
df['text'].str.replace(r'(\w+day\b)', lambda x: x.groups()[0][:3])
这里的lambda函数中输入x为所匹配出来的字符串,groups函数可以获取所有分段分段匹配的字符串,返回tuple,在这个问题中,返回的tuple只有一个元素;所以取[0], 针对所取出的字符,切片取前三个字符,于原字符串完成替换
0 Mon: The doctor's appointment is at 2:45pm.
1 Tue: The dentist's appointment is at 11:30 am.
2 Wed: At 7:00pm, there is a basketball game!
3 Thu: Be back home by 11:15 pm at the latest.
4 Fri: Take the train at 08:10 am, arrive at 09:...
Name: text, dtype: object
Create new columns from first match of extracted groups
df['text'].str.extract(r'(\d?\d):(\d\d)')
extract为文本提取函数,返回DataFrame
0 1
0 2 45
1 11 30
2 7 00
3 11 15
4 08 10
Extract the entire time, the hours, the minutes, and the period
df['text'].str.extractall(r'((\d?\d):(\d\d) ?([ap]m))')
该正则表达式有四段,分别用()隔开,此处第三段[ap]m代表匹配项为”am“和”pm“,第四处()为最大的括号,把所有的表达式都囊括在内。
extractall 会把所有分段的字符都输出为DataFrame,分段字符判断依据为()。
其中”:“没有特殊含义仅仅代表冒号。
" ?" 空格加一个问号,由于文本中的时间有的有空格有的没有,因此这样写以确保所有的时间都被找出来,可以匹配有或无空格。
0 1 2 3
match
0 0 2:45pm 2 45 pm
1 0 11:30 am 11 30 am
2 0 7:00pm 7 00 pm
3 0 11:15 pm 11 15 pm
4 0 08:10 am 08 10 am
1 09:00am 09 00 am
Extract the entire time, the hours, the minutes, and the period with group names
df['text'].str.extractall(r'(?P
输出规范化,?P在每一个正则表达式的开头,以此为DataFrame的列名称
time hour minute period
match
0 0 2:45pm 2 45 pm
1 0 11:30 am 11 30 am
2 0 7:00pm 7 00 pm
3 0 11:15 pm 11 15 pm
4 0 08:10 am 08 10 am
1 09:00am 09 00 am