在数据分析和建模的过程中,大量的时间花在数据准备上:加载、清理、转换和重新排列。这样的工作占了分析师80%以上的时间。
本章内容主要讲解pandas中用于缺失值、重复值、字符串操作和其他分析数据转换的工具。
pandas对象的所有描述性统计信息默认情况下是排除缺失值的。对于数值型数据,pandas使用浮点值NaN来表示缺失值,所以NaN是容易检测到的标识值。
a. 在Series上使用dropna,可以返回Series中素有的非空数据及其索引值:
data = pd.Series([1, np.nan, 3.5, np.nan, 7])
data
0 1.0
1 NaN
2 3.5
3 NaN
4 7.0
dtype: float64
cleaned = data.dropna() # 原数组不发生改变, 与data.notnull()等价
cleaned
0 1.0
2 3.5
4 7.0
dtype: float64
b. 对与DataFrame来说有一些选项可以操作:
from numpy import nan as NA # 使用NA来代替np.nan
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA], [NA, NA, NA], [NA, 6.5, 3.]])
data
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
data.dropna() # 参数默认的情况下删除包含缺失值的所有行
0 1 2
0 1.0 6.5 3.0
data.dropna(how='all') # 仅删除所有值都是NA的行
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
3 NaN 6.5 3.0
data[4] = NA
data
0 1 2 4
0 1.0 6.5 3.0 NaN
1 1.0 NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN 6.5 3.0 NaN
data.dropna(axis=1, how='all') # 指定轴1来删除列
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
df = pd.DataFrame(np.random.randn(7, 3))
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA
df
0 1 2
0 -0.049375 NaN NaN
1 -0.557540 NaN NaN
2 -0.079590 NaN 1.016858
3 0.035582 NaN 0.562665
4 1.023047 0.485505 -0.086212
5 0.315884 -1.314378 0.916748
6 0.479718 -0.704622 -1.252934
df.dropna(thresh=2) # 保留至少有2个非空值数据的行
0 1 2
2 -0.079590 NaN 1.016858
3 0.035582 NaN 0.562665
4 1.023047 0.485505 -0.086212
5 0.315884 -1.314378 0.916748
6 0.479718 -0.704622 -1.252934
除了将缺失值直接过滤掉,还可以通过fillna方法来补全。
a. 使用常数来替代缺失值:
df
0 1 2
0 -0.049375 NaN NaN
1 -0.557540 NaN NaN
2 -0.079590 NaN 1.016858
3 0.035582 NaN 0.562665
4 1.023047 0.485505 -0.086212
5 0.315884 -1.314378 0.916748
6 0.479718 -0.704622 -1.252934
df.fillna(0) # 原数据未改变
0 1 2
0 -0.049375 0.000000 0.000000
1 -0.557540 0.000000 0.000000
2 -0.079590 0.000000 1.016858
3 0.035582 0.000000 0.562665
4 1.023047 0.485505 -0.086212
5 0.315884 -1.314378 0.916748
6 0.479718 -0.704622 -1.252934
b. 可以为不同列设定不同的填充值:
df.fillna({
1: 0.5, 2: 0})
0 1 2
0 -0.049375 0.500000 0.000000
1 -0.557540 0.500000 0.000000
2 -0.079590 0.500000 1.016858
3 0.035582 0.500000 0.562665
4 1.023047 0.485505 -0.086212
5 0.315884 -1.314378 0.916748
6 0.479718 -0.704622 -1.252934
c. fillna返回的是一个新对象,原对象不变,也可以修改已经存在的对象:
df.fillna(0, inplace=True) # 修改原数据
d. 支持插值法填充
df = pd.DataFrame(np.random.randn(6, 3))
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
df
0 1 2
0 0.951068 0.878024 0.837684
1 0.231418 -1.218248 -0.986771
2 1.115496 NaN 0.536248
3 -1.295515 NaN -1.474961
4 0.101957 NaN NaN
5 -2.015696 NaN NaN
df.fillna(method='ffill') # 向前填充,bfill向后填充
0 1 2
0 0.951068 0.878024 0.837684
1 0.231418 -1.218248 -0.986771
2 1.115496 -1.218248 0.536248
3 -1.295515 -1.218248 -1.474961
4 0.101957 -1.218248 -1.474961
5 -2.015696 -1.218248 -1.474961
df.fillna(method='ffill', limit=2) # 向前填充两个数据
0 1 2
0 0.951068 0.878024 0.837684
1 0.231418 -1.218248 -0.986771
2 1.115496 -1.218248 0.536248
3 -1.295515 -1.218248 -1.474961
4 0.101957 NaN -1.474961
5 -2.015696 NaN -1.474961
e. 还可以使用统计学数据来填充,如平均值/中位数等:
data = pd.Series([1., NA, 3.5, NA, 7])
data
0 1.0
1 NaN
2 3.5
3 NaN
4 7.0
dtype: float64
data.fillna(data.mean())
0 1.000000
1 3.833333
2 3.500000
3 3.833333
4 7.000000
dtype: float64
a. DataFrame 的duplicated方法返回的是一个布尔值Series,反映每一行是否存在与之前出现过的行相同的情况:
data = pd.DataFrame({
'k1': ['one', 'two'] * 3 + ['two'],
'k2': [1, 1, 2, 3, 3, 4, 4]})
data
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
6 two 4
data.duplicated()
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
b. drop_duplicates 返回的是DataFrame,内容是duplicated返回数组中为False的部分,即删除了重复行
data.drop_duplicates() # 去除重复行
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
c. 可以基于特定的列去删除数据:
data['v1'] = range(7)
data
k1 k2 v1
0 one 1 0
1 two 1 1
2 one 2 2
3 two 3 3
4 one 3 4
5 two 4 5
6 two 4 6
data.drop_duplicates(['k1'])
k1 k2 v1
0 one 1 0
1 two 1 1
d. duplicated和drop_duplicates默认都是保留第一个观测到的值,参数keep='last‘可以指定保留最后一个:
data.drop_duplicates(['k1', 'k2'], keep='last')
k1 k2 v1
0 one 1 0
1 two 1 1
2 one 2 2
3 two 3 3
4 one 3 4
6 two 4 6
基于DataFrame中的数组、列或列中的数值进行一些转换。
a. 对数组使用映射:Series map方法接收一个函数或者一个包含映射关系的字典型对象
data = pd.DataFrame({
'food': ['bacon', 'pulled pork', 'bacon',
'Pastrami', 'corned beef', 'Bacon',
'pastrami', 'honey ham', 'nova lox'],
'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data
food ounces
0 bacon 4.0
1 pulled pork 3.0
2 bacon 12.0
3 Pastrami 6.0
4 corned beef 7.5
5 Bacon 8.0
6 pastrami 3.0
7 honey ham 5.0
8 nova lox 6.0
meat_to_animal = {
'bacon' : 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}
lowercased = data['food'].str.lower() # 统一大小写
lowercased
0 bacon
1 pulled pork
2 bacon
3 pastrami
4 corned beef
5 bacon
6 pastrami
7 honey ham
8 nova lox
Name: food, dtype: object
data['animal'] = lowercased.map(meat_to_animal)
data
food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 Pastrami 6.0 cow
4 corned beef 7.5 cow
5 Bacon 8.0 pig
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 salmon
b. 也可以传入一个能够完成所有工作的函数,使用map函数
data['food'].map(lambda x: meat_to_animal[x.lower()])
0 pig
1 pig
2 pig
3 cow
4 cow
5 pig
6 cow
7 pig
8 salmon
Name: food, dtype: object
使用fillna填充缺失值是通用值替换的特殊案例,而一些非NA的缺失值可能也需要进行处理,replace函数可以实现。注意与str.replace不同,str的replace方法是对字符串进行按元素替换的。
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data
0 1.0
1 -999.0
2 2.0
3 -999.0
4 -1000.0
5 3.0
dtype: float64
data.replace(-999, NA) # 原数据不变,除非传入inpalce=True参数
0 1.0
1 NaN
2 2.0
3 NaN
4 -1000.0
5 3.0
dtype: float64
data.replace([-999, -1000], NA) # 传入一个列表和替代值可以一次替代多个值
0 1.0
1 NaN
2 2.0
3 NaN
4 NaN
5 3.0
dtype: float64
data.replace([-999, -1000], [np.nan, 0]) # 将不同的值替换为不同的值
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
data.replace({
-999: np.nan, -1000: 0}) # 通过字典传递,效果同上
通过函数或某种形式的映射对轴标签进行转换,生成新的带有不同标签的对象。也可以在不生成新的数据结构的情况下修改轴。
a. 对轴索引进行转换
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
index=['Ohio', 'Colorado', 'New York'],
columns=['one', 'two', 'three', 'four'])
data
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
New York 8 9 10 11
data.index.map(lambda x: x[:4].upper()) # 原数据index不变
Index(['OHIO', 'COLO', 'NEW '], dtype='object')
data.index = data.index.map(transform) # 赋值给index,修改原DataFrame
data
one two three four
OHIO 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11
b. rername 方法可以创建数据集转换后的版本:
data.rename(index=str.title, columns=str.upper) # 不改变原来的数据集
ONE TWO THREE FOUR
Ohio 0 1 2 3
Colo 4 5 6 7
New 8 9 10 11
data.rename(index={
'OHIO': 'INDIANA'},
columns={
'three': 'peekaboo'}) # 可以同时对行和列名进行修改
one two peekaboo four
INDIANA 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11
data.rename(index={
'OHIO': 'INDIANA'}, inplace=True) # 修改原有数据集
data
one two three four
INDIANA 0 1 2 3
COLO 4 5 6 7
NEW 8 9 10 11
a. 连续值经常需要离散化,或者分离成“箱子”进行分析。pandas的cut方法可以实现此功能:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
pandas返回的是一个特殊的Categorical(adj.分类的)对象。
type(cats)
pandas.core.arrays.categorical.Categorical
可以将它看作一个表示箱名的字符串数组,内部包含了一个categories(类别)数组,指定了不同的类别名称以及 codes属性中的ages数据标签(属于哪一个分组):
cats.codes
array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
cats.categories
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
closed='right',
dtype='interval[int64]')
value_counts可以对每个箱中的数量进行计数:
pd.value_counts(cats)
(18, 25] 5
(35, 60] 3
(25, 35] 3
(60, 100] 1
dtype: int64
默认分段区间是左开右闭的,可以通过传递right=False来改变哪一边是封闭的:
pd.cut(ages, [18, 26, 36, 61, 100], right=False)
[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]
b. 可以通过labels 选项传递一个列表或数组来自定义箱的名称:
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
pd.cut(ages, bins, labels=group_names)
[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]
c. 可以显式地指定箱的数量,padnas将根据数据中的最大值和最小值计算出等长的箱:
data = np.random.rand(20)
data
array([0.62558948, 0.49247275, 0.25681885, 0.25178935, 0.14436841,
0.63673486, 0.23204162, 0.797068 , 0.03570436, 0.30070254,
0.30990067, 0.17416387, 0.02437491, 0.74195102, 0.56268857,
0.31056704, 0.84019179, 0.22255447, 0.95149209, 0.3665864 ])
pd.cut(data, 4, precision=2)
[(0.49, 0.72], (0.49, 0.72], (0.26, 0.49], (0.023, 0.26], (0.023, 0.26], ..., (0.26, 0.49], (0.72, 0.95], (0.023, 0.26], (0.72, 0.95], (0.26, 0.49]]
Length: 20
Categories (4, interval[float64]): [(0.023, 0.26] < (0.26, 0.49] < (0.49, 0.72] < (0.72, 0.95]]
pd.value_counts(pd.cut(data, 4, precision=2)) # precision=2将十进制精度限制在两位
(0.023, 0.26] 7
(0.26, 0.49] 5
(0.72, 0.95] 4
(0.49, 0.72] 4
dtype: int64
d. qcut 函数可以基于样本分位数进行分箱
data = np.random.randn(1000) # 正态分布
cats = pd.qcut(data, 4) # 根据分位数切成四份
cats
[(0.68, 2.953], (-3.303, -0.638], (-0.638, -0.018], (-0.638, -0.018], (-0.638, -0.018], ..., (-3.303, -0.638], (0.68, 2.953], (0.68, 2.953], (-0.018, 0.68], (0.68, 2.953]]
Length: 1000
Categories (4, interval[float64]): [(-3.303, -0.638] < (-0.638, -0.018] < (-0.018, 0.68] < (0.68, 2.953]]
pd.value_counts(cats)
(0.68, 2.953] 250
(-0.018, 0.68] 250
(-0.638, -0.018] 250
(-3.303, -0.638] 250
dtype: int64
e. qcut 支持传入自定义的分位数(0和1之间的数据,包括边):
pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.])
[(1.278, 2.953], (-1.163, -0.018], (-1.163, -0.018], (-1.163, -0.018], (-1.163, -0.018], ..., (-1.163, -0.018], (1.278, 2.953], (-0.018, 1.278], (-0.018, 1.278], (-0.018, 1.278]]
Length: 1000
Categories (4, interval[float64]): [(-3.303, -1.163] < (-1.163, -0.018] < (-0.018, 1.278] < (1.278, 2.953]]
pd.value_counts(pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.]))
(-0.018, 1.278] 400
(-1.163, -0.018] 400
(1.278, 2.953] 100
(-3.303, -1.163] 100
dtype: int64
a. 过滤和转换异常值操作
data = pd.DataFrame(np.random.randn(1000, 4))
data[np.abs(data[2]) > 3] # 选出列2中绝对值大于三的行
0 1 2 3
8 0.250961 1.072362 3.296707 -1.078753
115 -0.761032 0.401280 -3.030210 0.403297
406 0.217815 0.025579 3.377316 -0.192798
data[(np.abs(data) > 3).any(1)] # 选出所有值大于3或小于-3的行,any表示所有列中只要有一列满足就可以
0 1 2 3
8 0.250961 1.072362 3.296707 -1.078753
66 1.055522 0.034082 -0.128774 3.022063
115 -0.761032 0.401280 -3.030210 0.403297
203 3.108667 0.105787 0.763269 0.917179
331 -0.865157 1.298979 1.368561 -3.061785
406 0.217815 0.025579 3.377316 -0.192798
685 3.248309 1.246953 0.036534 0.682988
706 3.344908 -0.466498 -0.369035 -1.406542
766 1.149687 3.503083 -0.801795 -0.069251
777 -0.305337 -0.312032 2.006442 -3.775696
b. np.sign(data) 根据数据中的值的正负分别生成1和-1
data[np.abs(data) > 3] = np.sign(data) * 3
data.describe()
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean -0.023892 0.049617 -0.029047 -0.026467
std 1.003102 0.956176 0.986469 0.967800
min -2.957964 -2.796303 -3.000000 -3.000000
25% -0.690697 -0.558802 -0.692625 -0.641618
50% -0.025567 0.036595 -0.030208 -0.006524
75% 0.645585 0.681651 0.626127 0.623537
max 3.000000 3.000000 3.000000 3.000000
a. np.random.permutation 可以对DataFrame 中的Series或行进行置换(随机排序),在随机抽样中应用广泛。在调用permutation时根据你想要的轴长度可以生成一个表示新顺序的整数数组:
df = pd.DataFrame(np.arange(20).reshape(5, 4))
df
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
sampler = np.random.permutation(5) # 本质上sampler类型为array
sampler
array([4, 2, 1, 3, 0])
生成的整数数组可以用在基于iloc的索引或take函数中:
df.take(sampler) # df.iloc[sampler] 效果一致,根据sampler生成的数组对行重新排序
0 1 2 3
4 16 17 18 19
2 8 9 10 11
1 4 5 6 7
3 12 13 14 15
0 0 1 2 3
b. sample 方法可以选出一个不含有替代值的随机子集:
df.sample(n=3) # 选出的行是随机的
0 1 2 3
4 16 17 18 19
3 12 13 14 15
1 4 5 6 7
c. 生成一个带有替代值的样本(允许重复选择),replace=True
choices = pd.Series([5, 7, -1, 6, 4])
draws = choices.sample(10, replace=True)
draws
1 7
2 -1
4 4
4 4
2 -1
4 4
4 4
3 6
1 7
1 7
dtype: int64
a. 虚拟变量多用于分类变量的分类操作,如DataFrame中的一列有k个不同的值,则可以衍生一个k列的值为1和0的矩阵或DataFrame,pandas中的get_dummies函数可以实现该功能:
df = pd.DataFrame({
'key': ['b', 'b', 'a', 'c', 'a', 'b'], 'data1': range(6)})
df
key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 b 5
pd.get_dummies(df['key'])
a b c
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 0
5 0 1 0
可以在指标DataFrame的列上加入前缀,然后与其他数据合并:
dummies = pd.get_dummies(df['key'] , prefix= 'key')
dummies
key_a key_b key_c
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 0
5 0 1 0
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy
data1 key_a key_b key_c
0 0 0 1 0
1 1 0 1 0
2 2 1 0 0
3 3 0 0 1
4 4 1 0 0
5 5 0 1 0
b. 如果DataFrame中的一行属于多个类别,处理较为繁琐,如电影流派分类数据:
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('datasets/movielens/movies.dat', engine='python', sep='::',
header=None, names=mnames)
movies[:10]
movie_id title genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy
首先从数据集中提取出所有不同的流派列表:
all_genres = []
for x in movies.genres:
all_genres.extend(x.split('|'))
genres = pd.unique(all_genres)
genres
array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
'Western'], dtype=object)
生成满足长宽数的全0DataFrame:
zero_matrix = np.zeros((len(movies), len(genres)))
zero_matrix
array([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]])
dummies = pd.DataFrame(zero_matrix, columns=genres)
将每个电影对应的流派列值改为1:
for i, gen in enumerate(movies.genres):
indices = dummies.columns.get_indexer(gen.split('|'))
dummies.iloc[i, indices] = 1
将结果与movies进行联合:
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic[:5]
movie_id title genres Genre_Animation Genre_Children's Genre_Comedy Genre_Adventure Genre_Fantasy Genre_Romance Genre_Drama ... Genre_Crime Genre_Thriller Genre_Horror Genre_Sci-Fi Genre_Documentary Genre_War Genre_Musical Genre_Mystery Genre_Film-Noir Genre_Western
0 1 Toy Story (1995) Animation|Children's|Comedy 1.0 1.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 2 Jumanji (1995) Adventure|Children's|Fantasy 0.0 1.0 0.0 1.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 3 Grumpier Old Men (1995) Comedy|Romance 0.0 0.0 1.0 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 4 Waiting to Exhale (1995) Comedy|Drama 0.0 0.0 1.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 5 Father of the Bride Part II (1995) Comedy 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 rows × 21 columns
c. get_dummies与cut等离散化函数的结合广泛应用与统计分析中,可以实现分类并统计个数:
np.random.seed(12345)
values = np.random.rand(10)
bins = [0, 0.2, 0.4, 0.6, 0.8, 1]
pd.get_dummies(pd.cut(values, bins))[:4]
(0.0, 0.2] (0.2, 0.4] (0.4, 0.6] (0.6, 0.8] (0.8, 1.0]
0 0 0 0 0 1
1 0 1 0 0 0
2 1 0 0 0 0
3 0 1 0 0 0
find和index的区别在于index在字符串中没有找到时会抛出异常,而find则是返回-1。
a. python内建的re模块是用于将正则表达式应用到字符串上的库。re模块主要有三个主题:模式匹配、替代、拆分。
import re
text = 'foo bar\t baz \tqux'
re.split('\s+', text) # 一个或多个空白字符的正则表达式
['foo', 'bar', 'baz', 'qux']
b. 使用re.compile创建一个正则表达式对象可以形成一个可复用的对象,应用到多个字符串上:
regex = re.compile('\s+')
regex.split(text)
c. finall可以查看正则表达式对象所有匹配的模式:
regex.findall(text)
[' ', '\t ', ' \t']
d. findall 返回的是字符串中所有的匹配项;search 返回的仅仅是第一个匹配项;match 更严格,只在字符串的起始位置进行匹配:
text = """Dave [email protected]
Steve [email protected]
Rob [email protected]
Ryan [email protected]
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
regex = re.compile(pattern, flags=re.IGNORECASE) # re.IFNOREECASE 使正则表达式不区分大小写
regex.findall(text)
['[email protected]', '[email protected]', '[email protected]', '[email protected]']
search 返回的是文本中第一个匹配到的电子邮件地址,匹配对象只能告诉我们模式在字符串中起始和结束的位置:
m = regex.search(text)
m
<re.Match object; span=(5, 20), match='[email protected]'>
text[m.start():m.end()] # 通过切片来获得匹配到的字符串
'[email protected]'
match 只在模式出现与字符串起始位置时进行匹配,如果没有匹配到则返回None:
print(regex.match(text))
None
正则表达式其他相关内容过多,参见单独章节。
通过Series的str属性进行调用方法可以在面向数组时跳过NA值来进行字符串操作,相关详细内容参见其他。