有些时候我们的数据是字符串形式的,pandas也能方便的处理。
一、字符串处理。
In [25]: s = pd.Series(list('ABCDEF'))
In [26]: s
Out[26]:
0 A
1 B
2 C
3 D
4 E
5 F
dtype: object
(1)字符大小写转换:s.str.lower()和s.str.upper()
In [27]: s.str.lower()
Out[27]:
0 a
1 b
2 c
3 d
4 e
5 f
dtype: object
In [28]: s.str.upper()
Out[28]:
0 A
1 B
2 C
3 D
4 E
5 F
dtype: object
(2)获取字符串的长度:s.str.len()
In [30]: s.str.len()
Out[30]:
0 1
1 1
2 1
3 1
4 1
5 1
dtype: int64
(
3)切割字符串:
In [31]: s1 = pd.Series(['A_B_C','D E F',np.nan,'I_J K'])
In [32]: s1
Out[32]:
0 A_B_C
1 D E F
2 NaN
3 I_J K
dtype: object
split()将字符串转换为list
In [42]: s1.str.split('_')
Out[42]:
0 [A, B, C]
1 [D E F]
2 NaN
3 [I, J K]
dtype: object
str.get(i)取列表里第i个元素,.str[i]也是一样的结果,不存在用NaN表示。
In [43]: s1.str.split('_').str.get(1)
Out[43]:
0 B
1 NaN
2 NaN
3 J K
dtype: object
(4)替换字符串replace:replace的第一个参数是正则表达式,第二个参数是要替换成的字符串。
将空格替换下划线
In [53]: s1.str.replace(' ','_')
Out[53]:
0 A_B_C
1 D_E_F
2 NaN
3 I_J_K
dtype: object
用extract提取数字
In [58]: s2.str.extract('[ab](\d)')
/usr/bin/ipython:1: FutureWarning: currently extract(expand=None) means expand=False (return Index/Series/DataFrame) but in a future version of pandas this will be changed to expand=True (return DataFrame)
#!/usr/bin/python
Out[58]:
0 1
1 2
2 1
3 2
4 NaN
5 NaN
dtype: object
假如我们要提取多个数据,可以使用多个括号
In [59]: s2.str.extract('([abc])(\d)')
/usr/bin/ipython:1: FutureWarning: currently extract(expand=None) means expand=False (return Index/Series/DataFrame) but in a future version of pandas this will be changed to expand=True (return DataFrame)
#!/usr/bin/python
Out[59]:
0 1
0 a 1
1 a 2
2 b 1
3 b 2
4 c 1
5 NaN NaN
要灵活使用问号,它表示可有可无,下面的方式可以匹配字符“c”
In [60]: s2.str.extract('([abc])(\d)?')
/usr/bin/ipython:1: FutureWarning: currently extract(expand=None) means expand=False (return Index/Series/DataFrame) but in a future version of pandas this will be changed to expand=True (return DataFrame)
#!/usr/bin/python
Out[60]:
0 1
0 a 1
1 a 2
2 b 1
3 b 2
4 c 1
5 c NaN
(5)检测字符串是否包含某字符模式或者匹配它:contains和match,match方法可以严格匹配字符串
na参数来规定出现NaN数据的时候匹配成True还是False
In [70]: s
Out[70]:
0 a123
1 BCD
2 c5C
3 NaN
4 abc123
dtype: object
In [71]: s.str.contains('[a-z][0-9]',na=False)
Out[71]:
0 True
1 False
2 True
3 False
4 True
dtype: bool
In [72]: s.str.match('[a-z][0-9]',na=False)
Out[72]:
0 True
1 False
2 True
3 False
4 False
dtype: bool
查找以字母a开头的数据:s.str.startswith('a',na=False)或者s.str.contains('^a',na=False)
查找以字母a结尾的数据:s.str.endswith('a',na=False)或者s.str.contains('a$',na=False)
In [67]: s
Out[67]:
0 a123
1 BCD
2 c5C
3 NaN
4 abc123
dtype: object
In [68]: s.str.startswith('a',na=False)
Out[68]:
0 True
1 False
2 False
3 False
4 True
dtype: bool
In [69]: s.str.contains('^a',na=False)
Out[69]:
0 True
1 False
2 False
3 False
4 True
dtype: bool
二、缺失值处理
(1)填充缺失值:df.fillna(0)或df.fillna('abc')。可以用任意一个数字或字符代替NaN。
In [72]: df1
Out[72]:
0 1 2 3 4
0 77 2.0 5 NaN 76
1 32 NaN 44 99.0 70
2 51 61.0 58 NaN 10
3 19 52.0 20 68.0 54
4 89 NaN 81 75.0 98
In [73]: df1.fillna(0)
Out[73]:
0 1 2 3 4
0 77 2.0 5 0.0 76
1 32 0.0 44 99.0 70
2 51 61.0 58 0.0 10
3 19 52.0 20 68.0 54
4 89 0.0 81 75.0 98
用前一个数据代替NaN:method='pad',用后一个数据代替:methon='bfill'。
都是指的列的前一个数据。用limit限制每列可以替代NaN的数目:df1.fillna(method='bfill',limit=1)
In [74]: df1
Out[74]:
0 1 2 3 4
0 77 2.0 5 NaN 76
1 32 NaN 44 99.0 70
2 51 61.0 58 NaN 10
3 19 52.0 20 68.0 54
4 89 NaN 81 75.0 98
In [75]: df1.fillna(method='pad')
Out[75]:
0 1 2 3 4
0 77 2.0 5 NaN 76
1 32 2.0 44 99.0 70
2 51 61.0 58 99.0 10
3 19 52.0 20 68.0 54
4 89 52.0 81 75.0 98
In [76]: df1.fillna(method='bfill')
Out[76]:
0 1 2 3 4
0 77 2.0 5 99.0 76
1 32 61.0 44 99.0 70
2 51 61.0 58 68.0 10
3 19 52.0 20 68.0 54
4 89 NaN 81 75.0 98
除了上面用一个具体的值来代替NaN之外,还可以使用平均数或者其他描述性统计量来代替NaN,
还可以选择哪一列进行缺失值的处理
In [82]: df
Out[82]:
0 1 2 3 4
0 77 2.0 5 NaN 76
1 32 NaN 44 99.0 70
2 51 61.0 58 NaN 10
3 19 52.0 20 68.0 54
4 89 NaN 81 75.0 98
In [83]: df.fillna(df.mean()[1:2])
Out[83]:
0 1 2 3 4
0 77 2.000000 5 NaN 76
1 32 38.333333 44 99.0 70
2 51 61.000000 58 NaN 10
3 19 52.000000 20 68.0 54
4 89 38.333333 81 75.0 98
对数据进行布尔填充:
In [88]: df
Out[88]:
0 1 2 3 4
0 77 2.0 5 NaN 76
1 32 NaN 44 99.0 70
2 51 61.0 58 NaN 10
3 19 52.0 20 68.0 54
4 89 NaN 81 75.0 98
In [89]: pd.isnull(df)
Out[89]:
0 1 2 3 4
0 False False False True False
1 False True False False False
2 False False False True False
3 False False False False False
4 False True False False False
(2)删除缺失值
以选择删除行或者删除列,用的都是df.dropna()
In [84]: df.dropna(axis=0)
Out[84]:
0 1 2 3 4
3 19 52.0 20 68.0 54
In [85]: df
Out[85]:
0 1 2 3 4
0 77 2.0 5 NaN 76
1 32 NaN 44 99.0 70
2 51 61.0 58 NaN 10
3 19 52.0 20 68.0 54
4 89 NaN 81 75.0 98
In [86]: df.dropna(axis=1)
Out[86]:
0 2 4
0 77 5 76
1 32 44 70
2 51 58 10
3 19 20 54
4 89 81 98
使用插值法估计缺失值, interpolate()假设函数是直线形式,实际上上前一个值和后一个值得平均数。
In [94]: df
Out[94]:
0 1 2 3 4
0 77 2.0 5 NaN 76
1 32 NaN 44 99.0 70
2 51 61.0 58 NaN 10
3 19 52.0 20 68.0 54
4 89 NaN 81 75.0 98
In [95]: df.interpolate()
Out[95]:
0 1 2 3 4
0 77 2.0 5 NaN 76
1 32 31.5 44 99.0 70
2 51 61.0 58 83.5 10
3 19 52.0 20 68.0 54
4 89 52.0 81 75.0 98
1.数值替换
In [100]: s=pd.Series([0,1,2,3,4])
In [101]: s.replace(1,6)
Out[101]:
0 0
1 6
2 2
3 3
4 4
dtype: int64
2.列表替换
In [102]: s.replace([2,3,4],[4,3,2])
Out[102]:
0 0
1 1
2 4
3 3
4 2
dtype: int64
3.字典映射
In [103]: s.replace({1:100,2:200})
Out[103]:
0 0
1 100
2 200
3 3
4 4
dtype: int64
4.数值替换
In [109]: df
Out[109]:
A B
0 11 22
1 33 44
2 55 66
3 77 88
In [110]: df.replace(11,99)
Out[110]:
A B
0 99 22
1 33 44
2 55 66
3 77 88
5.只有一列数据需要替换数值,我们可以单独操作者一列
也可以操作多列:df[['A','B']].replace([11,55],[22,44])
In [111]: df
Out[111]:
A B
0 11 22
1 33 44
2 55 66
3 77 88
In [112]: df['A'].replace([11,55],[22,44])
Out[112]:
0 22
1 33
2 44
3 77
Name: A, dtype: int64
In [115]: df
Out[115]:
A B
0 11 22
1 33 44
2 55 66
3 77 88
In [116]: df.replace({'A':77,'B':66},np.nan)
Out[116]:
A B
0 11.0 22.0
1 33.0 44.0
2 55.0 NaN
3 NaN 88.0
7.插值法同样可以用于替换数值,只要使用参数method即可,不支持多列。
In [122]: df
Out[122]:
A B
0 11 22
1 33 44
2 55 66
3 77 88
In [123]: df['A'].replace([33,77],method='pad')
Out[123]:
0 11
1 11
2 55
3 55
Name: A, dtype: int64