Data Cleaning and Preparation 数据清洗和准备
修改之后,增加代码,注释
xiaoyao
import numpy as np
import pandas as pd
PREVIOUS_MAX_ROWS = pd.options.display.max_rows
pd.options.display.max_rows = 20
np.random.seed(12345)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(10, 6))
np.set_printoptions(precision=4, suppress=True)
import warnings
warnings.filterwarnings('ignore')
Handling Missing Data 处理缺失数据
pandas对象的所有描述性统计默认都不包括缺失数据。pandas使用浮点值NaN(Not a Number)表示缺失数据。
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data
0 aardvark
1 artichoke
2 NaN
3 avocado
dtype: object
string_data.isnull()
0 False
1 False
2 True
3 False
dtype: bool
在pandas中,采用的是R语言中的惯用法,将缺失值表示为NA,他表示不可用not available.在统计应用中,NA数据可能是不存在的数据或者虽然存在,但是没有观察到(例如,数据采集中发生了问题)。
- 当进行数据清洗以进行分析的时候,最好直接对缺失数据进行分析,从而判断数据采集的问题或者缺失数据可能导致的偏差。
- python内置的None值在对象数组中也可以作为NA
string_data
0 aardvark
1 artichoke
2 NaN
3 avocado
dtype: object
string_data.isnull()
0 False
1 False
2 True
3 False
dtype: bool
string_data[0] = None
string_data.isnull()
0 True
1 False
2 True
3 False
dtype: bool
一些关于缺失数据处理的函数
方法 |
说明 |
dropna |
根据各标签的值中是否存在缺失数据对轴标签进行过滤,可以通过阈值调节对缺失值的容忍度 |
fillna |
用指定值或者插值方法(如ffill或者bfill)填充确实数据 |
isnull |
返回一个含有布尔值的对象,这些布尔值表示哪些值为缺失值NA,该对象的类型与源类型一样 |
notnull |
这个是isnull的否定形式 |
Filtering Out Missing Data 滤除缺失数据
过滤掉缺失数据的方式有很多种。可以通过pandas.isnull或者布尔索引的方式,但dropna可能会更加实用。对于一个Series,dropna返回一个仅仅含有非空数据和索引值的Series:
from numpy import nan as NA
data = pd.Series([1, NA, 3.5, NA, 7])
data.dropna()
0 1.0
2 3.5
4 7.0
dtype: float64
data[data.notnull()]
0 1.0
2 3.5
4 7.0
dtype: float64
对于DataFrame对象,事情变得不一样。他这里默认丢弃任何含有缺失值的行。
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
[NA, NA, NA], [NA, 6.5, 3.]])
cleaned = data.dropna()
data
|
0 |
1 |
2 |
0 |
1.0 |
6.5 |
3.0 |
1 |
1.0 |
NaN |
NaN |
2 |
NaN |
NaN |
NaN |
3 |
NaN |
6.5 |
3.0 |
cleaned
data.dropna(how='all')
|
0 |
1 |
2 |
0 |
1.0 |
6.5 |
3.0 |
1 |
1.0 |
NaN |
NaN |
3 |
NaN |
6.5 |
3.0 |
data
|
0 |
1 |
2 |
0 |
1.0 |
6.5 |
3.0 |
1 |
1.0 |
NaN |
NaN |
2 |
NaN |
NaN |
NaN |
3 |
NaN |
6.5 |
3.0 |
data[4] = NA
data
|
0 |
1 |
2 |
4 |
0 |
1.0 |
6.5 |
3.0 |
NaN |
1 |
1.0 |
NaN |
NaN |
NaN |
2 |
NaN |
NaN |
NaN |
NaN |
3 |
NaN |
6.5 |
3.0 |
NaN |
data.dropna(axis=1, how='all')
|
0 |
1 |
2 |
0 |
1.0 |
6.5 |
3.0 |
1 |
1.0 |
NaN |
NaN |
2 |
NaN |
NaN |
NaN |
3 |
NaN |
6.5 |
3.0 |
另外一个滤除DataFrame行的问题所涉及时间序列数据。加入我只想留下一部分观测数据,可以采用thresh参数实现此目的。
df = pd.DataFrame(np.random.randn(7, 3))
df
|
0 |
1 |
2 |
0 |
0.476985 |
3.248944 |
-1.021228 |
1 |
-0.577087 |
0.124121 |
0.302614 |
2 |
0.523772 |
0.000940 |
1.343810 |
3 |
-0.713544 |
-0.831154 |
-2.370232 |
4 |
-1.860761 |
-0.860757 |
0.560145 |
5 |
-1.265934 |
0.119827 |
-1.063512 |
6 |
0.332883 |
-2.359419 |
-0.199543 |
df[0]
0 0.476985
1 -0.577087
2 0.523772
3 -0.713544
4 -1.860761
5 -1.265934
6 0.332883
Name: 0, dtype: float64
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA
df
|
0 |
1 |
2 |
0 |
0.476985 |
NaN |
NaN |
1 |
-0.577087 |
NaN |
NaN |
2 |
0.523772 |
NaN |
1.343810 |
3 |
-0.713544 |
NaN |
-2.370232 |
4 |
-1.860761 |
-0.860757 |
0.560145 |
5 |
-1.265934 |
0.119827 |
-1.063512 |
6 |
0.332883 |
-2.359419 |
-0.199543 |
df.dropna()
|
0 |
1 |
2 |
0 |
0.476985 |
3.248944 |
-1.021228 |
1 |
-0.577087 |
0.124121 |
0.302614 |
2 |
0.523772 |
0.000940 |
1.343810 |
3 |
-0.713544 |
-0.831154 |
-2.370232 |
4 |
-1.860761 |
-0.860757 |
0.560145 |
5 |
-1.265934 |
0.119827 |
-1.063512 |
6 |
0.332883 |
-2.359419 |
-0.199543 |
df.dropna(thresh=2)
|
0 |
1 |
2 |
0 |
0.476985 |
3.248944 |
-1.021228 |
1 |
-0.577087 |
0.124121 |
0.302614 |
2 |
0.523772 |
0.000940 |
1.343810 |
3 |
-0.713544 |
-0.831154 |
-2.370232 |
4 |
-1.860761 |
-0.860757 |
0.560145 |
5 |
-1.265934 |
0.119827 |
-1.063512 |
6 |
0.332883 |
-2.359419 |
-0.199543 |
Filling In Missing Data 填充缺失数据
不滤除缺失数据,我希望通过其他的方法来填补这些“空洞”,对于大多数情况而言,fillna方法是主要的函数。通过一个常数
调用fillna就会将缺失值替换为那个常数值:
df
|
0 |
1 |
2 |
0 |
0.476985 |
NaN |
NaN |
1 |
-0.577087 |
NaN |
NaN |
2 |
0.523772 |
NaN |
1.343810 |
3 |
-0.713544 |
NaN |
-2.370232 |
4 |
-1.860761 |
-0.860757 |
0.560145 |
5 |
-1.265934 |
0.119827 |
-1.063512 |
6 |
0.332883 |
-2.359419 |
-0.199543 |
df.fillna(0)
|
0 |
1 |
2 |
0 |
0.476985 |
0.000000 |
0.000000 |
1 |
-0.577087 |
0.000000 |
0.000000 |
2 |
0.523772 |
0.000000 |
1.343810 |
3 |
-0.713544 |
0.000000 |
-2.370232 |
4 |
-1.860761 |
-0.860757 |
0.560145 |
5 |
-1.265934 |
0.119827 |
-1.063512 |
6 |
0.332883 |
-2.359419 |
-0.199543 |
df.fillna({1: 0.5, 2: 0})
|
0 |
1 |
2 |
0 |
0.476985 |
0.500000 |
0.000000 |
1 |
-0.577087 |
0.500000 |
0.000000 |
2 |
0.523772 |
0.500000 |
1.343810 |
3 |
-0.713544 |
0.500000 |
-2.370232 |
4 |
-1.860761 |
-0.860757 |
0.560145 |
5 |
-1.265934 |
0.119827 |
-1.063512 |
6 |
0.332883 |
-2.359419 |
-0.199543 |
fillna默认会返回新对象,但是也可以实现对现有的对象进行就地修改
_ = df.fillna(0, inplace=True)
df
|
0 |
1 |
2 |
0 |
0.476985 |
0.000000 |
0.000000 |
1 |
-0.577087 |
0.000000 |
0.000000 |
2 |
0.523772 |
0.000000 |
1.343810 |
3 |
-0.713544 |
0.000000 |
-2.370232 |
4 |
-1.860761 |
-0.860757 |
0.560145 |
5 |
-1.265934 |
0.119827 |
-1.063512 |
6 |
0.332883 |
-2.359419 |
-0.199543 |
df = pd.DataFrame(np.random.randn(6, 3))
df
|
0 |
1 |
2 |
0 |
0.862580 |
-0.010032 |
0.050009 |
1 |
0.670216 |
0.852965 |
-0.955869 |
2 |
-0.023493 |
-2.304234 |
-0.652469 |
3 |
-1.218302 |
-1.332610 |
1.074623 |
4 |
0.723642 |
0.690002 |
1.001543 |
5 |
-0.503087 |
-0.622274 |
-0.921169 |
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
df
|
0 |
1 |
2 |
0 |
0.862580 |
-0.010032 |
0.050009 |
1 |
0.670216 |
0.852965 |
-0.955869 |
2 |
-0.023493 |
NaN |
-0.652469 |
3 |
-1.218302 |
NaN |
1.074623 |
4 |
0.723642 |
NaN |
NaN |
5 |
-0.503087 |
NaN |
NaN |
data = pd.Series([1., NA, 3.5, NA, 7])
data.mean()
3.8333333333333335
data.fillna(data.mean())
0 1.000000
1 3.833333
2 3.500000
3 3.833333
4 7.000000
dtype: float64
关于fillna参数的说明
|
|
value |
用于填充缺失值的标量值或者字典对象 |
method |
插值方式,如果函数调用时候没有进行指定,则默认为“ffill” |
axis |
待填充的轴,默认为axis=0 |
inplace |
修改调用者对象而不产生副本,就地修改 |
limit |
(对于前向和后向填充)可以连续填充的最大数量 |
Data Transformation 数据转换
到此之前都是进行的为:数据的重排,另一类重要的操作为:通过过滤,清理以及其他的转换工作。
Removing Duplicates 移除重复的数据
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
'k2': [1, 1, 2, 3, 3, 4, 4]})
data
|
k1 |
k2 |
0 |
one |
1 |
1 |
two |
1 |
2 |
one |
2 |
3 |
two |
3 |
4 |
one |
3 |
5 |
two |
4 |
6 |
two |
4 |
data.duplicated()
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
data.drop_duplicates()
|
k1 |
k2 |
0 |
one |
1 |
1 |
two |
1 |
2 |
one |
2 |
3 |
two |
3 |
4 |
one |
3 |
5 |
two |
4 |
data
|
k1 |
k2 |
0 |
one |
1 |
1 |
two |
1 |
2 |
one |
2 |
3 |
two |
3 |
4 |
one |
3 |
5 |
two |
4 |
6 |
two |
4 |
data['v1'] = range(7)
data
|
k1 |
k2 |
v1 |
0 |
one |
1 |
0 |
1 |
two |
1 |
1 |
2 |
one |
2 |
2 |
3 |
two |
3 |
3 |
4 |
one |
3 |
4 |
5 |
two |
4 |
5 |
6 |
two |
4 |
6 |
data.drop_duplicates(['k1'])
|
k1 |
k2 |
v1 |
0 |
one |
1 |
0 |
1 |
two |
1 |
1 |
data.drop_duplicates(['k1', 'k2'], keep='last')
|
k1 |
k2 |
v1 |
0 |
one |
1 |
0 |
1 |
two |
1 |
1 |
2 |
one |
2 |
2 |
3 |
two |
3 |
3 |
4 |
one |
3 |
4 |
6 |
two |
4 |
6 |
Transforming Data Using a Function or Mapping
利用函数或者映射进行数据转换
对于许多数据集,可能希望根据数组、Series或者DataFrame列中的值来实现转换工作,我们接下来:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
'Pastrami', 'corned beef', 'Bacon',
'pastrami', 'honey ham', 'nova lox'],
'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data
|
food |
ounces |
0 |
bacon |
4.0 |
1 |
pulled pork |
3.0 |
2 |
bacon |
12.0 |
3 |
Pastrami |
6.0 |
4 |
corned beef |
7.5 |
5 |
Bacon |
8.0 |
6 |
pastrami |
3.0 |
7 |
honey ham |
5.0 |
8 |
nova lox |
6.0 |
meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}
"""
有些肉类的首字母大写了,而另一些没有,
因此,首先调用Series的str.lower方法,将各个值转换为小写:
"""
lowercased = data['food'].str.lower()
lowercased
0 bacon
1 pulled pork
2 bacon
3 pastrami
4 corned beef
5 bacon
6 pastrami
7 honey ham
8 nova lox
Name: food, dtype: object
data['animal'] = lowercased.map(meat_to_animal)
data
|
food |
ounces |
animal |
0 |
bacon |
4.0 |
pig |
1 |
pulled pork |
3.0 |
pig |
2 |
bacon |
12.0 |
pig |
3 |
Pastrami |
6.0 |
cow |
4 |
corned beef |
7.5 |
cow |
5 |
Bacon |
8.0 |
pig |
6 |
pastrami |
3.0 |
cow |
7 |
honey ham |
5.0 |
pig |
8 |
nova lox |
6.0 |
salmon |
也可以传入一个可以完成全部工作的函数,这里使用匿名函数
data['food'].map(lambda x: meat_to_animal[x.lower()])
0 pig
1 pig
2 pig
3 cow
4 cow
5 pig
6 cow
7 pig
8 salmon
Name: food, dtype: object
data
|
food |
ounces |
animal |
0 |
bacon |
4.0 |
pig |
1 |
pulled pork |
3.0 |
pig |
2 |
bacon |
12.0 |
pig |
3 |
Pastrami |
6.0 |
cow |
4 |
corned beef |
7.5 |
cow |
5 |
Bacon |
8.0 |
pig |
6 |
pastrami |
3.0 |
cow |
7 |
honey ham |
5.0 |
pig |
8 |
nova lox |
6.0 |
salmon |
Replacing Values 替换值
利用fillna方法填充缺失数据可以看作是替换值的一种特殊方法。前面已经看到,map可以用于修改对象的数据子集。
而replace则提供了以中国实现该功能的更加简单、灵活的方式。
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data
0 1.0
1 -999.0
2 2.0
3 -999.0
4 -1000.0
5 3.0
dtype: float64
data.replace(-999, np.nan)
0 1.0
1 NaN
2 2.0
3 NaN
4 -1000.0
5 3.0
dtype: float64
data.replace([-999, -1000], np.nan)
0 1.0
1 NaN
2 2.0
3 NaN
4 NaN
5 3.0
dtype: float64
data
0 1.0
1 -999.0
2 2.0
3 -999.0
4 -1000.0
5 3.0
dtype: float64
data.replace([-999, -1000], [np.nan, 0])
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
data.replace({-999: np.nan, -1000: 0})
0 1.0
1 NaN
2 2.0
3 NaN
4 0.0
5 3.0
dtype: float64
data.replace方法与data.str.replace不同,后者做的是字符串的元素级替换,
Renaming Axis Indexes 重命名轴索引
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
index=['Ohio', 'Colorado', 'New York'],
columns=['one', 'two', 'three', 'four'])
跟Series中的值一样,轴标签也可以通过函数或者映射进行转换,从而得到一个新的不同标签的对象。轴还可以被就地修改,而无需新建一个数据结构。
data
|
one |
two |
three |
four |
Ohio |
0 |
1 |
2 |
3 |
Colorado |
4 |
5 |
6 |
7 |
New York |
8 |
9 |
10 |
11 |
transform = lambda x: x[:4].upper()
data.index.map(transform)
Index(['OHIO', 'COLO', 'NEW '], dtype='object')
data.index = data.index.map(transform)
data
|
one |
two |
three |
four |
OHIO |
0 |
1 |
2 |
3 |
COLO |
4 |
5 |
6 |
7 |
NEW |
8 |
9 |
10 |
11 |
data.rename(index=str.title, columns=str.upper)
|
ONE |
TWO |
THREE |
FOUR |
Ohio |
0 |
1 |
2 |
3 |
Colo |
4 |
5 |
6 |
7 |
New |
8 |
9 |
10 |
11 |
data.rename(index={'OHIO': 'INDIANA'},
columns={'three': 'peekaboo'})
|
one |
two |
peekaboo |
four |
INDIANA |
0 |
1 |
2 |
3 |
COLO |
4 |
5 |
6 |
7 |
NEW |
8 |
9 |
10 |
11 |
data.rename(index={'OHIO': 'INDIANA'}, inplace=True)
data
|
one |
two |
three |
four |
INDIANA |
0 |
1 |
2 |
3 |
COLO |
4 |
5 |
6 |
7 |
NEW |
8 |
9 |
10 |
11 |
Discretization and Binning 离散化和面元划分
为了便于分析,连续的数据常常被离散化或者拆分为"面元(bin)".
如下:假设有一组人员数据,希望将其划分为不同的年龄组:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
pandas返回的是一个特殊的Categorical对象。结果展示了pandas.cut划分的面元。可以将其看作一组表示面元名称的字符串。
它的底层含有一个表示不同分类名称的类型数组,以及一个codes属性中的年龄数据的标签。
cats.codes
array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
cats.categories
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
closed='right',
dtype='interval[int64]')
pd.value_counts(cats)
(18, 25] 5
(35, 60] 3
(25, 35] 3
(60, 100] 1
dtype: int64
pd.cut(ages, [18, 26, 36, 61, 100], right=False)
[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
pd.cut(ages, bins, labels=group_names)
[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]
data = np.random.rand(20)
pd.cut(data, 4, precision=2)
[(0.49, 0.72], (0.02, 0.26], (0.02, 0.26], (0.49, 0.72], (0.49, 0.72], ..., (0.49, 0.72], (0.49, 0.72], (0.26, 0.49], (0.72, 0.96], (0.49, 0.72]]
Length: 20
Categories (4, interval[float64]): [(0.02, 0.26] < (0.26, 0.49] < (0.49, 0.72] < (0.72, 0.96]]
这里的选项precision=2,限定小数只有两位
data = np.random.randn(1000)
cats = pd.qcut(data, 4)
cats
[(-0.0453, 0.604], (-2.9499999999999997, -0.686], (-0.0453, 0.604], (-0.0453, 0.604], (-2.9499999999999997, -0.686], ..., (-0.686, -0.0453], (0.604, 3.928], (0.604, 3.928], (-0.0453, 0.604], (-0.686, -0.0453]]
Length: 1000
Categories (4, interval[float64]): [(-2.9499999999999997, -0.686] < (-0.686, -0.0453] < (-0.0453, 0.604] < (0.604, 3.928]]
pd.value_counts(cats)
(0.604, 3.928] 250
(-0.0453, 0.604] 250
(-0.686, -0.0453] 250
(-2.9499999999999997, -0.686] 250
dtype: int64
pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.])
[(-0.0453, 1.289], (-1.191, -0.0453], (-0.0453, 1.289], (-0.0453, 1.289], (-2.9499999999999997, -1.191], ..., (-1.191, -0.0453], (1.289, 3.928], (1.289, 3.928], (-0.0453, 1.289], (-1.191, -0.0453]]
Length: 1000
Categories (4, interval[float64]): [(-2.9499999999999997, -1.191] < (-1.191, -0.0453] < (-0.0453, 1.289] < (1.289, 3.928]]
Detecting and Filtering Outliers 检测和过滤异常值
data = pd.DataFrame(np.random.randn(1000, 4))
data.describe()
|
0 |
1 |
2 |
3 |
count |
1000.000000 |
1000.000000 |
1000.000000 |
1000.000000 |
mean |
-0.043288 |
0.046433 |
0.026352 |
-0.010204 |
std |
0.998391 |
0.999185 |
1.010005 |
0.992779 |
min |
-3.428254 |
-3.645860 |
-3.184377 |
-3.745356 |
25% |
-0.740152 |
-0.599807 |
-0.612162 |
-0.699863 |
50% |
-0.085000 |
0.043663 |
-0.008168 |
-0.031732 |
75% |
0.625698 |
0.746527 |
0.690847 |
0.692355 |
max |
3.366626 |
2.653656 |
3.525865 |
2.735527 |
col = data[2]
col[np.abs(col) > 3]
50 3.260383
225 -3.056990
312 -3.184377
772 3.525865
Name: 2, dtype: float64
data[(np.abs(data) > 3).any(1)]
|
0 |
1 |
2 |
3 |
31 |
-2.315555 |
0.457246 |
-0.025907 |
-3.399312 |
50 |
0.050188 |
1.951312 |
3.260383 |
0.963301 |
126 |
0.146326 |
0.508391 |
-0.196713 |
-3.745356 |
225 |
-0.293333 |
-0.242459 |
-3.056990 |
1.918403 |
249 |
-3.428254 |
-0.296336 |
-0.439938 |
-0.867165 |
312 |
0.275144 |
1.179227 |
-3.184377 |
1.369891 |
534 |
-0.362528 |
-3.548824 |
1.553205 |
-2.186301 |
626 |
3.366626 |
-2.372214 |
0.851010 |
1.332846 |
772 |
-0.658090 |
-0.207434 |
3.525865 |
0.283070 |
793 |
0.599947 |
-3.645860 |
0.255475 |
-0.549574 |
data[np.abs(data) > 3] = np.sign(data) * 3
data.describe()
|
0 |
1 |
2 |
3 |
count |
1000.000000 |
1000.000000 |
1000.000000 |
1000.000000 |
mean |
-0.043227 |
0.047628 |
0.025807 |
-0.009059 |
std |
0.995841 |
0.995170 |
1.006769 |
0.988960 |
min |
-3.000000 |
-3.000000 |
-3.000000 |
-3.000000 |
25% |
-0.740152 |
-0.599807 |
-0.612162 |
-0.699863 |
50% |
-0.085000 |
0.043663 |
-0.008168 |
-0.031732 |
75% |
0.625698 |
0.746527 |
0.690847 |
0.692355 |
max |
3.000000 |
2.653656 |
3.000000 |
2.735527 |
np.sign(data).head()
|
0 |
1 |
2 |
3 |
0 |
-1.0 |
-1.0 |
-1.0 |
-1.0 |
1 |
-1.0 |
1.0 |
-1.0 |
-1.0 |
2 |
1.0 |
-1.0 |
-1.0 |
1.0 |
3 |
1.0 |
1.0 |
1.0 |
-1.0 |
4 |
1.0 |
1.0 |
1.0 |
1.0 |
Permutation and Random Sampling 排列和随机采样
利用numpy.random.permutation函数可以轻松实现对Series或者DataFrame的列的排列工作(permuting,随机重排序)。通过对需要排列的轴的长度调用permutation,可以产生一个表示新顺序的整数数组。
df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))
sampler = np.random.permutation(5)
sampler
array([2, 0, 3, 4, 1])
df
|
0 |
1 |
2 |
3 |
0 |
0 |
1 |
2 |
3 |
1 |
4 |
5 |
6 |
7 |
2 |
8 |
9 |
10 |
11 |
3 |
12 |
13 |
14 |
15 |
4 |
16 |
17 |
18 |
19 |
df.take(sampler)
|
0 |
1 |
2 |
3 |
2 |
8 |
9 |
10 |
11 |
0 |
0 |
1 |
2 |
3 |
3 |
12 |
13 |
14 |
15 |
4 |
16 |
17 |
18 |
19 |
1 |
4 |
5 |
6 |
7 |
df.sample(n=3)
|
0 |
1 |
2 |
3 |
2 |
8 |
9 |
10 |
11 |
1 |
4 |
5 |
6 |
7 |
0 |
0 |
1 |
2 |
3 |
choices = pd.Series([5, 7, -1, 6, 4])
draws = choices.sample(n=10, replace=True)
draws
4 4
4 4
1 7
3 6
4 4
3 6
4 4
4 4
3 6
2 -1
dtype: int64
Computing Indicator/Dummy Variables 计算指标/哑变量
另一种常用于统计建模或机器学习的转换方式是:将分类变量(类别型变量)转换为"哑变量"或者"指标矩阵"
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
'data1': range(6)})
pd.get_dummies(df['key'])
|
a |
b |
c |
0 |
0 |
1 |
0 |
1 |
0 |
1 |
0 |
2 |
1 |
0 |
0 |
3 |
0 |
0 |
1 |
4 |
1 |
0 |
0 |
5 |
0 |
1 |
0 |
df
|
key |
data1 |
0 |
b |
0 |
1 |
b |
1 |
2 |
a |
2 |
3 |
c |
3 |
4 |
a |
4 |
5 |
b |
5 |
如果,DataFrame的某一列中含有k各不同的值,则可以派生出一个k列的矩阵或者DataFrame(其值全为1和0)
"""
有时候,可能想给指标DataFrame的列加上一个前缀,以便于能够跟其他的数据进行合并。
get_dummies的prefix参数可以实现该功能。
"""
dummies = pd.get_dummies(df['key'], prefix='key')
df_with_dummy = df[['data1']].join(dummies)
df_with_dummy
|
data1 |
key_a |
key_b |
key_c |
0 |
0 |
0 |
1 |
0 |
1 |
1 |
0 |
1 |
0 |
2 |
2 |
1 |
0 |
0 |
3 |
3 |
0 |
0 |
1 |
4 |
4 |
1 |
0 |
0 |
5 |
5 |
0 |
1 |
0 |
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('datasets/movielens/movies.dat', sep='::',
header=None, names=mnames)
movies[:10]
|
movie_id |
title |
genres |
0 |
1 |
Toy Story (1995) |
Animation|Children's|Comedy |
1 |
2 |
Jumanji (1995) |
Adventure|Children's|Fantasy |
2 |
3 |
Grumpier Old Men (1995) |
Comedy|Romance |
3 |
4 |
Waiting to Exhale (1995) |
Comedy|Drama |
4 |
5 |
Father of the Bride Part II (1995) |
Comedy |
5 |
6 |
Heat (1995) |
Action|Crime|Thriller |
6 |
7 |
Sabrina (1995) |
Comedy|Romance |
7 |
8 |
Tom and Huck (1995) |
Adventure|Children's |
8 |
9 |
Sudden Death (1995) |
Action |
9 |
10 |
GoldenEye (1995) |
Action|Adventure|Thriller |
all_genres = []
for x in movies.genres:
all_genres.extend(x.split('|'))
genres = pd.unique(all_genres)
genres
array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
'Western'], dtype=object)
zero_matrix = np.zeros((len(movies), len(genres)))
dummies = pd.DataFrame(zero_matrix, columns=genres)
gen = movies.genres[0]
gen.split('|')
dummies.columns.get_indexer(gen.split('|'))
array([0, 1, 2], dtype=int64)
for i, gen in enumerate(movies.genres):
indices = dummies.columns.get_indexer(gen.split('|'))
dummies.iloc[i, indices] = 1
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic.iloc[0]
movie_id 1
title Toy Story (1995)
genres Animation|Children's|Comedy
Genre_Animation 1
Genre_Children's 1
...
Genre_War 0
Genre_Musical 0
Genre_Mystery 0
Genre_Film-Noir 0
Genre_Western 0
Name: 0, Length: 21, dtype: object
对于很大的数据,用这种方法构建多成员指标变量就会变得非常慢,最好使用更加低级的函数,将其写入到Numpy数组,然后将结果包装在DataFrame中。
np.random.seed(12345)
values = np.random.rand(10)
values
array([0.9296, 0.3164, 0.1839, 0.2046, 0.5677, 0.5955, 0.9645, 0.6532,
0.7489, 0.6536])
bins = [0, 0.2, 0.4, 0.6, 0.8, 1]
pd.get_dummies(pd.cut(values, bins))
|
(0.0, 0.2] |
(0.2, 0.4] |
(0.4, 0.6] |
(0.6, 0.8] |
(0.8, 1.0] |
0 |
0 |
0 |
0 |
0 |
1 |
1 |
0 |
1 |
0 |
0 |
0 |
2 |
1 |
0 |
0 |
0 |
0 |
3 |
0 |
1 |
0 |
0 |
0 |
4 |
0 |
0 |
1 |
0 |
0 |
5 |
0 |
0 |
1 |
0 |
0 |
6 |
0 |
0 |
0 |
0 |
1 |
7 |
0 |
0 |
0 |
1 |
0 |
8 |
0 |
0 |
0 |
1 |
0 |
9 |
0 |
0 |
0 |
1 |
0 |
String Manipulation 字符串操纵
python本身能够处理字符串和文本,对于更加复杂的模式匹配和文本操作,就需要使用到正则表达式。pandas对此进行了加强,可以实现对:整租数据应用字符串表达式和正则表达式,而且可以处理烦人的缺失数据。
String Object Methods 字符串对象方法
val = 'a,b, guido'
val.split(',')
['a', 'b', ' guido']
pieces = [x.strip() for x in val.split(',')]
pieces
['a', 'b', 'guido']
first, second, third = pieces
first + '::' + second + '::' + third
'a::b::guido'
'::'.join(pieces)
'a::b::guido'
'guido' in val
True
val.index(',')
1
val.find(':')
-1
find和index的区别是;如果找不到字符串,index将会引发一个异常,而不是返回-1
val.index(':')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
in
----> 1 val.index(':')
ValueError: substring not found
val.count(',')
2
val.replace(',', '::')
'a::b:: guido'
val.replace(',', '')
'ab guido'
python内置的字符串方法
方法 |
说明 |
count |
返回字串在字符串中出现的次数(非重叠) |
endswith |
字符串是否以某个后缀结尾,是则返回True |
startswith |
字符串是否以某个前缀开头,是则返回True |
find, rfind |
如果在字符串中找到字串,则返回第一次出现的位置,没有发现则返回-1,,后者返回最后一个发现的位置 |
Regular Expressions 正则表达式
re模块的函数可以分为三个大类:模式匹配,替换以及拆分
import re
text = "foo bar\t baz \tqux"
re.split('\s+', text)
['foo', 'bar', 'baz', 'qux']
调用re.split(’\s+’,text)的时候,正则表达式会先被编译,然后会在text上调用其split方法。
regex = re.compile('\s+')
regex.split(text)
['foo', 'bar', 'baz', 'qux']
regex.findall(text)
[' ', '\t ', ' \t']
如果想避免正则表达式中不需要的转移 (\),则可以使用原始字符串字面量如:
r’C:\x’
如果打算对许多字符串应用同一条正则表达式,建议通过re.compile创建regex对象。这样子可以节省大量的cpu时间
findall返回的是:字符串中所有的匹配项,而search则只返回第一个匹配项。match则更加严格,仅仅匹配字符串的首部。
text = """Dave [email protected]
Steve [email protected]
Rob [email protected]
Ryan [email protected]
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
regex = re.compile(pattern, flags=re.IGNORECASE)
regex.findall(text)
['[email protected]', '[email protected]', '[email protected]', '[email protected]']
m = regex.search(text)
m
text[m.start():m.end()]
'[email protected]'
print(regex.match(text))
None
print(regex.sub('REDACTED', text))
Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE)
m = regex.match('[email protected]')
m.groups()
('wesm', 'bright', 'net')
regex.findall(text)
[('dave', 'google', 'com'),
('steve', 'gmail', 'com'),
('rob', 'gmail', 'com'),
('ryan', 'yahoo', 'com')]
print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))
Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com
Vectorized String Functions in pandas pandas的矢量化字符串函数
data = {'Dave': '[email protected]', 'Steve': '[email protected]',
'Rob': '[email protected]', 'Wes': np.nan}
data = pd.Series(data)
data
Dave [email protected]
Steve [email protected]
Rob [email protected]
Wes NaN
dtype: object
data.isnull()
Dave False
Steve False
Rob False
Wes True
dtype: bool
data.str.contains('gmail')
Dave False
Steve True
Rob True
Wes NaN
dtype: object
pattern
data.str.findall(pattern, flags=re.IGNORECASE)
Dave [(dave, google, com)]
Steve [(steve, gmail, com)]
Rob [(rob, gmail, com)]
Wes NaN
dtype: object
matches = data.str.match(pattern, flags=re.IGNORECASE)
matches
Dave True
Steve True
Rob True
Wes NaN
dtype: object
matches.str[0]
data.str[:5]
pd.options.display.max_rows = PREVIOUS_MAX_ROWS
Conclusion