文章目录
- Chapter 7 Data Cleaning and Preparation 数据清洗和准备
- 7.1 Handling Missing Data 处理缺失数据
- 1 Filtering Out Missing Data(过滤缺失值)
- 2 Filling In Missing Data(填补缺失值)
Chapter 7 Data Cleaning and Preparation 数据清洗和准备
其实数据分析中80%的时间都是在数据清理部分,loading, clearning, transforming, rearranging
。而pandas
非常适合用来执行这些任务。
7.1 Handling Missing Data 处理缺失数据
在pandas
中,missing data
呈现的方式有些缺点的,但对大部分用户能起到足够的效果。对于数值型数据,pandas
用浮点值Nan
(Not a Number)来表示缺失值。我们称之为识别符(sentinel value
),这种值能被轻易检测到:
import pandas as pd
import numpy as np
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data
0 aardvark
1 artichoke
2 NaN
3 avocado
dtype: object
string_data.isnull()
0 False
1 False
2 True
3 False
dtype: bool
在pandas
中,我们使用了R语言中的一些传统,把缺失值表示为NA
(not available)。在统计应用里,NA
数据别是要么是数据不存在,要么是存在但不能被检测到。做数据清理的时候,对缺失值做分析是很重要的,我们要确定是否是数据收集的问题,或者缺失值是否会带来潜在的偏见。
内建的Python None
值也被当做NA:
string_data[0] = None
string_data.isnull()
0 True
1 False
2 True
3 False
dtype: bool
1 Filtering Out Missing Data(过滤缺失值)
有一些方法来过滤缺失值。可以使用pandas.isnull
和boolean indexing
, 配合使用dropna
。对于series
,只会返回non-null
数据和index values
:
from numpy import nan as NA
data = pd.Series([1, NA, 3.5, NA, 7])
data.dropna()
0 1.0
2 3.5
4 7.0
dtype: float64
上面的等同于:
data[data.notnull()]
0 1.0
2 3.5
4 7.0
dtype: float64
对于DataFrame
,会复杂一些。你可能想要删除包含有NA的row和column
。dropna
默认会删除包含有缺失值的row
:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
[NA, NA, NA], [NA, 6.5, 3.]])
data
|
0 |
1 |
2 |
0 |
1.0 |
6.5 |
3.0 |
1 |
1.0 |
NaN |
NaN |
2 |
NaN |
NaN |
NaN |
3 |
NaN |
6.5 |
3.0 |
cleaned = data.dropna()
cleaned
设定how=all
只会删除那些全是NA
的行:
data.dropna(how='all')
|
0 |
1 |
2 |
0 |
1.0 |
6.5 |
3.0 |
1 |
1.0 |
NaN |
NaN |
3 |
NaN |
6.5 |
3.0 |
删除列也一样,设置axis=1
:
data[4] = NA
data
|
0 |
1 |
2 |
4 |
0 |
1.0 |
6.5 |
3.0 |
NaN |
1 |
1.0 |
NaN |
NaN |
NaN |
2 |
NaN |
NaN |
NaN |
NaN |
3 |
NaN |
6.5 |
3.0 |
NaN |
data.dropna(axis=1, how='all')
|
0 |
1 |
2 |
0 |
1.0 |
6.5 |
3.0 |
1 |
1.0 |
NaN |
NaN |
2 |
NaN |
NaN |
NaN |
3 |
NaN |
6.5 |
3.0 |
一种删除DataFrame row
的相关应用是是time series data
。假设你想要保留有特定数字的观测结果,可以使用thresh参数:
df = pd.DataFrame(np.random.randn(7, 3))
df
|
0 |
1 |
2 |
0 |
-0.986575 |
0.487466 |
-0.251823 |
1 |
2.008704 |
-0.177133 |
1.827761 |
2 |
2.240856 |
-0.587865 |
0.273062 |
3 |
0.777182 |
-0.629568 |
-0.220044 |
4 |
0.327522 |
0.781662 |
-0.651949 |
5 |
1.454611 |
-0.170581 |
-1.740959 |
6 |
-0.711897 |
0.074983 |
1.343807 |
df.iloc[:4, 1] = NA
df
|
0 |
1 |
2 |
0 |
-0.986575 |
NaN |
-0.251823 |
1 |
2.008704 |
NaN |
1.827761 |
2 |
2.240856 |
NaN |
0.273062 |
3 |
0.777182 |
NaN |
-0.220044 |
4 |
0.327522 |
0.781662 |
-0.651949 |
5 |
1.454611 |
-0.170581 |
-1.740959 |
6 |
-0.711897 |
0.074983 |
1.343807 |
df.iloc[:2, 2] = NA
df
|
0 |
1 |
2 |
0 |
-0.986575 |
NaN |
NaN |
1 |
2.008704 |
NaN |
NaN |
2 |
2.240856 |
NaN |
0.273062 |
3 |
0.777182 |
NaN |
-0.220044 |
4 |
0.327522 |
0.781662 |
-0.651949 |
5 |
1.454611 |
-0.170581 |
-1.740959 |
6 |
-0.711897 |
0.074983 |
1.343807 |
df.dropna()
|
0 |
1 |
2 |
4 |
0.327522 |
0.781662 |
-0.651949 |
5 |
1.454611 |
-0.170581 |
-1.740959 |
6 |
-0.711897 |
0.074983 |
1.343807 |
df.dropna(thresh=2)
|
0 |
1 |
2 |
2 |
2.240856 |
NaN |
0.273062 |
3 |
0.777182 |
NaN |
-0.220044 |
4 |
0.327522 |
0.781662 |
-0.651949 |
5 |
1.454611 |
-0.170581 |
-1.740959 |
6 |
-0.711897 |
0.074983 |
1.343807 |
2 Filling In Missing Data(填补缺失值)
不是删除缺失值,而是用一些数字填补。对于大部分目的,fillna
是可以用的。调用fillna
的时候设置好一个常用用来替换缺失值:
df.fillna(0)
|
0 |
1 |
2 |
0 |
-0.986575 |
0.000000 |
0.000000 |
1 |
2.008704 |
0.000000 |
0.000000 |
2 |
2.240856 |
0.000000 |
0.273062 |
3 |
0.777182 |
0.000000 |
-0.220044 |
4 |
0.327522 |
0.781662 |
-0.651949 |
5 |
1.454611 |
-0.170581 |
-1.740959 |
6 |
-0.711897 |
0.074983 |
1.343807 |
给fillna
传入一个dict
,可以给不同列替换不同的值:
df.fillna({1: 0.5, 2: 0})
|
0 |
1 |
2 |
0 |
-0.986575 |
0.500000 |
0.000000 |
1 |
2.008704 |
0.500000 |
0.000000 |
2 |
2.240856 |
0.500000 |
0.273062 |
3 |
0.777182 |
0.500000 |
-0.220044 |
4 |
0.327522 |
0.781662 |
-0.651949 |
5 |
1.454611 |
-0.170581 |
-1.740959 |
6 |
-0.711897 |
0.074983 |
1.343807 |
fillna
返回一个新对象,但你可以使用in-place
来直接更改原有的数据:
_ = df.fillna(0, inplace=True)
df
|
0 |
1 |
2 |
0 |
-0.986575 |
0.000000 |
0.000000 |
1 |
2.008704 |
0.000000 |
0.000000 |
2 |
2.240856 |
0.000000 |
0.273062 |
3 |
0.777182 |
0.000000 |
-0.220044 |
4 |
0.327522 |
0.781662 |
-0.651949 |
5 |
1.454611 |
-0.170581 |
-1.740959 |
6 |
-0.711897 |
0.074983 |
1.343807 |
在使用fillna
的时候,这种插入法同样能用于reindexing
:
df = pd.DataFrame(np.random.randn(6, 3))
df
|
0 |
1 |
2 |
0 |
-1.151508 |
1.185176 |
-1.766933 |
1 |
0.544729 |
-0.807814 |
0.696087 |
2 |
-1.461950 |
0.448852 |
0.189045 |
3 |
0.559766 |
0.341335 |
1.469807 |
4 |
-0.362789 |
1.117338 |
-0.383870 |
5 |
-0.452329 |
-0.282040 |
-0.541759 |
df.iloc[2:, 1] = NA
df
|
0 |
1 |
2 |
0 |
-1.151508 |
1.185176 |
-1.766933 |
1 |
0.544729 |
-0.807814 |
0.696087 |
2 |
-1.461950 |
NaN |
0.189045 |
3 |
0.559766 |
NaN |
1.469807 |
4 |
-0.362789 |
NaN |
-0.383870 |
5 |
-0.452329 |
NaN |
-0.541759 |
df.iloc[4:, 2] = NA
df
|
0 |
1 |
2 |
0 |
-1.151508 |
1.185176 |
-1.766933 |
1 |
0.544729 |
-0.807814 |
0.696087 |
2 |
-1.461950 |
NaN |
0.189045 |
3 |
0.559766 |
NaN |
1.469807 |
4 |
-0.362789 |
NaN |
NaN |
5 |
-0.452329 |
NaN |
NaN |
df.fillna(method='ffill')
|
0 |
1 |
2 |
0 |
-1.151508 |
1.185176 |
-1.766933 |
1 |
0.544729 |
-0.807814 |
0.696087 |
2 |
-1.461950 |
-0.807814 |
0.189045 |
3 |
0.559766 |
-0.807814 |
1.469807 |
4 |
-0.362789 |
-0.807814 |
1.469807 |
5 |
-0.452329 |
-0.807814 |
1.469807 |
df.fillna(method='ffill', limit=2)
|
0 |
1 |
2 |
0 |
-1.151508 |
1.185176 |
-1.766933 |
1 |
0.544729 |
-0.807814 |
0.696087 |
2 |
-1.461950 |
-0.807814 |
0.189045 |
3 |
0.559766 |
-0.807814 |
1.469807 |
4 |
-0.362789 |
NaN |
1.469807 |
5 |
-0.452329 |
NaN |
1.469807 |
使用fillna
可以我们做一些颇有创造力的事情。比如,可以传入一个series
的平均值或中位数:
data = pd.Series([1., NA, 3.5, NA, 7])
data.fillna(data.mean())
0 1.000000
1 3.833333
2 3.500000
3 3.833333
4 7.000000
dtype: float64