pandas教程:Handling Missing Data 处理缺失数据

文章目录

  • Chapter 7 Data Cleaning and Preparation 数据清洗和准备
  • 7.1 Handling Missing Data 处理缺失数据
  • 1 Filtering Out Missing Data(过滤缺失值)
  • 2 Filling In Missing Data(填补缺失值)

Chapter 7 Data Cleaning and Preparation 数据清洗和准备

其实数据分析中80%的时间都是在数据清理部分,loading, clearning, transforming, rearranging。而pandas非常适合用来执行这些任务。

7.1 Handling Missing Data 处理缺失数据

pandas中,missing data呈现的方式有些缺点的,但对大部分用户能起到足够的效果。对于数值型数据,pandas用浮点值Nan(Not a Number)来表示缺失值。我们称之为识别符(sentinel value),这种值能被轻易检测到:

import pandas as pd
import numpy as np
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data
0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object
string_data.isnull()
0    False
1    False
2     True
3    False
dtype: bool

pandas中,我们使用了R语言中的一些传统,把缺失值表示为NA(not available)。在统计应用里,NA数据别是要么是数据不存在,要么是存在但不能被检测到。做数据清理的时候,对缺失值做分析是很重要的,我们要确定是否是数据收集的问题,或者缺失值是否会带来潜在的偏见。

内建的Python None值也被当做NA:

string_data[0] = None
string_data.isnull()
0     True
1    False
2     True
3    False
dtype: bool

1 Filtering Out Missing Data(过滤缺失值)

有一些方法来过滤缺失值。可以使用pandas.isnullboolean indexing, 配合使用dropna。对于series,只会返回non-null数据和index values:

from numpy import nan as NA
data = pd.Series([1, NA, 3.5, NA, 7])
data.dropna()
0    1.0
2    3.5
4    7.0
dtype: float64

上面的等同于:

data[data.notnull()]
0    1.0
2    3.5
4    7.0
dtype: float64

对于DataFrame,会复杂一些。你可能想要删除包含有NA的row和columndropna默认会删除包含有缺失值的row

data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
                     [NA, NA, NA], [NA, 6.5, 3.]])
data
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
cleaned = data.dropna()
cleaned
0 1 2
0 1.0 6.5 3.0

设定how=all只会删除那些全是NA的行:

data.dropna(how='all')
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
3 NaN 6.5 3.0

删除列也一样,设置axis=1:

data[4] = NA
data
0 1 2 4
0 1.0 6.5 3.0 NaN
1 1.0 NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN 6.5 3.0 NaN
data.dropna(axis=1, how='all')
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0

一种删除DataFrame row的相关应用是是time series data。假设你想要保留有特定数字的观测结果,可以使用thresh参数:

df = pd.DataFrame(np.random.randn(7, 3))
df
0 1 2
0 -0.986575 0.487466 -0.251823
1 2.008704 -0.177133 1.827761
2 2.240856 -0.587865 0.273062
3 0.777182 -0.629568 -0.220044
4 0.327522 0.781662 -0.651949
5 1.454611 -0.170581 -1.740959
6 -0.711897 0.074983 1.343807
df.iloc[:4, 1] = NA
df
0 1 2
0 -0.986575 NaN -0.251823
1 2.008704 NaN 1.827761
2 2.240856 NaN 0.273062
3 0.777182 NaN -0.220044
4 0.327522 0.781662 -0.651949
5 1.454611 -0.170581 -1.740959
6 -0.711897 0.074983 1.343807
df.iloc[:2, 2] = NA
df
0 1 2
0 -0.986575 NaN NaN
1 2.008704 NaN NaN
2 2.240856 NaN 0.273062
3 0.777182 NaN -0.220044
4 0.327522 0.781662 -0.651949
5 1.454611 -0.170581 -1.740959
6 -0.711897 0.074983 1.343807
df.dropna()
0 1 2
4 0.327522 0.781662 -0.651949
5 1.454611 -0.170581 -1.740959
6 -0.711897 0.074983 1.343807
df.dropna(thresh=2) 
0 1 2
2 2.240856 NaN 0.273062
3 0.777182 NaN -0.220044
4 0.327522 0.781662 -0.651949
5 1.454611 -0.170581 -1.740959
6 -0.711897 0.074983 1.343807

2 Filling In Missing Data(填补缺失值)

不是删除缺失值,而是用一些数字填补。对于大部分目的,fillna是可以用的。调用fillna的时候设置好一个常用用来替换缺失值:

df.fillna(0)
0 1 2
0 -0.986575 0.000000 0.000000
1 2.008704 0.000000 0.000000
2 2.240856 0.000000 0.273062
3 0.777182 0.000000 -0.220044
4 0.327522 0.781662 -0.651949
5 1.454611 -0.170581 -1.740959
6 -0.711897 0.074983 1.343807

fillna传入一个dict,可以给不同列替换不同的值:

df.fillna({1: 0.5, 2: 0})
0 1 2
0 -0.986575 0.500000 0.000000
1 2.008704 0.500000 0.000000
2 2.240856 0.500000 0.273062
3 0.777182 0.500000 -0.220044
4 0.327522 0.781662 -0.651949
5 1.454611 -0.170581 -1.740959
6 -0.711897 0.074983 1.343807

fillna返回一个新对象,但你可以使用in-place来直接更改原有的数据:

_ = df.fillna(0, inplace=True)
df
0 1 2
0 -0.986575 0.000000 0.000000
1 2.008704 0.000000 0.000000
2 2.240856 0.000000 0.273062
3 0.777182 0.000000 -0.220044
4 0.327522 0.781662 -0.651949
5 1.454611 -0.170581 -1.740959
6 -0.711897 0.074983 1.343807

在使用fillna的时候,这种插入法同样能用于reindexing

df = pd.DataFrame(np.random.randn(6, 3))
df
0 1 2
0 -1.151508 1.185176 -1.766933
1 0.544729 -0.807814 0.696087
2 -1.461950 0.448852 0.189045
3 0.559766 0.341335 1.469807
4 -0.362789 1.117338 -0.383870
5 -0.452329 -0.282040 -0.541759
df.iloc[2:, 1] = NA
df
0 1 2
0 -1.151508 1.185176 -1.766933
1 0.544729 -0.807814 0.696087
2 -1.461950 NaN 0.189045
3 0.559766 NaN 1.469807
4 -0.362789 NaN -0.383870
5 -0.452329 NaN -0.541759
df.iloc[4:, 2] = NA
df
0 1 2
0 -1.151508 1.185176 -1.766933
1 0.544729 -0.807814 0.696087
2 -1.461950 NaN 0.189045
3 0.559766 NaN 1.469807
4 -0.362789 NaN NaN
5 -0.452329 NaN NaN
df.fillna(method='ffill')
0 1 2
0 -1.151508 1.185176 -1.766933
1 0.544729 -0.807814 0.696087
2 -1.461950 -0.807814 0.189045
3 0.559766 -0.807814 1.469807
4 -0.362789 -0.807814 1.469807
5 -0.452329 -0.807814 1.469807
df.fillna(method='ffill', limit=2)
0 1 2
0 -1.151508 1.185176 -1.766933
1 0.544729 -0.807814 0.696087
2 -1.461950 -0.807814 0.189045
3 0.559766 -0.807814 1.469807
4 -0.362789 NaN 1.469807
5 -0.452329 NaN 1.469807

使用fillna可以我们做一些颇有创造力的事情。比如,可以传入一个series的平均值或中位数:

data = pd.Series([1., NA, 3.5, NA, 7])
data.fillna(data.mean())
0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

你可能感兴趣的:(pandas使用教程,pandas,windows,开发语言,python,R,transformer)