数据处理

1 处理缺失值

函数名 描述
dropna 根据每个标签的值是否缺失数据来筛选轴标签,并根据允许丢失的数据来确定阈值
fillna 用某些值填充缺失数据或使用插值方法(如’ffill’或’bfill’)
isnull 返回表明哪些值是缺失值的布尔值
notnull isnull的反函数

1.1过滤缺失值

(1)在Series上使用dropna,会返回Series中所有的非空数据及其索引值。

import pandas as pd
from numpy import nan as NA

data = pd.Series([1,NA,3.5,NA,7])
data.dropna()

等价

data[data.notnull()]

运行结果:

0    1.0
2    3.5
4    7.0
dtype: float64

(2)当处理DataFrame对象时,想要删除全部为NA或者包含有NA的行或列,dropna默认情况下删除包含缺失值的行。

data =pd.DataFrame([[1.,6.5,3.],[1.,NA,NA],[NA,NA,NA],[NA,6.5,3]])
cleaned = data.dropna()
cleaned

运行结果:

	0	1	2
0	1.0	6.5	3.0

axis=1,删除列

data =pd.DataFrame([[1.,6.5,3.],[1.,NA,NA],[1,NA,NA],[2,6.5,3]])
cleaned = data.dropna(axis=1)
cleaned

运行结果:

	0
0	1.0
1	1.0
2	1.0
3	2.0

(3)传入how='all’时,将删除所有值均为NA的行

data =pd.DataFrame([[1.,6.5,3.],[1.,NA,NA],[NA,NA,NA],[NA,6.5,3]])
data.dropna(how='all')

运行结果:

	0	1	2
0	1.0	6.5	3.0
1	1.0	NaN	NaN
3	NaN	6.5	3.0

删除所有值均为NA的列

data =pd.DataFrame([[1.,6.5,3.],[1.,NA,NA],[1,NA,NA],[2,6.5,3]])
data[4] = NA
data

运行结果:

0	1	2	4
0	1.0	6.5	3.0	NaN
1	1.0	NaN	NaN	NaN
2	1.0	NaN	NaN	NaN
3	2.0	6.5	3.0	NaN
data.dropna(axis=1,how = 'all')

运行结果:

	0	1	2
0	1.0	6.5	3.0
1	1.0	NaN	NaN
2	1.0	NaN	NaN
3	2.0	6.5	3.0

(4)如果想保留包含一定数量观察值的行,可以用thresh参数来表示。
格式:df.dropna ( thresh=n ),这一行除去NA值,剩余数值的数量大于等于n,便显示这一行。

import numpy as np
data = pd.DataFrame(np.random.randn(7,3))
data.iloc[:4,1]=NA
data.iloc[:2,2]=NA
data

运行结果:

	0	1	2
0	-0.214321	NaN	NaN
1	0.626029	NaN	NaN
2	-0.794404	NaN	0.494591
3	0.687121	NaN	-0.842619
4	-1.035153	1.106299	-2.506364
5	-0.557088	-0.311989	0.184533
6	1.435716	0.677662	1.430981
data.dropna(thresh=2)

运行结果:


0	1	2
2	-0.794404	NaN	0.494591
3	0.687121	NaN	-0.842619
4	-1.035153	1.106299	-2.506364
5	-0.557088	-0.311989	0.184533
6	1.435716	0.677662	1.430981

1.2 补全缺失值

(1)调用fillna时,可以使用一个常数来代替缺失值。

import numpy as np
data = pd.DataFrame(np.random.randn(7,3))
data.iloc[:4,1]=NA
data.iloc[:2,2]=NA
data.fillna(0)

运行结果:

	0	1	2
0	0.561395	0.000000	0.000000
1	-0.375632	0.000000	0.000000
2	1.797813	0.000000	0.147763
3	2.225626	0.000000	-0.607822
4	0.091317	-0.253953	1.292929
5	1.039400	-0.462940	-0.301816
6	-1.298559	0.299697	1.154967

(2)在调用fillna时使用字典,可以为不同列设定不同的填充值。

import numpy as np
data = pd.DataFrame(np.random.randn(7,3))
data.iloc[:4,1]=NA
data.iloc[:2,2]=NA
data.fillna({1:0.5,2:0})

运行结果:


0	1	2
0	0.409984	0.500000	0.000000
1	-2.511078	0.500000	0.000000
2	1.312398	0.500000	0.870725
3	0.862810	0.500000	1.427994
4	-1.141758	-0.356768	-0.325600
5	0.904226	0.998484	-0.637515
6	1.227163	-0.409949	-0.100507

(3)插值方法用于fillna ,method插值方法,如果没有其他参数,默认是‘ffill’

import numpy as np
data = pd.DataFrame(np.random.randn(7,3))
data.iloc[2:,1]=NA
data.iloc[4:,2]=NA
data

运行结果:

	0	1	2
0	2.267295	0.807200	0.634113
1	0.654534	-0.434101	-0.579048
2	0.316548	NaN	0.903257
3	1.546708	NaN	0.013961
4	-0.544912	NaN	NaN
5	-1.338441	NaN	NaN
6	0.800708	NaN	NaN
data.fillna(method = 'ffill')

运行结果:

0	1	2
0	2.267295	0.807200	0.634113
1	0.654534	-0.434101	-0.579048
2	0.316548	-0.434101	0.903257
3	1.546708	-0.434101	0.013961
4	-0.544912	-0.434101	0.013961
5	-1.338441	-0.434101	0.013961
6	0.800708	-0.434101	0.013961

(4)limit:用于向前或向后填充的最大范围

data.fillna(method = 'ffill',limit=2)

运行结果:

0	1	2
0	2.267295	0.807200	0.634113
1	0.654534	-0.434101	-0.579048
2	0.316548	-0.434101	0.903257
3	1.546708	-0.434101	0.013961
4	-0.544912	NaN	0.013961
5	-1.338441	NaN	0.013961
6	0.800708	NaN	NaN

(5)可以自定义填充,比如:将Series的平均值或中位数用于填充缺失值

data = pd.Series([1,NA,3.5,NA])
data.fillna(data.mean())

运行结果:

0    1.00
1    2.25
2    3.50
3    2.25
dtype: float64

2 数据转换

2.1 删除重复值

DataFrame中会出现重复行

import pandas as pd
data = pd.DataFrame({'k1':['one','two']*3+['two'],
                    'k2':[1,1,2,3,3,4,4]})
data

运行结果:

	k1	k2
0	one	1
1	two	1
2	one	2
3	two	3
4	one	3
5	two	4
6	two	4

(1)DataFrame中的duplicated方法返回的是一个布尔值Series,这个Series反映的是每一行是否存在重复情况。

data.duplicated()

运行结果:

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

(2)drop_duplicates返回的是DataFrame,去重复数据,内容是duplicated返回数组中为False的部分。

DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)

参数

  • subset: 列标签,可选
  • keep: {‘first’, ‘last’, False}, 默认值 ‘first’
    • first: 删除第一次出现的重复项。
    • last: 删除重复项,除了最后一次出现。
    • False: 删除所有重复项。
  • inplace:布尔值,默认为False,是否删除重复项或返回副本
  • 返回: 重复数据删除 : DataFrame
data = pd.DataFrame({'A':['a','b','c','c'],'B':[1,1,2,2]})

	A	B
0	a	1
1	b	1
2	c	2
3	c	2

data.drop_duplicates(subset=None,keep='first',inplace=True)

	A	B
0	a	1
1	b	1
2	c	2

data.drop_duplicates(subset=['B'],keep='first',inplace=True)

	A	B
0	a	1
2	c	2

data.drop_duplicates(subset=['B'],keep='last',inplace=True)


A	B
1	b	1
3	c	2

subset=None表示考虑所有列,将这所以列对应值相同的行进行去重。默认值None。subset=[‘B’]表示只考虑’B’这列,将B列对应值相同的行进行去重。

keep='first’表示保留第一次出现的重复行,是默认值。keep另外两个取值为"last"和False,分别表示保留最后一次出现的重复行和去除所有重复行。

inplace=True表示直接在原来的DataFrame上删除重复项,而默认值False表示生成一个副本。

2.2 使用函数或映射进行数据转换

data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
                              'Pastrami', 'corned beef', 'Bacon',
                              'pastrami', 'honey ham', 'nova lox'],
                     'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data


food	ounces
0	bacon	4.0
1	pulled pork	3.0
2	bacon	12.0
3	Pastrami	6.0
4	corned beef	7.5
5	Bacon	8.0
6	pastrami	3.0
7	honey ham	5.0
8	nova lox	6.0

假设你想要添加一列表示该肉类食物来源的动物类型。我们先编写一个不同肉类到动物的映射:

meat_to_animal = {            # 来添加一列表明每种肉的来源,,
  'bacon': 'pig',
  'pulled pork': 'pig',
  'pastrami': 'cow',
  'corned beef': 'cow',
  'honey ham': 'pig',
  'nova lox': 'salmon'
}

可以看到一些肉类时大写的,使用 str.lower()方法将所有值都小写。

lowercased=data['food'].str.lower()       
lowercased

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

series 的map 方法接收一个函数或一个包含映射关系的字典型对象。

data['animal']=lowercased.map(meat_to_animal)    
data

	food	ounces	animal
0	bacon	4.0	pig
1	pulled pork	3.0	pig
2	bacon	12.0	pig
3	Pastrami	6.0	cow
4	corned beef	7.5	cow
5	Bacon	8.0	pig
6	pastrami	3.0	cow
7	honey ham	5.0	pig
8	nova lox	6.0	salmon

或者用匿名函数:

data['food'].map(lambda x:meat_to_animal[x.lower()])
data

food	ounces	animal
0	bacon	4.0	pig
1	pulled pork	3.0	pig
2	bacon	12.0	pig
3	Pastrami	6.0	cow
4	corned beef	7.5	cow
5	Bacon	8.0	pig
6	pastrami	3.0	cow
7	honey ham	5.0	pig
8	nova lox	6.0	salmon

2.3 替代值

data = pd.Series([1,-999,2,-999,-1000,3])
data

0       1
1    -999
2       2
3    -999
4   -1000
5       3
dtype: int64

用NA代替所有的-999:

data.replace(-999,np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

一次替代多个值,可以传入一个列表和替代之:

data.replace([-999,-1000],np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

要将不同的值替换为不同的值,可以传入替代值得列表:

data.replace([-999,-1000],[np.nan,555])

0      1.0
1      NaN
2      2.0
3      NaN
4    555.0
5      3.0
dtype: float64

参数也可以通过字典传递:

data.replace({-999:np.nan,-1000:555})

0      1.0
1      NaN
2      2.0
3      NaN
4    555.0
5      3.0
dtype: float64

2.4 重命名轴索引

import pandas as pd
import numpy as np
data = pd.DataFrame(np.arange(12).reshape((3,4)),
                   index = ['beijing','shanghai','hangzhou'],
                   columns = ['one','two','three','four'])
data

	one	two	three	four
beijing	0	1	2	3
shanghai	4	5	6	7
hangzhou	8	9	10	11
transform = lambda x:x.upper()
data.index.map(transform)

Index(['BEIJING', 'SHANGHAI', 'HANGZHOU'], dtype='object')
data.index = data.index.map(transform)
data

one	two	three	four
BEIJING	0	1	2	3
SHANGHAI	4	5	6	7
HANGZHOU	8	9	10	11

使用rename方法创建数据及转换后的版本,并且不修改原有的数据集。

data.rename(index = str.title ,columns =str.upper)

ONE	TWO	THREE	FOUR
Beijing	0	1	2	3
Shanghai	4	5	6	7
Hangzhou	8	9	10	11

rename可以结合字典型对象使用,为轴标签的子集提供新的值。

data.rename(index = {'BEIJING':'WUHAN'},
           columns = {'three':3})

	one	two	3	four
WUHAN	0	1	2	3
SHANGHAI	4	5	6	7
HANGZHOU	8	9	10	11

data#未改变原有数据集

one	two	three	four
BEIJING	0	1	2	3
SHANGHAI	4	5	6	7
HANGZHOU	8	9	10	11

如果要修改原有的数据集,传入inplace = True

data.rename(index = {'BEIJING':'wuhan'},
           columns = {'three':3},inplace = True)
data

	one	two	3	four
wuhan	0	1	2	3
SHANGHAI	4	5	6	7
HANGZHOU	8	9	10	11

2.5 离散化和分箱

age = [20,22,25,27,21,23,37,31,61,45,41,32]

将年龄分为18-25,26-35,36-60以及60及以上等若干组

age = [20,22,25,27,21,23,37,31,61,45,41,32]
bins=[18,25,35,60,100]
cats = pd.cut(age,bins)
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
pd.value_counts(cats)

(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64

小括号表示开放,中括号表示是封闭的,使用right=False来改变

在这里插入代码片

你可能感兴趣的:(数据处理)