函数名 | 描述 |
---|---|
dropna | 根据每个标签的值是否缺失数据来筛选轴标签,并根据允许丢失的数据来确定阈值 |
fillna | 用某些值填充缺失数据或使用插值方法(如’ffill’或’bfill’) |
isnull | 返回表明哪些值是缺失值的布尔值 |
notnull | isnull的反函数 |
(1)在Series上使用dropna,会返回Series中所有的非空数据及其索引值。
import pandas as pd
from numpy import nan as NA
data = pd.Series([1,NA,3.5,NA,7])
data.dropna()
等价
data[data.notnull()]
运行结果:
0 1.0
2 3.5
4 7.0
dtype: float64
(2)当处理DataFrame对象时,想要删除全部为NA或者包含有NA的行或列,dropna默认情况下删除包含缺失值的行。
data =pd.DataFrame([[1.,6.5,3.],[1.,NA,NA],[NA,NA,NA],[NA,6.5,3]])
cleaned = data.dropna()
cleaned
运行结果:
0 1 2
0 1.0 6.5 3.0
axis=1,删除列
data =pd.DataFrame([[1.,6.5,3.],[1.,NA,NA],[1,NA,NA],[2,6.5,3]])
cleaned = data.dropna(axis=1)
cleaned
运行结果:
0
0 1.0
1 1.0
2 1.0
3 2.0
(3)传入how='all’时,将删除所有值均为NA的行
data =pd.DataFrame([[1.,6.5,3.],[1.,NA,NA],[NA,NA,NA],[NA,6.5,3]])
data.dropna(how='all')
运行结果:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
3 NaN 6.5 3.0
删除所有值均为NA的列
data =pd.DataFrame([[1.,6.5,3.],[1.,NA,NA],[1,NA,NA],[2,6.5,3]])
data[4] = NA
data
运行结果:
0 1 2 4
0 1.0 6.5 3.0 NaN
1 1.0 NaN NaN NaN
2 1.0 NaN NaN NaN
3 2.0 6.5 3.0 NaN
data.dropna(axis=1,how = 'all')
运行结果:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 1.0 NaN NaN
3 2.0 6.5 3.0
(4)如果想保留包含一定数量观察值的行,可以用thresh参数来表示。
格式:df.dropna ( thresh=n ),这一行除去NA值,剩余数值的数量大于等于n,便显示这一行。
import numpy as np
data = pd.DataFrame(np.random.randn(7,3))
data.iloc[:4,1]=NA
data.iloc[:2,2]=NA
data
运行结果:
0 1 2
0 -0.214321 NaN NaN
1 0.626029 NaN NaN
2 -0.794404 NaN 0.494591
3 0.687121 NaN -0.842619
4 -1.035153 1.106299 -2.506364
5 -0.557088 -0.311989 0.184533
6 1.435716 0.677662 1.430981
data.dropna(thresh=2)
运行结果:
0 1 2
2 -0.794404 NaN 0.494591
3 0.687121 NaN -0.842619
4 -1.035153 1.106299 -2.506364
5 -0.557088 -0.311989 0.184533
6 1.435716 0.677662 1.430981
(1)调用fillna时,可以使用一个常数来代替缺失值。
import numpy as np
data = pd.DataFrame(np.random.randn(7,3))
data.iloc[:4,1]=NA
data.iloc[:2,2]=NA
data.fillna(0)
运行结果:
0 1 2
0 0.561395 0.000000 0.000000
1 -0.375632 0.000000 0.000000
2 1.797813 0.000000 0.147763
3 2.225626 0.000000 -0.607822
4 0.091317 -0.253953 1.292929
5 1.039400 -0.462940 -0.301816
6 -1.298559 0.299697 1.154967
(2)在调用fillna时使用字典,可以为不同列设定不同的填充值。
import numpy as np
data = pd.DataFrame(np.random.randn(7,3))
data.iloc[:4,1]=NA
data.iloc[:2,2]=NA
data.fillna({1:0.5,2:0})
运行结果:
0 1 2
0 0.409984 0.500000 0.000000
1 -2.511078 0.500000 0.000000
2 1.312398 0.500000 0.870725
3 0.862810 0.500000 1.427994
4 -1.141758 -0.356768 -0.325600
5 0.904226 0.998484 -0.637515
6 1.227163 -0.409949 -0.100507
(3)插值方法用于fillna ,method插值方法,如果没有其他参数,默认是‘ffill’
import numpy as np
data = pd.DataFrame(np.random.randn(7,3))
data.iloc[2:,1]=NA
data.iloc[4:,2]=NA
data
运行结果:
0 1 2
0 2.267295 0.807200 0.634113
1 0.654534 -0.434101 -0.579048
2 0.316548 NaN 0.903257
3 1.546708 NaN 0.013961
4 -0.544912 NaN NaN
5 -1.338441 NaN NaN
6 0.800708 NaN NaN
data.fillna(method = 'ffill')
运行结果:
0 1 2
0 2.267295 0.807200 0.634113
1 0.654534 -0.434101 -0.579048
2 0.316548 -0.434101 0.903257
3 1.546708 -0.434101 0.013961
4 -0.544912 -0.434101 0.013961
5 -1.338441 -0.434101 0.013961
6 0.800708 -0.434101 0.013961
(4)limit:用于向前或向后填充的最大范围
data.fillna(method = 'ffill',limit=2)
运行结果:
0 1 2
0 2.267295 0.807200 0.634113
1 0.654534 -0.434101 -0.579048
2 0.316548 -0.434101 0.903257
3 1.546708 -0.434101 0.013961
4 -0.544912 NaN 0.013961
5 -1.338441 NaN 0.013961
6 0.800708 NaN NaN
(5)可以自定义填充,比如:将Series的平均值或中位数用于填充缺失值
data = pd.Series([1,NA,3.5,NA])
data.fillna(data.mean())
运行结果:
0 1.00
1 2.25
2 3.50
3 2.25
dtype: float64
DataFrame中会出现重复行
import pandas as pd
data = pd.DataFrame({'k1':['one','two']*3+['two'],
'k2':[1,1,2,3,3,4,4]})
data
运行结果:
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
6 two 4
(1)DataFrame中的duplicated方法返回的是一个布尔值Series,这个Series反映的是每一行是否存在重复情况。
data.duplicated()
运行结果:
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
(2)drop_duplicates返回的是DataFrame,去重复数据,内容是duplicated返回数组中为False的部分。
DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)
参数
data = pd.DataFrame({'A':['a','b','c','c'],'B':[1,1,2,2]})
A B
0 a 1
1 b 1
2 c 2
3 c 2
data.drop_duplicates(subset=None,keep='first',inplace=True)
A B
0 a 1
1 b 1
2 c 2
data.drop_duplicates(subset=['B'],keep='first',inplace=True)
A B
0 a 1
2 c 2
data.drop_duplicates(subset=['B'],keep='last',inplace=True)
A B
1 b 1
3 c 2
subset=None表示考虑所有列,将这所以列对应值相同的行进行去重。默认值None。subset=[‘B’]表示只考虑’B’这列,将B列对应值相同的行进行去重。
keep='first’表示保留第一次出现的重复行,是默认值。keep另外两个取值为"last"和False,分别表示保留最后一次出现的重复行和去除所有重复行。
inplace=True表示直接在原来的DataFrame上删除重复项,而默认值False表示生成一个副本。
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
'Pastrami', 'corned beef', 'Bacon',
'pastrami', 'honey ham', 'nova lox'],
'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data
food ounces
0 bacon 4.0
1 pulled pork 3.0
2 bacon 12.0
3 Pastrami 6.0
4 corned beef 7.5
5 Bacon 8.0
6 pastrami 3.0
7 honey ham 5.0
8 nova lox 6.0
假设你想要添加一列表示该肉类食物来源的动物类型。我们先编写一个不同肉类到动物的映射:
meat_to_animal = { # 来添加一列表明每种肉的来源,,
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}
可以看到一些肉类时大写的,使用 str.lower()方法将所有值都小写。
lowercased=data['food'].str.lower()
lowercased
0 bacon
1 pulled pork
2 bacon
3 pastrami
4 corned beef
5 bacon
6 pastrami
7 honey ham
8 nova lox
Name: food, dtype: object
series 的map 方法接收一个函数或一个包含映射关系的字典型对象。
data['animal']=lowercased.map(meat_to_animal)
data
food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 Pastrami 6.0 cow
4 corned beef 7.5 cow
5 Bacon 8.0 pig
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 salmon
或者用匿名函数:
data['food'].map(lambda x:meat_to_animal[x.lower()])
data
food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 Pastrami 6.0 cow
4 corned beef 7.5 cow
5 Bacon 8.0 pig
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 salmon
data = pd.Series([1,-999,2,-999,-1000,3])
data
0 1
1 -999
2 2
3 -999
4 -1000
5 3
dtype: int64
用NA代替所有的-999:
data.replace(-999,np.nan)
0 1.0
1 NaN
2 2.0
3 NaN
4 -1000.0
5 3.0
dtype: float64
一次替代多个值,可以传入一个列表和替代之:
data.replace([-999,-1000],np.nan)
0 1.0
1 NaN
2 2.0
3 NaN
4 NaN
5 3.0
dtype: float64
要将不同的值替换为不同的值,可以传入替代值得列表:
data.replace([-999,-1000],[np.nan,555])
0 1.0
1 NaN
2 2.0
3 NaN
4 555.0
5 3.0
dtype: float64
参数也可以通过字典传递:
data.replace({-999:np.nan,-1000:555})
0 1.0
1 NaN
2 2.0
3 NaN
4 555.0
5 3.0
dtype: float64
import pandas as pd
import numpy as np
data = pd.DataFrame(np.arange(12).reshape((3,4)),
index = ['beijing','shanghai','hangzhou'],
columns = ['one','two','three','four'])
data
one two three four
beijing 0 1 2 3
shanghai 4 5 6 7
hangzhou 8 9 10 11
transform = lambda x:x.upper()
data.index.map(transform)
Index(['BEIJING', 'SHANGHAI', 'HANGZHOU'], dtype='object')
data.index = data.index.map(transform)
data
one two three four
BEIJING 0 1 2 3
SHANGHAI 4 5 6 7
HANGZHOU 8 9 10 11
使用rename方法创建数据及转换后的版本,并且不修改原有的数据集。
data.rename(index = str.title ,columns =str.upper)
ONE TWO THREE FOUR
Beijing 0 1 2 3
Shanghai 4 5 6 7
Hangzhou 8 9 10 11
rename可以结合字典型对象使用,为轴标签的子集提供新的值。
data.rename(index = {'BEIJING':'WUHAN'},
columns = {'three':3})
one two 3 four
WUHAN 0 1 2 3
SHANGHAI 4 5 6 7
HANGZHOU 8 9 10 11
data#未改变原有数据集
one two three four
BEIJING 0 1 2 3
SHANGHAI 4 5 6 7
HANGZHOU 8 9 10 11
如果要修改原有的数据集,传入inplace = True
data.rename(index = {'BEIJING':'wuhan'},
columns = {'three':3},inplace = True)
data
one two 3 four
wuhan 0 1 2 3
SHANGHAI 4 5 6 7
HANGZHOU 8 9 10 11
age = [20,22,25,27,21,23,37,31,61,45,41,32]
将年龄分为18-25,26-35,36-60以及60及以上等若干组
age = [20,22,25,27,21,23,37,31,61,45,41,32]
bins=[18,25,35,60,100]
cats = pd.cut(age,bins)
cats
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
cats.codes
array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
pd.value_counts(cats)
(18, 25] 5
(35, 60] 3
(25, 35] 3
(60, 100] 1
dtype: int64
小括号表示开放,中括号表示是封闭的,使用right=False来改变
在这里插入代码片