pandas用浮点值NaN来表示缺失数据,它只是一个易于被检测出来的标识
方法 | 说明 |
---|---|
dropna | 过滤缺失数据,可以用阈值调节容忍度 |
fillna | 用指定值或插值方法填充缺失数据 |
isnull | 返回布尔值标识哪些是NaN |
notnull | isnull的反义 |
Examples
df = pd.DataFrame([[np.nan, 2, np.nan, 0], [3, 4, np.nan, 1],
… [np.nan, np.nan, np.nan, 5]],
… columns=list(‘ABCD’))
df
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
Drop the columns where all elements are nan:df.dropna(axis=1, how=’all’)
A B D
0 NaN 2.0 0
1 3.0 4.0 1
2 NaN NaN 5
Drop the columns where any of the elements is nandf.dropna(axis=1, how=’any’)
D
0 0
1 1
2 5
Drop the rows where all of the elements are nan (there is no row to drop, so df stays the same):df.dropna(axis=0, how=’all’)
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
Keep only the rows with at least 2 non-na values:df.dropna(thresh=2)
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
fillna参数
参数 | 说明 |
---|---|
value | 填充值,标量或字典 |
method | ffill和bfill |
axis | 按哪个轴 |
inplace | 替换原副本 |
limit | 最大替换数量 |
Examples
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
… [3, 4, np.nan, 1],
… [np.nan, np.nan, np.nan, 5],
… [np.nan, 3, np.nan, 4]],
… columns=list(‘ABCD’))
df
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
3 NaN 3.0 NaN 4
Replace all NaN elements with 0s.df.fillna(0)
A B C D
0 0.0 2.0 0.0 0
1 3.0 4.0 0.0 1
2 0.0 0.0 0.0 5
3 0.0 3.0 0.0 4
We can also propagate non-null values forward or backward.df.fillna(method=’ffill’)
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 3.0 4.0 NaN 5
3 3.0 3.0 NaN 4
Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively.values = {‘A’: 0, ‘B’: 1, ‘C’: 2, ‘D’: 3}
df.fillna(value=values)
A B C D
0 0.0 2.0 2.0 0
1 3.0 4.0 2.0 1
2 0.0 1.0 2.0 5
3 0.0 3.0 2.0 4
Only replace the first NaN element.df.fillna(value=values, limit=1)
A B C D
0 0.0 2.0 2.0 0
1 3.0 4.0 NaN 1
2 NaN 1.0 NaN 5
3 NaN 3.0 NaN 4
层次化索引(hierarchical indexing)是pandas的一项重要功能,他能使你在一个轴上拥有多个索引级别
example:Series
data = Series(np.random.rand(10),
index = [['a','a','a','b','b','b','c','c','d','d'],['1','2','3','1','2','3','1','2','1',2]])
#a 1 0.974478
2 0.638362
3 0.101788
b 1 0.713843
2 0.106504
3 0.175605
c 1 0.608555
2 0.399577
d 1 0.102047
2 0.726674
dtype: float64
data.index
#MultiIndex(levels=[['a', 'b', 'c', 'd'], [2, '1', '2', '3']],
labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [1, 2, 3, 1, 2, 3, 1, 2, 1, 0]])
data.unstack()
# 1 2 3
a 0.972446 0.058712 0.758507
b 0.648166 0.169893 0.423814
c 0.655640 0.869214 NaN
d 0.141091 0.251400 NaN
对于DataFrame,每条轴都可以有分层索引
frame = DataFrame(np.arange(12).reshape(4,3),index = [['a','a','b','b'],['1','2','1','2']],columns=[['Ohio','Ohio','Colonado'],['green','red','green']])
# Ohio Colonado
green red green
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
frame.index.names= ["key1","key2"]
frame.columns.names = ['state','color']
frame
#
state Ohio Colonado
color green red green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
frame["Ohio"]
# color green red
key1 key2
a 1 0 1
2 3 4
b 1 6 7
2 9 10
frame.swaplevel('key1','key2') #交换两个索引,名称或者编号
#
state Ohio Colonado
color green red green
key2 key1
1 a 0 1 2
2 a 3 4 5
1 b 6 7 8
2 b 9 10 11
frame.sort_index(level = 1) #按第二级索引排列
frame.swaplevel('key1','key2').sort_index(level = 0) #交换并按第1级索引排列
frame.sum(axis = 0,level = 1)
也可以使用列当行索引 set_index函数
frame1 = DataFrame({'a':range(7),'b':range(7,0,-1),'c':['a','a','a','b','b','b','b'],'d':[0,1,2,0,1,2,3]})
frame1
# a b c d
0 0 7 a 0
1 1 6 a 1
2 2 5 a 2
3 3 4 b 0
4 4 3 b 1
5 5 2 b 2
6 6 1 b 3
frame2 = frame1.set_index(['c','d'])
frame2
# a b
c d
a 0 0 7
1 1 6
2 2 5
b 0 3 4
1 4 3
2 5 2
3 6 1
frame1.set_index(['c','d'],drop=False) #保留作为索引的列
frame1.set_index(['c','d'])
frame2.reset_index() #恢复