【Python数据分析与展示】(六)处理缺失数据,层次化索引

处理缺失数据

pandas用浮点值NaN来表示缺失数据,它只是一个易于被检测出来的标识

方法 说明
dropna 过滤缺失数据,可以用阈值调节容忍度
fillna 用指定值或插值方法填充缺失数据
isnull 返回布尔值标识哪些是NaN
notnull isnull的反义

Examples

df = pd.DataFrame([[np.nan, 2, np.nan, 0], [3, 4, np.nan, 1],
… [np.nan, np.nan, np.nan, 5]],
… columns=list(‘ABCD’))
df
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
Drop the columns where all elements are nan:

df.dropna(axis=1, how=’all’)
A B D
0 NaN 2.0 0
1 3.0 4.0 1
2 NaN NaN 5
Drop the columns where any of the elements is nan

df.dropna(axis=1, how=’any’)
D
0 0
1 1
2 5
Drop the rows where all of the elements are nan (there is no row to drop, so df stays the same):

df.dropna(axis=0, how=’all’)
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
Keep only the rows with at least 2 non-na values:

df.dropna(thresh=2)
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1

fillna参数

参数 说明
value 填充值,标量或字典
method ffill和bfill
axis 按哪个轴
inplace 替换原副本
limit 最大替换数量

Examples

df = pd.DataFrame([[np.nan, 2, np.nan, 0],
… [3, 4, np.nan, 1],
… [np.nan, np.nan, np.nan, 5],
… [np.nan, 3, np.nan, 4]],
… columns=list(‘ABCD’))
df
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
3 NaN 3.0 NaN 4
Replace all NaN elements with 0s.

df.fillna(0)
A B C D
0 0.0 2.0 0.0 0
1 3.0 4.0 0.0 1
2 0.0 0.0 0.0 5
3 0.0 3.0 0.0 4
We can also propagate non-null values forward or backward.

df.fillna(method=’ffill’)
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 3.0 4.0 NaN 5
3 3.0 3.0 NaN 4
Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively.

values = {‘A’: 0, ‘B’: 1, ‘C’: 2, ‘D’: 3}
df.fillna(value=values)
A B C D
0 0.0 2.0 2.0 0
1 3.0 4.0 2.0 1
2 0.0 1.0 2.0 5
3 0.0 3.0 2.0 4
Only replace the first NaN element.

df.fillna(value=values, limit=1)
A B C D
0 0.0 2.0 2.0 0
1 3.0 4.0 NaN 1
2 NaN 1.0 NaN 5
3 NaN 3.0 NaN 4

层次化索引

层次化索引(hierarchical indexing)是pandas的一项重要功能,他能使你在一个轴上拥有多个索引级别
example:Series

data = Series(np.random.rand(10),
             index = [['a','a','a','b','b','b','c','c','d','d'],['1','2','3','1','2','3','1','2','1',2]])
#a  1    0.974478
   2    0.638362
   3    0.101788
b  1    0.713843
   2    0.106504
   3    0.175605
c  1    0.608555
   2    0.399577
d  1    0.102047
   2    0.726674
dtype: float64
data.index 
#MultiIndex(levels=[['a', 'b', 'c', 'd'], [2, '1', '2', '3']],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [1, 2, 3, 1, 2, 3, 1, 2, 1, 0]])
data.unstack()
#   1              2            3
a   0.972446    0.058712    0.758507
b   0.648166    0.169893    0.423814
c   0.655640    0.869214    NaN
d   0.141091    0.251400    NaN

对于DataFrame,每条轴都可以有分层索引

frame = DataFrame(np.arange(12).reshape(4,3),index = [['a','a','b','b'],['1','2','1','2']],columns=[['Ohio','Ohio','Colonado'],['green','red','green']])
#     Ohio      Colonado
      green red green
a   1   0   1   2
    2   3   4   5
b   1   6   7   8
    2   9   10  11
frame.index.names= ["key1","key2"]
frame.columns.names = ['state','color']
frame
#
     state  Ohio    Colonado
     color  green red green
key1  key2          
a      1    0   1   2
       2    3   4   5
b      1    6   7   8
       2    9   10  11
frame["Ohio"]
#   color   green   red
key1   key2     
a       1   0   1
        2   3   4
b       1   6   7
        2   9   10
frame.swaplevel('key1','key2') #交换两个索引,名称或者编号
#
    state   Ohio       Colonado
    color   green red   green
key2 key1           
1      a    0   1   2
2      a    3   4   5
1      b    6   7   8
2      b    9   10  11
frame.sort_index(level = 1) #按第二级索引排列
frame.swaplevel('key1','key2').sort_index(level = 0) #交换并按第1级索引排列
frame.sum(axis = 0,level = 1)

也可以使用列当行索引 set_index函数

frame1 = DataFrame({'a':range(7),'b':range(7,0,-1),'c':['a','a','a','b','b','b','b'],'d':[0,1,2,0,1,2,3]})
frame1 
#   a   b   c   d
0   0   7   a   0
1   1   6   a   1
2   2   5   a   2
3   3   4   b   0
4   4   3   b   1
5   5   2   b   2
6   6   1   b   3
frame2 = frame1.set_index(['c','d'])
frame2
#       a   b
c   d       
a   0   0   7
    1   1   6
    2   2   5
b   0   3   4
    1   4   3
    2   5   2
    3   6   1
frame1.set_index(['c','d'],drop=False) #保留作为索引的列
frame1.set_index(['c','d'])
frame2.reset_index() #恢复

你可能感兴趣的:(python数据分析)