第7章 数据规整化:清理、转换、合并、重塑(3)

第7章 数据规整化:清理、转换、合并、重塑(3)

合并重叠数据

Numpy.where()

一种矢量化的if-else

import pandas as pd;import numpy as np
from pandas import Series,DataFrame

a = Series([np.nan,2.5,np.nan,3.5,4.5,np.nan],index=['f','e','d','c','b','a'])
Out[8]: 
f    NaN
e    2.5
d    NaN
c    3.5
b    4.5
a    NaN
dtype: float64

b = Series(np.arange(len(a)),dtype=np.float64,index=['f','e','d','c','b','a'])
b[-1]=np.nan
Out[10]: 
f    0.0
e    1.0
d    2.0
c    3.0
b    4.0
a    NaN
dtype: float64

np.where(pd.isnull(a),b,a)  #if pd.isnull(a):b else:a
#返回数据类型ndarray/ndarray的元组
Out[11]: array([ 0. ,  2.5,  2. ,  3.5,  4.5,  nan])
###得到书上输出结果
Series(np.where(pd.isnull(a),b,a),index=['f','e','d','c','b','a'])
Out[13]: 
f    0.0
e    2.5
d    2.0
c    3.5
b    4.5
a    NaN
dtype: float64

combine_first()

可以实现where相同的功能,且实现数据对齐

In [14]:b[:-2]
Out[14]: 
f    0.0
e    1.0
d    2.0
c    3.0
dtype: float64
In [15]:a[2:]
Out[15]: 
d    NaN
c    3.5
b    4.5
a    NaN
dtype: float64
###if pd.isnull(b[:-2]):a[2:] else:b
###注意数据对齐效果
b[:-2].combine_first(a[2:])
Out[16]: 
a    NaN
b    4.5
c    3.0
d    2.0
e    1.0
f    0.0
dtype: float64


#####DataFrame也可以实现相同功能,可以视为,参数对象中的数据为调用对象中的缺失数据打补丁
df1 = DataFrame({'a':[1,np.nan,5,np.nan],'b':[np.nan,2,np.nan,6],'c':range(2,18,4)})
Out[18]: 
     a    b   c
0  1.0  NaN   2
1  NaN  2.0   6
2  5.0  NaN  10
3  NaN  6.0  14

df2 = DataFrame({'a':[5,4,np.nan,3,7],'b':[np.nan,3,4,6,8]})
Out[20]: 
     a    b
0  5.0  NaN
1  4.0  3.0
2  NaN  4.0
3  3.0  6.0
4  7.0  8.0

df1.combine_first(df2)
Out[21]: 
     a    b     c
0  1.0  NaN   2.0
1  4.0  2.0   6.0
2  5.0  4.0  10.0
3  3.0  6.0  14.0
4  7.0  8.0   NaN

重塑层次化索引

  • stack:将列旋转为行
  • unstack:将行旋转为列
data = DataFrame(np.arange(6).reshape(2,3),index = pd.Index(['Ohio','Colorado'],name='state'),columns=pd.Index(['one','two','three'],name='number'))
Out[23]: 
number    one  two  three
state                    
Ohio        0    1      2
Colorado    3    4      5

###stack方法列旋转为行,得到Series
result = data.stack()
Out[25]: 
state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int32

####同理层次化的Series重排为DataFrame
result.unstack()
Out[26]: 
number    one  two  three
state                    
Ohio        0    1      2
Colorado    3    4      5

######默认情况下,stack/unstack操作的是最内层,也就是级别最低层,且操作之后作为旋转轴的级别也为最低
#####传入分层级别的编号或名称就可对其他级别进行unstack/stack
result.unstack(0)##等价于result.unstack('state')
Out[28]: 
state   Ohio  Colorado
number                
one        0         3
two        1         4
three      2         5

详解:
data

number one two three
state
Ohio 0 1 2
Colorado 3 4 5

data.stack()方法将列旋转为行,其中data的列只有一层’number’,所以旋转’number’层,对于每一行,例’Ohio’行中含有三列,one:0,two:1,three:2,旋转为行,即成为层次化{'Ohio':{'one':0,'two':1,'three':2}}如下表:

Ohio one 0
two 1
three 2

对于行’Colorado’同理,则,data.stack()旋转结果为

state number
Ohio one 0
two 1
three 2
Colorado one 3
two 4
three 5

其中保留了原本数据中列的名number转为行名,因为stack默认方法转的对象是最内层也就是最低层,转的结果也是转到最内层即最低层,所以’number’在’state’内侧,此时转为层次化索引的Series,两级索引,级别分别为,state=0,number=1,数字越小代表级别越高

对缺省值的处理

s1 = Series([0,1,2,3],index=['a','b','c','d'])
Out[34]: 
a    0
b    1
c    2
d    3
dtype: int64

s2 = Series([4,5,6],index=['c','d','e'])
Out[35]: 
c    4
d    5
e    6
dtype: int64

data2 = pd.concat([s1,s2],keys=['one','two'])#####连接函数,keys用来在结果中区别出哪一部分是s1哪一部分是s2
Out[37]: 
one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: int64

data2.unstack() ####两层索引,默认旋转层级为1的即最内层索引
#####遇到缺省值引入NAN
Out[38]: 
       a    b    c    d    e
one  0.0  1.0  2.0  3.0  NaN
two  NaN  NaN  4.0  5.0  6.0

#####stack默认滤除缺省数据,所以可以逆运算
#####设置dropna参数改变对缺省值处理
#####默认dropna=True
data2.unstack().stack()
###旋转结果默认插入最内层,最低级别
Out[39]: 
one  a    0.0
     b    1.0
     c    2.0
     d    3.0
two  c    4.0
     d    5.0
     e    6.0
dtype: float64

data2.unstack().stack(dropna=False)
###旋转结果默认插入最内层,最低级别
one  a    0.0
     b    1.0
     c    2.0
     d    3.0
     e    NaN
two  a    NaN
     b    NaN
     c    4.0
     d    5.0
     e    6.0
dtype: float64

对DataFrame数据的处理类似,操作和插入结果默认都是最内层,即层次等级最低层,数字最大层。例

df = DataFrame({'left':result,'right':result+5},columns=pd.Index(['left','right'],name='side'))
Out[42]: 
side             left  right
state    number             
Ohio     one        0      5
         two        1      6
         three      2      7
Colorado one        3      8
         two        4      9
         three      5     10

####列只有一层,side,行有两层,'state'=0,'number'=1

df.unstack('state')####行'state'旋转为列,且插入最内层/最低层
Out[43]: 
side   left          right         
state  Ohio Colorado  Ohio Colorado
number                             
one       0        3     5        8
two       1        4     6        9
three     2        5     7       10

df.unstack('state').stack('side')###在上一个结果基础上将列'side'旋转为行,且插入最内层
Out[44]: 
state         Colorado  Ohio
number side                 
one    left          3     0
       right         8     5
two    left          4     1
       right         9     6
three  left          5     2
       right        10     7

你可能感兴趣的:(第7章 数据规整化:清理、转换、合并、重塑(3))