一种矢量化的if-else
import pandas as pd;import numpy as np
from pandas import Series,DataFrame
a = Series([np.nan,2.5,np.nan,3.5,4.5,np.nan],index=['f','e','d','c','b','a'])
Out[8]:
f NaN
e 2.5
d NaN
c 3.5
b 4.5
a NaN
dtype: float64
b = Series(np.arange(len(a)),dtype=np.float64,index=['f','e','d','c','b','a'])
b[-1]=np.nan
Out[10]:
f 0.0
e 1.0
d 2.0
c 3.0
b 4.0
a NaN
dtype: float64
np.where(pd.isnull(a),b,a) #if pd.isnull(a):b else:a
#返回数据类型ndarray/ndarray的元组
Out[11]: array([ 0. , 2.5, 2. , 3.5, 4.5, nan])
###得到书上输出结果
Series(np.where(pd.isnull(a),b,a),index=['f','e','d','c','b','a'])
Out[13]:
f 0.0
e 2.5
d 2.0
c 3.5
b 4.5
a NaN
dtype: float64
可以实现where相同的功能,且实现数据对齐
In [14]:b[:-2]
Out[14]:
f 0.0
e 1.0
d 2.0
c 3.0
dtype: float64
In [15]:a[2:]
Out[15]:
d NaN
c 3.5
b 4.5
a NaN
dtype: float64
###if pd.isnull(b[:-2]):a[2:] else:b
###注意数据对齐效果
b[:-2].combine_first(a[2:])
Out[16]:
a NaN
b 4.5
c 3.0
d 2.0
e 1.0
f 0.0
dtype: float64
#####DataFrame也可以实现相同功能,可以视为,参数对象中的数据为调用对象中的缺失数据打补丁
df1 = DataFrame({'a':[1,np.nan,5,np.nan],'b':[np.nan,2,np.nan,6],'c':range(2,18,4)})
Out[18]:
a b c
0 1.0 NaN 2
1 NaN 2.0 6
2 5.0 NaN 10
3 NaN 6.0 14
df2 = DataFrame({'a':[5,4,np.nan,3,7],'b':[np.nan,3,4,6,8]})
Out[20]:
a b
0 5.0 NaN
1 4.0 3.0
2 NaN 4.0
3 3.0 6.0
4 7.0 8.0
df1.combine_first(df2)
Out[21]:
a b c
0 1.0 NaN 2.0
1 4.0 2.0 6.0
2 5.0 4.0 10.0
3 3.0 6.0 14.0
4 7.0 8.0 NaN
data = DataFrame(np.arange(6).reshape(2,3),index = pd.Index(['Ohio','Colorado'],name='state'),columns=pd.Index(['one','two','three'],name='number'))
Out[23]:
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
###stack方法列旋转为行,得到Series
result = data.stack()
Out[25]:
state number
Ohio one 0
two 1
three 2
Colorado one 3
two 4
three 5
dtype: int32
####同理层次化的Series重排为DataFrame
result.unstack()
Out[26]:
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
######默认情况下,stack/unstack操作的是最内层,也就是级别最低层,且操作之后作为旋转轴的级别也为最低
#####传入分层级别的编号或名称就可对其他级别进行unstack/stack
result.unstack(0)##等价于result.unstack('state')
Out[28]:
state Ohio Colorado
number
one 0 3
two 1 4
three 2 5
详解:
data
number | one | two | three |
---|---|---|---|
state | |||
Ohio | 0 | 1 | 2 |
Colorado | 3 | 4 | 5 |
data.stack()方法将列旋转为行,其中data的列只有一层’number’,所以旋转’number’层,对于每一行,例’Ohio’行中含有三列,one:0,two:1,three:2,旋转为行,即成为层次化{'Ohio':{'one':0,'two':1,'three':2}}
如下表:
Ohio | one | 0 |
---|---|---|
two | 1 | |
three | 2 |
对于行’Colorado’同理,则,data.stack()旋转结果为
state | number | |
---|---|---|
Ohio | one | 0 |
two | 1 | |
three | 2 | |
Colorado | one | 3 |
two | 4 | |
three | 5 |
其中保留了原本数据中列的名number转为行名,因为stack默认方法转的对象是最内层也就是最低层,转的结果也是转到最内层即最低层,所以’number’在’state’内侧,此时转为层次化索引的Series,两级索引,级别分别为,state=0,number=1,数字越小代表级别越高。
对缺省值的处理
s1 = Series([0,1,2,3],index=['a','b','c','d'])
Out[34]:
a 0
b 1
c 2
d 3
dtype: int64
s2 = Series([4,5,6],index=['c','d','e'])
Out[35]:
c 4
d 5
e 6
dtype: int64
data2 = pd.concat([s1,s2],keys=['one','two'])#####连接函数,keys用来在结果中区别出哪一部分是s1哪一部分是s2
Out[37]:
one a 0
b 1
c 2
d 3
two c 4
d 5
e 6
dtype: int64
data2.unstack() ####两层索引,默认旋转层级为1的即最内层索引
#####遇到缺省值引入NAN
Out[38]:
a b c d e
one 0.0 1.0 2.0 3.0 NaN
two NaN NaN 4.0 5.0 6.0
#####stack默认滤除缺省数据,所以可以逆运算
#####设置dropna参数改变对缺省值处理
#####默认dropna=True
data2.unstack().stack()
###旋转结果默认插入最内层,最低级别
Out[39]:
one a 0.0
b 1.0
c 2.0
d 3.0
two c 4.0
d 5.0
e 6.0
dtype: float64
data2.unstack().stack(dropna=False)
###旋转结果默认插入最内层,最低级别
one a 0.0
b 1.0
c 2.0
d 3.0
e NaN
two a NaN
b NaN
c 4.0
d 5.0
e 6.0
dtype: float64
对DataFrame数据的处理类似,操作和插入结果默认都是最内层,即层次等级最低层,数字最大层。例
df = DataFrame({'left':result,'right':result+5},columns=pd.Index(['left','right'],name='side'))
Out[42]:
side left right
state number
Ohio one 0 5
two 1 6
three 2 7
Colorado one 3 8
two 4 9
three 5 10
####列只有一层,side,行有两层,'state'=0,'number'=1
df.unstack('state')####行'state'旋转为列,且插入最内层/最低层
Out[43]:
side left right
state Ohio Colorado Ohio Colorado
number
one 0 3 5 8
two 1 4 6 9
three 2 5 7 10
df.unstack('state').stack('side')###在上一个结果基础上将列'side'旋转为行,且插入最内层
Out[44]:
state Colorado Ohio
number side
one left 3 0
right 8 5
two left 4 1
right 9 6
three left 5 2
right 10 7