reindex:创建一个适应新索引的新对象
import pandas as pd
import numpy as np
t1 = pd.Series(np.arange(4),index=list('cdab'))
print(t1)
c 0
d 1
a 2
b 3
dtype: int32
print(t1.reindex(list('abcd')))
a 2
b 3
c 0
d 1
调用Series的reindex将会根据新索引进行重排。如果某个索引值不存在,就引入缺失值
print(t1.reindex(list('abcde')))
print(t1.reindex(list('abcde'),fill_value = 0))
dtype: int32
a 2.0
b 3.0
c 0.0
d 1.0
e NaN
dtype: float64
a 2
b 3
c 0
d 1
e 0
dtype: int32
参数 | 说明 |
---|---|
ffill或pad | 前向填充(或搬运)值 |
bfill或backfill | 后向填充(或搬运)值 |
t2 = pd.Series(['blue','purple','yellow'],index=[0,2,4])
print(t2)
print(t2.reindex(range(7),method='ffill'))
print(t2.reindex(range(7),method='pad'))
print(t2.reindex(range(7),method='bfill'))
0 blue
2 purple
4 yellow
dtype: object
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
6 yellow
dtype: object
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
6 yellow
dtype: object
0 blue
1 purple
2 purple
3 yellow
4 yellow
5 NaN
6 NaN
dtype: object
对于DataFrame,reindex可以修改(行)索引,列,或者两个都修改。
t3 = pd.DataFrame(np.arange(9).reshape((3,3)),index=list('acb'),columns=['W','X','Y'])
print(t3)
W X Y
a 0 1 2
c 3 4 5
b 6 7 8
print(t3.reindex(list('abcd')))
print(t3.reindex(columns=list('QXY')))
W X Y
a 0.0 1.0 2.0
b 6.0 7.0 8.0
c 3.0 4.0 5.0
d NaN NaN NaN
Q X Y
a NaN 1 2
c NaN 4 5
b NaN 7 8
df.reindex().ffill()
print(t3.reindex(index=list('abcd'),columns=list('QXY')).ffill())
Q X Y
a NaN 1.0 2.0
b NaN 7.0 8.0
c NaN 4.0 5.0
d NaN 4.0 5.0
弃用
print(t3.ix[['a'],['W']])
W
a 0
Seires:
import pandas as pd
import numpy as np
t1 = pd.Series(np.arange(5),index=list('abcde'))
print(t1)
print(t1.drop('c'))
print(t1.drop(['a','b']))
a 0
b 1
c 2
d 3
e 4
dtype: int32
a 0
b 1
d 3
e 4
dtype: int32
c 2
d 3
e 4
dtype: int32
DataFrame:
t2 = pd.DataFrame(np.arange(9).reshape((3,3)),index=list('abc'),columns=list('wxy'))
print(t2)
print(t2.drop(['a','b']))
print(t2.drop(['w','x'],axis=1))
w x y
a 0 1 2
b 3 4 5
c 6 7 8
w x y
c 6 7 8
y
a 2
b 5
c 8
在对象相加时,如果存在不同的索引对,则结果的索引就是该索引对的并集。自动的数据对齐操作在不重复的索引处引入NAN。
Series:
t1 = pd.Series(np.arange(3),index = list('abe'))
t2 = pd.Series([3,4,5],index = list('abd'))
print(t1+t2)
a 3.0
b 5.0
d NaN
e NaN
dtype: float64
DataFrame:
t3 = pd.DataFrame(np.arange(9).reshape((3,3)),index=list('abc'),columns=list('wxy'))
t4 = pd.DataFrame(np.arange(9).reshape((3,3)),index=list('bcd'),columns=list('xyz'))
print(t3+t4)
w x y z
a NaN NaN NaN NaN
b NaN 4.0 6.0 NaN
c NaN 10.0 12.0 NaN
d NaN NaN NaN NaN
在算术方法中填充值
当一个对象中某个轴标签在另一个对象中找不到时填充一个特殊值(如0):
print(t4.add(t3,fill_value=0))
w x y z
a 0.0 1.0 2.0 NaN
b 3.0 4.0 6.0 2.0
c 6.0 10.0 12.0 5.0
d NaN 6.0 7.0 8.0
对Series和DataFrame重新索引时,也可以指定一个填充值
print(t4.reindex(columns=t3.columns,fill_value=0))
w x y
b 0 0 1
c 0 3 4
d 0 6 7
方法 | 说明 |
---|---|
add | 用于加法的方法 |
sub | 用于减法的方法 |
div | 用于除法的方法 |
mul | 用于乘法的方法 |
DataFrame和Series之间的运算
二维数组与其某行的差:
t5 = np.arange(12).reshape((3,4))
print(t5)
print(t5[0])
print(t5-t5[0])
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
[0 1 2 3]
[[0 0 0 0]
[4 4 4 4]
[8 8 8 8]]
DataFrane和Series之间的运算会将Series的索引匹配到DataFrame的列,然后沿着行一直向下广播。
series = t3.loc['a']
print(series)
print(t3 - series)
w 0
x 1
y 2
Name: a, dtype: int32
w x y
a 0 0 0
b 3 3 3
c 6 6 6
如果某个索引值在DataFrame的列或者Series的索引中找不到,则参与运算的两个对象就会被重新以形成并集:
series2 = pd.Series(np.arange(3),index=list('wxz'))
print(t3 + series2)
w x y z
a 0.0 2.0 NaN NaN
b 3.0 5.0 NaN NaN
c 6.0 8.0 NaN NaN
如果你希望匹配行在列上广播,则必须使用算术运方法。
series3 = t3['w']
print(series3)
print(t3.sub(series3,axis=0))
a 0
b 3
c 6
Name: w, dtype: int32
w x y
a 0 1 2
b 0 1 2
c 0 1 2
t1 = pd.DataFrame(np.random.randn(4,3),columns=list('xyz'),index=list('abcd'))
print(t1)
print(np.abs(t1))
x y z
a 0.087261 -0.315335 -0.344275
b 0.478766 -2.687121 -0.948351
c 0.864202 -0.188504 -0.024666
d 2.655114 1.365731 2.040676
x y z
a 0.087261 0.315335 0.344275
b 0.478766 2.687121 0.948351
c 0.864202 0.188504 0.024666
d 2.655114 1.365731 2.040676
将函数应用到由各行各列所形成的一维数组上。
DataFrame的apply方法:
f = lambda x:x.max() - x.min()
print(t1.apply(f))
print(t1.apply(f,axis=1))
x 2.567853
y 4.052853
z 2.989026
dtype: float64
a 0.431536
b 3.165887
c 1.052705
d 1.289382
dtype: float64
除了标量值外,传递给apply的函数还可以返回有多个值组成的Series:
def f(x):
return pd.Series([x.min(), x.max()],index=['min','max'])
print(t1.apply(f))
x y z
min 0.087261 -2.687121 -0.948351
max 2.655114 1.365731 2.040676
format = lambda x:'%.2f' % x
t1.applymap(format)
x y z
a 0.09 -0.32 -0.34
b 0.48 -2.69 -0.95
c 0.86 -0.19 -0.02
d 2.66 1.37 2.04
t1['x'].map(format)
a 0.09
b 0.48
c 0.86
d 2.66
Name: x, dtype: object
sort_index()
t1 = pd.Series(np.arange(4),index=list('dcab'))
print(t1)
print(t1.sort_index())
d 0
c 1
a 2
b 3
dtype: int32
a 2
b 3
c 1
d 0
dtype: int32
DataFrame可以根据任意一个轴上的索引进行排序:
t2 = pd.DataFrame(np.arange(8).reshape((2,4)),index=list('ba'),columns=list('dcba'))
print(t2.sort_index())
print(t2.sort_index(axis=1))
d c b a
a 4 5 6 7
b 0 1 2 3
a b c d
b 3 2 1 0
a 7 6 5 4
数据默认是按照升序排序,改成降序ascending=False
print(t2.sort_index(axis=1,ascending = False))
d c b a
b 0 1 2 3
a 4 5 6 7
按值对Series排序sort_values()
obj = pd.Series([-1,6,3,9])
print(obj.sort_values())
obj = pd.Series([np.nan,-1,np.nan,6,3,9])
print(obj.sort_values())
0 -1
2 3
1 6
3 9
dtype: int64
1 -1.0
4 3.0
3 6.0
5 9.0
0 NaN
2 NaN
dtype: float64
t2 = pd.DataFrame({
'a':[2,7,5,1],'d':[9,6,10,-1]})
print(t2)
print(t2.sort_values(by=['a','d']))
print(t2.sort_values(by='d'))
a d
0 2 9
1 7 6
2 5 10
3 1 -1
a d
3 1 -1
0 2 9
2 5 10
1 7 6
a d
3 1 -1
1 7 6
0 2 9
2 5 10