reindex
是pandas
对象的重要方法,该方法用于创建一个符合新索引的新对象。Series
调用reindex
方法时,会将数据按照新的索引进行排列,如果某个索引值之前并不存在,则会引入缺失值:
In [1]: import pandas as pd
In [2]: obj = pd.Series([4.5, 5.3, -8.2, 4.9], index=['a', 's', 'q', 'f'])
In [3]: obj
Out[3]:
a 4.5
s 5.3
q -8.2
f 4.9
dtype: float64
In [6]: obj2 = obj.reindex(['a', 's', 'f', 'q', 'e'])
In [7]: obj2
Out[7]:
a 4.5
s 5.3
f 4.9
q -8.2
e NaN
dtype: float64
method
可选参数允许我们使用诸如ffill等方法在重建索引时插值,ffill
方法会将值前向填充:
In [16]: obj3
Out[16]:
0 blue
2 purple
4 yellow
dtype: object
In [17]: obj3.reindex(range(6), method='ffill')
Out[17]:
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object
frame = pd.DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd']
, columns=['Ohio', 'Texas', 'California'])
print(frame)
'''
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
'''
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
print(frame2)
'''
Ohio Texas California
a 0.0 1.0 2.0
b NaN NaN NaN
c 3.0 4.0 5.0
d 6.0 7.0 8.0
'''
drop
方法会返回一个含有指示值或轴向上删除值的新对象:
In [25]: obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
In [26]: obj
Out[26]:
a 0.0
b 1.0
c 2.0
d 3.0
e 4.0
dtype: float64
In [27]: new_obj = obj.drop('c')
In [28]: new_obj
Out[28]:
a 0.0
b 1.0
d 3.0
e 4.0
dtype: float64
In [29]: obj.drop(['d', 'c'])
Out[29]:
a 0.0
b 1.0
e 4.0
dtype: float64
在DataFrame
中,索引值可以从轴向上删除。
In [32]: data = pd.DataFrame(np.arange(16).reshape((4, 4)),
...: index=['Ohio', 'Colorado', 'Utah', 'New York'],
...: columns=['one', 'two', 'three', 'four'])
In [33]: data
Out[33]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [34]: data.drop(['Colorado', 'Ohio'])
Out[34]:
one two three four
Utah 8 9 10 11
New York 12 13 14 15
In [35]: data.drop('two', axis=1)
Out[35]:
one three four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15
In [36]: data.drop(['two', 'four'], axis='columns')
Out[36]:
one three
Ohio 0 2
Colorado 4 6
Utah 8 10
New York 12 14
很多函数,例如drop
,会修改Series
或DataFrame
的尺寸或形状,这些方法直接操作原对象而不返回新对象:
In [39]: obj.drop('c', inplace=True)
In [40]: obj
Out[40]:
a 0.0
b 1.0
d 3.0
e 4.0
dtype: float64
请注意inplace
属性,它会清除被删除的数据。
In [41]: obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
In [42]: obj
Out[42]:
a 0.0
b 1.0
c 2.0
d 3.0
dtype: float64
In [43]: obj['b']
Out[43]: 1.0
In [44]: obj[2]
Out[44]: 2.0
In [45]: obj[2 : 4]
Out[45]:
c 2.0
d 3.0
dtype: float64
In [46]: obj[obj < 2]
Out[46]:
a 0.0
b 1.0
dtype: float64
In [48]: obj[[1, 3]]
Out[48]:
b 1.0
d 3.0
dtype: float64
In [54]: data # 上面已经定义好
Out[54]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [55]: data['three']
Out[55]:
Ohio 2
Colorado 6
Utah 10
New York 14
Name: three, dtype: int32
In [56]: data[['three', 'four']]
Out[56]:
three four
Ohio 2 3
Colorado 6 7
Utah 10 11
New York 14 15
In [57]: data[ : 2]
Out[57]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
In [59]: data[data['three'] > 5]
Out[59]:
one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [60]: data < 5 # 返回bool类型
Out[60]:
one two three four
Ohio True True True True
Colorado True False False False
Utah False False False False
New York False False False False
In [61]: data[data < 5] = 0 # 将所有小于5的数值 赋值为 0
In [62]: data
Out[62]:
one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
针对DataFrame
在行上的标签索引,使用特殊的索引符号loc
和iloc
。允许使用轴标签(loc) 或 整数标签(iloc) 以NumPy
风格的语法从DataFrame
中选出数组的行和列的子集。
In [68]: data
Out[68]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [69]: data.loc['Colorado', ['two', 'three']]
Out[69]:
two 5
three 6
Name: Colorado, dtype: int32
In [70]: data.iloc[2, [3, 0, 1]]
Out[70]:
four 11
one 8
two 9
Name: Utah, dtype: int32
In [71]: data.iloc[2]
Out[71]:
one 8
two 9
three 10
four 11
Name: Utah, dtype: int32
In [72]: data.iloc[[1,2], [3,0,1]]
Out[72]:
four one two
Colorado 7 4 5
Utah 11 8 9
In [74]: data.loc[: 'Utah', 'two'] #
Out[74]:
Ohio 1
Colorado 5
Utah 9
Name: two, dtype: int32
In [76]: data.iloc[:, :3][data.three > 5]
Out[76]:
one two three
Colorado 4 5 6
Utah 8 9 10
New York 12 13 14
不使用轴索引会出错:
In [77]: ser = pd.Series(np.arange(3.))
In [78]: ser
Out[78]:
0 0.0
1 1.0
2 2.0
dtype: float64
In [79]: ser[-1] # 整数索引就会报错
---------------------------------------------------------------------------
KeyError
In [80]: ser
Out[80]:
0 0.0
1 1.0
2 2.0
dtype: float64
In [81]: ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])
In [82]: ser2
Out[82]:
a 0.0
b 1.0
c 2.0
dtype: float64
In [83]: ser2[-1] # 非整数索引 不会报错
Out[83]: 2.0
In [102]: ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c']) # 非整数索引不会产生歧义
In [103]: ser2[-1]
Out[103]: 2.0
In [104]: ser[:1]
Out[104]:
0 0.0
dtype: float64
In [105]: ser.loc[:1]
Out[105]:
0 0.0
1 1.0
dtype: float64
In [106]: ser.iloc[:1]
Out[106]:
0 0.0
dtype: float64
当你将对象相加时,如果存在某个索引对不相同,则返回结果的索引将是索引对的并集。没有交叠的标签位置上,内部数据对齐会产生缺失值。缺失值会在后续的算术操作上产生影响。
In [108]: s1 = pd.Series([7.3, -2.5, 3.4, 1.5],
...: index=['a', 'b', 'd', 'e'])
In [109]: s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e','f', 'g'])
In [110]: s1
Out[110]:
a 7.3
b -2.5
d 3.4
e 1.5
dtype: float64
In [111]: s2
Out[111]:
a -2.1
c 3.6
e -1.5
f 4.0
g 3.1
dtype: float64
In [112]: s1 + s2
Out[112]:
a 5.2
b NaN
c NaN
d NaN
e 0.0
f NaN
g NaN
dtype: float64
由于’c’
列和’e’
列并不是两个DataFrame
共有的列,这两列中产生了缺失值。对于行标签不同的DataFrame
对象也是如此。
In [113]: df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
...: index=['Ohio', 'Texas', 'Colorado'])
In [116]: df2 = pd.DataFrame(np.arange(12.).reshape((4,3)), columns=list('bde',
...: ), index=['Utah', 'Ohio', 'Texas', 'Oregon'])
In [117]: df1
Out[117]:
b c d
Ohio 0.0 1.0 2.0
Texas 3.0 4.0 5.0
Colorado 6.0 7.0 8.0
In [118]: df2
Out[118]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
In [119]: df1 + df2
Out[119]:
b c d e
Colorado NaN NaN NaN NaN
Ohio 3.0 NaN 6.0 NaN
Oregon NaN NaN NaN NaN
Texas 9.0 NaN 12.0 NaN
Utah NaN NaN NaN NaN
NumPy
的通用函数(逐元素数组方法)对pandas
对象也有效:
In [121]: frame
Out[121]:
b d e
Utah -1.349812 -0.962359 -0.947875
Ohio -0.226425 0.601588 0.045817
Texas 0.594069 0.205601 1.024613
Oregon 0.566535 0.249397 1.449775
In [122]: np.abs(frame)
Out[122]:
b d e
Utah 1.349812 0.962359 0.947875
Ohio 0.226425 0.601588 0.045817
Texas 0.594069 0.205601 1.024613
Oregon 0.566535 0.249397 1.449775
另一个常用的操作是将函数应用到一行或一列的一维数组上。DataFrame
的apply
方法可以实现这个功能:
In [123]: f = lambda x : x.max() - x.min()
In [124]: frame.apply(f)
Out[124]:
b 1.943881
d 1.563947
e 2.397650
dtype: float64
In [125]: frame.apply(f, axis='columns')
Out[125]:
Utah 0.401937
Ohio 0.828013
Texas 0.819013
Oregon 1.200378
dtype: float64
这里的函数f
,可以计算Series
最大值和最小值的差,会被frame
中的每一列调用一次。结果是一个以frame
的列作为索引的Series
。
如需按行或列索引进行字典型排序,需要使用sort_index
方法,该方法返回一个新的、排序好的对象:
In [126]: obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
In [127]: obj.sort_index()
Out[127]:
a 1
b 2
c 3
d 0
dtype: int32
In [129]: frame = pd.DataFrame(np.arange(8).reshape((2,4)),
...: index=['three', 'one'],
...: columns=['d', 'a', 'b', 'c'])
In [130]: frame
Out[130]:
d a b c
three 0 1 2 3
one 4 5 6 7
In [131]: frame.sort_index()
Out[131]:
d a b c
one 4 5 6 7
three 0 1 2 3
In [132]: frame.sort_index(axis=1)
Out[132]:
a b c d
three 1 2 3 0
one 5 6 7 4
In [133]: frame.sort_index(axis=1, ascending=False)
Out[133]:
d c b a
three 0 3 2 1
one 4 7 6 5