一、重新索引
obj = Series([1,2,3,4],index=['a','b','c','d']) 输出为: a 1 b 2 c 3 d 4
Series有一个reindex函数,可以将索引重排,以致元素顺序发生变化
obj.reindex(['a','c','d','b','e'],fill_value = 0) #fill_value 填充空的index的值 输出为: a 1 c 3 d 4 b 2 e 0
obj2 = Series(['red','blue'],index=[0,4]) 输出为: 0 red 4 blue obj2.reindex(range(6),method='ffill') #method = ffill,意味着前向值填充 输出为: 0 red 1 red 2 red 3 red 4 blue 5 blue
对于DataFrame,reindex可以修改行(索引)、列或者两个都改。
frame = DataFrame(np.arange(9).reshape((3,3)),index = ['a','c','d'],columns = ['Ohio','Texas','California']) 输出为: Ohio Texas California a 0 1 2 c 3 4 5 d 6 7 8 frame2 = frame.reindex(['a','b','c','d']) #只是传入一列数,是对行进行reindex 输出为: Ohio Texas California a 0.0 1.0 2.0 b NaN NaN NaN c 3.0 4.0 5.0 d 6.0 7.0 8.0 frame4 = frame.reindex(columns=states) # 使用columns关键字即可重新索引列 输出为: Texas Utah California a 1 NaN 2 c 4 NaN 5 d 7 NaN 8 frame5 = frame.reindex(index = ['a','b','c','d'],columns=states) #同时对行、列进行重新索引 输出为: Texas Utah California a 1.0 NaN 2.0 b NaN NaN NaN c 4.0 NaN 5.0 d 7.0 NaN 8.0
二、丢弃指定轴上的项:
obj = Series(np.arange(3.),index = ['a','b','c']) 输出为: a 0.0 b 1.0 c 2.0 obj.drop(['a','b']) 输出为: c 2.0
frame = DataFrame(np.arange(9).reshape((3,3)),index = ['a','c','d'],columns = ['Ohio','Texas','California']) 输出为: Ohio Texas California a 0 1 2 c 3 4 5 d 6 7 8 frame.drop(['a']) #删除行 输出为: Ohio Texas California c 3 4 5 d 6 7 8 frame.drop(['Ohio'],axis = 1) #删除列 输出为: Texas California a 1 2 c 4 5 d 7 8
三、索引、选取和过滤
Series的索引的工作方式类似于Numpy数组的索引,只不过Series的索引值不只是整数。
obj = Series([1,2,3,4],index=['a','b','c','d']) >>> a 1 b 2 c 3 d 4 obj['b'] >>> 2 obj[1] >>> 2 obj[0:3] >>> a 1 b 2 c 3 obj[[0,3]] >>> a 1 d 4 obj[obj<2] >>>a 1
利用标签的切片运算与普通的Python切片运算不同,其末端是包含的,即封闭区间:
obj['b':'d'] >>> b 2 c 3 d 4
DataFrame索引:对DataFrame进行索引就是获取一个或多个列:
frame = DataFrame(np.arange(16).reshape((4,4)),index = ['Ohio','Colorado','Utah','New York'],columns = ['one','two','three','four']) >>> one two three four Ohio 0 1 2 3 Colorado 4 5 6 7 Utah 8 9 10 11 New York 12 13 14 15 frame['two'] >>> Ohio 1 Colorado 5 Utah 9 New York 13 frame[:2] # 通过切片选得到的是行 >>> one two three four Ohio 0 1 2 3 Colorado 4 5 6 7
四、算术运算和数据对齐
pandas最重要的一个功能是,它可以对不同索引的对象进行算术运算。在将对象相加时,如果存在不同的索引对,则结果的索引就是该索引对的并集。
s1 = Series([1,2,3],['a','b','c']) s2 = Series([4,5,6],['b','c','d']) s1 + s2 >>> a NaN b 6.0 c 8.0 d NaN
对于DataFrame,对齐操作会同时发生在行和列上:
df1 = DataFrame(np.arange(12.).reshape(3,4),columns=list('abcd')) df2 = DataFrame(np.arange(20.).reshape(4,5),columns=list('abcde')) df1 >>> a b c d 0 0.0 1.0 2.0 3.0 1 4.0 5.0 6.0 7.0 2 8.0 9.0 10.0 11.0 df2 >>> a b c d e 0 0.0 1.0 2.0 3.0 4.0 1 5.0 6.0 7.0 8.0 9.0 2 10.0 11.0 12.0 13.0 14.0 3 15.0 16.0 17.0 18.0 19.0 df1 + df2 >>> a b c d e 0 0.0 2.0 4.0 6.0 NaN 1 9.0 11.0 13.0 15.0 NaN 2 18.0 20.0 22.0 24.0 NaN 3 NaN NaN NaN NaN NaN
下面看一下DataFrame和Series之间的计算过程:
arr = DataFrame(np.arange(12.).reshape((3,4)),columns = list('abcd')) arr >>> a b c d 0 0.0 1.0 2.0 3.0 1 4.0 5.0 6.0 7.0 2 8.0 9.0 10.0 11.0 Series = arr.ix[0] #如果写arr[0]是错的,因为只有标签索引函数ix后面加数字才表示行 >>> a 0.0 b 1.0 c 2.0 d 3.0 arr - Series #默认情况下,DataFrame和Series的计算会将Series的索引匹配到DataFrame的列,然后沿着行一直向下广播 >>> a b c d 0 0.0 0.0 0.0 0.0 1 4.0 4.0 4.0 4.0 2 8.0 8.0 8.0 8.0 Series2 = Series(range(3),index = list('cdf')) >>> c 0 d 1 f 2 arr + Series2 # #按照规则,在不匹配的列会形成NaN值 >>> a b c d f 0 NaN NaN 2.0 4.0 NaN 1 NaN NaN 6.0 8.0 NaN 2 NaN NaN 10.0 12.0 NaN Series3 = arr['d'] >>> 0 3.0 1 7.0 2 11.0 # 如果想匹配行且在列上广播,需要用到算术运算方法 # 传入的轴号就是希望匹配的轴,这里是匹配行索引并进行广播 # axis = 0 表示按照第0轴 二维情况下表示列 arr.sub(Series3,axis = 0) >>> a b c d 0 -3.0 -2.0 -1.0 0.0 1 -3.0 -2.0 -1.0 0.0 2 -3.0 -2.0 -1.0 0.0
五、函数应用和映射