表8-2 merge函数的参数
indicator 添加特殊的列_merge,它可以指明每个行的来源,它的值有left_only、right_only或
both,根据每行的合并数据的来源。
索引上的合并
有时候,DataFrame中的连接键位于其索引中。在这种情况下,你可以传入left_index=True或
right_index=True(或两个都传)以说明索引应该被用作连接键:
In [56]: left1 = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'],
....: 'value': range(6)})
In [57]: right1 = pd.DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])
In [58]: left1
Out[58]:
key value
0 a 0
1 b 1
2 a 2
3 a 3
4 b 4
5 c 5
In [59]: right1
Out[59]:
group_val
a 3.5
b 7.0
In [60]: pd.merge(left1, right1, left_on='key', right_index=True)
Out[60]:
key value group_val
0 a 0 3.5
2 a 2 3.5
3 a 3 3.5
1 b 1 7.0
4 b 4 7.0
由于默认的merge方法是求取连接键的交集,因此你可以通过外连接的方式得到它们的并集:
In [61]: pd.merge(left1, right1, left_on='key', right_index=True, how='outer')
Out[61]:
key value group_val
0 a 0 3.5
2 a 2 3.5
3 a 3 3.5
1 b 1 7.0
4 b 4 7.0
5 c 5 NaN
对于层次化索引的数据,事情就有点复杂了,因为索引的合并默认是多键合并:
In [62]: lefth = pd.DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio',
....: 'Nevada', 'Nevada'],
....: 'key2': [2000, 2001, 2002, 2001, 2002],
....: 'data': np.arange(5.)})
In [63]: righth = pd.DataFrame(np.arange(12).reshape((6, 2)),
....: index=[['Nevada', 'Nevada', 'Ohio', 'Ohio',
....: 'Ohio', 'Ohio'],
....: [2001, 2000, 2000, 2000, 2001, 2002]],
....: columns=['event1', 'event2'])
In [64]: lefth
Out[64]:
data key1 key2
0 0.0 Ohio 2000
1 1.0 Ohio 2001
2 2.0 Ohio 2002
3 3.0 Nevada 2001
4 4.0 Nevada 2002
In [65]: righth
Out[65]:
event1 event2
Nevada 2001 0 1
2000 2 3
Ohio 2000 4 5
2000 6 7
2001 8 9
2002 10 11
这种情况下,你必须以列表的形式指明用作合并键的多个列(注意用how='outer'对重复索引值的
处理):
In [66]: pd.merge(lefth, righth, left_on=['key1', 'key2'], right_index=True)
Out[66]:
data key1 key2 event1 event2
0 0.0 Ohio 2000 4 5
0 0.0 Ohio 2000 6 7
1 1.0 Ohio 2001 8 9
2 2.0 Ohio 2002 10 11
3 3.0 Nevada 2001 0 1
In [67]: pd.merge(lefth, righth, left_on=['key1', 'key2'],
....: right_index=True, how='outer')
Out[67]:
data key1 key2 event1 event2
0 0.0 Ohio 2000 4.0 5.0
0 0.0 Ohio 2000 6.0 7.0
1 1.0 Ohio 2001 8.0 9.0
2 2.0 Ohio 2002 10.0 11.0
3 3.0 Nevada 2001 0.0 1.0
4 4.0 Nevada 2002 NaN NaN
4 NaN Nevada 2000 2.0 3.0
同时使用合并双方的索引也没问题:
In [68]: left2 = pd.DataFrame([[1., 2.], [3., 4.], [5., 6.]],
....: index=['a', 'c', 'e'],
....: columns=['Ohio', 'Nevada'])
In [69]: right2 = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]],
....: index=['b', 'c', 'd', 'e'],
....: columns=['Missouri', 'Alabama'])
In [70]: left2
Out[70]:
Ohio Nevada
a 1.0 2.0
c 3.0 4.0
e 5.0 6.0
In [71]: right2
Out[71]:
Missouri Alabama
b 7.0 8.0
c 9.0 10.0
d 11.0 12.0
e 13.0 14.0
In [72]: pd.merge(left2, right2, how='outer', left_index=True, right_index=True)
Out[72]:
Ohio Nevada Missouri Alabama
a 1.0 2.0 NaN NaN
b NaN NaN 7.0 8.0
c 3.0 4.0 9.0 10.0
d NaN NaN 11.0 12.0
e 5.0 6.0 13.0 14.0
DataFrame还有一个便捷的join实例方法,它能更为方便地实现按索引合并。它还可用于合并多个
带有相同或相似索引的DataFrame对象,但要求没有重叠的列。在上面那个例子中,我们可以编
写:
In [73]: left2.join(right2, how='outer')
Out[73]:
Ohio Nevada Missouri Alabama
a 1.0 2.0 NaN NaN
b NaN NaN 7.0 8.0
c 3.0 4.0 9.0 10.0
d NaN NaN 11.0 12.0
e 5.0 6.0 13.0 14.0
因为一些历史版本的遗留原因,DataFrame的join方法默认使用的是左连接,保留左边表的行索
引。它还支持在调用的DataFrame的列上,连接传递的DataFrame索引:
In [74]: left1.join(right1, on='key')
Out[74]:
key value group_val
0 a 0 3.5
1 b 1 7.0
2 a 2 3.5
3 a 3 3.5
4 b 4 7.0
5 c 5 NaN
最后,对于简单的索引合并,你还可以向join传入一组DataFrame,下一节会介绍更为通用的
concat函数,也能实现此功能:
In [75]: another = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [16., 17.]],
....: index=['a', 'c', 'e', 'f'],
....: columns=['New York',
'Oregon'])
In [76]: another
Out[76]:
New York Oregon
a 7.0 8.0
c 9.0 10.0
e 11.0 12.0
f 16.0 17.0
In [77]: left2.join([right2, another])
Out[77]:
Ohio Nevada Missouri Alabama New York Oregon
a 1.0 2.0 NaN NaN 7.0 8.0
c 3.0 4.0 9.0 10.0 9.0 10.0
e 5.0 6.0 13.0 14.0 11.0 12.0
In [78]: left2.join([right2, another], how='outer')
Out[78]:
Ohio Nevada Missouri Alabama New York Oregon
a 1.0 2.0 NaN NaN 7.0 8.0
b NaN NaN 7.0 8.0 NaN NaN
c 3.0 4.0 9.0 10.0 9.0 10.0
d NaN NaN 11.0 12.0 NaN NaN
e 5.0 6.0 13.0 14.0 11.0 12.0
f NaN NaN NaN NaN 16.0 17.0
轴向连接
另一种数据合并运算也被称作连接(concatenation)、绑定(binding)或堆叠(stacking)。
NumPy的concatenation函数可以用NumPy数组来做:
In [79]: arr = np.arange(12).reshape((3, 4))
In [80]: arr
Out[80]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
In [81]: np.concatenate([arr, arr], axis=1)
Out[81]:
array([[ 0, 1, 2, 3, 0, 1, 2, 3],
[ 4, 5, 6, 7, 4, 5, 6, 7],
[ 8, 9, 10, 11, 8, 9, 10, 11]])
对于pandas对象(如Series和DataFrame),带有标签的轴使你能够进一步推广数组的连接运
算。具体点说,你还需要考虑以下这些东西:
如果对象在其它轴上的索引不同,我们应该合并这些轴的不同元素还是只使用交集?
连接的数据集是否需要在结果对象中可识别?
连接轴中保存的数据是否需要保留?许多情况下,DataFrame默认的整数标签最好在连接时删
掉。
pandas的concat函数ᨀ 供了一种能够解决这些问题的可靠方式。我将给出一些例子来讲解其使用
方式。假设有三个没有重叠索引的Series:
In [82]: s1 = pd.Series([0, 1], index=['a', 'b'])
In [83]: s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])
In [84]: s3 = pd.Series([5, 6], index=['f', 'g'])
对这些对象调用concat可以将值和索引粘合在一起:
In [85]: pd.concat([s1, s2, s3])
Out[85]:
a 0
b 1
c 2
d 3
e 4
f 5
g 6
dtype: int64
默认情况下,concat是在axis=0上工作的,最终产生一个新的Series。如果传入axis=1,则结果就
会变成一个DataFrame(axis=1是列):
In [86]: pd.concat([s1, s2, s3], axis=1)
Out[86]:
0 1 2
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0
这种情况下,另外的轴上没有重叠,从索引的有序并集(外连接)上就可以看出来。传入
join='inner'即可得到它们的交集: I
n [87]: s4 = pd.concat([s1, s3])
In [88]: s4
Out[88]:
a 0
b 1
f 5
g 6
dtype: int64
In [89]: pd.concat([s1, s4], axis=1)
Out[89]:
0 1
a 0.0 0
b 1.0 1
f NaN 5
g NaN 6
In [90]: pd.concat([s1, s4], axis=1, join='inner')
Out[90]:
0 1
a 0 0
b 1 1
在这个例子中,f和g标签消失了,是因为使用的是join='inner'选项。
你可以通过join_axes指定要在其它轴上使用的索引:
In [91]: pd.concat([s1, s4], axis=1, join_axes=[['a', 'c', 'b', 'e']])
Out[91]:
0 1
a 0.0 0.0
c NaN NaN
b 1.0 1.0
e NaN NaN
不过有个问题,参与连接的片段在结果中区分不开。假设你想要在连接轴上创建一个层次化索引。
使用keys参数即可达到这个目的:
In [92]: result = pd.concat([s1, s1, s3], keys=['one','two', 'three'])
In [93]: result
Out[93]:
one a 0
b 1
two a 0
b 1
three f 5
g 6
dtype: int64
In [94]: result.unstack()
Out[94]:
a b f g
one 0.0 1.0 NaN NaN
two 0.0 1.0 NaN NaN
three NaN NaN 5.0 6.0
如果沿着axis=1对Series进行合并,则keys就会成为DataFrame的列头:
In [95]: pd.concat([s1, s2, s3], axis=1, keys=['one','two', 'three'])
Out[95]:
one two three
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0
同样的逻辑也适用于DataFrame对象:
In [96]: df1 = pd.DataFrame(np.arange(6).reshape(3, 2), index=['a', 'b', 'c'],
....: columns=['one', 'two'])
In [97]: df2 = pd.DataFrame(5 + np.arange(4).reshape(2, 2), index=['a', 'c'],
....: columns=['three', 'four'])
In [98]: df1
Out[98]:
one two
a 0 1
b 2 3
c 4 5
In [99]: df2
Out[99]:
three four
a 5 6
c 7 8
In [100]: pd.concat([df1, df2], axis=1, keys=['level1', 'level2'])
Out[100]:
level1 level2
one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0
如果传入的不是列表而是一个字典,则字典的键就会被当做keys选项的值:
In [101]: pd.concat({'level1': df1, 'level2': df2}, axis=1)
Out[101]:
level1 level2
one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0
此外还有两个用于管理层次化索引创建方式的参数(参见表8-3)。举个例子,我们可以用names
参数命名创建的轴级别:
In [102]: pd.concat([df1, df2], axis=1, keys=['level1', 'level2'],
.....: names=['upper', 'lower'])
Out[102]:
upper level1 level2
lower one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0
最后一个关于DataFrame的问题是,DataFrame的行索引不包含任何相关数据:
In [103]: df1 = pd.DataFrame(np.random.randn(3, 4), columns=['a', 'b', 'c', 'd'])
In [104]: df2 = pd.DataFrame(np.random.randn(2, 3), columns=['b', 'd', 'a'])
In [105]: df1
Out[105]:
a b c d
0 1.246435 1.007189 -1.296221 0.274992
1 0.228913 1.352917 0.886429 -2.001637
2 -0.371843 1.669025 -0.438570 -0.539741
In [106]: df2
Out[106]:
b d a
0 0.476985 3.248944 -1.021228
1 -0.577087 0.124121 0.302614
在这种情况下,传入ignore_index=True即可:
In [107]: pd.concat([df1, df2], ignore_index=True)
Out[107]:
a b c d
0 1.246435 1.007189 -1.296221 0.274992
1 0.228913 1.352917 0.886429 -2.001637
2 -0.371843 1.669025 -0.438570 -0.539741
3 -1.021228 0.476985 NaN 3.248944
4 0.302614 -0.577087 NaN 0.124121
合并重叠数据
还有一种数据组合问题不能用简单的合并(merge)或连接(concatenation)运算来处理。比如
说,你可能有索引全部或部分重叠的两个数据集。举个有启发性的例子,我们使用NumPy的where
函数,它表示一种等价于面向数组的if-else:
In [108]: a = pd.Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan],
.....: index=['f', 'e', 'd', 'c', 'b', 'a'])
In [109]: b = pd.Series(np.arange(len(a), dtype=np.float64),
.....: index=['f', 'e', 'd', 'c', 'b', 'a'])
In [110]: b[-1] = np.nan
In [111]: a
Out[111]:
f NaN
e 2.5
d NaN
c 3.5
b 4.5
a NaN
dtype: float64
In [112]: b
Out[112]:
f 0.0
e 1.0
d 2.0
c 3.0
b 4.0
a NaN
dtype: float64
In [113]: np.where(pd.isnull(a), b, a)
Out[113]: array([ 0. , 2.5, 2. , 3.5, 4.5, nan])
Series有一个combine_first方法,实现的也是一样的功能,还带有pandas的数据对齐:
In [114]: b[:-2].combine_first(a[2:])
Out[114]:
a NaN
b 4.5
c 3.0
d 2.0
e 1.0
f 0.0
dtype: float64
对于DataFrame,combine_first自然也会在列上做同样的事情,因此你可以将其看做:用传递对象
中的数据为调用对象的缺失数据“打补丁”:
In [115]: df1 = pd.DataFrame({'a': [1., np.nan, 5., np.nan],
.....: 'b': [np.nan, 2., np.nan, 6.],
.....: 'c': range(2, 18, 4)})
In [116]: df2 = pd.DataFrame({'a': [5., 4., np.nan, 3., 7.],
.....: 'b': [np.nan, 3., 4., 6., 8.]})
In [117]: df1
Out[117]:
a b c
0 1.0 NaN 2
1 NaN 2.0 6
2 5.0 NaN 10
3 NaN 6.0 14
In [118]: df2
Out[118]:
a b
0 5.0 NaN
1 4.0 3.0
2 NaN 4.0
3 3.0 6.0
4 7.0 8.0
In [119]: df1.combine_first(df2)
Out[119]:
a b c
0 1.0 NaN 2.0
1 4.0 2.0 6.0
2 5.0 4.0 10.0
3 3.0 6.0 14.0
4 7.0 8.0 NaN