Pandas的基本特性之一就是高性能的内存式数据链接(join)和合并(merge)操作。
pd.merge()函数实现了三种数据链接的类型:一对一、多对一和多对多。这三种数据连接类型都通过pd.merge()接口进行调用。
In [1] :import pandas as pd
In [2] :df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],'hire_date': [2004, 2008, 2012, 2014]})
In [3] :df1
Out[3] :
employee group
0 Bob Accounting
1 Jake Engineering
2 Lisa Engineering
3 Sue HR
In [4] :df2
Out[4] :
employee hire_date
0 Lisa 2004
1 Bob 2008
2 Jake 2012
3 Sue 2014
可以看到,两个df有共同列‘employee ’,若想将这两个DataFrame合并成一个DataFrame,可以使用pd.merge(),并且会自动以相同列作为key进行连接:
In [5] :df3 = pd.merge(df1,df2)
df3
Out[5] :
employee group hire_date
0 Bob Accounting 2008
1 Jake Engineering 2012
2 Lisa Engineering 2004
3 Sue HR 2014
两个df的合并会生成一个新的DataFrame。而且,pd.merge()会默认丢弃原来的行索引,重新生成一个整数行索引。
多对一连接是指,在需要连接的两个列中,有一列的值有重复。通过多对一连接获得的结果DataFrame将会保留重复值。
In [6] :df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],'supervisor': ['Carly', 'Guido', 'Steve']})
df4
Out[6] :
group supervisor
0 Accounting Carly
1 Engineering Guido
2 HR Steve
In [7] :pd.merge(df3,df4)
Out[7] :
employee group hire_date supervisor
0 Bob Accounting 2008 Carly
1 Jake Engineering 2012 Guido
2 Lisa Engineering 2004 Guido
3 Sue HR 2014 Steve
对于df4中的group列,因为df3中group列拥有两个‘Engineering’,所以会自动生成与其对应的连接值。
可以理解为,如果左右两个输入的共同列都包含重复值,那么合并的结果就是一种多对多连接。
In [8] :df5 = pd.DataFrame({'group': ['Accounting', 'Accounting','Engineering', 'Engineering', 'HR', 'HR'],'skills': ['math', 'spreadsheets', 'coding', 'linux',
'spreadsheets', 'organization']})
df5
Out[8] :
group skills
0 Accounting math
1 Accounting spreadsheets
2 Engineering coding
3 Engineering linux
4 HR spreadsheets
5 HR organization
In [9] :pd.merge(df1,df5)
Out[9] :
employee group skills
0 Bob Accounting math
1 Bob Accounting spreadsheets
2 Jake Engineering coding
3 Jake Engineering linux
4 Lisa Engineering coding
5 Lisa Engineering linux
6 Sue HR spreadsheets
7 Sue HR organization
这三种数据连接类型可以直接与其他Pandas 工具组合使用,从而实现各种各样的功
能。
上面所介绍的合并都是拥有共同列名的两个DataFrame之间的合并,可在工作中更常见的是意义相同,列名却不同的两个DataFrame进行合并。pd.merge()提供了一些参数处理这个问题。
最简单的方法就是直接将参数on 设置为一个列名字符串或者一个包含多列名称的列表:
对于拥有相同列的df使用
In [10] :pd.merge(df1, df2, on='employee')
Out[10] :
employee group hire_date
0 Bob Accounting 2008
1 Jake Engineering 2012
2 Lisa Engineering 2004
3 Sue HR 2014
有时你也需要合并两个列名不同的数据集,例如前面的员工信息表中有一个字段不是
“employee”而是“name”。在这种情况下,就可以用left_on 和right_on 参数来指定
列名:
In [11] :df3 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],'salary': [70000, 80000, 120000, 90000]})
In [12] :pd.merge(df1, df3, left_on="employee", right_on="name")
Out[12] :
employee group name salary
0 Bob Accounting Bob 70000
1 Jake Engineering Jake 80000
2 Lisa Engineering Lisa 120000
3 Sue HR Sue 90000
除了合并列之外,你可能还需要合并索引。就像下面例子中的数据那样:
In [13] :df1a = df1.set_index('employee')
df2a = df2.set_index('employee')
In [14] :df1a
group
employee
Bob Accounting
Jake Engineering
Lisa Engineering
Sue HR
In [14] :df2a
hire_date
employee
Lisa 2004
Bob 2008
Jake 2012
Sue 2014
In [15] :pd.merge(df1a,df2a,left_index=True,right_index=True) # 参数必须同时为True,否则报错
Out[15] :
group hire_date
employee
Bob Accounting 2008
Jake Engineering 2012
Lisa Engineering 2004
Sue HR 2014
pd.merge()合并两个DataFrame时,默认求交集,这种连接方式被称为内连接。
In [15] :df6 = pd.DataFrame({'name': ['Peter', 'Paul', 'Mary'],'food': ['fish', 'beans', 'bread']},columns=['name', 'food'])
In [16] :df7 = pd.DataFrame({'name': ['Mary', 'Joseph'],'drink': ['wine', 'beer']},columns=['name', 'drink'])
In [17] :df6
Out[17] :
name food
0 Peter fish
1 Paul beans
2 Mary bread
In [18] :df7
Out[18] :
name drink
0 Mary wine
1 Joseph beer
In [19] :pd.merge(df6,df7)
Out[19] :
name food drink
0 Mary bread wine
我们可以用how参数设置连接方式,默认为‘innner’
In [20] :pd.merge(df6, df7, how='inner')
Out[20] :
name food drink
0 Mary bread wine
外连接outer,也就是求并集,缺失值用NaN填充:
In [21] :pd.merge(df6,df7,how="outer")
Out[21] :
name food drink
0 Peter fish NaN
1 Paul beans NaN
2 Mary bread wine
3 Joseph NaN beer
左连接left,只包含相同列左边的值:
In [22] :pd.merge(df6,df7,how="outer")
Out[22] :
name food drink
0 Peter fish NaN
1 Paul beans NaN
2 Mary bread wine
3 Joseph NaN beer
右连接right,只包含相同列右边的值:
In [23] :pd.merge(df6,df7,how="right")
Out[23] :
name food drink
0 Mary bread wine
1 Joseph NaN beer
最后,你可能会遇到两个输入DataFrame 有重名列的情况。来看看下面的例子:
In [24] :df8 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],'rank': [1, 2, 3, 4]})
df9 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],'rank': [3, 1, 4, 2]})
In [25] :df8
Out[23] :
name rank
0 Bob 1
1 Jake 2
2 Lisa 3
3 Sue 4
In [25] :df9
Out[23] :
name rank
0 Bob 3
1 Jake 1
2 Lisa 4
3 Sue 2
# 对于这种所有列名都相同的DF,必须制定合并key,除非完全相同,否则没有结果
In [26] :pd.merge(df8,df9)
Out[26] :
name rank
In [27] :pd.merge(df8,df9,on="name")
Out[27] :
name rank_x rank_y
0 Bob 1 3
1 Jake 2 1
2 Lisa 3 4
3 Sue 4 2
由于输出结果中有两个重复的列名,因此pd.merge() 函数会自动为它们增加后缀_x 或_y,当然也可以通过suffixes 参数自定义后缀名:
In [28] :pd.merge(df8,df9,on="name",suffixes=["_L","_R"])
Out[28] :
name rank_L rank_R
0 Bob 1 3
1 Jake 2 1
2 Lisa 3 4
3 Sue 4 2