pd.merge()的一对一链接自动以这列作为键进行链接。即使共同列的位置不一样,它也会自动处理,使相应的数据对应。
>>>df1 = pd.DataFrame({'employee':['Bob', 'Jake', 'Lisa', 'Sue'],
'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
>>>df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
'hire_date': [2004, 2008, 2012, 2014]})
>>>df3 = pd.merge(df1, df2)
employee group hire_date
0 Bob Accounting 2008
1 Jake Engineering 2012
2 Lisa Engineering 2004
3 Sue HR 2014
加入有一列的值有重复,则通过多对一链接获得的结果DataFrame将会保留重复值。
>>>df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
'supervisor': ['Carly', 'Guido', 'Steve']})
group supervisor
0 Accounting Carly
1 Engineering Guido
2 HR Steve
>>>pd.merge(df3, df4)
employee group hire_date supervisor
0 Bob Accounting 2008 Carly
1 Jake Engineering 2012 Guido
2 Lisa Engineering 2004 Guido
3 Sue HR 2014 Steve
如果左右输入的两个列都包含重复值,那么合并的结果就是一种多对多的链接。
简单理解为,它会自动的改变索引,使得两组数据的没一个数据都能有序的融合到一个新的DataFrame对象中。
>>>df5 = pd.DataFrame({'group': ['Accounting', 'Accounting', 'Engineering', 'Engineering', 'HR','HR'],
'skills': ['math', 'spreadsheets', 'coding', 'linux', 'spreadsheets', 'organization']})
>>>pd.merge(df1, df5)
employee group skills
0 Bob Accounting math
1 Bob Accounting spreadsheets
2 Jake Engineering coding
3 Jake Engineering linux
4 Lisa Engineering coding
5 Lisa Engineering linux
6 Sue HR spreadsheets
7 Sue HR organization
只能在两个DataFrame对象有共同列名的时候才能使用。
>>>pd.merge(df1, df2, on='employee')
employee group hire_date
0 Bob Accounting 2008
1 Jake Engineering 2012
2 Lisa Engineering 2004
3 Sue HR 2014
当两个数据对象的列名不一样时,可以使用这两个参数指定要合并的那一列的名字。
>>>df3 = pd.DataFrame({'name':['Bob', 'Jake', 'Lisa', 'Sue'],
'salary':[70000, 80000, 120000, 90000]})
name salary
0 Bob 70000
1 Jake 80000
2 Lisa 120000
3 Sue 90000
>>>df1
employee group
0 Bob Accounting
1 Jake Engineering
2 Lisa Engineering
3 Sue HR
>>>pd.merge(df1, df3, left_on='employee', right_on='name')
employee group name salary
0 Bob Accounting Bob 70000
1 Jake Engineering Jake 80000
2 Lisa Engineering Lisa 120000
3 Sue HR Sue 90000
>>>df1a = df1.set_index('employee')
group
employee
Bob Accounting
Jake Engineering
Lisa Engineering
Sue HR
>>>df2a = df2.set_index('employee')
hire_date
employee
Lisa 2004
Bob 2008
Jake 2012
Sue 2014
>>>pd.merge(df1a, df2a, left_index=True, right_index=True)
group hire_date
employee
Bob Accounting 2008
Jake Engineering 2012
Lisa Engineering 2004
Sue HR 2014
也可以使用DataFrame.join()方法来按照索引进行数据合并:
>>>df1a.join(df2a)
group hire_date
employee
Bob Accounting 2008
Jake Engineering 2012
Lisa Engineering 2004
Sue HR 2014
也可以混合使用:
>>>pd.merge(df1a, df3, left_index=True, right_on='name')
group name salary
0 Accounting Bob 70000
1 Engineering Jake 80000
2 Engineering Lisa 120000
3 HR Sue 90000
默认情况下,merge合并后的对象只会包含两个输入集合的交集,这种方式称为内连接。
>>>df6 = pd.DataFrame({'name': ['Peter', 'Paul', 'Mary'],
'food': ['fish', 'beans', 'bread']},
columns=['name', 'food'])
name food
0 Peter fish
1 Paul beans
2 Mary bread
>>>df7 = pd.DataFrame({'name': ['Mary', 'Joseph'],
'drink': ['wine', 'beer']},
columns=['name', 'drink'])
name drink
0 Mary wine
1 Joseph beer
>>>pd.merge(df6, df7)
name food drink
0 Mary bread wine
可以用 how 来设置链接方式,默认的是 ‘inner’。
>>>pd.merge(df6, df7, how='outer')
name food drink
0 Peter fish NaN
1 Paul beans NaN
2 Mary bread wine
3 Joseph NaN beer
此外,how 的链接方式还有’left’,‘right’。
>>>pd.merge(df6, df7, how='left')
name food drink
0 Peter fish NaN
1 Paul beans NaN
2 Mary bread wine
>>>pd.merge(df6, df7, how='right')
name food drink
0 Mary bread wine
1 Joseph NaN beer
对重复的列名,会自动为他们增加后缀,也可以通过suffixes设置后缀名:suffixes=["_L", “_R”]
>>>df8 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
'rank':[1, 2, 3, 4]})
>>>df9 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
'rank':[3, 1, 4, 2]})
>>>pd.merge(df8, df9, on='name')
name rank_x rank_y
0 Bob 1 3
1 Jake 2 1
2 Lisa 3 4
3 Sue 4 2