pd.merge()合并和链接数据——pandas

一、数据链接类型

1、一对一链接:

pd.merge()的一对一链接自动以这列作为键进行链接。即使共同列的位置不一样,它也会自动处理,使相应的数据对应。

>>>df1 = pd.DataFrame({'employee':['Bob', 'Jake', 'Lisa', 'Sue'],
                   'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
>>>df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
                   'hire_date': [2004, 2008, 2012, 2014]})
>>>df3 = pd.merge(df1, df2)
	employee	group	hire_date
0	Bob		Accounting	2008
1	Jake	Engineering	2012
2	Lisa	Engineering	2004
3	Sue		HR			2014
2、多对一连接

加入有一列的值有重复,则通过多对一链接获得的结果DataFrame将会保留重复值。

>>>df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
                   'supervisor': ['Carly', 'Guido', 'Steve']})

group	supervisor
0	Accounting	Carly
1	Engineering	Guido
2	HR	Steve
>>>pd.merge(df3, df4)
	employee	group	hire_date	supervisor
0	Bob		Accounting	2008	Carly
1	Jake	Engineering	2012	Guido
2	Lisa	Engineering	2004	Guido
3	Sue		HR			2014	Steve
3、多对多连接

如果左右输入的两个列都包含重复值,那么合并的结果就是一种多对多的链接。
简单理解为,它会自动的改变索引,使得两组数据的没一个数据都能有序的融合到一个新的DataFrame对象中。

>>>df5 = pd.DataFrame({'group': ['Accounting', 'Accounting', 'Engineering', 'Engineering', 'HR','HR'],
                   'skills': ['math', 'spreadsheets', 'coding', 'linux', 'spreadsheets', 'organization']})
>>>pd.merge(df1, df5)
	employee	group	skills
0	Bob		Accounting	math
1	Bob		Accounting	spreadsheets
2	Jake	Engineering	coding
3	Jake	Engineering	linux
4	Lisa	Engineering	coding
5	Lisa	Engineering	linux
6	Sue		HR	spreadsheets
7	Sue		HR	organization

二、数据合并的参数

1、参数on

只能在两个DataFrame对象有共同列名的时候才能使用。

>>>pd.merge(df1, df2, on='employee')
	employee	group	hire_date
0	Bob		Accounting	2008
1	Jake	Engineering	2012
2	Lisa	Engineering	2004
3	Sue		HR			2014
2、left_on 和 right_on参数

当两个数据对象的列名不一样时,可以使用这两个参数指定要合并的那一列的名字。

>>>df3 = pd.DataFrame({'name':['Bob', 'Jake', 'Lisa', 'Sue'],
                   'salary':[70000, 80000, 120000, 90000]})
	name	salary
0	Bob		70000
1	Jake	80000
2	Lisa	120000
3	Sue		90000
>>>df1
	employee	group
0	Bob		Accounting
1	Jake	Engineering
2	Lisa	Engineering
3	Sue		HR
>>>pd.merge(df1, df3, left_on='employee', right_on='name')
	employee	group	name	salary
0	Bob		Accounting	Bob		70000
1	Jake	Engineering	Jake	80000
2	Lisa	Engineering	Lisa	120000
3	Sue		HR			Sue		90000
3、left_index 和 right_index
>>>df1a = df1.set_index('employee')
		group
employee	
Bob		Accounting
Jake	Engineering
Lisa	Engineering
Sue		HR
>>>df2a = df2.set_index('employee')
		hire_date
employee	
Lisa	2004
Bob		2008
Jake	2012
Sue		2014
>>>pd.merge(df1a, df2a, left_index=True, right_index=True)
		group		hire_date
employee		
Bob		Accounting	2008
Jake	Engineering	2012
Lisa	Engineering	2004
Sue		HR			2014

也可以使用DataFrame.join()方法来按照索引进行数据合并:

>>>df1a.join(df2a)
		group		hire_date
employee		
Bob		Accounting	2008
Jake	Engineering	2012
Lisa	Engineering	2004
Sue		HR			2014

也可以混合使用:

>>>pd.merge(df1a, df3, left_index=True, right_on='name')

	group		name	salary
0	Accounting	Bob		70000
1	Engineering	Jake	80000
2	Engineering	Lisa	120000
3	HR			Sue		90000

三、设置数据链接的集合操作规则

默认情况下,merge合并后的对象只会包含两个输入集合的交集,这种方式称为内连接。

>>>df6 = pd.DataFrame({'name': ['Peter', 'Paul', 'Mary'],
                   'food': ['fish', 'beans', 'bread']},
                  columns=['name', 'food'])
	name	food
0	Peter	fish
1	Paul	beans
2	Mary	bread
>>>df7 = pd.DataFrame({'name': ['Mary', 'Joseph'],
                   'drink': ['wine', 'beer']},
                  columns=['name', 'drink'])
	name	drink
0	Mary	wine
1	Joseph	beer
>>>pd.merge(df6, df7)
	name	food	drink
0	Mary	bread	wine

可以用 how 来设置链接方式,默认的是 ‘inner’。

1、当设置 how = ‘outer’ 时,表示为外连接,是两个数据集合的并集。
>>>pd.merge(df6, df7, how='outer')
	name	food	drink
0	Peter	fish	NaN
1	Paul	beans	NaN
2	Mary	bread	wine
3	Joseph	NaN		beer

此外,how 的链接方式还有’left’,‘right’。

2、how = ‘left’ 为左链接,返回结果只包含左边数据的列名,即一左边为基础来进行并集:
>>>pd.merge(df6, df7, how='left')
	name	food	drink
0	Peter	fish	NaN
1	Paul	beans	NaN
2	Mary	bread	wine
3、how = ‘right’ 为右链接,返回结果只包含右边数据的列名,即一右边为基础来进行并集:
>>>pd.merge(df6, df7, how='right')
	name	food	drink
0	Mary	bread	wine
1	Joseph	NaN		beer
4、重复列名:suffixes参数

对重复的列名,会自动为他们增加后缀,也可以通过suffixes设置后缀名:suffixes=["_L", “_R”]

>>>df8 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
                   'rank':[1, 2, 3, 4]})
>>>df9 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
                   'rank':[3, 1, 4, 2]})
>>>pd.merge(df8, df9, on='name')
	name	rank_x	rank_y
0	Bob		1		3
1	Jake	2		1
2	Lisa	3		4
3	Sue		4		2

你可能感兴趣的:(Python)