27. Pandas的数据拼接-merge函数
concat函数可以实现内外连接,而pandas的merge函数可以真正实现数据库的内外连接,且外连接还可以有左右连接的特性。
merge函数默认拼接数据是inner join即内连接。下面以学生选课为例,设计两个DataFrame通过merge函数来拼接合并。
import pandas as pd
import numpy as np
col1 = "class_name class_id class_lecturer".split()
col2 = "class_id stu_id".split()
val1 = [["IT", 100, "Wangli"],["CS", 101, "WangMa"],["CAD", 102, "Liping"]]
val2 = [[100, 20181115],[100, 20181116],[101, 20181117]]
course = pd.DataFrame(val1, columns = col1)
print "***course", "*" * 38
print course
choose = pd.DataFrame(val2, columns = col2)
print "***choose", "*" * 38
print choose
print "***course merge choose", "*" * 25
print course.merge(choose)
print "***choose merge course", "*" * 25
print choose.merge(course)
程序的执行结果:
***course **************************************
class_name class_id class_lecturer
0 IT 100 Wangli
1 CS 101 WangMa
2 CAD 102 Liping
***choose **************************************
class_id stu_id
0 100 20181115
1 100 20181116
2 101 20181117
***course merge choose *************************
class_name class_id class_lecturer stu_id
0 IT 100 Wangli 20181115
1 IT 100 Wangli 20181116
2 CS 101 WangMa 20181117
***choose merge course *************************
class_id stu_id class_name class_lecturer
0 100 20181115 IT Wangli
1 100 20181116 IT Wangli
2 101 20181117 CS WangMa
merge的outer连接方式。结果是两个DataFrame均输出,未匹配上的用NaN填充。
import pandas as pd
import numpy as np
col1 = "class_name class_id class_lecturer".split()
col2 = "class_id stu_id".split()
val1 = [["IT", 100, "Wangli"],["CS", 101, "WangMa"],["CAD", 102, "Liping"], ["ME", 103, "Wufang"],["IT", 104, "Xiaomin"]]
val2 = [[100, 20181115],[100, 20181116],[101, 20181117]]
course = pd.DataFrame(val1, columns = col1)
print "***course", "*" * 38
print course
choose = pd.DataFrame(val2, columns = col2)
print "***choose", "*" * 38
print choose
print "***course merge choose in inner", "*" * 25
print course.merge(choose, how = "inner")
print "***course merge choose in outer", "*" * 25
print course.merge(choose, how = "outer")
print "***choose merge course in inner", "*" * 25
print choose.merge(course, how = "inner")
print "***choose merge course in outer", "*" * 25
print choose.merge(course, how = "outer")
程序执行结果:
***course **************************************
class_name class_id class_lecturer
0 IT 100 Wangli
1 CS 101 WangMa
2 CAD 102 Liping
3 ME 103 Wufang
4 IT 104 Xiaomin
***choose **************************************
class_id stu_id
0 100 20181115
1 100 20181116
2 101 20181117
***course merge choose in inner *************************
class_name class_id class_lecturer stu_id
0 IT 100 Wangli 20181115
1 IT 100 Wangli 20181116
2 CS 101 WangMa 20181117
***course merge choose in outer *************************
class_name class_id class_lecturer stu_id
0 IT 100 Wangli 20181115
1 IT 100 Wangli 20181116
2 CS 101 WangMa 20181117
3 CAD 102 Liping NaN
4 ME 103 Wufang NaN
5 IT 104 Xiaomin NaN
***choose merge course in inner *************************
class_id stu_id class_name class_lecturer
0 100 20181115 IT Wangli
1 100 20181116 IT Wangli
2 101 20181117 CS WangMa
***choose merge course in outer *************************
class_id stu_id class_name class_lecturer
0 100 20181115 IT Wangli
1 100 20181116 IT Wangli
2 101 20181117 CS WangMa
3 102 NaN CAD Liping
4 103 NaN ME Wufang
5 104 NaN IT Xiaomin
merge的左右连接,这里调用merge的Dataframe是“左表”而连接即作为形参的是DataFrame是“右表”。左连接左表全输出而右表能匹配的输出,匹配不上的填充NaN,同理右连接时“右表”全输出,而左表匹配上输出,匹配不上填充NaN。
import pandas as pd
import numpy as np
col1 = "class_name class_id class_lecturer".split()
col2 = "class_id stu_id".split()
val1 = [["IT", 100, "Wangli"],["CS", 101, "WangMa"],["CAD", 102, "Liping"], ["ME", 103, "Wufang"],["IT", 104, "Xiaomin"]]
val2 = [[100, 20181115],[100, 20181116],[101, 20181117],[100, 20181118],[101, 20181119], [200, 20181120]]
course = pd.DataFrame(val1, columns = col1)
print "***course", "*" * 38
print course
choose = pd.DataFrame(val2, columns = col2)
print "***choose", "*" * 38
print choose
print "***course merge choose in left", "*" * 25
print course.merge(choose, how = "left")
print "***course merge choose in right", "*" * 25
print course.merge(choose, how = "right")
print "***choose merge course in left", "*" * 25
print choose.merge(course, how = "left")
print "***choose merge course in right", "*" * 25
print choose.merge(course, how = "right")
程序执行结果:
***course **************************************
class_name class_id class_lecturer
0 IT 100 Wangli
1 CS 101 WangMa
2 CAD 102 Liping
3 ME 103 Wufang
4 IT 104 Xiaomin
***choose **************************************
class_id stu_id
0 100 20181115
1 100 20181116
2 101 20181117
3 100 20181118
4 101 20181119
5 200 20181120
***course merge choose in left *************************
class_name class_id class_lecturer stu_id
0 IT 100 Wangli 20181115
1 IT 100 Wangli 20181116
2 IT 100 Wangli 20181118
3 CS 101 WangMa 20181117
4 CS 101 WangMa 20181119
5 CAD 102 Liping NaN
6 ME 103 Wufang NaN
7 IT 104 Xiaomin NaN
***course merge choose in right *************************
class_name class_id class_lecturer stu_id
0 IT 100 Wangli 20181115
1 IT 100 Wangli 20181116
2 IT 100 Wangli 20181118
3 CS 101 WangMa 20181117
4 CS 101 WangMa 20181119
5 NaN 200 NaN 20181120
***choose merge course in left *************************
class_id stu_id class_name class_lecturer
0 100 20181115 IT Wangli
1 100 20181116 IT Wangli
2 101 20181117 CS WangMa
3 100 20181118 IT Wangli
4 101 20181119 CS WangMa
5 200 20181120 NaN NaN
***choose merge course in right *************************
class_id stu_id class_name class_lecturer
0 100 20181115 IT Wangli
1 100 20181116 IT Wangli
2 100 20181118 IT Wangli
3 101 20181117 CS WangMa
4 101 20181119 CS WangMa
5 102 NaN CAD Liping
6 103 NaN ME Wufang
7 104 NaN IT Xiaomin
请注意[200, 20181120]这条选课数据,课程id为200在course里并不存在。而["CAD", 102, "Liping"], ["ME", 103, "Wufang"],["IT", 104, "Xiaomin"]这三门课没有学生选。
由此可见,merge函数的left join、right join和数据库的表的left join、right join的概念完全匹配。