笔者最近正在学习Pandas数据分析,将自己的学习笔记做成一套系列文章。本节主要记录Pandas中DataFrame的Merge
Pandas的Merge,相当于Sql的Join,将不同的表按key关联到一个表
DataFrame.merge(right, how=‘inner’, on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=(’_x’, ‘_y’), copy=True, indicator=False, validate=None)
电影评分数据集
是推荐系统研究的很好的数据集,包含三个文件:
提前读取数据
import pandas as pd
#当分隔符是"::"的时候,Pandas会认为是正则表达式,但是其实它不是正则表达式,我们用engine="python"来说明就可以了
df_ratings=pd.read_csv(
"./datas/ml-1m/ratings.dat",
sep="::",
engine='python',
names="UserID::MovieID::Rating::Timestamp".split("::")
)
df_ratings.head()
df_users=pd.read_csv(
"./datas/ml-1m/users.dat",
sep="::",
engine='python',
names="UserID::Gender::Age::Occupation::Zip-code".split("::")
)
df_users.head()
df_movies=pd.read_csv(
"./datas/ml-1m/movies.dat",
sep="::",
engine='python',
names="MovieID::Title::Genres".split("::")
)
df_movies.head()
df_ratings_users=pd.merge(
df_ratings,df_users,left_on='UserID',right_on="UserID",how="inner"
)
df_ratings_users.head()
df_ratings_users_movies=pd.merge( df_ratings_users,df_movies,left_on='MovieID',right_on='MovieID',how='inner'
)
df_ratings_users_movies.head(10)
以下关系要正确的理解:
#2.1 一对一关系
left=pd.DataFrame({
'sno':[11,12,13,14],
'name':['a','b','c','d']
})
left
right=pd.DataFrame({'sno':[11,12,13,14],
'age':['21','22','23','24']})
right
pd.merge(left,right,on='sno')
#2.2 一对多的关系
#注意:数据会被复制
left=pd.DataFrame({
'sno':[11,12,13,14],
'name':['a','b','c','d']
})
left
right=pd.DataFrame({'sno':[11,11,11,12,12,13],
'grade':['语文88','数学90','英语75','语文66','数学55','英语29']})
right
#数目以多的一边为准
pd.merge(left,right,on='sno')
#2.3 多对多关系
#注意:结果数量会出现乘法
left=pd.DataFrame({
'sno':[11,11,12,12,12],
'爱好':['篮球','羽毛球','乒乓球','篮球','足球']
})
left
right=pd.DataFrame({'sno':[11,11,11,12,12,13],
'grade':['语文88','数学90','英语75','语文66','数学55','英语29']})
right
pd.merge(left,right,on='sno')
left=pd.DataFrame({
'key':['K0','K1','K2','K3'],
'A':['A0','A1','A2','A3'],
'B':['B0','B1','B2','B3']
})
right=pd.DataFrame({
'key':['K0','K1','K2','K3'],
'C':['C0','C1','C2','C3'],
'D':['D0','D1','D2','D3']
})
left
right
#3.1 inner join 默认
#左边和右边的key都有,才会出现在结果里
pd.merge(left,right,how='inner')
#3.2 left join
#左边的都会出现在结果里,右边的如果无法匹配则为Null
pd.merge(left,right,how='left')
#3.3 right join
#右边的都会出现在结果里,左边的如果无法匹配则为Null
pd.merge(left,right,how='right')
#3.4 outer join
#左边和右边的都会出现在结果里,如果无法匹配则为Null
pd.merge(left,right,how='outer')
left=pd.DataFrame({
'key':['K0','K1','K2','K3'],
'A':['A0','A1','A2','A3'],
'B':['B0','B1','B2','B3']
})
right=pd.DataFrame({
'key':['K0','K1','K2','K3'],
'A':['C0','C1','C2','C3'],
'D':['D0','D1','D2','D3']
})
left
right
pd.merge(left,right,on='key')
pd.merge(left,right,on='key',suffixes=('_left','_right'))
这就是pandas的DataFrame的Merge的基本用法了,希望可以帮助到你。