pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True, suffixes=('_x', '_y'),
copy=True, indicator=False, validate=None)
right : DataFrame or named Series,也就是2个表的名字。
当使用pandas.merge()时,right处实际填入两个待合并的结构;当使用dataframe.merge()时,
right处仅填入一个待合并的结构,此处的right与dataframe分别作为右/左结构.
how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘inner’:
how指定了结构的融合的类型,是一个关于key的重要参数
默认inner,即采用交叉部分的key作为列的内容
left: 即选取左侧结构的key作为列的内容
right: 即选取左侧结构的key作为列的内容
outer: 选取所有的键作为列内容.
不存在的内容用NaN填充
on : label or list
参数on指定了用于合并的键key.
参数on指定的键必须是两个结构中共有的.
indicator : bool or str, default False。If True, adds a column to output DataFrame called “_merge” with information on the source of each row.
indicator 用于指示说明该行所用的键来自于哪一边结构.
left_index : bool, default False
Use the index from the left DataFrame as the join key(s).
left_index 设定为True, 即根据左侧结构的index进行merge. 而不再是根据某一columns.
right_index : bool, default False
Use the index from the right DataFrame as the join key. Same caveats as left_index.
right_index 与 left_index同时使用.即根据两个结构的index进行merge.
suffixes : tuple of (str, str), default (‘_x’, ‘_y’)
Suffix to apply to overlapping column names in the left and right side, respectively. To raise an exception on overlapping columns use (False, False).
suffixes 主要用于解决两个合并结构的列存在交叉的情况.
通过suffixes 的指定,名字相同可以在merge后使用不同的列名,并同时存在.
ratings.dat的数据如下这样:
1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
1::3408::4::978300275
1::2355::5::978824291
import pandas as pd
import os
os.chdir(r'C:\Users\Hans\Desktop\data_analysis\test_data\movie')
df_ratings = pd.read_csv('ratings.dat',sep = '::',engine='python',names = 'UserID::MovieID::Timestamp'.split('::'))
df_ratings.head()
|
UserID |
MovieID |
Timestamp |
1 |
1193 |
5 |
978300760 |
1 |
661 |
3 |
978302109 |
1 |
914 |
3 |
978301968 |
1 |
3408 |
4 |
978300275 |
1 |
2355 |
5 |
978824291 |
users.dat数据如下:
1::Toy Story (1995)::Animation|Children’s|Comedy
2::Jumanji (1995)::Adventure|Children’s|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama
5::Father of the Bride Part II (1995)::Comedy
df_users = pd.read_csv('users.dat',sep = '::',engine='python',names = 'UserID::Gender::Age::Occupation::Zip-code'.split('::'))
df_users.head()
'''
sep=""这是分隔符 ,如果这个分隔符是一个,则自动认为是一个分隔符。但是当分隔符是2个字符的时候,可能被认为是正则表达式,
此时就需要指定engine = ‘python’
names是一个列表,通过str.split实现
'''
|
UserID |
Gender |
Age |
Occupation |
Zip-code |
0 |
1 |
F |
1 |
10 |
48067 |
1 |
2 |
M |
56 |
16 |
70072 |
2 |
3 |
M |
25 |
15 |
55117 |
3 |
4 |
M |
45 |
7 |
02460 |
4 |
5 |
M |
25 |
20 |
55455 |
movies.dat数据是这样子的:
1::Toy Story (1995)::Animation|Children’s|Comedy
2::Jumanji (1995)::Adventure|Children’s|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama
5::Father of the Bride Part II (1995)::Comedy
df_movies = pd.read_csv('movies.dat',sep = '::',engine='python',names = 'MovieID::Title::Genres'.split('::'))
df_movies.head()
|
MovieID |
Title |
Genres |
0 |
1 |
Toy Story (1995) |
Animation|Children's|Comedy |
1 |
2 |
Jumanji (1995) |
Adventure|Children's|Fantasy |
2 |
3 |
Grumpier Old Men (1995) |
Comedy|Romance |
3 |
4 |
Waiting to Exhale (1995) |
Comedy|Drama |
4 |
5 |
Father of the Bride Part II (1995) |
Comedy |
df_ratings_users = pd.merge(df_ratings,df_users,left_on = 'UserID',right_on = 'UserID',how = 'inner')
df_ratings_users.head()
|
UserID |
MovieID |
Timestamp |
Gender |
Age |
Occupation |
Zip-code |
0 |
1193 |
5 |
978300760 |
M |
25 |
12 |
90712 |
1 |
1193 |
5 |
978298413 |
M |
25 |
12 |
90712 |
2 |
1193 |
4 |
978220179 |
M |
25 |
12 |
90712 |
3 |
1193 |
4 |
978199279 |
M |
25 |
12 |
90712 |
4 |
1193 |
5 |
978158471 |
M |
25 |
12 |
90712 |
df_ratings_users.loc[df_ratings_users['UserID'] == 1193].tail()
|
UserID |
MovieID |
Timestamp |
Gender |
Age |
Occupation |
Zip-code |
1720 |
1193 |
5 |
956713500 |
M |
25 |
12 |
90712 |
1721 |
1193 |
5 |
956710879 |
M |
25 |
12 |
90712 |
1722 |
1193 |
5 |
956710766 |
M |
25 |
12 |
90712 |
1723 |
1193 |
4 |
956709215 |
M |
25 |
12 |
90712 |
1724 |
1193 |
4 |
957716612 |
M |
25 |
12 |
90712 |
df_users.loc[df_users['UserID']==1193]
|
UserID |
Gender |
Age |
Occupation |
Zip-code |
1192 |
1193 |
M |
25 |
12 |
90712 |
df_ratings.loc[df_ratings['UserID']==1193].tail()
|
UserID |
MovieID |
Timestamp |
6033 |
1193 |
5 |
956713500 |
6035 |
1193 |
5 |
956710879 |
6036 |
1193 |
5 |
956710766 |
6037 |
1193 |
4 |
956709215 |
6040 |
1193 |
4 |
957716612 |
import pandas as pd
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
left
|
key1 |
key2 |
A |
B |
0 |
K0 |
K0 |
A0 |
B0 |
1 |
K0 |
K1 |
A1 |
B1 |
2 |
K1 |
K0 |
A2 |
B2 |
3 |
K2 |
K1 |
A3 |
B3 |
right
|
key1 |
key2 |
C |
D |
0 |
K0 |
K0 |
C0 |
D0 |
1 |
K1 |
K0 |
C1 |
D1 |
2 |
K1 |
K0 |
C2 |
D2 |
3 |
K2 |
K0 |
C3 |
D3 |
pd.merge(left, right, on=['key1', 'key1'], how='inner')
|
key1 |
key2_x |
A |
B |
key2_y |
C |
D |
0 |
K0 |
K0 |
A0 |
B0 |
K0 |
C0 |
D0 |
1 |
K0 |
K1 |
A1 |
B1 |
K0 |
C0 |
D0 |
2 |
K1 |
K0 |
A2 |
B2 |
K0 |
C1 |
D1 |
3 |
K1 |
K0 |
A2 |
B2 |
K0 |
C2 |
D2 |
4 |
K2 |
K1 |
A3 |
B3 |
K0 |
C3 |
D3 |
pd.merge(left, right, on='key1', how='inner')
|
key1 |
key2_x |
A |
B |
key2_y |
C |
D |
0 |
K0 |
K0 |
A0 |
B0 |
K0 |
C0 |
D0 |
1 |
K0 |
K1 |
A1 |
B1 |
K0 |
C0 |
D0 |
2 |
K1 |
K0 |
A2 |
B2 |
K0 |
C1 |
D1 |
3 |
K1 |
K0 |
A2 |
B2 |
K0 |
C2 |
D2 |
4 |
K2 |
K1 |
A3 |
B3 |
K0 |
C3 |
D3 |
pd.merge(left, right, on=['key1', 'key2'], how='inner')
|
key1 |
key2 |
A |
B |
C |
D |
0 |
K0 |
K0 |
A0 |
B0 |
C0 |
D0 |
1 |
K1 |
K0 |
A2 |
B2 |
C1 |
D1 |
2 |
K1 |
K0 |
A2 |
B2 |
C2 |
D2 |
pd.merge(right,left, on=['key1', 'key2'], how='inner')
|
key1 |
key2 |
C |
D |
A |
B |
0 |
K0 |
K0 |
C0 |
D0 |
A0 |
B0 |
1 |
K1 |
K0 |
C1 |
D1 |
A2 |
B2 |
2 |
K1 |
K0 |
C2 |
D2 |
A2 |
B2 |
‘’’
Merge时数量的对齐关系:
(1)one-to-one:一对一关系,关联的key都是唯一的
比如(学号,姓名)merge(学号,年龄)
结果为:11
(2)one-to-many:一对多关系,左边是唯一key,右边不唯一key,
比如(学号,姓名)merge(学号,[语文成绩,数学成绩,英语成绩])
结果为:1N
(3)many-to-many:多对多关系,左边%xdel边都不唯一的,
比如(学号,[语文成绩,数学成绩,英语成绩])merge(学号,[篮球,足球,乒乓球])
结果为:M*N
‘’’
import pandas as pd
import numpy as np
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
left
|
key1 |
key2 |
A |
B |
0 |
K0 |
K0 |
A0 |
B0 |
1 |
K0 |
K1 |
A1 |
B1 |
2 |
K1 |
K0 |
A2 |
B2 |
3 |
K2 |
K1 |
A3 |
B3 |
right
|
key1 |
key2 |
C |
D |
0 |
K0 |
K0 |
C0 |
D0 |
1 |
K1 |
K0 |
C1 |
D1 |
2 |
K1 |
K0 |
C2 |
D2 |
3 |
K2 |
K0 |
C3 |
D3 |
pd.merge(left,right,how = 'left')
|
key1 |
key2 |
A |
B |
C |
D |
0 |
K0 |
K0 |
A0 |
B0 |
C0 |
D0 |
1 |
K0 |
K1 |
A1 |
B1 |
NaN |
NaN |
2 |
K1 |
K0 |
A2 |
B2 |
C1 |
D1 |
3 |
K1 |
K0 |
A2 |
B2 |
C2 |
D2 |
4 |
K2 |
K1 |
A3 |
B3 |
NaN |
NaN |