pandas实现DataFrame的Merge功能

#Merge :按照key将不同的表进行合并
#语法:
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
         left_index=False, right_index=False, sort=True, suffixes=('_x', '_y'),
         copy=True, indicator=False, validate=None)

right : DataFrame or named Series,也就是2个表的名字。

当使用pandas.merge(),right处实际填入两个待合并的结构;当使用dataframe.merge(),
right处仅填入一个待合并的结构,此处的right与dataframe分别作为右/左结构.
how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘inner’:

how指定了结构的融合的类型,是一个关于key的重要参数
默认inner,即采用交叉部分的key作为列的内容
left: 即选取左侧结构的key作为列的内容
right: 即选取左侧结构的key作为列的内容
outer: 选取所有的键作为列内容.
不存在的内容用NaN填充

on : label or list
参数on指定了用于合并的键key.
参数on指定的键必须是两个结构中共有的.
indicator : bool or str, default False。If True, adds a column to output DataFrame called “_merge” with information on the source of each row.
indicator 用于指示说明该行所用的键来自于哪一边结构.
left_index : bool, default False
Use the index from the left DataFrame as the join key(s).

left_index 设定为True, 即根据左侧结构的index进行merge. 而不再是根据某一columns.
right_index : bool, default False
Use the index from the right DataFrame as the join key. Same caveats as left_index.

right_index 与 left_index同时使用.即根据两个结构的index进行merge.
suffixes : tuple of (str, str), default (‘_x’, ‘_y’)
Suffix to apply to overlapping column names in the left and right side, respectively. To raise an exception on overlapping columns use (False, False).
suffixes 主要用于解决两个合并结构的列存在交叉的情况.
通过suffixes 的指定,名字相同可以在merge后使用不同的列名,并同时存在.

ratings.dat的数据如下这样:
1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
1::3408::4::978300275
1::2355::5::978824291

import pandas as pd 
import os 
os.chdir(r'C:\Users\Hans\Desktop\data_analysis\test_data\movie')
df_ratings = pd.read_csv('ratings.dat',sep = '::',engine='python',names = 'UserID::MovieID::Timestamp'.split('::'))
#注意names的写法,添加表头,不仅指定了表头名称还做了拆分
df_ratings.head()
UserID MovieID Timestamp
1 1193 5 978300760
1 661 3 978302109
1 914 3 978301968
1 3408 4 978300275
1 2355 5 978824291

users.dat数据如下:
1::Toy Story (1995)::Animation|Children’s|Comedy
2::Jumanji (1995)::Adventure|Children’s|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama
5::Father of the Bride Part II (1995)::Comedy

df_users = pd.read_csv('users.dat',sep = '::',engine='python',names = 'UserID::Gender::Age::Occupation::Zip-code'.split('::'))
df_users.head()
'''
sep=""这是分隔符 ,如果这个分隔符是一个,则自动认为是一个分隔符。但是当分隔符是2个字符的时候,可能被认为是正则表达式,
此时就需要指定engine = ‘python’

names是一个列表,通过str.split实现

'''
UserID Gender Age Occupation Zip-code
0 1 F 1 10 48067
1 2 M 56 16 70072
2 3 M 25 15 55117
3 4 M 45 7 02460
4 5 M 25 20 55455

movies.dat数据是这样子的:
1::Toy Story (1995)::Animation|Children’s|Comedy
2::Jumanji (1995)::Adventure|Children’s|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama
5::Father of the Bride Part II (1995)::Comedy

df_movies = pd.read_csv('movies.dat',sep = '::',engine='python',names = 'MovieID::Title::Genres'.split('::'))
df_movies.head()
MovieID Title Genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy
#合并:
df_ratings_users = pd.merge(df_ratings,df_users,left_on = 'UserID',right_on = 'UserID',how = 'inner')
df_ratings_users.head()
UserID MovieID Timestamp Gender Age Occupation Zip-code
0 1193 5 978300760 M 25 12 90712
1 1193 5 978298413 M 25 12 90712
2 1193 4 978220179 M 25 12 90712
3 1193 4 978199279 M 25 12 90712
4 1193 5 978158471 M 25 12 90712
df_ratings_users.loc[df_ratings_users['UserID'] == 1193].tail()
#ratings表里面UserID == 1193的有很多,但是uesrs表中UserID == 1193的只有1个,
#但是我们是将users表合并到ratings表中的,所以只要ratings表中有的,users也会多复制一个出来
#这跟两个表的合并的顺序没关
UserID MovieID Timestamp Gender Age Occupation Zip-code
1720 1193 5 956713500 M 25 12 90712
1721 1193 5 956710879 M 25 12 90712
1722 1193 5 956710766 M 25 12 90712
1723 1193 4 956709215 M 25 12 90712
1724 1193 4 957716612 M 25 12 90712
df_users.loc[df_users['UserID']==1193]
UserID Gender Age Occupation Zip-code
1192 1193 M 25 12 90712
df_ratings.loc[df_ratings['UserID']==1193].tail()
UserID MovieID Timestamp
6033 1193 5 956713500
6035 1193 5 956710879
6036 1193 5 956710766
6037 1193 4 956709215
6040 1193 4 957716612
import pandas as pd

left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                      'key2': ['K0', 'K1', 'K0', 'K1'],
                      'A': ['A0', 'A1', 'A2', 'A3'],
                      'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                       'key2': ['K0', 'K0', 'K0', 'K0'],
                       'C': ['C0', 'C1', 'C2', 'C3'],
                       'D': ['D0', 'D1', 'D2', 'D3']})
left
key1 key2 A B
0 K0 K0 A0 B0
1 K0 K1 A1 B1
2 K1 K0 A2 B2
3 K2 K1 A3 B3
right
key1 key2 C D
0 K0 K0 C0 D0
1 K1 K0 C1 D1
2 K1 K0 C2 D2
3 K2 K0 C3 D3
pd.merge(left, right, on=['key1', 'key1'], how='inner')
#inner的意思就是按照表一的key1和表2的key1相同的,才会合并,不同的会删除,如果表一某个key出现重复则会多一个
key1 key2_x A B key2_y C D
0 K0 K0 A0 B0 K0 C0 D0
1 K0 K1 A1 B1 K0 C0 D0
2 K1 K0 A2 B2 K0 C1 D1
3 K1 K0 A2 B2 K0 C2 D2
4 K2 K1 A3 B3 K0 C3 D3
pd.merge(left, right, on='key1', how='inner')#与上一个相同
key1 key2_x A B key2_y C D
0 K0 K0 A0 B0 K0 C0 D0
1 K0 K1 A1 B1 K0 C0 D0
2 K1 K0 A2 B2 K0 C1 D1
3 K1 K0 A2 B2 K0 C2 D2
4 K2 K1 A3 B3 K0 C3 D3
pd.merge(left, right, on=['key1', 'key2'], how='inner')
#inner是表一的key1&key2 == 表二的key1&key2时才会合并,表二中的key1&key2中出现了2行k1 k0,
#表一中的key1&key2出现了一行k1,k0,那么合并的死后就会多加一行,也就是表一被值不变,多一个表二的值,如下表
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
pd.merge(right,left, on=['key1', 'key2'], how='inner')#两个表左右放没有区别
key1 key2 C D A B
0 K0 K0 C0 D0 A0 B0
1 K1 K0 C1 D1 A2 B2
2 K1 K0 C2 D2 A2 B2

‘’’
Merge时数量的对齐关系:
(1)one-to-one:一对一关系,关联的key都是唯一的
比如(学号,姓名)merge(学号,年龄)
结果为:11
(2)one-to-many:一对多关系,左边是唯一key,右边不唯一key,
比如(学号,姓名)merge(学号,[语文成绩,数学成绩,英语成绩])
结果为:1
N
(3)many-to-many:多对多关系,左边%xdel边都不唯一的,
比如(学号,[语文成绩,数学成绩,英语成绩])merge(学号,[篮球,足球,乒乓球])
结果为:M*N

‘’’

#how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘inner’:这几种方式举例
#(1)how = left
import pandas as pd 
import numpy as np

left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                      'key2': ['K0', 'K1', 'K0', 'K1'],
                      'A': ['A0', 'A1', 'A2', 'A3'],
                      'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                       'key2': ['K0', 'K0', 'K0', 'K0'],
                       'C': ['C0', 'C1', 'C2', 'C3'],
                       'D': ['D0', 'D1', 'D2', 'D3']})
left
key1 key2 A B
0 K0 K0 A0 B0
1 K0 K1 A1 B1
2 K1 K0 A2 B2
3 K2 K1 A3 B3
right
key1 key2 C D
0 K0 K0 C0 D0
1 K1 K0 C1 D1
2 K1 K0 C2 D2
3 K2 K0 C3 D3
#(1)how = left,以左边的表为准,左边的表值全部取,右边的表取其中与左边的相同的部分,如果右边没有的部分填充NaN
pd.merge(left,right,how = 'left')
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 C1 D1
3 K1 K0 A2 B2 C2 D2
4 K2 K1 A3 B3 NaN NaN

你可能感兴趣的:(python笔记,python)