import numpy as np
import pandas as pd
关系型连接
连接的基本概念
把两张相关的表按照某一个或某一组键连接起来即为连接表,关系型连接中,键常用on
表示并且十分重要.同时连接的形式也十分重要,pandas
中关系型连接函数有merge
和join
,其中提供how
参数代表连接形式,
- 左连接
left
:以左边的键为准,若右边键在左边存在则添加到左边,否则处理为缺失值.
- 右连接
right
:类似于左连接
- 内连接
inner
:只合并左右表同时出现的键.
- 外连接
outer
:在内连接的基础上包含左边出现及只在右边出现的值,又称全连接
** 注意: ** 若左右两表的键同时出现重复时,需要通过笛卡尔积的方式加入,单边出现的时候则根据连接形式进行处理然后逐个匹配,
** 不含重复键 **
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-T4TvBGeK-1609256527408)(attachment:image.png)]
** 含重复键 **
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-J3PDgEvf-1609256527413)(attachment:image.png)]
值连接
两表根据某几列的组合进行连接,这种连接可以通过merge
函数实现
df1 = pd.DataFrame({
'Name':['San Zhang','Si Li'],
'Age':[20,30]})
df2 = pd.DataFrame({
'Name':['Si Li', 'Wu Wang'],
'Gender':['F','M']})
df1.merge(df2, on = 'Name', how = 'left')
|
Name |
Age |
Gender |
0 |
San Zhang |
20 |
NaN |
1 |
Si Li |
30 |
F |
df1 = pd.DataFrame({
'df1_name':['San Zhang','Si Li'],
'Age':[20,30]})
df2 = pd.DataFrame({
'df2_name':['Si Li', 'Wu Wang'],
'Gender':['F','M']})
df1.merge(df2, left_on='df1_name', right_on='df2_name',how='left')
|
df1_name |
Age |
df2_name |
Gender |
0 |
San Zhang |
20 |
NaN |
NaN |
1 |
Si Li |
30 |
Si Li |
F |
df1 = pd.DataFrame({
'Name':['San Zhang'],'Grade':[70]})
df2 = pd.DataFrame({
'Name':['San Zhang'],'Grade':[80]})
df1.merge(df2, on='Name', how = 'left', suffixes=['_Chinese','_Math'])
|
Name |
Grade_Chinese |
Grade_Math |
0 |
San Zhang |
70 |
80 |
df1 = pd.DataFrame({
'Name':['San Zhang', 'San Zhang'],
'Age':[20, 21],
'Class':['one', 'two']})
df2 = pd.DataFrame({
'Name':['San Zhang', 'San Zhang'],
'Gender':['F', 'M'],
'Class':['two', 'one']})
df1.merge(df2, on = 'Name', how='left')
|
Name |
Age |
Class_x |
Gender |
Class_y |
0 |
San Zhang |
20 |
one |
F |
two |
1 |
San Zhang |
20 |
one |
M |
one |
2 |
San Zhang |
21 |
two |
F |
two |
3 |
San Zhang |
21 |
two |
M |
one |
df1.merge(df2, on=['Name','Class'], how = 'left')
|
Name |
Age |
Class |
Gender |
0 |
San Zhang |
20 |
one |
M |
1 |
San Zhang |
21 |
two |
F |
索引连接
即将索引作为键进行连接,pandas
中利用join
函数处理索引连接,其参数个数少于merge
,除on
和how
外,有对重复列指定左右后缀lsuffix
和rsuffix
,on参数指索引名,单层索引时省略参数表示按照当前索引连接
df1 = pd.DataFrame({
'Age':[20,30]},
index=pd.Series(
['San Zhang','Si Li'],name='Name'))
df2 = pd.DataFrame({
'Gender':['F','M']},
index=pd.Series(
['Si Li','Wu Wang'],name='Name'))
df1.join(df2, how='left')
|
Age |
Gender |
Name |
|
|
San Zhang |
20 |
NaN |
Si Li |
30 |
F |
df1 = pd.DataFrame({
'Grade':[70]},
index=pd.Series(['San Zhang'],
name='Name'))
df2 = pd.DataFrame({
'Grade':[80]},
index=pd.Series(['San Zhang'],
name='Name'))
df1.join(df2, how='left',lsuffix='_Chinese',rsuffix='_Math')
|
Grade_Chinese |
Grade_Math |
Name |
|
|
San Zhang |
70 |
80 |
df1 = pd.DataFrame({
'Age':[20,21]},
index=pd.MultiIndex.from_arrays(
[['San Zhang', 'San Zhang'],['one', 'two']],
names=('Name','Class')))
df2 = pd.DataFrame({
'Gender':['F', 'M']},
index=pd.MultiIndex.from_arrays(
[['San Zhang', 'San Zhang'],['two', 'one']],
names=('Name','Class')))
df1.join(df2)
|
|
Age |
Gender |
Name |
Class |
|
|
San Zhang |
one |
20 |
M |
two |
21 |
F |
方向连接
concat
将两张表或多个表按照纵向或则横向拼接, concat
常用参数有axis
(拼接方向),join
(连接形式),keys
(新表中指示来自于哪张旧表的名字)
默认状态下axis
为0,表示纵向拼接多个表,常用于多个样本的拼接,当axis
为1时表示横向拼接多个表,常用于多个字段和特征的拼接.
** 注意: **当需要使用多表方向合并时,需要首先用reset_index
方法恢复默认整数索引再进行合并,防止出现由索引的误对齐和重复索引的笛卡尔积带来的错误结果.keys
用在当多个表合并后,标识新表中的数据来源,可通过keys
参数残生多级索引进行标记.
df1 = pd.DataFrame({
'Name':['San Zhang','Si Li'],
'Age':[20,30]})
df2 = pd.DataFrame({
'Name':['Wu Wang'], 'Age':[40]})
pd.concat([df1, df2])
|
Name |
Age |
0 |
San Zhang |
20 |
1 |
Si Li |
30 |
0 |
Wu Wang |
40 |
df2 = pd.DataFrame({
'Grade':[80, 90]})
df3 = pd.DataFrame({
'Gender':['M', 'F']})
pd.concat([df1, df2, df3], 1)
|
Name |
Age |
Grade |
Gender |
0 |
San Zhang |
20 |
80 |
M |
1 |
Si Li |
30 |
90 |
F |
df2 = pd.DataFrame({
'Name':['Wu Wang'], 'Gender':['M']})
pd.concat([df1, df2])
|
Name |
Age |
Gender |
0 |
San Zhang |
20.0 |
NaN |
1 |
Si Li |
30.0 |
NaN |
0 |
Wu Wang |
NaN |
M |
df2 = pd.DataFrame({
'Grade':[80, 90]}, index=[1, 2])
pd.concat([df1, df2], 1)
|
Name |
Age |
Grade |
0 |
San Zhang |
20.0 |
NaN |
1 |
Si Li |
30.0 |
80.0 |
2 |
NaN |
NaN |
90.0 |
pd.concat([df1, df2], axis=1, join='inner')
|
Name |
Age |
Grade |
1 |
Si Li |
30 |
80 |
df1 = pd.DataFrame({
'Name':['San Zhang','Si Li'],
'Age':[20,21]})
df2 = pd.DataFrame({
'Name':['Wu Wang'],'Age':[21]})
pd.concat([df1, df2], keys=['one', 'two'])
|
|
Name |
Age |
one |
0 |
San Zhang |
20 |
1 |
Si Li |
21 |
two |
0 |
Wu Wang |
21 |