详见【python】详解pandas库的pd.merge函数
pd.concat(
objs: Union[Iterable[ForwardRef('NDFrame')], Mapping[Optional[Hashable], ForwardRef('NDFrame')]],
axis=0,
join='outer',
ignore_index: bool = False,
keys=None,
levels=None,
names=None,
verify_integrity: bool = False,
sort: bool = False,
copy: bool = True,
)
常用的参数:objs
是要合并的df或Series,可以将待合并的df放到一个列表里,一起传入,可以是两个df,也可以是多个df;axis=0
意思是默认按列合并,即取列名的并集或交集进行合并df,当设为1时,对行索引名取并集或交集进行合并df;join='outer'
意思是默认取并集,可设置为'inner'
改为取交集;ignore_index: bool = False
则是是否忽略索引名,默认不忽略,即用原索引名,当设为True时,用0,1,2,……代替;
通过几个例子来说明这个函数的用法。
先建立两个DataFrame:
data1 = {'name': ['apolo', 'adm', 'bolon', 'ali', 'cathy', 'devn', 'elov'],
'age': [18, 29, 32, 28, 34, 19, None],
'sex': ['male', 'female', 'male', 'male', 'female', 'male', 'female'],
'weight': [67, 78, 87, 59, 90, 101, 78],
'height': [170, 189, 190, 179, None, 160, 185]}
df1 = pd.DataFrame(data1, index=['a', 'i', 'c', 'd', 'h','f', 'g'])
df1
输出:df1
name age sex weight height
a apolo 18.0 male 67 170.0
i adm 29.0 female 78 189.0
c bolon 32.0 male 87 190.0
d ali 28.0 male 59 179.0
h cathy 34.0 female 90 NaN
f devn 19.0 male 101 160.0
g elov NaN female 78 185.0
data2 = {'name': ['apolo', 'adm', 'bolon', 'ali', 'cathy'],
'age': [18, 29, 32, 28, None],
'sex': ['male', 'male', 'female', 'male', 'female'],
'country': ['BR', 'MX', 'CL', 'BR', 'CO']}
df2 = pd.DataFrame(data2, index=['a', 'b', 'm', 'n', 'h'])
df2
输出:df2
name age sex country
a apolo 18.0 male BR
b adm 29.0 male MX
m bolon 32.0 female CL
n ali 28.0 male BR
h cathy NaN female CO
直接用concat()将上述两个df进行合并。
result = pd.concat([df1, df2])
result
输出:可以看到,新的数据集的列名是它们的并集。对于只在一个df中出现的列(如country,只原本在df2中出现),该列在另一个df对应的索引的值是空值。此时新的数据集的行索引为原数据集的所有行索引,可以重复。
name age sex weight height country
a apolo 18.0 male 67.0 170.0 NaN
i adm 29.0 female 78.0 189.0 NaN
c bolon 32.0 male 87.0 190.0 NaN
d ali 28.0 male 59.0 179.0 NaN
h cathy 34.0 female 90.0 NaN NaN
f devn 19.0 male 101.0 160.0 NaN
g elov NaN female 78.0 185.0 NaN
a apolo 18.0 male NaN NaN BR
b adm 29.0 male NaN NaN MX
m bolon 32.0 female NaN NaN CL
n ali 28.0 male NaN NaN BR
h cathy NaN female NaN NaN CO
设置axis=1
result = pd.concat([df1, df2], axis=1)
result
输出:可以看到,此时行索引名是原数据集行索引的并集,对于只在一个df中出现的行索引名(如n,只原本在df2中出现),该行在另一个df对应的列的值是空值。此时新的数据集的列名为原数据集的所有列名,可以重复。
name age sex weight height name age sex country
a apolo 18.0 male 67.0 170.0 apolo 18.0 male BR
i adm 29.0 female 78.0 189.0 NaN NaN NaN NaN
c bolon 32.0 male 87.0 190.0 NaN NaN NaN NaN
d ali 28.0 male 59.0 179.0 NaN NaN NaN NaN
h cathy 34.0 female 90.0 NaN cathy NaN female CO
f devn 19.0 male 101.0 160.0 NaN NaN NaN NaN
g elov NaN female 78.0 185.0 NaN NaN NaN NaN
b NaN NaN NaN NaN NaN adm 29.0 male MX
m NaN NaN NaN NaN NaN bolon 32.0 female CL
n NaN NaN NaN NaN NaN ali 28.0 male BR
设置join='inner'
result = pd.concat([df1, df2], join='inner')
result
输出:可以看到此时的列名为原数据集列名的交集,而行索引依旧是原数据集的所有行索引,可以重复。
name age sex
a apolo 18.0 male
i adm 29.0 female
c bolon 32.0 male
d ali 28.0 male
h cathy 34.0 female
f devn 19.0 male
g elov NaN female
a apolo 18.0 male
b adm 29.0 male
m bolon 32.0 female
n ali 28.0 male
h cathy NaN female
设置ignore_index=True
result = pd.concat([df1, df2], join='inner', ignore_index=True)
result
输出:可以看到此时的行索引变为了0, 1, 2, ……,11,而不是原数据集的行索引;同样,当设置axis=1时,列索引即列名会变为0,1, 2,……
name age sex
0 apolo 18.0 male
1 adm 29.0 female
2 bolon 32.0 male
3 ali 28.0 male
4 cathy 34.0 female
5 devn 19.0 male
6 elov NaN female
7 apolo 18.0 male
8 adm 29.0 male
9 bolon 32.0 female
10 ali 28.0 male
11 cathy NaN female
combine_first函数将空值补齐:
df1 = pd.DataFrame({'a':[1, np.nan, 5, np.nan],
'b':[np.nan, 2, np.nan, 6],
'c':range(2, 18, 4)})
df2 = pd.DataFrame({'a':[5, 4, np.nan, 3, 7],
'b':[np.nan, 3, 4, 6, 8]})
df1:
a b c
0 1.0 NaN 2
1 NaN 2.0 6
2 5.0 NaN 10
3 NaN 6.0 14
df2:
a b
0 5.0 NaN
1 4.0 3.0
2 NaN 4.0
3 3.0 6.0
4 7.0 8.0
df1.combine_first(df2)
输出结果:
a b c
0 1.0 NaN 2.0
1 4.0 2.0 6.0
2 5.0 4.0 10.0
3 3.0 6.0 14.0
4 7.0 8.0 NaN