最近做了一个数据挖掘的项目,里面涉及到大量dataframe拼接的操作。在这个过程中,我主要使用过两种拼接方法:pd.merge
和pd.concat
。其中遇到过一些坑,在这里记录一下。
首先给出pandas官方文档对于这两种方法的介绍:
pd.merge
:
Merge DataFrame or named Series objects with a database-style join.
A named Series object is treated as a DataFrame with a single named column.
The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross merge, no column specifications to merge on are allowed.
pd.concat
:
Concatenate pandas objects along a particular axis.
Allows optional set logic along the other axes.
Can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same (or overlapping) on the passed axis number.
可以看出:
pd.merge
是一个类似于database join
的方法,和SQL
用起来基本没啥区别,也是有内连接、外连接之类的这些概念pd.concat
可以指定轴,也就是说既可以横向拼接,又可以纵向拼接。将两个表按照name字段做pd.merge
操作。
import pandas as pd
df1 = pd.DataFrame(
[
['a', 1],
['b', 2],
['c', 3],
],
columns=['name', 'score1'],
)
df2 = pd.DataFrame(
[
['a', 1],
['b', 2],
['d', 4],
],
columns=['name', 'score2'],
)
result_list = {
'inner': pd.merge(left=df1, right=df2, how='inner', on='name'), # 取name的交集
'outer': pd.merge(left=df1, right=df2, how='outer', on='name'), # 取name的并集
'left': pd.merge(left=df1, right=df2, how='left', on='name'), # 取左边表的name
'right': pd.merge(left=df1, right=df2, how='right', on='name'), # 取右边表的name
}
for merge_type, df in result_list.items():
print(merge_type)
print(df)
输出结果:
inner
name score1 score2
0 a 1 1
1 b 2 2
outer
name score1 score2
0 a 1.0 1.0
1 b 2.0 2.0
2 c 3.0 NaN
3 d NaN 4.0
left
name score1 score2
0 a 1 1.0
1 b 2 2.0
2 c 3 NaN
right
name score1 score2
0 a 1.0 1
1 b 2.0 2
2 d NaN 4
其中缺失值会置为NaN。
import pandas as pd
df1 = pd.DataFrame(
[
['a', 1],
['b', 2],
['c', 3],
],
columns=['name', 'score1'],
)
df2 = pd.DataFrame(
[
['a', 1],
['b', 2],
['d', 4],
],
columns=['name', 'score2'],
)
result_list = {
'axis=0': pd.concat([df1, df2], axis=0),
'axis=1': pd.concat([df1, df2], axis=1),
}
for merge_type, df in result_list.items():
print(merge_type)
print(df)
输出结果:
axis=0
name score1 score2
0 a 1.0 NaN
1 b 2.0 NaN
2 c 3.0 NaN
0 a NaN 1.0
1 b NaN 2.0
2 d NaN 4.0
axis=1
name score1 name score2
0 a 1 a 1
1 b 2 b 2
2 c 3 d 4
同样的,缺失值会用NaN填充。
如果合并的两个dataframe中除了name还有名字相同的列,那么:
pd.merge
会默认将column重新命名(加上后缀)pd.concat
只是简单的做拼接,不会对index或者column重新命名,进而会导致合并后有重复的index或者column例子如下:
import pandas as pd
df1 = pd.DataFrame(
[
['a', 1],
['b', 2],
['c', 3],
],
columns=['name', 'score'],
)
df2 = pd.DataFrame(
[
['a', 1],
['b', 2],
['d', 4],
],
columns=['name', 'score'],
)
result_list = {
'inner': pd.merge(left=df1, right=df2, how='inner', on='name'),
'axis=0': pd.concat([df1, df2], axis=0),
'axis=1': pd.concat([df1, df2], axis=1),
}
for merge_type, df in result_list.items():
print(merge_type)
print(df)
输出结果:
inner
name score_x score_y
0 a 1 1
1 b 2 2
axis=0
name score
0 a 1
1 b 2
2 c 3
0 a 1
1 b 2
2 d 4
axis=1
name score name score
0 a 1 a 1
1 b 2 b 2
2 c 3 d 4
如果合并的两个dataframe的index不相同,那么:
pd.merge
是没有影响的,因为pd.merge
本身是基于column进行合并的,并且通过on
参数去指定根据哪个column进行合并。并且,合并之后的index默认是从0开始,以1为公差的等差数列pd.concat
来说,在横向拼接(pd.concat(axis=1)
)的时候,index会变成两个dataframe的index的并集,同时出现的缺失值会用NaN填充例子如下:
import pandas as pd
df1 = pd.DataFrame(
[
['a', 1],
['b', 2],
['c', 3],
],
columns=['name', 'score'],
index=[0, 1, 'xxx'],
)
df2 = pd.DataFrame(
[
['a', 1],
['b', 2],
['d', 4],
],
columns=['name', 'score'],
index=[0, 1, 'yyy'],
)
result_list = {
'inner': pd.merge(left=df1, right=df2, how='inner', on='name'),
'axis=0': pd.concat([df1, df2], axis=0),
'axis=1': pd.concat([df1, df2], axis=1),
}
for merge_type, df in result_list.items():
print(merge_type)
print(df)
输出结果:
inner
name score_x score_y
0 a 1 1
1 b 2 2
axis=0
name score
0 a 1
1 b 2
xxx c 3
0 a 1
1 b 2
yyy d 4
axis=1
name score name score
0 a 1.0 a 1.0
1 b 2.0 b 2.0
xxx c 3.0 NaN NaN
yyy NaN NaN d 4.0
pd.merge | pd.concat | |
---|---|---|
作用的对象 | 两个dataframe | 多个dataframe |
拼接的方式 | 通过指定列名按照类似数据库join的方式进行拼接 | 简单的横向拼接、纵向拼接 |
index是否相同对于合并是否有影响 | 无 | 有 |
拼接结果的区别 | 1. index会默认从0开始编号 2. column可能会被加上后缀(当两个dataframe有相同列名时) |
1. index和column的名字不会被修改 2. 可能会出现重复index或重复column(两个dataframe中有同名的index或者column) 3. 横向拼接的时候行数可能会变(两个dataframe中有同名的index) |