【Pandas】pd.concat和pd.merge的区别

前言

最近做了一个数据挖掘的项目,里面涉及到大量dataframe拼接的操作。在这个过程中,我主要使用过两种拼接方法:pd.mergepd.concat。其中遇到过一些坑,在这里记录一下。

简介

首先给出pandas官方文档对于这两种方法的介绍:

pd.merge

Merge DataFrame or named Series objects with a database-style join.

A named Series object is treated as a DataFrame with a single named column.

The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross merge, no column specifications to merge on are allowed.

pd.concat

Concatenate pandas objects along a particular axis.

Allows optional set logic along the other axes.

Can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same (or overlapping) on the passed axis number.

可以看出:

  • pd.merge是一个类似于database join的方法,和SQL用起来基本没啥区别,也是有内连接、外连接之类的这些概念
  • pd.concat可以指定轴,也就是说既可以横向拼接,又可以纵向拼接。

基本用法

pd.merge

将两个表按照name字段做pd.merge操作。

import pandas as pd

df1 = pd.DataFrame(
    [
        ['a', 1],
        ['b', 2],
        ['c', 3],
    ],
    columns=['name', 'score1'],
)
df2 = pd.DataFrame(
    [
        ['a', 1],
        ['b', 2],
        ['d', 4],
    ],
    columns=['name', 'score2'],
)

result_list = {
    'inner': pd.merge(left=df1, right=df2, how='inner', on='name'),  # 取name的交集
    'outer': pd.merge(left=df1, right=df2, how='outer', on='name'),  # 取name的并集
    'left': pd.merge(left=df1, right=df2, how='left', on='name'),  # 取左边表的name
    'right': pd.merge(left=df1, right=df2, how='right', on='name'),  # 取右边表的name
}

for merge_type, df in result_list.items():
    print(merge_type)
    print(df)

输出结果:

inner
  name  score1  score2
0    a       1       1
1    b       2       2
outer
  name  score1  score2
0    a     1.0     1.0
1    b     2.0     2.0
2    c     3.0     NaN
3    d     NaN     4.0
left
  name  score1  score2
0    a       1     1.0
1    b       2     2.0
2    c       3     NaN
right
  name  score1  score2
0    a     1.0       1
1    b     2.0       2
2    d     NaN       4

其中缺失值会置为NaN。

pd.concat

import pandas as pd

df1 = pd.DataFrame(
    [
        ['a', 1],
        ['b', 2],
        ['c', 3],
    ],
    columns=['name', 'score1'],
)
df2 = pd.DataFrame(
    [
        ['a', 1],
        ['b', 2],
        ['d', 4],
    ],
    columns=['name', 'score2'],
)

result_list = {
    'axis=0': pd.concat([df1, df2], axis=0),
    'axis=1': pd.concat([df1, df2], axis=1),
}

for merge_type, df in result_list.items():
    print(merge_type)
    print(df)

输出结果:

axis=0
  name  score1  score2
0    a     1.0     NaN
1    b     2.0     NaN
2    c     3.0     NaN
0    a     NaN     1.0
1    b     NaN     2.0
2    d     NaN     4.0
axis=1
  name  score1 name  score2
0    a       1    a       1
1    b       2    b       2
2    c       3    d       4

同样的,缺失值会用NaN填充。

遇到的坑

index或者column被修改

如果合并的两个dataframe中除了name还有名字相同的列,那么:

  • pd.merge会默认将column重新命名(加上后缀)
  • pd.concat只是简单的做拼接,不会对index或者column重新命名,进而会导致合并后有重复的index或者column

例子如下:

import pandas as pd

df1 = pd.DataFrame(
    [
        ['a', 1],
        ['b', 2],
        ['c', 3],
    ],
    columns=['name', 'score'],
)
df2 = pd.DataFrame(
    [
        ['a', 1],
        ['b', 2],
        ['d', 4],
    ],
    columns=['name', 'score'],
)

result_list = {
    'inner': pd.merge(left=df1, right=df2, how='inner', on='name'),
    'axis=0': pd.concat([df1, df2], axis=0),
    'axis=1': pd.concat([df1, df2], axis=1),
}

for merge_type, df in result_list.items():
    print(merge_type)
    print(df)

输出结果:

inner
  name  score_x  score_y
0    a        1        1
1    b        2        2
axis=0
  name  score
0    a      1
1    b      2
2    c      3
0    a      1
1    b      2
2    d      4
axis=1
  name  score name  score
0    a      1    a      1
1    b      2    b      2
2    c      3    d      4

index是否相同对于合并的影响

如果合并的两个dataframe的index不相同,那么:

  • pd.merge是没有影响的,因为pd.merge本身是基于column进行合并的,并且通过on参数去指定根据哪个column进行合并。并且,合并之后的index默认是从0开始,以1为公差的等差数列
  • 而对于pd.concat来说,在横向拼接(pd.concat(axis=1))的时候,index会变成两个dataframe的index的并集,同时出现的缺失值会用NaN填充

例子如下:

import pandas as pd

df1 = pd.DataFrame(
    [
        ['a', 1],
        ['b', 2],
        ['c', 3],
    ],
    columns=['name', 'score'],
    index=[0, 1, 'xxx'],
)
df2 = pd.DataFrame(
    [
        ['a', 1],
        ['b', 2],
        ['d', 4],
    ],
    columns=['name', 'score'],
    index=[0, 1, 'yyy'],
)

result_list = {
    'inner': pd.merge(left=df1, right=df2, how='inner', on='name'),
    'axis=0': pd.concat([df1, df2], axis=0),
    'axis=1': pd.concat([df1, df2], axis=1),
}

for merge_type, df in result_list.items():
    print(merge_type)
    print(df)

输出结果:

inner
  name  score_x  score_y
0    a        1        1
1    b        2        2
axis=0
    name  score
0      a      1
1      b      2
xxx    c      3
0      a      1
1      b      2
yyy    d      4
axis=1
    name  score name  score
0      a    1.0    a    1.0
1      b    2.0    b    2.0
xxx    c    3.0  NaN    NaN
yyy  NaN    NaN    d    4.0

总结

pd.merge pd.concat
作用的对象 两个dataframe 多个dataframe
拼接的方式 通过指定列名按照类似数据库join的方式进行拼接 简单的横向拼接、纵向拼接
index是否相同对于合并是否有影响
拼接结果的区别 1. index会默认从0开始编号
2. column可能会被加上后缀(当两个dataframe有相同列名时)
1. index和column的名字不会被修改
2. 可能会出现重复index或重复column(两个dataframe中有同名的index或者column)
3. 横向拼接的时候行数可能会变(两个dataframe中有同名的index)

你可能感兴趣的:(Python基础知识,数据挖掘,pandas)