pandas merge小结

文章目录

    • pandas merge
      • pd.merge
      • DataFrame.join
    • 小结

pandas merge

merge 是 DataFrame之间类似于SQL的表连接操作, pandas 本身提供了 pd.merge的方法完成连接, 同时DataFrame(Series不存在此方法)也提供了join方法完成连接, 本文主要分析一下两种方法的异同.

pd.merge

查看方法描述:

Merge DataFrame or named Series objects with a database-style join.

left : DataFrame
    right : DataFrame or named Series
        Object to merge with.
    how : {'left', 'right', 'outer', 'inner'}, default 'inner'
        Type of merge to be performed.
    
        * left: use only keys from left frame, similar to a SQL left outer join;
          preserve key order.
        * right: use only keys from right frame, similar to a SQL right outer join;
          preserve key order.
        * outer: use union of keys from both frames, similar to a SQL full outer
          join; sort keys lexicographically.
        * inner: use intersection of keys from both frames, similar to a SQL inner
          join; preserve the order of the left keys.
    on : label or list
        Column or index level names to join on. These must be found in both
        DataFrames. If `on` is None and not merging on indexes then this defaults
        to the intersection of the columns in both DataFrames.

left_on : label or list, or array-like
        Column or index level names to join on in the left DataFrame. Can also
        be an array or list of arrays of the length of the left DataFrame.
        These arrays are treated as if they are columns.
    right_on : label or list, or array-like
        Column or index level names to join on in the right DataFrame. Can also
        be an array or list of arrays of the length of the right DataFrame.
        These arrays are treated as if they are columns.
    left_index : bool, default False
        Use the index from the left DataFrame as the join key(s). If it is a
        MultiIndex, the number of keys in the other DataFrame (either the index
        or a number of columns) must match the number of levels.
    right_index : bool, default False
        Use the index from the right DataFrame as the join key. Same caveats as
        left_index.

suffixes : tuple of (str, str), default ('_x', '_y')
        Suffix to apply to overlapping column names in the left and right
        side, respectively. To raise an exception on overlapping columns use
        (False, False).

关键参数:
how: 连接方式, 类SQL 的表连接 inner left outer right outer full outer 默认是inner join

on: 连接列 列名 或者是索引名称, 要求在左右链接的对象里面都有此名称 如果这个值没有给定, 且不是在Index层面上去Merge, 那么会选择左右两边对象所有同名列 做链接处理.
DataFrame的连接比SQL的连接更严格, SQL 表连接结果是允许存在同名列的(在不同表里面, 所以可以同名) 但是 DataFrame 需要处理两边名字相同的列. 如果on 没有给定那么认为是以左右两边所有同名列进行连接.

left_on: 指定左边用于连接的列或者索引

right on: 指定右边用于连接的列或者索引

left_index: 指定左边索引用于连接

right_index 指定右边索引用于连接.

suffixes: 同名列不用于连接 添加的后缀.

此处总结一些常规的连接操作:

  1. 同名列相连接

定义两个df

df_obj1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                        'data1' : np.random.randint(0,10,7)})
df_obj2 = pd.DataFrame({'key': ['a', 'b', 'd'],
                        'data2' : np.random.randint(0,10,3)})
 key  data1
0   b      6
1   b      5
2   a      9
3   c      5
4   a      4
5   a      2
6   b      4
  key  data2
0   a      8
1   b      0
2   d      4
print(pd.merge(df_obj1, df_obj2)) # 默认使用同名列 key链接
print(pd.merge(df_obj1, df_obj2, on='key')) # 显示指定在on 上面链接
print(pd.merge(df_obj1, df_obj2, left_on='key', right_on='key')) # 分别指定左右用于连接的列

结果都为

 key  data1  data2
0   b      6      0
1   b      5      0
2   b      4      0
3   a      9      8
4   a      4      8
5   a      2      8
print(pd.merge(df_obj1, df_obj2, on='data'))

强制在非同名列上面做链接就会报错

  1. 显示指定左右链接列
print(pd.merge(df_obj1, df_obj2, left_on='data1', right_on='data2'))

对于同名不用作链接的列(此处key) 添加了_x, _y 后缀, 显示指定第一个df的 data1 列 和 第二个 df 的data2 列做链接.

key_x  data1 key_y  data2
0     a      4     d      4
1     b      4     d      4
  1. 使用索引链接
df_obj1 = pd.DataFrame({'key':[1,2,3,4]}, index=['a','a','b','c'])
df_obj2 = pd.DataFrame({'key':[4,5,6]}, index=['a','d','b'])
key
a    1
a    2
b    3
c    4
   key
a    4
d    5
b    6
print(pd.merge(df_obj1, df_obj2, left_index=True, right_index=True))
key_x  key_y
a      1      4
a      2      4
b      3      6
  1. 列与索引链接
df_obj1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                        'data1' : np.random.randint(0,10,7)})
df_obj2 = pd.DataFrame({'data2' : np.random.randint(0,10,3)}, index=['a', 'b', 'd'])
key  data1
0   b      4
1   b      6
2   a      0
3   c      2
4   a      7
5   a      6
6   b      7
   data2
a      3
b      3
d      0

左边使用key这一列 和 右边的 索引做链接

pd.merge(df_obj1, df_obj2, left_on='key', right_index=True)
key	data1	data2
0	b	4	3
1	b	6	3
6	b	7	3
2	a	0	3
4	a	7	3
5	a	6	3

基本上常规就是上面四种操作, 基本上会上面四种常规操作就可以了. 有些人喜欢都用列做连接, 那么在索引的df上面执行reset_index 将索引变为列即可.

DataFrame.join

这个方法只有DataFrame存在, Series不存在.
查看方法描述.

 Join columns of another DataFrame.
    
    Join columns with `other` DataFrame either on index or on a key
    column. Efficiently join multiple DataFrame objects by index at once by
    passing a list.

other : DataFrame, Series, or list of DataFrame
        Index should be similar to one of the columns in this one. If a
        Series is passed, its name attribute must be set, and that will be
        used as the column name in the resulting joined DataFrame.
    on : str, list of str, or array-like, optional
        Column or index level name(s) in the caller to join on the index
        in `other`, otherwise joins index-on-index. If multiple
        values given, the `other` DataFrame must have a MultiIndex. Can
        pass an array as the join key if it is not already contained in
        the calling DataFrame. Like an Excel VLOOKUP operation.
    how : {'left', 'right', 'outer', 'inner'}, default 'left'
        How to handle the operation of the two objects.
    
        * left: use calling frame's index (or column if on is specified)
        * right: use `other`'s index.
        * outer: form union of calling frame's index (or column if on is
          specified) with `other`'s index, and sort it.
          lexicographically.
        * inner: form intersection of calling frame's index (or column if
          on is specified) with `other`'s index, preserving the order
          of the calling's one.
    lsuffix : str, default ''
        Suffix to use from left frame's overlapping columns.
    rsuffix : str, default ''
        Suffix to use from right frame's overlapping columns.

作用: 将多个Series 或者 多个 DF的列链接起来.

other: 一个或者多个(DataFrame或者Series)

on: 索引名称或者列名称, 用于和other的df或者series 做连接. 这里已经表明, 无论other 给什么合法类型, join 都是 在 index 级别的, on 这个参数只是决定左边的object 到底是用列来连接还是用索引来连接. 默认情况下, 两边链接都是使用的索引链接.

how: 连接方式, 内连接 左外连接 右外连接 全连接 默认是 左外连接 保留左边没有匹配上右边的行.

lsuffix rsuffix: 同名行的后缀, 如果有同名行 这两个参数不给会报错.

df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'], 'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
other = pd.DataFrame({'key': ['K0', 'K1', 'K2'],'B': ['B0', 'B1', 'B2']})
key   A
0  K0  A0
1  K1  A1
2  K2  A2
3  K3  A3
4  K4  A4
5  K5  A5
  key   B
0  K0  B0
1  K1  B1
2  K2  B2
print(df.join(other, lsuffix='_caller', rsuffix='_other'))

默认都是基于索引连接.

key_caller   A key_other    B
0         K0  A0        K0   B0
1         K1  A1        K1   B1
2         K2  A2        K2   B2
3         K3  A3       NaN  NaN
4         K4  A4       NaN  NaN
5         K5  A5       NaN  NaN

不指定同名列后缀, 会报错

print(df.join(other))

两边都将Key这一列作为索引进行链接

print(df.set_index('key').join(other.set_index('key')))
    A    B
key         
K0   A0   B0
K1   A1   B1
K2   A2   B2
K3   A3  NaN
K4   A4  NaN
K5   A5  NaN

指定左边列 和右边索引连接

print(df.join(other.set_index('key'), on='key'))
key   A    B
0  K0  A0   B0
1  K1  A1   B1
2  K2  A2   B2
3  K3  A3  NaN
4  K4  A4  NaN
5  K5  A5  NaN

总之使用DataFrame.join 右边始终用索引来连接, 左边可以选择列还是索引进行连接.

小结

pd.merge 和 DataFrame.join 都提供了类似于SQL的表连接操作. pd.merge即可执行按列连接, 也可以按索引连接, 非常灵活, 但是一次性只能将两个对象连接. DataFrame.join 本身是将后续的series DataFrame 列并到前一个DataFrame. 所以它后面可以跟多个对象, 但是它默认在后面的连接对象上只能够基于index连接, 相比于pd.merge就有这样的限制.

你可能感兴趣的:(ML,pandas)