pandas 合并数据集 pd.concat(), pd.merge(), ,data1.append(data2)

重点内容几种常用的合并数据集的方法:
1) pd.concat ([data1,data2,…], axis=1(or 0), keys=[‘key1’, ‘key2’,…], names=[‘upper’,’lower’,…], ignore_index=True/False,…)

2) pd.merge(left, right, how=’inner’, on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=(‘_x’, ‘_y’), copy=True, indicator=False, validate=None)

3) data1.append(data2): 不推荐用,用pd.concat替代


  • 作用区别:
    pd.concat() 主要是沿着一条轴(axis=0 或1)将多个对象堆叠在一起;
    pd.merege() 根据一个或多个键将不同DataFrame中的行连接起来,类似SQL连接操作

举例:
数据集data1和data2:

data1:
  subject_id first_name last_name
0          4      Billy    Bonder
1          5      Brian     Black
2          6       Bran   Balwner
3          7      Bryce     Brice
4          8      Betty    Btisan
data1:
  subject_id first_name last_name
0          1       Alex  Anderson
1          2        Amy  Ackerman
2          3      Allen       Ali
3          4      Alice      Aoni
4          5     Ayoung   Atiches

方法和参数选择:
- pd.merge():
1) pd.merge(data1,data2, how=’inner’,on=’subject_id’)

pd.merge(data1,data2, how='inner',on='subject_id')
Out[31]: 
  subject_id first_name_x last_name_x first_name_y last_name_y
0          4        Alice        Aoni        Billy      Bonder
1          5       Ayoung     Atiches        Brian       Black

2) pd.merge(data1,data2, how=’right’,on=’subject_id’)

pd.merge(data1,data2, how='right',on='subject_id')
Out[34]: 
  subject_id first_name_x last_name_x first_name_y last_name_y
0          4        Alice        Aoni        Billy      Bonder
1          5       Ayoung     Atiches        Brian       Black
2          6          NaN         NaN         Bran     Balwner
3          7          NaN         NaN        Bryce       Brice
4          8          NaN         NaN        Betty      Btisan

- pd.concat():
1) pd.concat([data1,data2],axis=1,ignore_index=True)

all_data_col = pd.concat([data1,data2],axis=1,ignore_index=True)
all_data_col
Out[40]: 
   0       1         2  3      4        5
0  1    Alex  Anderson  4  Billy   Bonder
1  2     Amy  Ackerman  5  Brian    Black
2  3   Allen       Ali  6   Bran  Balwner
3  4   Alice      Aoni  7  Bryce    Brice
4  5  Ayoung   Atiches  8  Betty   Btisan

2) pd.concat([data1,data2],axis=0)

pd.concat([data1,data2],axis=0)
all_data_col
Out[38]: 
  subject_id first_name last_name
0          1       Alex  Anderson
1          2        Amy  Ackerman
2          3      Allen       Ali
3          4      Alice      Aoni
4          5     Ayoung   Atiches
0          4      Billy    Bonder
1          5      Brian     Black
2          6       Bran   Balwner
3          7      Bryce     Brice
4          8      Betty    Btisan--------
  • data1.append(data2):
all_data = data1.append(data2)
all_data
Out[14]: 
  subject_id first_name last_name
0          1       Alex  Anderson
1          2        Amy  Ackerman
2          3      Allen       Ali
3          4      Alice      Aoni
4          5     Ayoung   Atiches
0          4      Billy    Bonder
1          5      Brian     Black
2          6       Bran   Balwner
3          7      Bryce     Brice
4          8      Betty    Btisan
Less efficient:
    >>> df = pd.DataFrame(columns=['A'])
    >>> for i in range(5):
    ...     df = df.append({'A'}: i}, ignore_index=True)
    >>> df
       A
    0  0
    1  1
    2  2
    3  3
    4  4

    More efficient:
    >>> pd.concat([pd.DataFrame([i], columns=['A']) for i in range(5)],ignore_index=True)
       A
    0  0
    1  1
    2  2
    3  3
    4  4

help()代码信息:

concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, copy=True)
    Concatenate pandas objects along a particular axis with optional set logic
    along the other axes.
Parameters
    ----------
    objs : a sequence or mapping of Series, DataFrame, or Panel objects
        If a dict is passed, the sorted keys will be used as the `keys`
        argument, unless it is passed, in which case the values will be
        selected (see below). Any None objects will be dropped silently unless
        they are all None in which case a ValueError will be raised
    axis : {0/'index', 1/'columns'}, default 0
        The axis to concatenate along
    join : {'inner', 'outer'}, default 'outer'
        How to handle indexes on other axis(es)
    join_axes : list of Index objects
        Specific indexes to use for the other n - 1 axes instead of performing
        inner/outer set logic
    ignore_index : boolean, default False
        If True, do not use the index values along the concatenation axis. The
        resulting axis will be labeled 0, ..., n - 1. This is useful if you are
        concatenating objects where the concatenation axis does not have
        meaningful indexing information. Note the index values on the other
        axes are still respected in the join.
    keys : sequence, default None
        If multiple levels passed, should contain tuples. Construct
        hierarchical index using the passed keys as the outermost level
    levels : list of sequences, default None
        Specific levels (unique values) to use for constructing a
        MultiIndex. Otherwise they will be inferred from the keys
    names : list, default None
        Names for the levels in the resulting hierarchical index
    verify_integrity : boolean, default False
        Check whether the new concatenated axis contains duplicates. This can
        be very expensive relative to the actual data concatenation
    copy : boolean, default True
        If False, do not copy data unnecessarily
  Returns
    -------
    concatenated : object, type of objs
        When concatenating all ``Series`` along the index (axis=0), a
        ``Series`` is returned. When ``objs`` contains at least one
        ``DataFrame``, a ``DataFrame`` is returned. When concatenating along
        the columns (axis=1), a ``DataFrame`` is returned.
merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
    Merge DataFrame objects by performing a database-style join operation by
    columns or indexes.

    If joining columns on columns, the DataFrame indexes *will be
    ignored*. Otherwise if joining indexes on indexes or indexes on a column or
    columns, the index will be passed on.

    Parameters
    ----------
    left : DataFrame
    right : DataFrame
    how : {'left', 'right', 'outer', 'inner'}, default 'inner'
        * left: use only keys from left frame, similar to a SQL left outer join;
          preserve key order
        * right: use only keys from right frame, similar to a SQL right outer join;
          preserve key order
        * outer: use union of keys from both frames, similar to a SQL full outer
          join; sort keys lexicographically
        * inner: use intersection of keys from both frames, similar to a SQL inner
          join; preserve the order of the left keys
    on : label or list
        Field names to join on. Must be found in both DataFrames. If on is
        None and not merging on indexes, then it merges on the intersection of
        the columns by default.
    left_on : label or list, or array-like
        Field names to join on in left DataFrame. Can be a vector or list of
        vectors of the length of the DataFrame to use a particular vector as
        the join key instead of columns
    right_on : label or list, or array-like
        Field names to join on in right DataFrame or vector/list of vectors per
        left_on docs
    left_index : boolean, default False
        Use the index from the left DataFrame as the join key(s). If it is a
        MultiIndex, the number of keys in the other DataFrame (either the index
        or a number of columns) must match the number of levels
    right_index : boolean, default False
        Use the index from the right DataFrame as the join key. Same caveats as
        left_index
    sort : boolean, default False
        Sort the join keys lexicographically in the result DataFrame. If False,
        the order of the join keys depends on the join type (how keyword)
    suffixes : 2-length sequence (tuple, list, ...)
        Suffix to apply to overlapping column names in the left and right
        side, respectively
    copy : boolean, default True
        If False, do not copy data unnecessarily
    indicator : boolean or string, default False
        If True, adds a column to output DataFrame called "_merge" with
        information on the source of each row.
        If string, column with information on source of each row will be added to
        output DataFrame, and column will be named value of string.
        Information column is Categorical-type and takes on a value of "left_only"
        for observations whose merge key only appears in 'left' DataFrame,
        "right_only" for observations whose merge key only appears in 'right'
        DataFrame, and "both" if the observation's merge key is found in both.

你可能感兴趣的:(pandas)