Pandas杂记(二) - 合并相关

目录

concat

Join(Merge)


concat

  • 搞清楚axis
    • axis = 0,沿着行拼接,即shape[0]变大,shape[1]不变,相同的列名拼接,列名不同,有些没有的NaN。 最终的结果,行数是所有参与拼接的df的总行数,列名是所有参与拼接的df的列名的集合。
    • axis = 1,沿着列拼接,相同的索引拼一起, 即shape[0]不变,shape[1]变大,相同的索引拼接,索引不同,有些没有的NaN 

Join(Merge)

  • 笛卡尔积,可以按照index或列名join,左连接、右连接、内连接、外连接。

以下整篇均是例子

concat

搞清楚axis

 用于concat的3个df

pieces = [df[:3], df[3:7], df[7:]]
pieces[0].columns = list('defg')
print(len(pieces), type(pieces))

for piece in pieces:
    print(piece)
3 
          d         e         f         g
0 -0.124606  0.451328  0.573751 -0.249369
1 -1.625827  0.778354 -0.584605  1.259189
2 -1.848430  0.626941 -1.403778 -2.429420
          A         B         C         D
3 -1.206565  0.277291 -2.438375  0.443205
4 -0.435070  1.233779  0.264640  0.820606
5  0.313668 -0.915840 -1.076644 -0.043687
6  0.020860 -0.275683  0.312347  0.756234
          A         B         C         D
7  0.740609 -0.408475  0.916659  0.157059
8  0.524124  0.464926  0.276881 -1.824904
9  1.633787  0.760564  1.616592  0.662823

axis = 0,沿着行拼接,即shape[0]变大,shape[1]不变,相同的列名拼接,列名不同,有些没有的NaN 。最终的结果,行数是所有参与拼接的df的总行数,列名是所有参与拼接的df的列名的集合。当遇到参与concat的某些df不存在该列的情况 。可以看到新的df的行是所有参与concat的df的行数的综合。列名是所有df的列名的集合,某些df不存在的列,用NaN来填充对应的位置。

df_concat = pd.concat(pieces, axis=0)
df_concat
  d e f g A B C D
0 -0.124606 0.451328 0.573751 -0.249369 NaN NaN NaN NaN
1 -1.625827 0.778354 -0.584605 1.259189 NaN NaN NaN NaN
2 -1.848430 0.626941 -1.403778 -2.429420 NaN NaN NaN NaN
3 NaN NaN NaN NaN -1.206565 0.277291 -2.438375 0.443205
4 NaN NaN NaN NaN -0.435070 1.233779 0.264640 0.820606
5 NaN NaN NaN NaN 0.313668 -0.915840 -1.076644 -0.043687
6 NaN NaN NaN NaN 0.020860 -0.275683 0.312347 0.756234
7 NaN NaN NaN NaN 0.740609 -0.408475 0.916659 0.157059
8 NaN NaN NaN NaN 0.524124 0.464926 0.276881 -1.824904
9 NaN NaN NaN NaN 1.633787 0.760564 1.616592 0.662823

axis = 1,沿着列拼接,相同的索引拼一起, 即shape[0]不变,shape[1]变大,相同的索引拼接,索引不同,有些没有的NaN 

参与拼接的3个df 

pieces = [df[:3], df[3:7], df[7:]]
pieces[1].index = range(0, 4)
for piece in pieces:
    print(piece)
          A         B         C         D
0  0.944063  0.498505  0.487059 -0.941936
1 -1.018475  0.920059  2.106816  1.104312
2  0.184466 -1.754729 -0.656002 -2.230612
          A         B         C         D
0 -0.270285  0.031730  0.100857  0.005652
1 -0.090587 -0.111332 -0.350779  0.013985
2 -1.213148  2.003474 -2.005585 -0.205235
3  0.084784 -1.076278  0.061052  0.432984
          A         B         C         D
7  0.317298 -0.734991  0.462587  1.468966
8  0.574835  1.248497  1.325436 -0.543574
9  0.597899  1.683313 -0.566700 -0.014385

axis=1,沿着列拼接,相同的索引拼一起,新的df的列数是所有参与拼接的df的列数的总和,行数则是所有index的集合的长度。匹配不上的则用NaN填充

pd_concat = pd.concat(pieces, axis=1)
pd_concat
  A B C D A B C D A B C D
0 0.944063 0.498505 0.487059 -0.941936 -0.270285 0.031730 0.100857 0.005652 NaN NaN NaN NaN
1 -1.018475 0.920059 2.106816 1.104312 -0.090587 -0.111332 -0.350779 0.013985 NaN NaN NaN NaN
2 0.184466 -1.754729 -0.656002 -2.230612 -1.213148 2.003474 -2.005585 -0.205235 NaN NaN NaN NaN
3 NaN NaN NaN NaN 0.084784 -1.076278 0.061052 0.432984 NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN NaN NaN 0.317298 -0.734991 0.462587 1.468966
8 NaN NaN NaN NaN NaN NaN NaN NaN 0.574835 1.248497 1.325436 -0.543574
9 NaN NaN NaN NaN NaN NaN NaN NaN 0.597899 1.683313 -0.566700 -0.014385

Join(Merge)  


笛卡尔积,可以按照index或列名join,左连接、右连接、内连接、外连接。

left = pd.DataFrame({'lkey': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'rkey': ['foo', 'foo', 'op'], 'rval': [4, 5, 6]})
print(left)
print('=' * 48)
print(right)
print('=' * 48)
print('=' * 10, "left Join right on columns", '=' * 10)
print(pd.merge(left, right, left_on='lkey', right_on='rkey', how='left'))
print('=' * 48)
print('=' * 10, "right Join left on columns", '=' * 10)
print(pd.merge(right, left, left_on='rkey', right_on='lkey', how='left', indicator=True))
print('=' * 48)
print('=' * 11, "right Join left on index", '=' * 11)
print(pd.merge(right, left, left_index=True, right_index=True, how='left', indicator=True))
  lkey  lval
0  foo     1
1  foo     2
================================================
  rkey  rval
0  foo     4
1  foo     5
2   op     6
================================================
========== left Join right on columns ==========
  lkey  lval rkey  rval
0  foo     1  foo     4
1  foo     1  foo     5
2  foo     2  foo     4
3  foo     2  foo     5
================================================
========== right Join left on columns ==========
  rkey  rval lkey  lval     _merge
0  foo     4  foo   1.0       both
1  foo     4  foo   2.0       both
2  foo     5  foo   1.0       both
3  foo     5  foo   2.0       both
4   op     6  NaN   NaN  left_only
================================================
=========== right Join left on index ===========
  rkey  rval lkey  lval     _merge
0  foo     4  foo   1.0       both
1  foo     5  foo   2.0       both
2   op     6  NaN   NaN  left_only
left = pd.DataFrame({'key': ['foo', 'foo'], 'val': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'foo', 'op'], 'val': [4, 5, 6]})
print(left)
print('=' * 48)
print(right)
print('=' * 48)
print('=' * 10, "left Join right on columns", '=' * 10)
print(pd.merge(left, right, on='key', how='left'))
print('=' * 48)
print('=' * 11, 'left Join right on index', '=' * 11)
print(pd.merge(left, right, left_index=True, right_index=True, how='left', indicator=True))
   key  val
0  foo    1
1  foo    2
================================================
   key  val
0  foo    4
1  foo    5
2   op    6
================================================
========== left Join right on columns ==========
   key  val_x  val_y
0  foo      1      4
1  foo      1      5
2  foo      2      4
3  foo      2      5
================================================
=========== left Join right on index ===========
  key_x  val_x key_y  val_y _merge
0   foo      1   foo      4   both
1   foo      2   foo      5   both

 

你可能感兴趣的:(pandas,python,pandas,python)