目录
concat
Join(Merge)
concat
Join(Merge)
以下整篇均是例子
搞清楚axis
用于concat的3个df
pieces = [df[:3], df[3:7], df[7:]]
pieces[0].columns = list('defg')
print(len(pieces), type(pieces))
for piece in pieces:
print(piece)
3
d e f g
0 -0.124606 0.451328 0.573751 -0.249369
1 -1.625827 0.778354 -0.584605 1.259189
2 -1.848430 0.626941 -1.403778 -2.429420
A B C D
3 -1.206565 0.277291 -2.438375 0.443205
4 -0.435070 1.233779 0.264640 0.820606
5 0.313668 -0.915840 -1.076644 -0.043687
6 0.020860 -0.275683 0.312347 0.756234
A B C D
7 0.740609 -0.408475 0.916659 0.157059
8 0.524124 0.464926 0.276881 -1.824904
9 1.633787 0.760564 1.616592 0.662823
axis = 0,沿着行拼接,即shape[0]变大,shape[1]不变,相同的列名拼接,列名不同,有些没有的NaN 。最终的结果,行数是所有参与拼接的df的总行数,列名是所有参与拼接的df的列名的集合。当遇到参与concat的某些df不存在该列的情况 。可以看到新的df的行是所有参与concat的df的行数的综合。列名是所有df的列名的集合,某些df不存在的列,用NaN来填充对应的位置。
df_concat = pd.concat(pieces, axis=0)
df_concat
d | e | f | g | A | B | C | D | |
---|---|---|---|---|---|---|---|---|
0 | -0.124606 | 0.451328 | 0.573751 | -0.249369 | NaN | NaN | NaN | NaN |
1 | -1.625827 | 0.778354 | -0.584605 | 1.259189 | NaN | NaN | NaN | NaN |
2 | -1.848430 | 0.626941 | -1.403778 | -2.429420 | NaN | NaN | NaN | NaN |
3 | NaN | NaN | NaN | NaN | -1.206565 | 0.277291 | -2.438375 | 0.443205 |
4 | NaN | NaN | NaN | NaN | -0.435070 | 1.233779 | 0.264640 | 0.820606 |
5 | NaN | NaN | NaN | NaN | 0.313668 | -0.915840 | -1.076644 | -0.043687 |
6 | NaN | NaN | NaN | NaN | 0.020860 | -0.275683 | 0.312347 | 0.756234 |
7 | NaN | NaN | NaN | NaN | 0.740609 | -0.408475 | 0.916659 | 0.157059 |
8 | NaN | NaN | NaN | NaN | 0.524124 | 0.464926 | 0.276881 | -1.824904 |
9 | NaN | NaN | NaN | NaN | 1.633787 | 0.760564 | 1.616592 | 0.662823 |
axis = 1,沿着列拼接,相同的索引拼一起, 即shape[0]不变,shape[1]变大,相同的索引拼接,索引不同,有些没有的NaN
参与拼接的3个df
pieces = [df[:3], df[3:7], df[7:]]
pieces[1].index = range(0, 4)
for piece in pieces:
print(piece)
A B C D
0 0.944063 0.498505 0.487059 -0.941936
1 -1.018475 0.920059 2.106816 1.104312
2 0.184466 -1.754729 -0.656002 -2.230612
A B C D
0 -0.270285 0.031730 0.100857 0.005652
1 -0.090587 -0.111332 -0.350779 0.013985
2 -1.213148 2.003474 -2.005585 -0.205235
3 0.084784 -1.076278 0.061052 0.432984
A B C D
7 0.317298 -0.734991 0.462587 1.468966
8 0.574835 1.248497 1.325436 -0.543574
9 0.597899 1.683313 -0.566700 -0.014385
axis=1,沿着列拼接,相同的索引拼一起,新的df的列数是所有参与拼接的df的列数的总和,行数则是所有index的集合的长度。匹配不上的则用NaN填充
pd_concat = pd.concat(pieces, axis=1)
pd_concat
A | B | C | D | A | B | C | D | A | B | C | D | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.944063 | 0.498505 | 0.487059 | -0.941936 | -0.270285 | 0.031730 | 0.100857 | 0.005652 | NaN | NaN | NaN | NaN |
1 | -1.018475 | 0.920059 | 2.106816 | 1.104312 | -0.090587 | -0.111332 | -0.350779 | 0.013985 | NaN | NaN | NaN | NaN |
2 | 0.184466 | -1.754729 | -0.656002 | -2.230612 | -1.213148 | 2.003474 | -2.005585 | -0.205235 | NaN | NaN | NaN | NaN |
3 | NaN | NaN | NaN | NaN | 0.084784 | -1.076278 | 0.061052 | 0.432984 | NaN | NaN | NaN | NaN |
7 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.317298 | -0.734991 | 0.462587 | 1.468966 |
8 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.574835 | 1.248497 | 1.325436 | -0.543574 |
9 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.597899 | 1.683313 | -0.566700 | -0.014385 |
笛卡尔积,可以按照index或列名join,左连接、右连接、内连接、外连接。
left = pd.DataFrame({'lkey': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'rkey': ['foo', 'foo', 'op'], 'rval': [4, 5, 6]})
print(left)
print('=' * 48)
print(right)
print('=' * 48)
print('=' * 10, "left Join right on columns", '=' * 10)
print(pd.merge(left, right, left_on='lkey', right_on='rkey', how='left'))
print('=' * 48)
print('=' * 10, "right Join left on columns", '=' * 10)
print(pd.merge(right, left, left_on='rkey', right_on='lkey', how='left', indicator=True))
print('=' * 48)
print('=' * 11, "right Join left on index", '=' * 11)
print(pd.merge(right, left, left_index=True, right_index=True, how='left', indicator=True))
lkey lval
0 foo 1
1 foo 2
================================================
rkey rval
0 foo 4
1 foo 5
2 op 6
================================================
========== left Join right on columns ==========
lkey lval rkey rval
0 foo 1 foo 4
1 foo 1 foo 5
2 foo 2 foo 4
3 foo 2 foo 5
================================================
========== right Join left on columns ==========
rkey rval lkey lval _merge
0 foo 4 foo 1.0 both
1 foo 4 foo 2.0 both
2 foo 5 foo 1.0 both
3 foo 5 foo 2.0 both
4 op 6 NaN NaN left_only
================================================
=========== right Join left on index ===========
rkey rval lkey lval _merge
0 foo 4 foo 1.0 both
1 foo 5 foo 2.0 both
2 op 6 NaN NaN left_only
left = pd.DataFrame({'key': ['foo', 'foo'], 'val': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'foo', 'op'], 'val': [4, 5, 6]})
print(left)
print('=' * 48)
print(right)
print('=' * 48)
print('=' * 10, "left Join right on columns", '=' * 10)
print(pd.merge(left, right, on='key', how='left'))
print('=' * 48)
print('=' * 11, 'left Join right on index', '=' * 11)
print(pd.merge(left, right, left_index=True, right_index=True, how='left', indicator=True))
key val
0 foo 1
1 foo 2
================================================
key val
0 foo 4
1 foo 5
2 op 6
================================================
========== left Join right on columns ==========
key val_x val_y
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5
================================================
=========== left Join right on index ===========
key_x val_x key_y val_y _merge
0 foo 1 foo 4 both
1 foo 2 foo 5 both