数据合并的操作,类似于SQL中的关联。详细内容参看:官方文档。
DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
Parameters | Type | Detail |
---|---|---|
right | DataFrame | 关联的数据框 |
how | {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘inner’ | 1.left:类似sql中left join 2.right:right join 3.outer :full outer join 4. inner: inner join |
on | label or list | join on 的列名,默认是两个dataframe共有的列名 |
left_on | boolean, default False | 关联时,右表的列名 |
left_index | label or list | 关联时,用左表的index做key,默认为False,一般也用不到 |
right_index | label or list | 关联时,用右表的index做key,默认为False,一般也用不到 |
sort | boolean, default False | 关联后,是否排序 |
suffixes | 2-length sequence (tuple, list, …) | 关联之后,重复的列,加上前缀 |
copy | boolean, default True | 如果False,不复制不必要的数据 |
indicator | boolean or string, default False | 如果True,关联之后的dataframe添加一列,列名为’-merage’(或者不为True,为任意一个字符串,此时列名即是此字符串)。该列的值为{left_onlu,right_only,both},即该行来自哪张表 |
validate | string, default None | 如果指定类型,则会检查数据 1. “one_to_one” or “1:1”: 检查两个dataframe的key有没有重复项 2.“one_to_many” or “1:m”: 检查左表是不是unique 3. “many_to_one” or “m:1”: 检查右表是不是unique 4. “many_to_many” or “m:m”: 参数可选但是不会check 5. 0.21.0版本才有 |
Returns | DataFrame | 返回的也是DataFrame |
用两份数据,测试不同参数的实际效果。
import pandas as pd
A = pd.DataFrame({'leftkey': ['foo','bar','baz','foo'],"value":[1,2,3,4],"x":[4,3,2,1]})
B = pd.DataFrame({'rightkey': ['foo','bar','qux','bar'],"value":[5,6,7,8],"x":[4,3,2,1]})
print(A);print(B);
leftkey value x
0 foo 1 4
1 bar 2 3
2 baz 3 2
3 foo 4 1
rightkey value x
0 foo 5 4
1 bar 6 3
2 qux 7 2
3 bar 8 1
## on这个参数需要两个表里有相同的列名
## 关联之后,除key之外的列名,加了_x做标识
A.merge(right = B,how = "left",on = "x")
leftkey | value_x | x | rightkey | value_y | |
---|---|---|---|---|---|
0 | foo | 1 | 4 | foo | 5 |
1 | bar | 2 | 3 | bar | 6 |
2 | baz | 3 | 2 | qux | 7 |
3 | foo | 4 | 1 | bar | 8 |
## 左关联的使用,其余右关联不赘
A.merge(right = B,how = "left",left_on = "leftkey",right_on = "rightkey")
leftkey | value_x | x_x | rightkey | value_y | x_y | |
---|---|---|---|---|---|---|
0 | foo | 1 | 4 | foo | 5.0 | 4.0 |
1 | bar | 2 | 3 | bar | 6.0 | 3.0 |
2 | bar | 2 | 3 | bar | 8.0 | 1.0 |
3 | baz | 3 | 2 | NaN | NaN | NaN |
4 | foo | 4 | 1 | foo | 5.0 | 4.0 |
## 取交集即可
A.merge(right = B,how = "inner",left_on = "leftkey",right_on = "rightkey")
leftkey | value_x | x_x | rightkey | value_y | x_y | |
---|---|---|---|---|---|---|
0 | foo | 1 | 4 | foo | 5 | 4 |
1 | foo | 4 | 1 | foo | 5 | 4 |
2 | bar | 2 | 3 | bar | 6 | 3 |
3 | bar | 2 | 3 | bar | 8 | 1 |
## 取并集
A.merge(right = B,how = "outer",left_on = "leftkey",right_on = "rightkey")
leftkey | value_x | x_x | rightkey | value_y | x_y | |
---|---|---|---|---|---|---|
0 | foo | 1.0 | 4.0 | foo | 5.0 | 4.0 |
1 | foo | 4.0 | 1.0 | foo | 5.0 | 4.0 |
2 | bar | 2.0 | 3.0 | bar | 6.0 | 3.0 |
3 | bar | 2.0 | 3.0 | bar | 8.0 | 1.0 |
4 | baz | 3.0 | 2.0 | NaN | NaN | NaN |
5 | NaN | NaN | NaN | qux | 7.0 | 2.0 |
## 按照字典顺序排序,也是就是字母顺序排序
A.merge(right = B,how = "left",left_on = "leftkey",right_on = "rightkey",sort = True)
leftkey | value_x | x_x | rightkey | value_y | x_y | |
---|---|---|---|---|---|---|
0 | bar | 2 | 3 | bar | 6.0 | 3.0 |
1 | bar | 2 | 3 | bar | 8.0 | 1.0 |
2 | baz | 3 | 2 | NaN | NaN | NaN |
3 | foo | 1 | 4 | foo | 5.0 | 4.0 |
4 | foo | 4 | 1 | foo | 5.0 | 4.0 |
## indicator = True,新增列名为_merge的一列,作为标识。即该行数据是属于左表独有还是右表独有,还是有both
A.merge(right = B,how = "outer",left_on = "leftkey",right_on = "rightkey",sort = True,indicator=True)
leftkey | value_x | x_x | rightkey | value_y | x_y | _merge | |
---|---|---|---|---|---|---|---|
0 | bar | 2.0 | 3.0 | bar | 6.0 | 3.0 | both |
1 | bar | 2.0 | 3.0 | bar | 8.0 | 1.0 | both |
2 | baz | 3.0 | 2.0 | NaN | NaN | NaN | left_only |
3 | foo | 1.0 | 4.0 | foo | 5.0 | 4.0 | both |
4 | foo | 4.0 | 1.0 | foo | 5.0 | 4.0 | both |
5 | NaN | NaN | NaN | qux | 7.0 | 2.0 | right_only |
pd.merge_ordered(A1,B1,left_by='leftkey',fill_method='ffill',how="outer")
leftkey | value | x1 | rightkey | y1 | |
---|---|---|---|---|---|
0 | foo | 1 | 1 | NaN | NaN |
1 | foo | 4 | 4 | NaN | NaN |
2 | foo | 5 | 4 | foo | 4.0 |
3 | foo | 6 | 4 | bar | 3.0 |
4 | foo | 7 | 4 | qux | 2.0 |
5 | foo | 8 | 4 | bar | 1.0 |
6 | bar | 2 | 3 | NaN | NaN |
7 | bar | 5 | 3 | foo | 4.0 |
8 | bar | 6 | 3 | bar | 3.0 |
9 | bar | 7 | 3 | qux | 2.0 |
10 | bar | 8 | 3 | bar | 1.0 |
11 | baz | 3 | 2 | NaN | NaN |
12 | baz | 5 | 2 | foo | 4.0 |
13 | baz | 6 | 2 | bar | 3.0 |
14 | baz | 7 | 2 | qux | 2.0 |
15 | baz | 8 | 2 | bar | 1.0 |
简单的行合并和列合并操作。详细内容参看:官方文档。
pandas.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=None, copy=True)
Parameters | Type | Detail |
---|---|---|
objs | DataFrame,Series,Panel objects | 需要拼接的数据 |
axis | {0/’index’, 1/’columns’}, default 0 | 拼接的方向,0行拼接(R中rbind),1列拼接(R中cbind) |
join | {‘inner’, ‘outer’}, default ‘outer’ | 如何处理其他轴上的数据,稍后实验解释 |
join_axes | list of Index objects | 不知道什么意思 |
ignore_index | boolean, default False | 索引数据是否使用,或者说行名是否沿用,默认False,即重新命名0:n-1 |
keys | sequence, default None | 分层索引的名字,或者说是合并之后,给原始数据一个行标识 |
levels | list of sequences, default None | 需要拼接的数据 |
names | list, default None | 就demo来看,是Series合并之后指定的列名,具体用法不明 |
verify_integrity | boolean, default False | 检查是否存在重复值,计算花费大,默认不执行 |
sort | boolean, default None | 需要拼接的数据 |
copy | boolean, default True | 如果False则不复制非必要数据,设置成False,似乎没有什么变化 |
Returns | 返回类型和拼接对象有关系 | 如果DataFrame那么返回DataFrame |
测试上述参数的实际用法。
import pandas as pd
A = pd.DataFrame({'leftkey': ['foo','bar','baz','foo'],"value":[1,2,3,4],"x1":[4,3,2,1]})
B = pd.DataFrame({'rightkey': ['foo','bar','qux','bar'],"value":[5,6,7,8],"x2":[4,3,2,1]})
print(A);print(B);
leftkey value x1
0 foo 1 4
1 bar 2 3
2 baz 3 2
3 foo 4 1
rightkey value x2
0 foo 5 4
1 bar 6 3
2 qux 7 2
3 bar 8 1
行合并需要设置axis = 0
。join
默认为"outer",即如果列名不一致,用NaN填充;如果join为inner,只保留共有列名的列。
pd.concat([A,B],axis=0,join = "outer")
leftkey | rightkey | value | x1 | x2 | |
---|---|---|---|---|---|
0 | foo | NaN | 1 | 4.0 | NaN |
1 | bar | NaN | 2 | 3.0 | NaN |
2 | baz | NaN | 3 | 2.0 | NaN |
3 | foo | NaN | 4 | 1.0 | NaN |
0 | NaN | foo | 5 | NaN | 4.0 |
1 | NaN | bar | 6 | NaN | 3.0 |
2 | NaN | qux | 7 | NaN | 2.0 |
3 | NaN | bar | 8 | NaN | 1.0 |
pd.concat([A,B],axis=0,join = "inner")
value | |
---|---|
0 | 1 |
1 | 2 |
2 | 3 |
3 | 4 |
0 | 5 |
1 | 6 |
2 | 7 |
3 | 8 |
列合并不多说,设置axis参数即可。
pd.concat([A,B],axis = 1)
leftkey | value | x1 | rightkey | value | x2 | |
---|---|---|---|---|---|---|
0 | foo | 1 | 4 | foo | 5 | 4 |
1 | bar | 2 | 3 | bar | 6 | 3 |
2 | baz | 3 | 2 | qux | 7 | 2 |
3 | foo | 4 | 1 | bar | 8 | 1 |
添加更高维度的索引,有点像R里的array或者list。
pd.concat([A, B], keys=['A', 'B'],axis = 1)
A | B | |||||
---|---|---|---|---|---|---|
leftkey | value | x1 | rightkey | value | x2 | |
0 | foo | 1 | 4 | foo | 5 | 4 |
1 | bar | 2 | 3 | bar | 6 | 3 |
2 | baz | 3 | 2 | qux | 7 | 2 |
3 | foo | 4 | 1 | bar | 8 | 1 |
pd.concat([A, B], keys=['A', 'B'],axis = 0)
leftkey | rightkey | value | x1 | x2 | ||
---|---|---|---|---|---|---|
A | 0 | foo | NaN | 1 | 4.0 | NaN |
1 | bar | NaN | 2 | 3.0 | NaN | |
2 | baz | NaN | 3 | 2.0 | NaN | |
3 | foo | NaN | 4 | 1.0 | NaN | |
B | 0 | NaN | foo | 5 | NaN | 4.0 |
1 | NaN | bar | 6 | NaN | 3.0 | |
2 | NaN | qux | 7 | NaN | 2.0 | |
3 | NaN | bar | 8 | NaN | 1.0 |
2018-06-11 于南京建邺区 新城科技园