pandas之合并数据集

文章目录

    • 1. merge函数
    • 2. concat函数
    • 3. combine_first函数

1. merge函数

详见【python】详解pandas库的pd.merge函数

pandas之合并数据集_第1张图片

2. concat函数

pd.concat(
    objs: Union[Iterable[ForwardRef('NDFrame')], Mapping[Optional[Hashable], ForwardRef('NDFrame')]],
    axis=0,
    join='outer',
    ignore_index: bool = False,
    keys=None,
    levels=None,
    names=None,
    verify_integrity: bool = False,
    sort: bool = False,
    copy: bool = True,
)

常用的参数:objs是要合并的df或Series,可以将待合并的df放到一个列表里,一起传入,可以是两个df,也可以是多个df;axis=0意思是默认按列合并,即取列名的并集或交集进行合并df,当设为1时,对行索引名取并集或交集进行合并df;join='outer'意思是默认取并集,可设置为'inner'改为取交集;ignore_index: bool = False则是是否忽略索引名,默认不忽略,即用原索引名,当设为True时,用0,1,2,……代替;

通过几个例子来说明这个函数的用法。

先建立两个DataFrame:

data1 = {'name': ['apolo', 'adm', 'bolon', 'ali', 'cathy', 'devn', 'elov'],
        'age': [18, 29, 32, 28, 34, 19, None],
        'sex': ['male', 'female', 'male', 'male', 'female', 'male', 'female'],
        'weight': [67, 78, 87, 59, 90, 101, 78],
        'height': [170, 189, 190, 179, None, 160, 185]}
df1 = pd.DataFrame(data1, index=['a', 'i', 'c', 'd', 'h','f', 'g'])
df1

输出:df1

    name	age	    sex	weight	height
a	apolo	18.0	male	67	170.0
i	adm	    29.0	female	78	189.0
c	bolon	32.0	male	87	190.0
d	ali	    28.0	male	59	179.0
h	cathy	34.0	female	90	NaN
f	devn	19.0	male	101	160.0
g	elov	NaN	    female	78	185.0
data2 = {'name': ['apolo', 'adm', 'bolon', 'ali', 'cathy'],
        'age': [18, 29, 32, 28, None],
        'sex': ['male', 'male', 'female', 'male', 'female'],
        'country': ['BR', 'MX', 'CL', 'BR', 'CO']}
df2 = pd.DataFrame(data2, index=['a', 'b', 'm', 'n', 'h'])
df2

输出:df2

    name	age	    sex	country
a	apolo	18.0	male	BR
b	adm	    29.0	male	MX
m	bolon	32.0	female	CL
n	ali	    28.0	male	BR
h	cathy	NaN	   female	CO

直接用concat()将上述两个df进行合并。

result = pd.concat([df1, df2])
result

输出:可以看到,新的数据集的列名是它们的并集。对于只在一个df中出现的列(如country,只原本在df2中出现),该列在另一个df对应的索引的值是空值。此时新的数据集的行索引为原数据集的所有行索引,可以重复。

  	 name	age	     sex	weight	height	country
a	apolo	18.0	male	67.0	170.0	NaN
i	adm	    29.0	female	78.0	189.0	NaN
c	bolon	32.0	male	87.0	190.0	NaN
d	ali	    28.0	male	59.0	179.0	NaN
h	cathy	34.0	female	90.0	NaN	    NaN
f	devn	19.0	male	101.0	160.0	NaN
g	elov	NaN	    female	78.0	185.0	NaN
a	apolo	18.0	male	NaN	     NaN	BR
b	adm	    29.0	male	NaN	     NaN	MX
m	bolon	32.0	female	NaN	     NaN	CL
n	ali	    28.0	male	NaN	     NaN	BR
h	cathy	NaN	    female	NaN 	 NaN	CO

设置axis=1

result = pd.concat([df1, df2], axis=1)
result

输出:可以看到,此时行索引名是原数据集行索引的并集,对于只在一个df中出现的行索引名(如n,只原本在df2中出现),该行在另一个df对应的列的值是空值。此时新的数据集的列名为原数据集的所有列名,可以重复。

   	name	age	     sex	weight	height	name	age	    sex	  country
a	apolo	18.0	male	67.0	170.0	apolo	18.0	male	BR
i	adm	    29.0	female	78.0	189.0	NaN	     NaN	NaN	    NaN
c	bolon	32.0	male	87.0	190.0	NaN	     NaN	NaN	    NaN
d	ali	    28.0	male	59.0	179.0	NaN	     NaN	NaN	    NaN
h	cathy	34.0	female	90.0	NaN	   cathy	 NaN	female	CO
f	devn	19.0	male	101.0	160.0	NaN	     NaN	NaN	    NaN
g	elov	NaN	   female	78.0	185.0	NaN	     NaN	NaN  	NaN
b	NaN	    NaN	    NaN	     NaN	NaN	    adm	    29.0	male	MX
m	NaN	    NaN	    NaN	     NaN	NaN	   bolon	32.0	female	CL
n	NaN	    NaN	    NaN	     NaN	NaN	    ali	    28.0	male	BR

设置join='inner'

result = pd.concat([df1, df2], join='inner')
result

输出:可以看到此时的列名为原数据集列名的交集,而行索引依旧是原数据集的所有行索引,可以重复。

     name	age	    sex
a	apolo	18.0	male
i	adm	    29.0	female
c	bolon	32.0	male
d	ali	    28.0	male
h	cathy	34.0	female
f	devn	19.0	male
g	elov	NaN	    female
a	apolo	18.0	male
b	adm	    29.0	male
m	bolon	32.0	female
n	ali	    28.0	male
h	cathy	NaN	    female

设置ignore_index=True

result = pd.concat([df1, df2], join='inner', ignore_index=True)
result

输出:可以看到此时的行索引变为了0, 1, 2, ……,11,而不是原数据集的行索引;同样,当设置axis=1时,列索引即列名会变为0,1, 2,……

	name	age	    sex
0	apolo	18.0	male
1	adm	    29.0	female
2	bolon	32.0	male
3	ali	    28.0	male
4	cathy	34.0	female
5	devn	19.0	male
6	elov	NaN	    female
7	apolo	18.0	male
8	adm	    29.0	male
9	bolon	32.0	female
10	ali	    28.0	male
11	cathy	NaN 	female

3. combine_first函数

combine_first函数将空值补齐:

df1 = pd.DataFrame({'a':[1, np.nan, 5, np.nan],
                    'b':[np.nan, 2, np.nan, 6],
                    'c':range(2, 18, 4)})
df2 = pd.DataFrame({'a':[5, 4, np.nan, 3, 7],
                    'b':[np.nan, 3, 4, 6, 8]})

df1:


	 a	 b	c
0	1.0	NaN	2
1	NaN	2.0	6
2	5.0	NaN	10
3	NaN	6.0	14

df2:

	  a	 b
0	5.0	NaN
1	4.0	3.0
2	NaN	4.0
3	3.0	6.0
4	7.0	8.0
df1.combine_first(df2)

输出结果:


	 a	 b	 c
0	1.0	NaN	2.0
1	4.0	2.0	6.0
2	5.0	4.0	10.0
3	3.0	6.0	14.0
4	7.0	8.0	NaN

你可能感兴趣的:(Python,python)