在分析数据时,遇到了要处理多个dataframe按照关键词user_id连接的需求,在sql中只要多个表left join on 就可以了,那么在pandas中怎么操作呢?
dataframe主要有三个函数可以用来做表的连接,分别是join、merge、concat,下面分别介绍这三个DataFrame的用法。
DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)
通过列或者索引join “other” dataframe, 能高效的连接多个dataframe。
参数:
注意:多表连接只能通过索引,否则会报错
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-35-29a230cbe380> in <module>
----> 1 oned_beh_UserCou.join([threed_beh_UserCou.reset_index(),sixd_beh_UserCou.reset_index],how='left',on='user_id')
/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py in join(self, other, on, how, lsuffix, rsuffix, sort)
5291 # For SparseDataFrame's benefit
5292 return self._join_compat(other, on=on, how=how, lsuffix=lsuffix,
-> 5293 rsuffix=rsuffix, sort=sort)
5294
5295 def _join_compat(self, other, on=None, how='left', lsuffix='', rsuffix='',
/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py in _join_compat(self, other, on, how, lsuffix, rsuffix, sort)
5309 else:
5310 if on is not None:
-> 5311 raise ValueError('Joining multiple DataFrames only supported'
5312 ' for joining on index')
5313
ValueError: Joining multiple DataFrames only supported for joining on index
但是当对dataframe进行groupby、reset_index等操作后,此时的index可能并不是我们想要的索引列。可以通过以下函数来重置index:
pandas.DataFrame.reset_index()
pandas.DataFrame.set_index("",drop=True)
pandas.Series.reset_index()
DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)[source]
merge类似于数据库中的join操作(database-style join),如果两表连接的字段是column,dataframe的index会被省略,如果是索引和索引或者索引和column的链接,最后仍会保留index
参数
pandas.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=None, copy=True)
按某一维度拼接pandas对象
参数:
import pandas as pd
df1 = pd.DataFrame([['a', 1], ['b', 2]],columns=['letter', 'number'])
df3 = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']], columns=['letter', 'number', 'animal'])
pd.concat([df1, df3], join="inner")
letter number
0 a 1
1 b 2
0 c 3
1 d 4
pd.concat([df1, df3], join="outer",ignore_index=True)
animal letter number
0 NaN a 1
1 NaN b 2
2 cat c 3
3 dog d 4
merge与join相比,join可以一次进行多个dataframe的连接,而merge一次只能进行两个dataframe, 但merge在连接字段的设置上更灵活,而concat只是进行dataframe的行或列的拼接。