一、多条件过滤
使用query方法
df_filtered = df.query('a == 4 & b != 2')
注意:等于过滤,是两个==;
使用==
data[(data['A']==0)&(data['B']==1)]
使用loc函数
>>> data.loc[(data['A']==0)&(data['B']==1)] # 提取data数据(多个筛选条件)
A B C D
a 0 1 2 3
二、范围过滤
// query函数
// query <
rpt.query('60000 < STK_ID < 70000')
// query in
rpt.query('STK_ID in (600809,600141,600329)')
// isin函数
// 筛选出dataframe中有某一个或某几个字符串的列:
list=['key1','key2']
df = df[df['one'].isin(list)]
// data[(data['A'].isin([0]))&(data['B'].isin([1]))] # isin函数
// 筛选出dataframe中不含某一个或某几个字符串的列,相当于反选
df = df[~df['one'].isin(list)]
三、有级联关系的过滤,比如20201101有两个advertiser_id(adv1044525491840、adv1049003362112),20201102有一个(adv1049003362112),直接通过not in &实现不了,如下
// 预期结果如下
advertiser_id day id
0 adv1044525491840 20201101 1
1 adv1049003362112 20201101 2
>>> import pandas as pd
>>> data1 = {'id':[1,2,3],'day':[20201101,20201101,20201102],'advertiser_id':['adv1044525491840','adv1049003362112','adv1049003362112']}
>>> patchDF = pd.DataFrame(data1)
>>> data2 = {'day':[20201102],'advertiser_id':['adv1049003362112']}
>>> advertiserDF = pd.DataFrame(data2)
>>> adDF = patchDF.query("day not in (%s) & advertiser_id not in (%s)"%(advertiserDF['day'].tolist(),advertiserDF['advertiser_id'].tolist()))
>>> adDF
advertiser_id day id
0 adv1044525491840 20201101 1
// 实际返回如上
() not in ((),())写法,这种写法不支持,如下
>>> import pandas as pd
>>> data1 = {'id':[1,2,3],'day':[20201101,20201101,20201102],'advertiser_id':['adv1044525491840','adv1049003362112','adv1049003362112']}
>>> patchDF = pd.DataFrame(data1)
>>> data2 = {'day':[20201102],'advertiser_id':['adv1049003362112']}
>>> advertiserDF = pd.DataFrame(data2)
>>> adDF = patchDF.query("(day,advertiser_id) not in ((20201102,'adv1049003362112'))")
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 2850, in query
new_data = self.loc[res]
File "/usr/local/lib/python2.7/dist-packages/pandas/core/indexing.py", line 1478, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/indexing.py", line 1912, in _getitem_axis
return self._get_label(key, axis=axis)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/indexing.py", line 140, in _get_label
return self.obj._xs(label, axis=axis)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 2987, in xs
loc = self.index.get_loc(key)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/indexes/base.py", line 3080, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 159, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index_class_helper.pxi", line 120, in pandas._libs.index.Int64Engine._check_type
KeyError: True
left join方式
>>> import pandas as pd
>>> import numpy as np
>>> data1 = {'id':[1,2,3],'day':[20201101,20201101,20201102],'advertiser_id':['adv1044525491840','adv1049003362112','adv1049003362112']}
>>> patchDF = pd.DataFrame(data1)
>>> data2 = {'day':[20201102],'advertiser_id':['adv1049003362112'],'id':[1]}
>>> advertiserDF = pd.DataFrame(data2)
>>> mergeDF = pd.merge(patchDF, advertiserDF, how='left', on=['day', 'advertiser_id'],suffixes=('_patch', '_advertiser'))
>>> adDF = mergeDF[np.isnan(mergeDF['id_advertiser'])]
>>> adDF
advertiser_id day id_patch id_advertiser
0 adv1044525491840 20201101 1 NaN
1 adv1049003362112 20201101 2 NaN
// 求非nan,大数据量可能报错https://itdiandi.net/view/2874
// >>> adDF = mergeDF[~np.isnan(mergeDF['id_advertiser'])]
// >>> adDF
// advertiser_id day id_patch id_advertiser
// 2 adv1049003362112 20201102 3 1.0
// 使用pd.notna判断
adDF = mergeDF[pd.notna(mergeDF['id_advertiser'])]