python dataframe多条件筛选_pandas dataframe 多条件过滤

一、多条件过滤

使用query方法

df_filtered = df.query('a == 4 & b != 2')

注意:等于过滤,是两个==;

使用==

data[(data['A']==0)&(data['B']==1)]

使用loc函数

>>> data.loc[(data['A']==0)&(data['B']==1)] # 提取data数据(多个筛选条件)

A B C D

a 0 1 2 3

二、范围过滤

// query函数

// query <

rpt.query('60000 < STK_ID < 70000')

// query in

rpt.query('STK_ID in (600809,600141,600329)')

// isin函数

// 筛选出dataframe中有某一个或某几个字符串的列:

list=['key1','key2']

df = df[df['one'].isin(list)]

// data[(data['A'].isin([0]))&(data['B'].isin([1]))] # isin函数

// 筛选出dataframe中不含某一个或某几个字符串的列,相当于反选

df = df[~df['one'].isin(list)]

三、有级联关系的过滤,比如20201101有两个advertiser_id(adv1044525491840、adv1049003362112),20201102有一个(adv1049003362112),直接通过not in &实现不了,如下

// 预期结果如下

advertiser_id day id

0 adv1044525491840 20201101 1

1 adv1049003362112 20201101 2

>>> import pandas as pd

>>> data1 = {'id':[1,2,3],'day':[20201101,20201101,20201102],'advertiser_id':['adv1044525491840','adv1049003362112','adv1049003362112']}

>>> patchDF = pd.DataFrame(data1)

>>> data2 = {'day':[20201102],'advertiser_id':['adv1049003362112']}

>>> advertiserDF = pd.DataFrame(data2)

>>> adDF = patchDF.query("day not in (%s) & advertiser_id not in (%s)"%(advertiserDF['day'].tolist(),advertiserDF['advertiser_id'].tolist()))

>>> adDF

advertiser_id day id

0 adv1044525491840 20201101 1

// 实际返回如上

() not in ((),())写法,这种写法不支持,如下

>>> import pandas as pd

>>> data1 = {'id':[1,2,3],'day':[20201101,20201101,20201102],'advertiser_id':['adv1044525491840','adv1049003362112','adv1049003362112']}

>>> patchDF = pd.DataFrame(data1)

>>> data2 = {'day':[20201102],'advertiser_id':['adv1049003362112']}

>>> advertiserDF = pd.DataFrame(data2)

>>> adDF = patchDF.query("(day,advertiser_id) not in ((20201102,'adv1049003362112'))")

Traceback (most recent call last):

File "", line 1, in

File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 2850, in query

new_data = self.loc[res]

File "/usr/local/lib/python2.7/dist-packages/pandas/core/indexing.py", line 1478, in __getitem__

return self._getitem_axis(maybe_callable, axis=axis)

File "/usr/local/lib/python2.7/dist-packages/pandas/core/indexing.py", line 1912, in _getitem_axis

return self._get_label(key, axis=axis)

File "/usr/local/lib/python2.7/dist-packages/pandas/core/indexing.py", line 140, in _get_label

return self.obj._xs(label, axis=axis)

File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 2987, in xs

loc = self.index.get_loc(key)

File "/usr/local/lib/python2.7/dist-packages/pandas/core/indexes/base.py", line 3080, in get_loc

return self._engine.get_loc(self._maybe_cast_indexer(key))

File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc

File "pandas/_libs/index.pyx", line 159, in pandas._libs.index.IndexEngine.get_loc

File "pandas/_libs/index_class_helper.pxi", line 120, in pandas._libs.index.Int64Engine._check_type

KeyError: True

left join方式

>>> import pandas as pd

>>> import numpy as np

>>> data1 = {'id':[1,2,3],'day':[20201101,20201101,20201102],'advertiser_id':['adv1044525491840','adv1049003362112','adv1049003362112']}

>>> patchDF = pd.DataFrame(data1)

>>> data2 = {'day':[20201102],'advertiser_id':['adv1049003362112'],'id':[1]}

>>> advertiserDF = pd.DataFrame(data2)

>>> mergeDF = pd.merge(patchDF, advertiserDF, how='left', on=['day', 'advertiser_id'],suffixes=('_patch', '_advertiser'))

>>> adDF = mergeDF[np.isnan(mergeDF['id_advertiser'])]

>>> adDF

advertiser_id day id_patch id_advertiser

0 adv1044525491840 20201101 1 NaN

1 adv1049003362112 20201101 2 NaN

// 求非nan,大数据量可能报错https://itdiandi.net/view/2874

// >>> adDF = mergeDF[~np.isnan(mergeDF['id_advertiser'])]

// >>> adDF

// advertiser_id day id_patch id_advertiser

// 2 adv1049003362112 20201102 3 1.0

// 使用pd.notna判断

adDF = mergeDF[pd.notna(mergeDF['id_advertiser'])]

你可能感兴趣的:(python,dataframe多条件筛选)