作者介绍:Python领域优质创作者、华为云享专家、阿里云专家博主、2021年CSDN博客新星Top6
- 本文已收录于Python全栈系列专栏:《100天精通Python从入门到就业》
- 此专栏文章是专门针对Python零基础小白所准备的一套完整教学,从0到100的不断进阶深入的学习,各知识点环环相扣
- 订阅专栏后续可以阅读Python从入门到就业100篇文章;还可私聊进两百人Python全栈交流群(手把手教学,问题解答); 进群可领取80GPython全栈教程视频 + 300本计算机书籍:基础、Web、爬虫、数据分析、可视化、机器学习、深度学习、人工智能、算法、面试题等。
- 加入我一起学习进步,一个人可以走的很快,一群人才能走的更远!
在数据分析清洗数据过程中,可能需要会滤掉、删除DataFrame中一些行,本文将介绍常用的筛选方法。
布尔索引可以用于判断和筛选
>>> import pandas as pd
>>> import numpy as np
>>>
>>> df = pd.DataFrame(np.random.randn(3, 3), columns=['A', 'B', 'C'])
>>> print(df)
A B C
0 -0.595510 -1.349175 -0.313918
1 1.130604 -2.094348 -0.449182
2 1.745407 -0.136642 -0.943479
>>>
>>> # 布尔索引判断:A列大于1的数
>>> print(df['A'] > 1)
0 False
1 True
2 True
Name: A, dtype: bool
>>>
>>> # 布尔索引筛选:A列中大于1的行
>>> print(df[df['A'] > 1])
A B C
1 1.130604 -2.094348 -0.449182
2 1.745407 -0.136642 -0.943479
between(left,right),筛选指定区间的行
>>> import pandas as pd
>>>
>>> data = {'name': ['小红', '小明', '小白', '小黑'], 'age': [10, 20, 30, 25]}
>>> df = pd.DataFrame(data)
>>> print(df)
name age
0 小红 10
1 小明 20
2 小白 30
3 小黑 25
>>>
>>> # 判断年龄是否在20-30之间
>>> print(df['age'].between(20, 30))
0 False
1 True
2 True
3 True
Name: age, dtype: bool
>>> # 筛选年龄在20-30之间的行
>>> print(df[df['age'].between(20, 30)])
name age
1 小明 20
2 小白 30
3 小黑 25
isin()接收一个列表,可以同时判断数据是否与多个值相等,若与其中的某个值相等则返回True,否则则为False
创建DataFrame:
>>> import pandas as pd
>>> import numpy as np
>>>
>>> data = [['foo', 'one', 'small', 1], ['foo', 'one', 'large', 5],
... ['bar', 'one', 'small', 10], ['bar', 'two', 'samll', 10],
... ['bar', 'two', 'large', 50]]
>>> df = pd.DataFrame(data, columns=['A', 'B', 'C', 'D'])
>>> print(df)
A B C D
0 foo one small 1
1 foo one large 5
2 bar one small 10
3 bar two samll 10
4 bar two large 50
df[df[列名].isin([异常值])]
>>> # 1. 接收一个值:判断A列中的值是否为foo
>>> df['A'].isin(['foo'])
0 True
1 True
2 False
3 False
4 False
Name: A, dtype: bool
>>>
>>> # 2. 接收多个值:判断A列中的值是否为foo,bar
>>> df['A'].isin(['foo','bar'])
0 True
1 True
2 True
3 True
4 True
Name: A, dtype: bool
同时满足用&连接,或的话用 | 连接
筛选出每列都有异常值的行:df[df[列名].isin([异常值])& df[列名].isin([异常值])]
>>> # 筛选中A列中等于bar,并且B列中等于one的行
>>> df[df['A'].isin(['bar'])& df['B'].isin(['one'])]
A B C D
2 bar one small 10
筛选出至少有一列有异常值的行:df[df[列名].isin([异常值])| df[列名].isin([异常值])]
>>> # 筛选中A列中等于bar,或者B列中等于one的行
>>> df[df['A'].isin(['bar']) | df['B'].isin(['one'])]
A B C D
0 foo one small 1
1 foo one large 5
2 bar one small 10
3 bar two samll 10
4 bar two large 50
{‘某列’:[条件],‘某列’:[条件],}
# 这种方法不符合的位置都会显示NAN
>>> df[df.isin({'A':['bar'],'C':['small']})]
A B C D
0 NaN NaN small NaN
1 NaN NaN NaN NaN
2 bar NaN small NaN
3 bar NaN NaN NaN
4 bar NaN NaN NaN
因为isin()返还的是boolean的DataFrame,在里面的是True,不在里面的是False,所以我们只需要对它进行异或取反即可。
# 删除A列中foo的行
>>> df[True^df['A'].isin(['foo'])]
A B C D
2 bar one small 10
3 bar two samll 10
4 bar two large 50
前面加上
~
# 删除A列中foo的行
>>> df[~(df['A']=='foo')]
A B C D
2 bar one small 10
3 bar two samll 10
4 bar two large 50
loc()函数和iloc()函数的区别在于:
- loc()函数是通过索引名称提取数据
- iloc()函数通过行和列的下标提取数据
>>> import pandas as pd
>>>
>>> data = [['foo', 'one', 'small', 1], ['foo', 'one', 'large', 5],
... ['bar', 'one', 'small', 10], ['bar', 'two', 'samll', 10],
... ['bar', 'two', 'large', 50]]
>>> df = pd.DataFrame(data, columns=['A', 'B', 'C', 'D'], index=['a', 'b', 'c', 'd', 'e'])
>>> print(df)
A B C D
a foo one small 1
b foo one large 5
c bar one small 10
d bar two samll 10
e bar two large 50
>>> # loc取索引为a的行(第一行)
>>> df.loc['a']
A foo
B one
C small
D 1
Name: a, dtype: object
>>>
>>> # iloc取索引为a的行(第一行)
>>> df.iloc[0]
A foo
B one
C small
D 1
Name: a, dtype: object
>>> # loc取A列所有行
>>> df.loc[:, ['A']]
A
a foo
b foo
c bar
d bar
e bar
>>>
>>> # iloc取A列所有行
>>> df.iloc[:,[0]]
A
a foo
b foo
c bar
d bar
e bar
(1)连续多列:
>>> # loc取A,B,C列所有行
>>> df.loc[:, ['A', 'B', 'C']]
A B C
a foo one small
b foo one large
c bar one small
d bar two samll
e bar two large
>>>
>>> # iloc取A,B,C列所有行
>>> df.iloc[:, 0:3]
A B C
a foo one small
b foo one large
c bar one small
d bar two samll
e bar two large
(2)不连续多列
>>> # loc取A,D列所有行
>>> df.loc[:, ['A', 'D']]
A D
a foo 1
b foo 5
c bar 10
d bar 10
e bar 50
>>>
>>> # iloc取A,D列所有行
>>> df.iloc[:, [0,3]]
A D
a foo 1
b foo 5
c bar 10
d bar 10
e bar 50
>>> # loc取索引为a、d,并且列名也为A、D的行和列
>>> df.loc[['a', 'd'], ['A', 'D']]
A D
a foo 1
d bar 10
>>>
>>> # iloc取索引为a、d,并且列名也为A、D的行和列
>>> df.iloc[[0, 3], [0, 3]]
A D
a foo 1
d bar 10
>>> # loc取全部
>>> df.loc[:,:]
A B C D
a foo one small 1
b foo one large 5
c bar one small 10
d bar two samll 10
e bar two large 50
>>>
>>> # iloc取全部
>>> df.iloc[:,:]
A B C D
a foo one small 1
b foo one large 5
c bar one small 10
d bar two samll 10
e bar two large 50
利用loc可以对值进行筛选
>>> # loc取A列值为foo的行
>>> df.loc[df['A'] == 'foo']
A B C D
a foo one small 1
b foo one large 5
>>>
>>> # loc取D值大于等于10的行
>>> df.loc[df['D'] >= 10]
A B C D
c bar one small 10
d bar two samll 10
e bar two large 50