Pandas数据分析-数据筛选Indexing/Selection/Filtering, since 2022-05-16

(2022.05.16 Mon)
Pandas的Series的选取需要根据index

>> obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
>> obj
a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

Series可以通过index名字和序号两种方式索引。

>> obj[3]
3.0
>> obj['a':'c']
a    0.0
b    1.0
c    2.0
dtype: float64
>> obj[['b', 'a', 'd']] # 注意这里传递的是一个list
b    1.0
a    0.0
d    3.0
dtype: float64
>> obj[[3, 1, 2]]
d    3.0
b    1.0
c    2.0
dtype: float64

Pandas的DataFrame可以使用column和index number索引。

>> data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                       index=['Ohio', 'Colorado', 'Utah', 'New York'],
                       columns=['one', 'two', 'three', 'four'])
>> data
           one two three four
Ohio        0   1    2    3
Colorado    4   5    6    7
Utah        8   9    10   11
New York    12  13   14   15

指定DataFrame的某一/几列,使用该列的column name

>> data['four']
Ohio         3
Colorado     7
Utah        11
New York    15
Name: four, dtype: int64
>> data[['four', 'one']]
         four one
Ohio       3  0
Colorado   7  4
Utah      11  8
New York  15  12

选定行,可以使用index number

>> data[:2]
one two three   four
Ohio    0   1   2   3
Colorado    4   5   6   7
>> data['three']>4
Ohio        False
Colorado     True
Utah         True
New York     True
Name: three, dtype: bool
>> data[data['three']>4]
           one two three four
Colorado    4   5   6   7
Utah        8   9   10  11
New York    12  13  14  15

也可以使用lociloc的方式索引,其中iloc表示用integer做索引筛选。注意最后一种条件索引方式。

>> data.loc['Colorado', ['two', 'three']]
two      5
three    6
Name: Colorado, dtype: int64
>> data.iloc[[1,2], [3, 0, 1]]
    four    one two
Colorado    7   4   5
Utah    11  8   9
>> data.iloc[2]
one       8
two       9
three    10
four     11
Name: Utah, dtype: int64
>> data[:'Utah', 'two']
Ohio 0
Colorado 5
Utah 9
Name: two, dtype: int64
>> data.iloc[:, :3][data.three > 5] # *********
one two three
Colorado 0 5 6
Utah 8 9 10
New York 12 13 14

(2023.04.26 Wed @KLN, HK)

filtering

数据筛选有如下几种情况
特定值
筛选出某一列中特定值的行

df10 = df[df['some_field'] == 10]

范围
某一列中符合范围的行。除了可以使用>/=/<=等数学符号外,还可使用pandas内置函数.between,范围的边界包含在返回结果中,df['some_field'].between(a, b)相当于a <= df['some_field'] <= b

>> a = range(1,11)
>> b = range(110, 10, -10)
>> df = pd.DataFrame({'a': a, 'b': b}, index=range(1,11))
>> df
     a    b
1    1  110
2    2  100
3    3   90
4    4   80
5    5   70
6    6   60
7    7   50
8    8   40
9    9   30
10  10   20
>> dfg = df[df['a'] > 3]
>> dfl = df[df['b'] < 80]
>> dfbt = df[df['a'].between(3, 6)]
>> dfbt
   a   b
3  3  90
4  4  80
5  5  70
6  6  60

in
给出若干特定值,找出与这些值相等的df中的行,使用.isin([])方法

>> df[df['b'].isin([80, 30])]
   a   b
4  4  80
9  9  30

**字符包含string contains
针对字符型字段,使用DataFrame内置的.str.contains('xxxx')方法。

>> df['c'] = ['asdf', 'zxcv','asdf1','zxcv1','qwer','qwer2','xcvb','xcvb2','poiu','poiu3']
>> df
     a    b      c
1    1  110   asdf
2    2  100   zxcv
3    3   90  asdf1
4    4   80  zxcv1
5    5   70   qwer
6    6   60  qwer2
7    7   50   xcvb
8    8   40  xcvb2
9    9   30   poiu
10  10   20  poiu3
>> df[df['c'].str.contains('asdf')]
   a    b      c
1  1  110   asdf
3  3   90  asdf1

filtering条件的组合

上面的方法仅仅给出了单一条件的filtering。当使用多个条件的组合时,需要使用&|来作为“并”和“或”操作。特别注意,不同条件组合时,每个条件需要用小括号( )标记为一个整体。

>> df[(df['a']>3) & (df['c'].str.contains('zxcv'))]
   a   b      c
4  4  80  zxcv1
>> df[(df['b']<60) | (df['a']<2)] 
     a    b      c
1    1  110   asdf
7    7   50   xcvb
8    8   40  xcvb2
9    9   30   poiu
10  10   20  poiu3

(2023.04.29 Sat @KLN HK)

query做查询

使用pandas内置的query方法,查询数据更加自如。query中使用类SQL语法,可以直接返回查询逻辑中的结果。

>> df.query("a >= 3 and b >30")
   a   b
3  3  90
4  4  80
5  5  70
6  6  60
7  7  50
8  8  40

Reference

1 Python for Data Analysis, Wes McKinney
2 知乎, LifeIsEasy

你可能感兴趣的:(Pandas数据分析-数据筛选Indexing/Selection/Filtering, since 2022-05-16)