Pandas模块之DataFrame:02-索引与切片

Dataframe既有行索引也有列索引,可以被看做由Series组成的字典。

df = pd.DataFrame(np.random.randint(100,size =12).reshape(3,4),
                   index = ['one','two','three'],
                   columns = ['a','b','c','d'])
print(df)
============================
        a   b   c   d
one    35  35  17  50
two    53   4  51  23
three  82  12  51  97

# 按照列名选择列,只选择一列输出Series,选择多列输出Dataframe
data1 = df['a']
data2 = df[['a','c']]
print(data1,type(data1))
print(data2,type(data2))
============================
one      35
two      53
three    82
Name: a, dtype: int32 <class 'pandas.core.series.Series'>
        a   c
one    35  17
two    53  51
three  82  51 <class 'pandas.core.frame.DataFrame'>

# 按照index选择行,只选择一行输出Series,选择多行输出Dataframe
data3 = df.loc['one']
data4 = df.loc[['one','two']]
print(data3,type(data3))
print(data4,type(data4))
============================
a    35
b    35
c    17
d    50
Name: one, dtype: int32 <class 'pandas.core.series.Series'>
      a   b   c   d
one  35  35  17  50
two  53   4  51  23 <class 'pandas.core.frame.DataFrame'>

df[ ]的用法

df[ ]默认选择列,[ ]中写列名(所以一般数据colunms都会单独制定,不会用默认数字列名,以免和index冲突)。单选列结果为Series,多选列结果为Dataframe。选取列名不能超出源数据列名,不然报错

data1 = df['a']
data2 = df[['b','c']]  
print(data1)
print(data2)
============================
one      35
two      53
three    82
Name: a, dtype: int32
        b   c
one    35  17
two     4  51
three  12  51

df[]中为数字时,默认选择行,且只能进行切片操作,不能单独选择(df[0]),输出结果为Dataframe,即便只选择一行。df[]不能通过索引标签名来选择行(df[‘one’])

data3 = df[:1]
print(data3,type(data3))
============================
      a   b   c   d
one  35  35  17  50 <class 'pandas.core.frame.DataFrame'>

df.loc[ ]的用法

  • 选取:
    df.loc[ ]用法是根据标签名来定位,提取的默认用法是df.loc[ 行标签(多个用列表),列标签(多个用列表)],如果只提取列,则行标签位置用:代替
df1 = pd.DataFrame(np.random.randint(100,size = 16).reshape(4,4),
                   index = ['one','two','three','four'],
                   columns = ['a','b','c','d'])
print(df1)
============================
        a   b   c   d
one    81  16  59  87
two    16   7  66  70
three  18  28  68  59
four   50  87  98  73


df1.loc[['one','three'],['a','b']]
============================
	    a	b
one	    81	16
three	18	28

如果索引的标签不存在,结果则会用NaN代替。

df1.loc[['one','two','five'],['a','b','x']]
============================
         a     b   x
one   81.0  16.0 NaN
two   16.0   7.0 NaN
five   NaN   NaN NaN
  • 切片:
    loc用作切片索引的时候,是左右包含的,在参数写法上没有列表的方括号,示例如下:
df1.loc['one':'three','a':'c']
============================
        a   b   c
one    81  16  59
two    16   7  66
three  18  28  68

df.iloc[ ]的用法

与df.loc[ ]用法不同的是,该方法通过行列的位置来定位,从0开始计,左闭右开,在用法思路上和loc方法类似。

  • 选取:
df1 = pd.DataFrame(np.random.randint(100,size = 16).reshape(4,4),
                   index = ['one','two','three','four'],
                   columns = ['a','b','c','d'])
print(df1)
============================
        a   b   c   d
one    81  16  59  87
two    16   7  66  70
three  18  28  68  59
four   50  87  98  73

df1.iloc[[1,2],[1,2]]
        b   c
two    69  36
three  35  45

选取索引和列表类似,也可以用负数索引

df1.iloc[:,-1]
============================
one      29
two      72
three    37
four     99
Name: d, dtype: int32
  • 切片:
    采用切片操作的时候,表示行列位置的数字要求在行列范围内,不能越界。而且切片遵循的是左闭右开。
df1.iloc[1:2,1:2]
============================
        b   c
two    69  36
three  35  45

print(df1.iloc[::2])
============================
        a   b   c   d
one    40  19  37  29
three  74  35  45  37

布尔型索引

布尔型索引的用法与Series结构中的用法类似。示例:

df = pd.DataFrame(np.random.randint(100,size = 16).reshape(4,4),
                   index = ['one','two','three','four'],
                   columns = ['a','b','c','d'])
print(df)
============================
        a   b   c   d
one     3  94  79  46
two    43  46  79  60
three  54  56  77  24
four   85  24  59  73

df > 50
==============================
           a      b     c      d
one    False   True  True  False
two    False  False  True   True
three   True   True  True  False
four    True  False  True   True

df[df > 50]
==============================
          a     b   c     d
one     NaN  94.0  79   NaN
two     NaN   NaN  79  60.0
three  54.0  56.0  77   NaN
four   85.0   NaN  59  73.0

可以看出,对整个DataFrame采用布尔索引操作时,如果判断为True则返回原数据,False返回值为NaN。
也可以对具体的行列采用布尔索引,示例如下:

df[df['a'] > 50]
==============================
        a   b   c   d
three  54  56  77  24
four   85  24  59  73

df.loc[['one','three']] > 50
==============================
           a     b     c      d
one    False  True  True  False
three   True  True  True  False

df[df.loc[['one','three']] > 50]
==============================
          a     b     c   d
one     NaN  94.0  79.0 NaN
two     NaN   NaN   NaN NaN
three  54.0  56.0  77.0 NaN
four    NaN   NaN   NaN NaN

综合示例:

创建Dataframe(4 * 4,值为0-100的随机数),通过索引得到部分值
① 索引得到b,c列的所有值
② 索引得到第三第四行的数据
③ 按顺序索引得到two,one行的值
④ 索引得到大于50的值

data = np.random.randint(100, size = 16).reshape((4,4))
inx = ['one','two','three','four']
col = list('abcd')
df = pd.DataFrame(data,index=inx,columns=col)
print(df)
print('-'*10)
print(df[['b','c']])
print('-'*10)
print(df.iloc[2:4])
print('-'*10)
print(df.loc[['two','one']])
print('-'*10)
b = df > 50
print(df[b])

==============================
  a   b   c   d
one    20  23  74  94
two    39  32   7  39
three  84   6  32  75
four   53  47  46  25
----------
        b   c
one    23  74
two    32   7
three   6  32
four   47  46
----------
        a   b   c   d
three  84   6  32  75
four   53  47  46  25
----------
      a   b   c   d
two  39  32   7  39
one  20  23  74  94
----------
          a   b     c     d
one     NaN NaN  74.0  94.0
two     NaN NaN   NaN   NaN
three  84.0 NaN   NaN  75.0
four   53.0 NaN   NaN   NaN

你可能感兴趣的:(pandas,python,pandas)