Day3-Python索引(Datawhale)

一、索引器

import numpy as np
import pandas as pd

1.1 表的索引

列索引,通过[列名]实现:返回值为Series

[列名组成的列表]:返回值为DataFrame

.列名:取出单列并且列名不包含空格,等价于[列名]

df = pd.read_csv('data/learn_pandas.csv',
                usecols=['School','Grade','Name','Gender','Weight','Transfer'])
df['Name'].head()
0      Gaopeng Yang
1    Changqiang You
2           Mei Sun
3      Xiaojuan Sun
4       Gaojuan You
Name: Name, dtype: object
df[['Grade','Name']].head()
Grade Name
0 Freshman Gaopeng Yang
1 Freshman Changqiang You
2 Senior Mei Sun
3 Sophomore Xiaojuan Sun
4 Sophomore Gaojuan You
df.Name.head()
0      Gaopeng Yang
1    Changqiang You
2           Mei Sun
3      Xiaojuan Sun
4       Gaojuan You
Name: Name, dtype: object

1.2 序列的行索引

1.2.1 以字符串为索引的Series

如果取出单个索引的对应元素,可以使用[item],

若Series只有单个值对应,返回标量值,

多个值对应,返回Series

如果取出某两个索引之间的元素,并且这两个索引是在整个索引中唯一出现,则可以使用切片,切片包含两个端点

s = pd.Series([1, 2, 3, 4, 5, 6],
             index=['a','b','a','a','a','c'])
print(s)
s['a']
a    1
b    2
a    3
a    4
a    5
c    6
dtype: int64





a    1
a    3
a    4
a    5
dtype: int64
s['b']
2
s[['a','c']]
a    1
a    3
a    4
a    5
c    6
dtype: int64
s['c':'b':-2]
c    6
a    4
b    2
dtype: int64

1.2.2 以整数为索引的Series

不指定索引,会生成从0开始的整数索引

和字符串一样取出对应索引元素的值

整数切片不包含右端点

s = pd.Series(['a','b','c','d','e','f'],
             index=[1,3,1,2,5,4])
s[1]
1    a
1    c
dtype: object
s[1:-1:2]
3    b
2    d
dtype: object

说明:不要将纯浮点以及任何混合类型作为索引

1.3 loc索引器

基于元素的loc索引器,形式loc[,],第一个代表行的选择,第二个代表列的索引

loc[*]代表行的筛选

*的五类合法对象:单个元素、元素列表、元素切片、布尔列表、函数

Series可以使用loc索引

df_demo = df.set_index('Name')
df_demo.head()
School Grade Gender Weight Transfer
Name
Gaopeng Yang Shanghai Jiao Tong University Freshman Female 46.0 N
Changqiang You Peking University Freshman Male 70.0 N
Mei Sun Shanghai Jiao Tong University Senior Male 89.0 N
Xiaojuan Sun Fudan University Sophomore Female 41.0 N
Gaojuan You Fudan University Sophomore Male 74.0 N

1.3.1 *为单个元素

df_demo.loc['Qiang Sun']
School Grade Gender Weight Transfer
Name
Qiang Sun Tsinghua University Junior Female 53.0 N
Qiang Sun Tsinghua University Sophomore Female 40.0 N
Qiang Sun Shanghai Jiao Tong University Junior Female NaN N

选择行和列

df_demo.loc['Qiang Sun','School']
Name
Qiang Sun              Tsinghua University
Qiang Sun              Tsinghua University
Qiang Sun    Shanghai Jiao Tong University
Name: School, dtype: object
df_demo.loc['Quan Zhao','School']
'Shanghai Jiao Tong University'

1.3.2 *为元素列表

df_demo.loc[['Qiang Sun','Quan Zhao'],['School','Gender']]
School Gender
Name
Qiang Sun Tsinghua University Female
Qiang Sun Tsinghua University Female
Qiang Sun Shanghai Jiao Tong University Female
Quan Zhao Shanghai Jiao Tong University Female

1.3.3 *为切片

唯一值的起点和终点字符,可以使用切片,并且包含两个端点

df_demo.loc['Gaojuan You':'Gaoqiang Qian','School':'Gender']
School Grade Gender
Name
Gaojuan You Fudan University Sophomore Male
Xiaoli Qian Tsinghua University Freshman Female
Qiang Chu Shanghai Jiao Tong University Freshman Female
Gaoqiang Qian Tsinghua University Junior Female

注意:整数索引的切片也是包含端点且起点、终点、不允许重复

df_demo_copy = df_demo.copy()
df_demo_copy.index = range(df_demo.shape[0],0,-1) #倒序
df_demo_copy.head()
School Grade Gender Weight Transfer
200 Shanghai Jiao Tong University Freshman Female 46.0 N
199 Peking University Freshman Male 70.0 N
198 Shanghai Jiao Tong University Senior Male 89.0 N
197 Fudan University Sophomore Female 41.0 N
196 Fudan University Sophomore Male 74.0 N
df_demo_copy.loc[5:3]
School Grade Gender Weight Transfer
5 Fudan University Junior Female 46.0 N
4 Tsinghua University Senior Female 50.0 N
3 Shanghai Jiao Tong University Senior Female 45.0 N
df_demo_copy.loc[3:5] 
#没有返回,说明不存在切片的顺序,上面我们使用的是倒序,要保持一致,或者使用-1
School Grade Gender Weight Transfer
df_demo_copy.loc[3:5:-1]
School Grade Gender Weight Transfer
3 Shanghai Jiao Tong University Senior Female 45.0 N
4 Tsinghua University Senior Female 50.0 N
5 Fudan University Junior Female 46.0 N

1.3.4 *为布尔列表

根据条件来筛选行,传入loc的布尔列表与DataFrame长度一致,且True的位置被选中,False被剔除

#体重超过70kg的学生
df_demo.loc[df_demo.Weight>70].head()
School Grade Gender Weight Transfer
Name
Mei Sun Shanghai Jiao Tong University Senior Male 89.0 N
Gaojuan You Fudan University Sophomore Male 74.0 N
Xiaopeng Zhou Shanghai Jiao Tong University Freshman Male 74.0 N
Xiaofeng Sun Tsinghua University Senior Male 71.0 N
Qiang Zheng Shanghai Jiao Tong University Senior Male 87.0 N

使用isin方法返回的布尔列表

df_demo.loc[df_demo.Grade.isin(['Freshman','Senior'])].head()
School Grade Gender Weight Transfer
Name
Gaopeng Yang Shanghai Jiao Tong University Freshman Female 46.0 N
Changqiang You Peking University Freshman Male 70.0 N
Mei Sun Shanghai Jiao Tong University Senior Male 89.0 N
Xiaoli Qian Tsinghua University Freshman Female 51.0 N
Qiang Chu Shanghai Jiao Tong University Freshman Female 52.0 N

对于复合条件可以使用|(或),&(且),~(反)的组合来实现

#选出复旦大学体重超过70的大四学生或者北大男生体重超过80的非大四学生
x1 = df_demo.School == 'Fudan University'
x2 = df_demo.Grade == 'Senior'
x3 = df_demo.Weight>70
x = x1 & x2 & x3

y1 = df_demo.School == 'Peking University'
y2 = df_demo.Grade == 'Senior'
y3 = df_demo.Weight >80
y = y1 & (~y2) & y3

df_demo.loc[x | y]

School Grade Gender Weight Transfer
Name
Qiang Han Peking University Freshman Male 87.0 N
Chengpeng Zhou Fudan University Senior Male 81.0 N
Changpeng Zhao Peking University Freshman Male 83.0 N
Chengpeng Qian Fudan University Senior Male 73.0 Y

1.3.5 练一练

select_dtypes是一个实用函数,能从表中选出相应类型的列,若要选出所有数值型的列,只需要使用.select_dtypes(‘number’),请使用布尔列表选择的方法集合DataFrame的dtypes属性在learn_pandas数据上实现功能

df_demo.select_dtypes('number').head()
Weight
Name
Gaopeng Yang 46.0
Changqiang You 70.0
Mei Sun 89.0
Xiaojuan Sun 41.0
Gaojuan You 74.0
df_demo[df_demo.columns[df_demo.dtypes == 'float64']].head()
Weight
Name
Gaopeng Yang 46.0
Changqiang You 70.0
Mei Sun 89.0
Xiaojuan Sun 41.0
Gaojuan You 74.0

1.3.6 *为函数

这里的函数,必须以前面的四种合法形式(单个元素、元素列表、切片、布尔型)之一为返回值,并给函数的输入值为DataFrame本身。函数的形式参数x本质即为df_demo

def condition(z):
    x1 = z.School == 'Fudan University'
    x2 = z.Grade == 'Senior'
    x3 = z.Weight>70
    x = x1 & x2 & x3
    y1 = z.School == 'Peking University'
    y2 = z.Grade == 'Senior'
    y3 = z.Weight >80
    y = y1 & (~y2) & y3
    result = x | y
    return result
df_demo.loc[condition]
School Grade Gender Weight Transfer
Name
Qiang Han Peking University Freshman Male 87.0 N
Chengpeng Zhou Fudan University Senior Male 81.0 N
Changpeng Zhao Peking University Freshman Male 83.0 N
Chengpeng Qian Fudan University Senior Male 73.0 Y

支持lambda表达式

df_demo.loc[lambda x :'Quan Zhao',lambda x:'Gender']
'Female'

slice为切片函数,slice(start,stop,step)

df_demo.loc[lambda x: slice('Gaojuan You','Gaoqiang Qian')]
School Grade Gender Weight Transfer
Name
Gaojuan You Fudan University Sophomore Male 74.0 N
Xiaoli Qian Tsinghua University Freshman Female 51.0 N
Qiang Chu Shanghai Jiao Tong University Freshman Female 52.0 N
Gaoqiang Qian Tsinghua University Junior Female 50.0 N
df_chain = pd.DataFrame([[0,0],[1,0],[-1,0]],columns=list('AB'))
df_chain
A B
0 0 0
1 1 0
2 -1 0
df_chain.loc[df_chain.A!=0,'B'] = 1 
#将df_chain中A列不为0的B列的值赋值为1
df_chain
A B
0 0 0
1 1 1
2 -1 1

1.4 iloc索引器

iloc针对位置进行索引

五类合法对象:整数、整数列表、整数切片、布尔列表、函数

函数的返回值必须是整数、整数列表、整数切片、布尔列表之一

切片不好含结束端点

1.4.1 整数

df_demo.iloc[1,1] #第二行第二列
'Freshman'

1.4.2 整数列表

df_demo.iloc[[0,1],[0,1]] #前两行前两列
School Grade
Name
Gaopeng Yang Shanghai Jiao Tong University Freshman
Changqiang You Peking University Freshman

1.4.3 切片

df_demo.iloc[1:4,2:4] #1-3行,2-3列
Gender Weight
Name
Changqiang You Male 70.0
Mei Sun Male 89.0
Xiaojuan Sun Female 41.0

1.4.4 布尔类型

布尔列表不能传入Series,必须传入values,布尔筛选时优先考虑从loc

df_demo.iloc[(df_demo.Weight>80).values].head()
School Grade Gender Weight Transfer
Name
Mei Sun Shanghai Jiao Tong University Senior Male 89.0 N
Qiang Zheng Shanghai Jiao Tong University Senior Male 87.0 N
Qiang Han Peking University Freshman Male 87.0 N
Chengpeng Zhou Fudan University Senior Male 81.0 N
Feng Han Shanghai Jiao Tong University Sophomore Male 82.0 N

1.4.5 函数

df_demo.iloc[lambda x : slice(1,4)]  #1至3行所有的列
School Grade Gender Weight Transfer
Name
Changqiang You Peking University Freshman Male 70.0 N
Mei Sun Shanghai Jiao Tong University Senior Male 89.0 N
Xiaojuan Sun Fudan University Sophomore Female 41.0 N

Series序列可以通过iloc返回相应位置的值或子序列

df_demo.School
Name
Gaopeng Yang      Shanghai Jiao Tong University
Changqiang You                Peking University
Mei Sun           Shanghai Jiao Tong University
Xiaojuan Sun                   Fudan University
Gaojuan You                    Fudan University
                              ...              
Xiaojuan Sun                   Fudan University
Li Zhao                     Tsinghua University
Chengqiang Chu    Shanghai Jiao Tong University
Chengmei Shen     Shanghai Jiao Tong University
Chunpeng Lv                 Tsinghua University
Name: School, Length: 200, dtype: object
df_demo.School.iloc[1] #第一行的值
'Peking University'
df_demo.School.iloc[1:5:2] #不包含结尾,返回1、3行
Name
Changqiang You    Peking University
Xiaojuan Sun       Fudan University
Name: School, dtype: object

1.5 query方法

pandas支持将字符串形式的查询表达式传入query方法来查询数据,其表达式的执行结果必须返回布尔列表。

df.query('((School == "Fudan University")& ' #字符串需要使用双引号
         '(Grade == "Senior")& '
         '(Weight >70))|'
         '((School == "Peking University")&'
         '(Grade != "Senior")&'
         '(Weight > 80))')  
School Grade Name Gender Weight Transfer
38 Peking University Freshman Qiang Han Male 87.0 N
66 Fudan University Senior Chengpeng Zhou Male 81.0 N
99 Peking University Freshman Changpeng Zhao Male 83.0 N
131 Fudan University Senior Chengpeng Qian Male 73.0 Y

query可以直接调用列名和正常的函数调用无差别

df.query('Weight>Weight.mean()').head()
School Grade Name Gender Weight Transfer
1 Peking University Freshman Changqiang You Male 70.0 N
2 Shanghai Jiao Tong University Senior Mei Sun Male 89.0 N
4 Fudan University Sophomore Gaojuan You Male 74.0 N
10 Shanghai Jiao Tong University Freshman Xiaopeng Zhou Male 74.0 N
14 Tsinghua University Senior Xiaomei Zhou Female 57.0 N

对于含有空格的列名,需要使用(`col name`)的方式引用

可以使用英语的字面用法:or、and、in 、not in

df.query('(Grade not in ["Freshman","Sophomore"]) and'
        '(Gender == "Male")').head()
School Grade Name Gender Weight Transfer
2 Shanghai Jiao Tong University Senior Mei Sun Male 89.0 N
16 Tsinghua University Junior Xiaoqiang Qin Male 68.0 N
17 Tsinghua University Junior Peng Wang Male 65.0 N
18 Tsinghua University Senior Xiaofeng Sun Male 71.0 N
21 Shanghai Jiao Tong University Senior Xiaopeng Shen Male 62.0 NaN

query引入外部变量,在变量前加@符号。

例如:取出体重位于70kg到80kg之间的学生

low, high = 70, 80
df.query('Weight >= @low and Weight <= @high').head()
#df.query('Weight.between (@low,@high)')的查询没有实现,转换为>=&<=
School Grade Name Gender Weight Transfer
1 Peking University Freshman Changqiang You Male 70.0 N
4 Fudan University Sophomore Gaojuan You Male 74.0 N
10 Shanghai Jiao Tong University Freshman Xiaopeng Zhou Male 74.0 N
18 Tsinghua University Senior Xiaofeng Sun Male 71.0 N
35 Peking University Freshman Gaoli Zhao Male 78.0 N

1.6 随机抽样

将每一行看作样本,每一列看作一个特征,整个DataFrame看作总体,可以使用sample函数进行随机抽样。

sample函数参数:

n : 抽样数量

axis : 抽样的方向(0为行,1为列)

frac : 抽样比例(0.3表示从总体抽30%)

replace : 是否有放回,replace=True为有放回抽样

weights : 每个样本的抽样相对概率

例如:构造df_sample以value值的相对大小为抽样概率进行有放回抽样,抽样数量为3

df_sample = pd.DataFrame({
     'id' : list('abcde'),
                           'value' : [1, 2, 3, 4, 90]})
df_sample
id value
0 a 1
1 b 2
2 c 3
3 d 4
4 e 90
df_sample.sample(3, replace=True, weights=df_sample.value)
id value
4 e 90
4 e 90
4 e 90

二、多级索引

2.1 多级索引及其表的结构

索引的名字names

索引的值values

获取单层索引get_level_values(无法修改索引值)

np.random.seed(0)
multi_index = pd.MultiIndex.from_product([list('ABCD'),
                                         df.Gender.unique()],names=('School','Gender'))
multi_column = pd.MultiIndex.from_product([['Height','Weight'],
                                          df.Grade.unique()],names=('Indicator','Grade'))
df_multi = pd.DataFrame(np.c_[(np.random.randn(8,4)*5+163).tolist(),
                             (np.random.randn(8,4)*5+65).tolist()],
                             index = multi_index,
                              columns = multi_column).round(1)
df_multi
Indicator Height Weight
Grade Freshman Senior Sophomore Junior Freshman Senior Sophomore Junior
School Gender
A Female 171.8 165.0 167.9 174.2 60.6 55.1 63.3 65.8
Male 172.3 158.1 167.8 162.2 71.2 71.0 63.1 63.5
B Female 162.5 165.1 163.7 170.3 59.8 57.9 56.5 74.8
Male 166.8 163.6 165.2 164.7 62.5 62.8 58.7 68.9
C Female 170.5 162.0 164.6 158.7 56.9 63.9 60.5 66.9
Male 150.2 166.3 167.3 159.3 62.4 59.1 64.9 67.1
D Female 174.3 155.7 163.2 162.1 65.3 66.5 61.8 63.2
Male 170.7 170.3 163.8 164.9 61.6 63.2 60.9 56.4

Day3-Python索引(Datawhale)_第1张图片

注意:外层索引的值在第一次出现后会被隐藏

df_multi.index.names #列索引
df_multi.columns.names #行名称
FrozenList(['Indicator', 'Grade'])
df_multi.index.values
array([('A', 'Female'), ('A', 'Male'), ('B', 'Female'), ('B', 'Male'),
       ('C', 'Female'), ('C', 'Male'), ('D', 'Female'), ('D', 'Male')],
      dtype=object)
df_multi.columns.values
array([('Height', 'Freshman'), ('Height', 'Senior'),
       ('Height', 'Sophomore'), ('Height', 'Junior'),
       ('Weight', 'Freshman'), ('Weight', 'Senior'),
       ('Weight', 'Sophomore'), ('Weight', 'Junior')], dtype=object)
df_multi.index.get_level_values(0)
Index(['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D'], dtype='object', name='School')

2.2 多级索引中的loc索引器

df_multi_t = df.set_index(['School','Grade'])
df_multi_t.head()
Name Gender Weight Transfer
School Grade
Shanghai Jiao Tong University Freshman Gaopeng Yang Female 46.0 N
Peking University Freshman Changqiang You Male 70.0 N
Shanghai Jiao Tong University Senior Mei Sun Male 89.0 N
Fudan University Sophomore Xiaojuan Sun Female 41.0 N
Sophomore Gaojuan You Male 74.0 N

多级索引中的单个元素以元组为单位,使用loc和iloc方法时,只需要将标量的位置替换为对应的元组,不过在索引前最好对MultiIndex进行排序

df_multi_t = df_multi_t.sort_index() #对索引进行排序
df_multi_t.head()
Name Gender Weight Transfer
School Grade
Fudan University Freshman Changqiang Yang Female 49.0 N
Freshman Gaoqiang Qin Female 63.0 N
Freshman Gaofeng Zhao Female 43.0 N
Freshman Yanquan Wang Female 55.0 N
Freshman Feng Wang Male 74.0 N

2.2.1 单一值

df_multi_t.loc[('Fudan University','Junior'),['Weight','Gender']].head()
#列名还是需要使用列表进行获取
Weight Gender
School Grade
Fudan University Junior 48.0 Female
Junior 72.0 Male
Junior 76.0 Male
Junior 49.0 Female
Junior 43.0 Female

2.2.2 多值列表

df_multi_t.loc[[('Fudan University','Senior'),
                ('Shanghai Jiao Tong University','Freshman')]]
Name Gender Weight Transfer
School Grade
Fudan University Senior Chengpeng Zheng Female 38.0 N
Senior Feng Zhou Female 47.0 N
Senior Gaomei Lv Female 34.0 N
Senior Chunli Lv Female 56.0 N
Senior Chengpeng Zhou Male 81.0 N
Senior Gaopeng Qin Female 52.0 N
Senior Chunjuan Xu Female 47.0 N
Senior Juan Zhang Female 47.0 N
Senior Chengpeng Qian Male 73.0 Y
Senior Xiaojuan Qian Female 50.0 N
Senior Quan Xu Female 44.0 N
Shanghai Jiao Tong University Freshman Gaopeng Yang Female 46.0 N
Freshman Qiang Chu Female 52.0 N
Freshman Xiaopeng Zhou Male 74.0 N
Freshman Yanpeng Lv Male 65.0 N
Freshman Xiaopeng Zhao Female 53.0 N
Freshman Chunli Zhao Male 83.0 N
Freshman Peng Zhang Female NaN N
Freshman Xiaoquan Sun Female 40.0 N
Freshman Chunmei Shi Female 52.0 N
Freshman Xiaomei Yang Female 49.0 N
Freshman Xiaofeng Qian Female 49.0 N
Freshman Changmei Lv Male 75.0 N
Freshman Qiang Feng Male 80.0 N

2.2.3 布尔类型

df_multi_t.loc[df_multi_t.Weight>70].head()
Name Gender Weight Transfer
School Grade
Fudan University Freshman Feng Wang Male 74.0 N
Junior Chunqiang Chu Male 72.0 N
Junior Changfeng Lv Male 76.0 N
Senior Chengpeng Zhou Male 81.0 N
Senior Chengpeng Qian Male 73.0 Y

2.2.4 函数

df_multi_t.loc[lambda x:('Fudan University','Junior')].head()
Name Gender Weight Transfer
School Grade
Fudan University Junior Yanli You Female 48.0 N
Junior Chunqiang Chu Male 72.0 N
Junior Changfeng Lv Male 76.0 N
Junior Yanjuan Lv Female 49.0 NaN
Junior Gaoqiang Zhou Female 43.0 N

2.2.5 练一练

与单层索引类似,若存在重复元素,则不能使用切片,请去除重复索引后给出一个元素切片的例子

df_multi_uni = df.drop_duplicates(['School','Grade']).set_index(['School', 'Grade'])  
df_multi_uni = df_multi_uni.sort_index()
df_multi_uni.loc[('Shanghai Jiao Tong University','Freshman'):('Tsinghua University','Junior')] 
Name Gender Weight Transfer
School Grade
Shanghai Jiao Tong University Freshman Gaopeng Yang Female 46.0 N
Junior Feng Zheng Female 51.0 N
Senior Mei Sun Male 89.0 N
Sophomore Yanfeng Qian Female 48.0 N
Tsinghua University Freshman Xiaoli Qian Female 51.0 N
Junior Gaoqiang Qian Female 50.0 N

2.2.6 交叉组合索引

需要指定loc的列,全选为 : ,每一层需要选中的元素用列表存放,传入loc的形式为[(level_0_list,level_1_list),cols]

例:实现所有北大和复旦的大二大三同学

#使用之前的列表方法
res = df_multi_t.loc[[('Peking University','Sophomore'),
                     ('Peking University','Junior'),
                     ('Fudan University','Sophomore'),
                     ('Fudan University','Junior')]]
print(res.head())
print(res.shape)
                                     Name  Gender  Weight Transfer
School            Grade                                           
Peking University Sophomore   Changmei Xu  Female    43.0        N
                  Sophomore  Xiaopeng Qin    Male     NaN        N
                  Sophomore        Mei Xu  Female    39.0        N
                  Sophomore   Xiaoli Zhou  Female    55.0        N
                  Sophomore      Peng Han  Female    34.0      NaN
(33, 4)
#使用交叉组合
res_t = df_multi_t.loc[(['Peking University','Fudan University'],
                      ['Sophomore','Junior']),:]
print(res_t.head())
print(res_t.shape)
                                     Name  Gender  Weight Transfer
School            Grade                                           
Peking University Sophomore   Changmei Xu  Female    43.0        N
                  Sophomore  Xiaopeng Qin    Male     NaN        N
                  Sophomore        Mei Xu  Female    39.0        N
                  Sophomore   Xiaoli Zhou  Female    55.0        N
                  Sophomore      Peng Han  Female    34.0      NaN
(33, 4)

2.3 IndexSlice对象

对每层进行切片,将切片和布尔列表混用。slice的两种形式:loc[idx[,]]和loc[idx[,],idx[,]]

#构建索引不重复的DataFrame
np.random.seed(0)
L1, L2 = ['A','B','C'],['a','b','c']
mul_index1 = pd.MultiIndex.from_product([L1,L2],names=('Upper','Lower'))
L3, L4 = ['D','E','F'],['d','e','f']
mul_index2 = pd.MultiIndex.from_product([L3,L4],names=('Big','Small'))
df_ex = pd.DataFrame(np.random.randint(-9,10,(9,9)),
                     index = mul_index1,
                     columns = mul_index2)
df_ex
Big D E F
Small d e f d e f d e f
Upper Lower
A a 3 6 -9 -6 -6 -2 0 9 -5
b -3 3 -8 -3 -2 5 8 -4 4
c -1 0 7 -4 6 6 -9 9 -6
B a 8 5 -2 -9 -8 0 -9 1 -6
b 2 9 -7 -9 -9 -5 -4 -3 -1
c 8 6 -5 0 1 -8 -8 -2 0
C a -6 -3 2 5 9 -9 5 -6 3
b 1 2 -5 -3 -5 6 -6 3 -5
c -1 5 6 -6 6 4 7 8 -4

定义slice

idx = pd.IndexSlice

2.3.1 loc[idx[,]]

这种情况不能进行多层切片,[,]表示[行,列],与loc一致

df_ex.loc[idx['C':,('D','f'):]]
Big D E F
Small f d e f d e f
Upper Lower
C a 2 5 9 -9 5 -6 3
b -5 -3 -5 6 -6 3 -5
c 6 -6 6 4 7 8 -4

布尔序列索引

此处存在疑问点???

df_ex.loc[idx[:'A',lambda x:x.sum()>0]] 
#列和大于0,为什么第一列会出现
Big D F
Small d e e
Upper Lower
A a 3 6 9
b -3 3 -4
c -1 0 9

2.3.2 loc[idx[,],idx[,]]

这种可以分层进行切片,前一个idx指行索引,后一个idx指列索引

df_ex.loc[idx[:'A','b':],idx['E':,'e':]]
Big E F
Small e f e f
Upper Lower
A b -2 5 -4 4
c 6 6 9 -6

2.4 多级索引的构造

set_index

pd.MultiIndex对象下:

from_tuples

from_arrays

from_product

2.4.1 from_tuples

根据传入由元组组成的列表进行构造

my_tuple = [('a','cat'),('a','dog'),('b','cat'),('b','dog')]
pd.MultiIndex.from_tuples(my_tuple , names=['first','second'])
MultiIndex([('a', 'cat'),
            ('a', 'dog'),
            ('b', 'cat'),
            ('b', 'dog')],
           names=['first', 'second'])

2.4.2 from_arrays

根据传入列表中的对应层的列表进行构造

my_array = [list('aabb'),['cat','dog']*2]
pd.MultiIndex.from_arrays(my_array,names=['First','Second'])
MultiIndex([('a', 'cat'),
            ('a', 'dog'),
            ('b', 'cat'),
            ('b', 'dog')],
           names=['First', 'Second'])

2.4.3 from_product

根据给定多个列表的笛卡尔积进行构造

my_list1 = ['a','b']
my_list2 = ['cat','dog']
pd.MultiIndex.from_product([my_list1,
                           my_list2],
                          names=['First','Second'])
MultiIndex([('a', 'cat'),
            ('a', 'dog'),
            ('b', 'cat'),
            ('b', 'dog')],
           names=['First', 'Second'])

三、索引的常用方法

3.1 索引层的交换和删除

np.random.seed(0)
L1,L2,L3=['A','B'],['a','b'],['alpha','beta']
mul_index1 = pd.MultiIndex.from_product([L1,L2,L3],
                                       names=('Upper','Lower','Extra'))
L4,L5,L6=['C','D'],['c','d'],['cat','dog']
mul_index2 = pd.MultiIndex.from_product([L4,L5,L6],
                                       names=('Big','Small','Other'))
df_ex = pd.DataFrame(np.random.randint(-9,10,(8,8)),
                    index=mul_index1,
                    columns=mul_index2)
df_ex
Big C D
Small c d c d
Other cat dog cat dog cat dog cat dog
Upper Lower Extra
A a alpha 3 6 -9 -6 -6 -2 0 9
beta -5 -3 3 -8 -3 -2 5 8
b alpha -4 4 -1 0 7 -4 6 6
beta -9 9 -6 8 5 -2 -9 -8
B a alpha 0 -9 1 -6 2 9 -7 -9
beta -9 -5 -4 -3 -1 8 6 -5
b alpha 0 1 -8 -8 -2 0 -6 -3
beta 2 5 9 -9 5 -6 3 1

3.1.1 索引层的交换

由swaplevel和reorder_levels完成,前者交换两层,后者交换任意层,两者都可以指定交换的轴式哪一个,即行索引和列索引

df_ex.swaplevel(0,2,axis=1).head() 
#axis=1表示列,0代表第一行,2代表第三行。整个的意思是将列的第一行和第三行互换
Other cat dog cat dog cat dog cat dog
Small c c d d c c d d
Big C C C C D D D D
Upper Lower Extra
A a alpha 3 6 -9 -6 -6 -2 0 9
beta -5 -3 3 -8 -3 -2 5 8
b alpha -4 4 -1 0 7 -4 6 6
beta -9 9 -6 8 5 -2 -9 -8
B a alpha 0 -9 1 -6 2 9 -7 -9
df_ex.reorder_levels([2,0,1],axis=0).head() 
#[2,0,1]代表现在排在第1位的是原来的第3位,排在第2位的是原来的第1位,排在第3位的是原来的第2位,选取行索引
Big C D
Small c d c d
Other cat dog cat dog cat dog cat dog
Extra Upper Lower
alpha A a 3 6 -9 -6 -6 -2 0 9
beta A a -5 -3 3 -8 -3 -2 5 8
alpha A b -4 4 -1 0 7 -4 6 6
beta A b -9 9 -6 8 5 -2 -9 -8
alpha B a 0 -9 1 -6 2 9 -7 -9

3.1.2 索引层的删除

droplevel

df_ex.droplevel(1,axis=1) #删除列索引的第2个
Big C D
Other cat dog cat dog cat dog cat dog
Upper Lower Extra
A a alpha 3 6 -9 -6 -6 -2 0 9
beta -5 -3 3 -8 -3 -2 5 8
b alpha -4 4 -1 0 7 -4 6 6
beta -9 9 -6 8 5 -2 -9 -8
B a alpha 0 -9 1 -6 2 9 -7 -9
beta -9 -5 -4 -3 -1 8 6 -5
b alpha 0 1 -8 -8 -2 0 -6 -3
beta 2 5 9 -9 5 -6 3 1
df_ex.droplevel([0,1],axis=0) #删除行索引的第1个和第2个
Big C D
Small c d c d
Other cat dog cat dog cat dog cat dog
Extra
alpha 3 6 -9 -6 -6 -2 0 9
beta -5 -3 3 -8 -3 -2 5 8
alpha -4 4 -1 0 7 -4 6 6
beta -9 9 -6 8 5 -2 -9 -8
alpha 0 -9 1 -6 2 9 -7 -9
beta -9 -5 -4 -3 -1 8 6 -5
alpha 0 1 -8 -8 -2 0 -6 -3
beta 2 5 9 -9 5 -6 3 1

3.2 索引属性的修改

3.2.1 索引层名字的修改

使用rename_axis对索引层的名字进行修改,常见形式是传入字典的映射

df_ex.rename_axis(index={
     'Upper':'Changed_row'},
                 columns={
     'Other':'Change_Col'}).head()
Big C D
Small c d c d
Change_Col cat dog cat dog cat dog cat dog
Changed_row Lower Extra
A a alpha 3 6 -9 -6 -6 -2 0 9
beta -5 -3 3 -8 -3 -2 5 8
b alpha -4 4 -1 0 7 -4 6 6
beta -9 9 -6 8 5 -2 -9 -8
B a alpha 0 -9 1 -6 2 9 -7 -9

3.2.2 索引层值的修改

rename对索引的值进行修改,如果是多级索引需要指定修改的层号level

df_ex.rename(columns={
     'cat':'not_cat'},
            level=2).head()
Big C D
Small c d c d
Other not_cat dog not_cat dog not_cat dog not_cat dog
Upper Lower Extra
A a alpha 3 6 -9 -6 -6 -2 0 9
beta -5 -3 3 -8 -3 -2 5 8
b alpha -4 4 -1 0 7 -4 6 6
beta -9 9 -6 8 5 -2 -9 -8
B a alpha 0 -9 1 -6 2 9 -7 -9

传入参数为函数,输入值就是索引元素

df_ex.rename(index=lambda x:str.upper(x),
            level=2).head()
Big C D
Small c d c d
Other cat dog cat dog cat dog cat dog
Upper Lower Extra
A a ALPHA 3 6 -9 -6 -6 -2 0 9
BETA -5 -3 3 -8 -3 -2 5 8
b ALPHA -4 4 -1 0 7 -4 6 6
BETA -9 9 -6 8 5 -2 -9 -8
B a ALPHA 0 -9 1 -6 2 9 -7 -9

3.2.3 练一练

在rename_axis中使用函数完成和例子中一样的功能

df_ex.rename_axis(index=lambda x:'Changed_Row' if x=='Upper' else x,
                  columns=lambda x:'Changed_Col' if x=='Other' else x ).head() 
Big C D
Small c d c d
Changed_Col cat dog cat dog cat dog cat dog
Changed_Row Lower Extra
A a alpha 3 6 -9 -6 -6 -2 0 9
beta -5 -3 3 -8 -3 -2 5 8
b alpha -4 4 -1 0 7 -4 6 6
beta -9 9 -6 8 5 -2 -9 -8
B a alpha 0 -9 1 -6 2 9 -7 -9

3.2.4 对整个索引的元素进行替换

  1. 使用迭代器
new_values = iter(list('abcdefgh'))
df_ex.rename(index=lambda x:next(new_values),
            level=2)
#迭代器的元素一定要与替换的元素数量保持一致
Big C D
Small c d c d
Other cat dog cat dog cat dog cat dog
Upper Lower Extra
A a a 3 6 -9 -6 -6 -2 0 9
b -5 -3 3 -8 -3 -2 5 8
b c -4 4 -1 0 7 -4 6 6
d -9 9 -6 8 5 -2 -9 -8
B a e 0 -9 1 -6 2 9 -7 -9
f -9 -5 -4 -3 -1 8 6 -5
b g 0 1 -8 -8 -2 0 -6 -3
h 2 5 9 -9 5 -6 3 1

思考:如果对于第一层索引的修改,设置的值是几个?首先考虑设置的值为8个,测试时发现使用两个会报错,因此要使用8个

new_values = iter(list('AB'*4))
df_ex.rename(index=lambda x:next(new_values),
            level=0)
Big C D
Small c d c d
Other cat dog cat dog cat dog cat dog
Upper Lower Extra
A a alpha 3 6 -9 -6 -6 -2 0 9
B a beta -5 -3 3 -8 -3 -2 5 8
A b alpha -4 4 -1 0 7 -4 6 6
B b beta -9 9 -6 8 5 -2 -9 -8
A a alpha 0 -9 1 -6 2 9 -7 -9
B a beta -9 -5 -4 -3 -1 8 6 -5
A b alpha 0 1 -8 -8 -2 0 -6 -3
B b beta 2 5 9 -9 5 -6 3 1

单层索引的迭代可以取出索引的values属性,再给定列表,最后进行index对象重新赋值

多级索引,先把某一层索引临时转为表的元素,再进行修改,最后重新设定为索引

2.map

定义在Index上的方法,与rename方法中函数用法类似,它传入的不是层的标量值,是直接传入索引的元组,可以提供跨层的遍历修改

#字符串转大写
df_temp = df_ex.copy()
new_idx = df_temp.index.map(lambda x:(x[0],
                                     x[1],
                                     str.upper(x[2])))
df_temp.index = new_idx
df_temp.head()
Big C D
Small c d c d
Other cat dog cat dog cat dog cat dog
Upper Lower Extra
A a ALPHA 3 6 -9 -6 -6 -2 0 9
BETA -5 -3 3 -8 -3 -2 5 8
b ALPHA -4 4 -1 0 7 -4 6 6
BETA -9 9 -6 8 5 -2 -9 -8
B a ALPHA 0 -9 1 -6 2 9 -7 -9

map的另一个使用方法:对多级索引的压缩

df_temp = df_ex.copy()
new_idx = df_temp.index.map(lambda x :(x[0]+'-'+
                                      x[1]+'-'+
                                      x[2]))
df_temp.index = new_idx
df_temp.head()#变为单层索引
Big C D
Small c d c d
Other cat dog cat dog cat dog cat dog
A-a-alpha 3 6 -9 -6 -6 -2 0 9
A-a-beta -5 -3 3 -8 -3 -2 5 8
A-b-alpha -4 4 -1 0 7 -4 6 6
A-b-beta -9 9 -6 8 5 -2 -9 -8
B-a-alpha 0 -9 1 -6 2 9 -7 -9

反向展开

new_idx = df_temp.index.map(lambda x :tuple(x.split('-')))
#使用split将数据分隔开,然后再通过tuple转换为数组
df_temp.index = new_idx
df_temp.head() #三层索引
Big C D
Small c d c d
Other cat dog cat dog cat dog cat dog
A a alpha 3 6 -9 -6 -6 -2 0 9
beta -5 -3 3 -8 -3 -2 5 8
b alpha -4 4 -1 0 7 -4 6 6
beta -9 9 -6 8 5 -2 -9 -8
B a alpha 0 -9 1 -6 2 9 -7 -9

3.3 索引的设置和重置

df_new = pd.DataFrame({
     'A':list('aacd'),
                      'B':list('PQRT'),
                      'C':[1,2,3,4]})
df_new
A B C
0 a P 1
1 a Q 2
2 c R 3
3 d T 4

3.3.1 索引的设置

使用set_index完成,参数:append,表示是否保留原来的索引

df_new.set_index('A')
B C
A
a P 1
a Q 2
c R 3
d T 4
df_new.set_index('A',append=True)
B C
A
0 a P 1
1 a Q 2
2 c R 3
3 d T 4

指定多列为索引

df_new.set_index(['A','B'])
C
A B
a P 1
Q 2
c R 3
d T 4

添加新列作为索引,在参数中传入Series

df_new = pd.DataFrame({
     'A':list('aacd'),
                      'B':list('PQRT'),
                      'C':[1,2,3,4]})
my_index = pd.Series(list('WXYZ'),name='D')
df_new = df_new.set_index(['A',my_index])
df_new
B C
A D
a W P 1
X Q 2
c Y R 3
d Z T 4

3.3.2 索引的重置

reset_index是set_index的逆函数,主要参数是drop,表示是否将去掉索引层丢弃(不添加到列中)

df_new.reset_index(['D'])
D B C
A
a W P 1
a X Q 2
c Y R 3
d Z T 4
df_new.reset_index(['A'],drop=True)
B C
D
W P 1
X Q 2
Y R 3
Z T 4

重置索引

df_new.reset_index()
A D B C
0 a W P 1
1 a X Q 2
2 c Y R 3
3 d Z T 4

3.4 索引的变形

对索引进行扩充或者剔除,指定一个新的索引或将原表相应的索引对应元素填充到新索引构成的表中

使用reindex

例:在新表中增加一名员工的同时去掉身高列并增加性别列

df_reindex = pd.DataFrame({
     'Weight':[60,70,80],
                           'Height':[176,180,179]},
                         index = ['1001','1003','1002'])
df_reindex
Weight Height
1001 60 176
1003 70 180
1002 80 179
df_reindex.reindex(index=['1001','1003','1002','1004'],
                   columns=['Weight','Gender'])
Weight Gender
1001 60.0 NaN
1003 70.0 NaN
1002 80.0 NaN
1004 NaN NaN

reindex_like

仿照传入的表的索引来进行被调用表索引的变形

df_reindex
Weight Height
1001 60 176
1003 70 180
1002 80 179
df_existed = pd.DataFrame(index=['1001','1002','1003','1004'],
                         columns=['Weight','Gender'])
df_reindex.reindex_like(df_existed)
Weight Gender
1001 60.0 NaN
1002 80.0 NaN
1003 70.0 NaN
1004 NaN NaN

四、索引运算

4.1 集合的运算法则

A并B: 属于A并且属于B (A.intersection(B) <> A&B)

A交B: 属于A或者属于B (A.union(B) <
> A|B)

A减B: 属于A但是不属于B (A.difference(B) <> (A^B)&A)

A和B的对称差: A交B-A并B (A.symmetric_difference(B) <
> A^B)

4.2 一般的索引运算

由于集合的元素是不重复的,先用unique去重后再进行运算

df_set_1 = pd.DataFrame([[0,1],[1,2],[3,4]],
                       index = pd.Index(['a','b','a'],name='id1'))
df_set_2 = pd.DataFrame([[4,5],[2,6],[7,1]],
                       index = pd.Index(['b','b','c'],name='id2'))
id1,id2 = df_set_1.index.unique(),df_set_2.index.unique()

id1.intersection(id2) 
Index(['b'], dtype='object')
id1.union(id2)
Index(['a', 'b', 'c'], dtype='object')
id1.difference(id2)
Index(['a'], dtype='object')
id1.symmetric_difference(id2)
Index(['a', 'c'], dtype='object')

若两张表需要做集合运算的列并没有被设置为索引,一种方法:先转换为索引set_index,运算后再恢复。第二种:使用isin函数

例:选出id列交集所在的行

df_set_in_col1 = df_set_1.reset_index()
df_set_in_col1
id1 0 1
0 a 0 1
1 b 1 2
2 a 3 4
df_set_in_col2 = df_set_2.reset_index()
df_set_in_col2
id2 0 1
0 b 4 5
1 b 2 6
2 c 7 1
df_set_in_col1[df_set_in_col1.id1.isin(df_set_in_col2.id2)]
id1 0 1
1 b 1 2

五、练习

5.1 公司员工数据集

df = pd.read_csv('data/Company.csv')
df.head()
EmployeeID birthdate_key age city_name department job_title gender
0 1318 1/3/1954 61 Vancouver Executive CEO M
1 1319 1/3/1957 58 Vancouver Executive VP Stores F
2 1320 1/2/1955 60 Vancouver Executive Legal Counsel F
3 1321 1/2/1959 56 Vancouver Executive VP Human Resources M
4 1322 1/9/1958 57 Vancouver Executive VP Finance M

1.分别使用query和loc选出年龄不超过四十岁且工作部门为Dairy或Bakery的男性

使用query

思路:就按照条件一级一级的去拼写,但是在将==换成is in 的时候会报错,之后测试发现正确的写法是in 没有is

df.query('((age <= 40)&'
        ' (department in ["Dairy","Bakery"])&'
         '(gender in "M"))'
        ).shape
(441, 7)

使用loc

思路:loc需要使用布尔型来找到符合条件的值,最后再通过shape去判断查询的结果是否一致

x1 = df.age<=40
x2 = df.department.isin(["Dairy","Bakery"])
x3 = df.gender == 'M'
x = x1 & x2 & x3
df.loc[x].shape
(441, 7)

2.选出员工ID号为奇数所在行的第1、第3和倒数第2列

思路:首先使用不被2整除来判断ID号为奇数,这个是列值的判断,我们需要使用loc;然后找出第1,3,倒数第2列,这个是位置的索引,需要使用iloc,注意是从0开始计算

(df.loc[df.EmployeeID%2!=0]).iloc[:,[0,2,-2]].head()
EmployeeID age job_title
1 1319 58 VP Stores
3 1321 56 VP Human Resources
5 1323 53 Exec Assistant, VP Stores
6 1325 51 Exec Assistant, Legal Counsel
8 1329 48 Store Manager

3.按照步骤进行索引操作

3.1 将后三列设为索引后交换内外两层

思路:首先使用columns取出后三列,然后使用set_index将后三列设置为索引,因为只交换两层,我们使用swaplevel对列索引的位置进行交换

df.columns[-4:-1].values
array(['city_name', 'department', 'job_title'], dtype=object)
df_demo = df.set_index(list(df.columns[-4:-1].values))
df_demo.head()
EmployeeID birthdate_key age gender
city_name department job_title
Vancouver Executive CEO 1318 1/3/1954 61 M
VP Stores 1319 1/3/1957 58 F
Legal Counsel 1320 1/2/1955 60 F
VP Human Resources 1321 1/2/1959 56 M
VP Finance 1322 1/9/1958 57 M
df_demo.swaplevel(0,2,axis=0).head()
EmployeeID birthdate_key age gender
job_title department city_name
CEO Executive Vancouver 1318 1/3/1954 61 M
VP Stores Executive Vancouver 1319 1/3/1957 58 F
Legal Counsel Executive Vancouver 1320 1/2/1955 60 F
VP Human Resources Executive Vancouver 1321 1/2/1959 56 M
VP Finance Executive Vancouver 1322 1/9/1958 57 M

3.2 恢复中间一层

恢复就是使用reset_index, 可以通过df_demo.index.names[1]找到索引第1列的名字

df_demo1 = df_demo.reset_index(df_demo.index.names[1])
df_demo1.head()
department EmployeeID birthdate_key age gender
city_name job_title
Vancouver CEO Executive 1318 1/3/1954 61 M
VP Stores Executive 1319 1/3/1957 58 F
Legal Counsel Executive 1320 1/2/1955 60 F
VP Human Resources Executive 1321 1/2/1959 56 M
VP Finance Executive 1322 1/9/1958 57 M

3.3 修改外索引名为Gender

直接使用rename_axis,对index进行修改

df_demo1.rename_axis(index={
     'city_name':'Gender'}).head()
department EmployeeID birthdate_key age gender
Gender job_title
Vancouver CEO Executive 1318 1/3/1954 61 M
VP Stores Executive 1319 1/3/1957 58 F
Legal Counsel Executive 1320 1/2/1955 60 F
VP Human Resources Executive 1321 1/2/1959 56 M
VP Finance Executive 1322 1/9/1958 57 M

3.4 是下划线合并两层行索引

合并索引使用map函数

df_demo1.head()
department EmployeeID birthdate_key age gender
city_name job_title
Vancouver CEO Executive 1318 1/3/1954 61 M
VP Stores Executive 1319 1/3/1957 58 F
Legal Counsel Executive 1320 1/2/1955 60 F
VP Human Resources Executive 1321 1/2/1959 56 M
VP Finance Executive 1322 1/9/1958 57 M
df_temp = df_demo1.copy()
new_idx = df_temp.index.map(lambda x :(x[0]+'_'+
                            x[1]))
df_temp.index = new_idx
df_temp.head()
department EmployeeID birthdate_key age gender
Vancouver_CEO Executive 1318 1/3/1954 61 M
Vancouver_VP Stores Executive 1319 1/3/1957 58 F
Vancouver_Legal Counsel Executive 1320 1/2/1955 60 F
Vancouver_VP Human Resources Executive 1321 1/2/1959 56 M
Vancouver_VP Finance Executive 1322 1/9/1958 57 M

3.5 将行索引拆分为原状态

反向使用map和split

new_idx = df_temp.index.map(lambda x :tuple(x.split('_')))
df_temp.index = new_idx
df_temp.head()
department EmployeeID birthdate_key age gender
Vancouver CEO Executive 1318 1/3/1954 61 M
VP Stores Executive 1319 1/3/1957 58 F
Legal Counsel Executive 1320 1/2/1955 60 F
VP Human Resources Executive 1321 1/2/1959 56 M
VP Finance Executive 1322 1/9/1958 57 M

3.6 修改索引名为原表名称

之前使用的是copy出来的df_temp,现在先将df_demo1的索引名称选取出来,再将名称赋值给df_temp.index.names

new_name = df_demo1.index.names
new_name
FrozenList(['city_name', 'job_title'])
df_temp.index.names = new_name
df_temp.head(1)
department EmployeeID birthdate_key age gender
city_name job_title
Vancouver CEO Executive 1318 1/3/1954 61 M

3.7 恢复默认索引并将列保持为原表的相对位置

先将索引还原,然后利用loc按照顺序获取值之后再将它赋值到本身

df_new = df_temp.reset_index().head()
df_new
city_name job_title department EmployeeID birthdate_key age gender
0 Vancouver CEO Executive 1318 1/3/1954 61 M
1 Vancouver VP Stores Executive 1319 1/3/1957 58 F
2 Vancouver Legal Counsel Executive 1320 1/2/1955 60 F
3 Vancouver VP Human Resources Executive 1321 1/2/1959 56 M
4 Vancouver VP Finance Executive 1322 1/9/1958 57 M
cols = list(df.columns)
df_new = df_new.loc[:,cols]
df_new
EmployeeID birthdate_key age city_name department job_title gender
0 1318 1/3/1954 61 Vancouver Executive CEO M
1 1319 1/3/1957 58 Vancouver Executive VP Stores F
2 1320 1/2/1955 60 Vancouver Executive Legal Counsel F
3 1321 1/2/1959 56 Vancouver Executive VP Human Resources M
4 1322 1/9/1958 57 Vancouver Executive VP Finance M

5.2 巧克力数据集

df = pd.read_csv('data/chocolate.csv')
df.head(3)
Company Review\nDate Cocoa\nPercent Company\nLocation Rating
0 A. Morin 2016 63% France 3.75
1 A. Morin 2015 70% France 2.75
2 A. Morin 2015 70% France 3.00
  1. 把列索引名中的\n替换为空格

    思路:列索引的替换思维,将原列索引取出来后进行置换后再赋值回原列索引
df_demo = df.copy()
df_demo.columns = df_demo.columns.map(lambda x:x.replace('\n',' '))
df_demo.head()
Company Review Date Cocoa Percent Company Location Rating
0 A. Morin 2016 63% France 3.75
1 A. Morin 2015 70% France 2.75
2 A. Morin 2015 70% France 3.00
3 A. Morin 2015 70% France 3.50
4 A. Morin 2015 70% France 3.50
  1. 巧克力Rating评分为1至5,每0.25分一个档,请选出2.75分及以下且可可含量Coco Percent高于中位数的样本

我们发现Cocoa Percent存的是字符,我们需要将它转换为float的型进行中位数的计算;字段包含由空格,要注意使用"`"

df_demo['Cocoa Percent'] = df_demo['Cocoa Percent'].apply(lambda x: float(x[:-1])/100)
df_demo.head()
Company Review Date Cocoa Percent Company Location Rating
0 A. Morin 2016 0.63 France 3.75
1 A. Morin 2015 0.70 France 2.75
2 A. Morin 2015 0.70 France 3.00
3 A. Morin 2015 0.70 France 3.50
4 A. Morin 2015 0.70 France 3.50
df_demo.query('((Rating <=2.75) &'
              '(`Cocoa Percent` >= `Cocoa Percent`.median()))').head()
Company Review Date Cocoa Percent Company Location Rating
1 A. Morin 2015 0.70 France 2.75
5 A. Morin 2014 0.70 France 2.75
10 A. Morin 2013 0.70 France 2.75
14 A. Morin 2013 0.70 France 2.75
33 Akesson's (Pralus) 2010 0.75 Switzerland 2.75
  1. 将Review Date和Company Location设为索引后,选出Review Data在2012年之后且Company Location不属于France,Canada,Amsterdam,Belgium

思路:我考虑的是通过将2012年以后的年份和需要筛选的Location条件筛选后作为需要去查找的条件,再通过两者求得两者的笛卡尔积找出最后符合条件的值。(我觉得这不是一个很好的方法,但是是目前所能想到的)

df_demo1 = df.copy()
df_demo1.columns = df_demo1.columns.map(lambda x:x.replace('\n',' '))
df_demo1 = df_demo1.set_index(['Review Date','Company Location'])
date = df_demo1.index.get_level_values(0)
date1 = list(set(date))
location = df_demo1.index.get_level_values(1)
location1 = list(set(location))
x = [i for i in date1 if i>2012 ]
y = [i for i in location1 if i not in  ['France','Canada','Amsterdam','Belgium']]
df_demo1.loc[(x,y),:].head(2)
Company Cocoa Percent Rating
Review Date Company Location
2016 Austria Martin Mayer 76% 2.75
Austria Martin Mayer 82% 3.00

你可能感兴趣的:(Python,python,数据分析)