import numpy as np
import pandas as pd
列索引,通过[列名]实现:返回值为Series
[列名组成的列表]:返回值为DataFrame
.列名:取出单列并且列名不包含空格,等价于[列名]
df = pd.read_csv('data/learn_pandas.csv',
usecols=['School','Grade','Name','Gender','Weight','Transfer'])
df['Name'].head()
0 Gaopeng Yang
1 Changqiang You
2 Mei Sun
3 Xiaojuan Sun
4 Gaojuan You
Name: Name, dtype: object
df[['Grade','Name']].head()
Grade | Name | |
---|---|---|
0 | Freshman | Gaopeng Yang |
1 | Freshman | Changqiang You |
2 | Senior | Mei Sun |
3 | Sophomore | Xiaojuan Sun |
4 | Sophomore | Gaojuan You |
df.Name.head()
0 Gaopeng Yang
1 Changqiang You
2 Mei Sun
3 Xiaojuan Sun
4 Gaojuan You
Name: Name, dtype: object
如果取出单个索引的对应元素,可以使用[item],
若Series只有单个值对应,返回标量值,
多个值对应,返回Series
如果取出某两个索引之间的元素,并且这两个索引是在整个索引中唯一出现,则可以使用切片,切片包含两个端点
s = pd.Series([1, 2, 3, 4, 5, 6],
index=['a','b','a','a','a','c'])
print(s)
s['a']
a 1
b 2
a 3
a 4
a 5
c 6
dtype: int64
a 1
a 3
a 4
a 5
dtype: int64
s['b']
2
s[['a','c']]
a 1
a 3
a 4
a 5
c 6
dtype: int64
s['c':'b':-2]
c 6
a 4
b 2
dtype: int64
不指定索引,会生成从0开始的整数索引
和字符串一样取出对应索引元素的值
整数切片不包含右端点
s = pd.Series(['a','b','c','d','e','f'],
index=[1,3,1,2,5,4])
s[1]
1 a
1 c
dtype: object
s[1:-1:2]
3 b
2 d
dtype: object
说明:不要将纯浮点以及任何混合类型作为索引
基于元素的loc索引器,形式loc[,],第一个代表行的选择,第二个代表列的索引
loc[*]代表行的筛选
*的五类合法对象:单个元素、元素列表、元素切片、布尔列表、函数
Series可以使用loc索引
df_demo = df.set_index('Name')
df_demo.head()
School | Grade | Gender | Weight | Transfer | |
---|---|---|---|---|---|
Name | |||||
Gaopeng Yang | Shanghai Jiao Tong University | Freshman | Female | 46.0 | N |
Changqiang You | Peking University | Freshman | Male | 70.0 | N |
Mei Sun | Shanghai Jiao Tong University | Senior | Male | 89.0 | N |
Xiaojuan Sun | Fudan University | Sophomore | Female | 41.0 | N |
Gaojuan You | Fudan University | Sophomore | Male | 74.0 | N |
df_demo.loc['Qiang Sun']
School | Grade | Gender | Weight | Transfer | |
---|---|---|---|---|---|
Name | |||||
Qiang Sun | Tsinghua University | Junior | Female | 53.0 | N |
Qiang Sun | Tsinghua University | Sophomore | Female | 40.0 | N |
Qiang Sun | Shanghai Jiao Tong University | Junior | Female | NaN | N |
选择行和列
df_demo.loc['Qiang Sun','School']
Name
Qiang Sun Tsinghua University
Qiang Sun Tsinghua University
Qiang Sun Shanghai Jiao Tong University
Name: School, dtype: object
df_demo.loc['Quan Zhao','School']
'Shanghai Jiao Tong University'
df_demo.loc[['Qiang Sun','Quan Zhao'],['School','Gender']]
School | Gender | |
---|---|---|
Name | ||
Qiang Sun | Tsinghua University | Female |
Qiang Sun | Tsinghua University | Female |
Qiang Sun | Shanghai Jiao Tong University | Female |
Quan Zhao | Shanghai Jiao Tong University | Female |
唯一值的起点和终点字符,可以使用切片,并且包含两个端点
df_demo.loc['Gaojuan You':'Gaoqiang Qian','School':'Gender']
School | Grade | Gender | |
---|---|---|---|
Name | |||
Gaojuan You | Fudan University | Sophomore | Male |
Xiaoli Qian | Tsinghua University | Freshman | Female |
Qiang Chu | Shanghai Jiao Tong University | Freshman | Female |
Gaoqiang Qian | Tsinghua University | Junior | Female |
注意:整数索引的切片也是包含端点且起点、终点、不允许重复
df_demo_copy = df_demo.copy()
df_demo_copy.index = range(df_demo.shape[0],0,-1) #倒序
df_demo_copy.head()
School | Grade | Gender | Weight | Transfer | |
---|---|---|---|---|---|
200 | Shanghai Jiao Tong University | Freshman | Female | 46.0 | N |
199 | Peking University | Freshman | Male | 70.0 | N |
198 | Shanghai Jiao Tong University | Senior | Male | 89.0 | N |
197 | Fudan University | Sophomore | Female | 41.0 | N |
196 | Fudan University | Sophomore | Male | 74.0 | N |
df_demo_copy.loc[5:3]
School | Grade | Gender | Weight | Transfer | |
---|---|---|---|---|---|
5 | Fudan University | Junior | Female | 46.0 | N |
4 | Tsinghua University | Senior | Female | 50.0 | N |
3 | Shanghai Jiao Tong University | Senior | Female | 45.0 | N |
df_demo_copy.loc[3:5]
#没有返回,说明不存在切片的顺序,上面我们使用的是倒序,要保持一致,或者使用-1
School | Grade | Gender | Weight | Transfer |
---|
df_demo_copy.loc[3:5:-1]
School | Grade | Gender | Weight | Transfer | |
---|---|---|---|---|---|
3 | Shanghai Jiao Tong University | Senior | Female | 45.0 | N |
4 | Tsinghua University | Senior | Female | 50.0 | N |
5 | Fudan University | Junior | Female | 46.0 | N |
根据条件来筛选行,传入loc的布尔列表与DataFrame长度一致,且True的位置被选中,False被剔除
#体重超过70kg的学生
df_demo.loc[df_demo.Weight>70].head()
School | Grade | Gender | Weight | Transfer | |
---|---|---|---|---|---|
Name | |||||
Mei Sun | Shanghai Jiao Tong University | Senior | Male | 89.0 | N |
Gaojuan You | Fudan University | Sophomore | Male | 74.0 | N |
Xiaopeng Zhou | Shanghai Jiao Tong University | Freshman | Male | 74.0 | N |
Xiaofeng Sun | Tsinghua University | Senior | Male | 71.0 | N |
Qiang Zheng | Shanghai Jiao Tong University | Senior | Male | 87.0 | N |
使用isin方法返回的布尔列表
df_demo.loc[df_demo.Grade.isin(['Freshman','Senior'])].head()
School | Grade | Gender | Weight | Transfer | |
---|---|---|---|---|---|
Name | |||||
Gaopeng Yang | Shanghai Jiao Tong University | Freshman | Female | 46.0 | N |
Changqiang You | Peking University | Freshman | Male | 70.0 | N |
Mei Sun | Shanghai Jiao Tong University | Senior | Male | 89.0 | N |
Xiaoli Qian | Tsinghua University | Freshman | Female | 51.0 | N |
Qiang Chu | Shanghai Jiao Tong University | Freshman | Female | 52.0 | N |
对于复合条件可以使用|(或),&(且),~(反)的组合来实现
#选出复旦大学体重超过70的大四学生或者北大男生体重超过80的非大四学生
x1 = df_demo.School == 'Fudan University'
x2 = df_demo.Grade == 'Senior'
x3 = df_demo.Weight>70
x = x1 & x2 & x3
y1 = df_demo.School == 'Peking University'
y2 = df_demo.Grade == 'Senior'
y3 = df_demo.Weight >80
y = y1 & (~y2) & y3
df_demo.loc[x | y]
School | Grade | Gender | Weight | Transfer | |
---|---|---|---|---|---|
Name | |||||
Qiang Han | Peking University | Freshman | Male | 87.0 | N |
Chengpeng Zhou | Fudan University | Senior | Male | 81.0 | N |
Changpeng Zhao | Peking University | Freshman | Male | 83.0 | N |
Chengpeng Qian | Fudan University | Senior | Male | 73.0 | Y |
select_dtypes是一个实用函数,能从表中选出相应类型的列,若要选出所有数值型的列,只需要使用.select_dtypes(‘number’),请使用布尔列表选择的方法集合DataFrame的dtypes属性在learn_pandas数据上实现功能
df_demo.select_dtypes('number').head()
Weight | |
---|---|
Name | |
Gaopeng Yang | 46.0 |
Changqiang You | 70.0 |
Mei Sun | 89.0 |
Xiaojuan Sun | 41.0 |
Gaojuan You | 74.0 |
df_demo[df_demo.columns[df_demo.dtypes == 'float64']].head()
Weight | |
---|---|
Name | |
Gaopeng Yang | 46.0 |
Changqiang You | 70.0 |
Mei Sun | 89.0 |
Xiaojuan Sun | 41.0 |
Gaojuan You | 74.0 |
这里的函数,必须以前面的四种合法形式(单个元素、元素列表、切片、布尔型)之一为返回值,并给函数的输入值为DataFrame本身。函数的形式参数x本质即为df_demo
def condition(z):
x1 = z.School == 'Fudan University'
x2 = z.Grade == 'Senior'
x3 = z.Weight>70
x = x1 & x2 & x3
y1 = z.School == 'Peking University'
y2 = z.Grade == 'Senior'
y3 = z.Weight >80
y = y1 & (~y2) & y3
result = x | y
return result
df_demo.loc[condition]
School | Grade | Gender | Weight | Transfer | |
---|---|---|---|---|---|
Name | |||||
Qiang Han | Peking University | Freshman | Male | 87.0 | N |
Chengpeng Zhou | Fudan University | Senior | Male | 81.0 | N |
Changpeng Zhao | Peking University | Freshman | Male | 83.0 | N |
Chengpeng Qian | Fudan University | Senior | Male | 73.0 | Y |
支持lambda表达式
df_demo.loc[lambda x :'Quan Zhao',lambda x:'Gender']
'Female'
slice为切片函数,slice(start,stop,step)
df_demo.loc[lambda x: slice('Gaojuan You','Gaoqiang Qian')]
School | Grade | Gender | Weight | Transfer | |
---|---|---|---|---|---|
Name | |||||
Gaojuan You | Fudan University | Sophomore | Male | 74.0 | N |
Xiaoli Qian | Tsinghua University | Freshman | Female | 51.0 | N |
Qiang Chu | Shanghai Jiao Tong University | Freshman | Female | 52.0 | N |
Gaoqiang Qian | Tsinghua University | Junior | Female | 50.0 | N |
df_chain = pd.DataFrame([[0,0],[1,0],[-1,0]],columns=list('AB'))
df_chain
A | B | |
---|---|---|
0 | 0 | 0 |
1 | 1 | 0 |
2 | -1 | 0 |
df_chain.loc[df_chain.A!=0,'B'] = 1
#将df_chain中A列不为0的B列的值赋值为1
df_chain
A | B | |
---|---|---|
0 | 0 | 0 |
1 | 1 | 1 |
2 | -1 | 1 |
iloc针对位置进行索引
五类合法对象:整数、整数列表、整数切片、布尔列表、函数
函数的返回值必须是整数、整数列表、整数切片、布尔列表之一
切片不好含结束端点
df_demo.iloc[1,1] #第二行第二列
'Freshman'
df_demo.iloc[[0,1],[0,1]] #前两行前两列
School | Grade | |
---|---|---|
Name | ||
Gaopeng Yang | Shanghai Jiao Tong University | Freshman |
Changqiang You | Peking University | Freshman |
df_demo.iloc[1:4,2:4] #1-3行,2-3列
Gender | Weight | |
---|---|---|
Name | ||
Changqiang You | Male | 70.0 |
Mei Sun | Male | 89.0 |
Xiaojuan Sun | Female | 41.0 |
布尔列表不能传入Series,必须传入values,布尔筛选时优先考虑从loc
df_demo.iloc[(df_demo.Weight>80).values].head()
School | Grade | Gender | Weight | Transfer | |
---|---|---|---|---|---|
Name | |||||
Mei Sun | Shanghai Jiao Tong University | Senior | Male | 89.0 | N |
Qiang Zheng | Shanghai Jiao Tong University | Senior | Male | 87.0 | N |
Qiang Han | Peking University | Freshman | Male | 87.0 | N |
Chengpeng Zhou | Fudan University | Senior | Male | 81.0 | N |
Feng Han | Shanghai Jiao Tong University | Sophomore | Male | 82.0 | N |
df_demo.iloc[lambda x : slice(1,4)] #1至3行所有的列
School | Grade | Gender | Weight | Transfer | |
---|---|---|---|---|---|
Name | |||||
Changqiang You | Peking University | Freshman | Male | 70.0 | N |
Mei Sun | Shanghai Jiao Tong University | Senior | Male | 89.0 | N |
Xiaojuan Sun | Fudan University | Sophomore | Female | 41.0 | N |
Series序列可以通过iloc返回相应位置的值或子序列
df_demo.School
Name
Gaopeng Yang Shanghai Jiao Tong University
Changqiang You Peking University
Mei Sun Shanghai Jiao Tong University
Xiaojuan Sun Fudan University
Gaojuan You Fudan University
...
Xiaojuan Sun Fudan University
Li Zhao Tsinghua University
Chengqiang Chu Shanghai Jiao Tong University
Chengmei Shen Shanghai Jiao Tong University
Chunpeng Lv Tsinghua University
Name: School, Length: 200, dtype: object
df_demo.School.iloc[1] #第一行的值
'Peking University'
df_demo.School.iloc[1:5:2] #不包含结尾,返回1、3行
Name
Changqiang You Peking University
Xiaojuan Sun Fudan University
Name: School, dtype: object
pandas支持将字符串形式的查询表达式传入query方法来查询数据,其表达式的执行结果必须返回布尔列表。
df.query('((School == "Fudan University")& ' #字符串需要使用双引号
'(Grade == "Senior")& '
'(Weight >70))|'
'((School == "Peking University")&'
'(Grade != "Senior")&'
'(Weight > 80))')
School | Grade | Name | Gender | Weight | Transfer | |
---|---|---|---|---|---|---|
38 | Peking University | Freshman | Qiang Han | Male | 87.0 | N |
66 | Fudan University | Senior | Chengpeng Zhou | Male | 81.0 | N |
99 | Peking University | Freshman | Changpeng Zhao | Male | 83.0 | N |
131 | Fudan University | Senior | Chengpeng Qian | Male | 73.0 | Y |
query可以直接调用列名和正常的函数调用无差别
df.query('Weight>Weight.mean()').head()
School | Grade | Name | Gender | Weight | Transfer | |
---|---|---|---|---|---|---|
1 | Peking University | Freshman | Changqiang You | Male | 70.0 | N |
2 | Shanghai Jiao Tong University | Senior | Mei Sun | Male | 89.0 | N |
4 | Fudan University | Sophomore | Gaojuan You | Male | 74.0 | N |
10 | Shanghai Jiao Tong University | Freshman | Xiaopeng Zhou | Male | 74.0 | N |
14 | Tsinghua University | Senior | Xiaomei Zhou | Female | 57.0 | N |
对于含有空格的列名,需要使用(`col name`)的方式引用
可以使用英语的字面用法:or、and、in 、not in
df.query('(Grade not in ["Freshman","Sophomore"]) and'
'(Gender == "Male")').head()
School | Grade | Name | Gender | Weight | Transfer | |
---|---|---|---|---|---|---|
2 | Shanghai Jiao Tong University | Senior | Mei Sun | Male | 89.0 | N |
16 | Tsinghua University | Junior | Xiaoqiang Qin | Male | 68.0 | N |
17 | Tsinghua University | Junior | Peng Wang | Male | 65.0 | N |
18 | Tsinghua University | Senior | Xiaofeng Sun | Male | 71.0 | N |
21 | Shanghai Jiao Tong University | Senior | Xiaopeng Shen | Male | 62.0 | NaN |
query引入外部变量,在变量前加@符号。
例如:取出体重位于70kg到80kg之间的学生
low, high = 70, 80
df.query('Weight >= @low and Weight <= @high').head()
#df.query('Weight.between (@low,@high)')的查询没有实现,转换为>=&<=
School | Grade | Name | Gender | Weight | Transfer | |
---|---|---|---|---|---|---|
1 | Peking University | Freshman | Changqiang You | Male | 70.0 | N |
4 | Fudan University | Sophomore | Gaojuan You | Male | 74.0 | N |
10 | Shanghai Jiao Tong University | Freshman | Xiaopeng Zhou | Male | 74.0 | N |
18 | Tsinghua University | Senior | Xiaofeng Sun | Male | 71.0 | N |
35 | Peking University | Freshman | Gaoli Zhao | Male | 78.0 | N |
将每一行看作样本,每一列看作一个特征,整个DataFrame看作总体,可以使用sample函数进行随机抽样。
sample函数参数:
n : 抽样数量
axis : 抽样的方向(0为行,1为列)
frac : 抽样比例(0.3表示从总体抽30%)
replace : 是否有放回,replace=True为有放回抽样
weights : 每个样本的抽样相对概率
例如:构造df_sample以value值的相对大小为抽样概率进行有放回抽样,抽样数量为3
df_sample = pd.DataFrame({
'id' : list('abcde'),
'value' : [1, 2, 3, 4, 90]})
df_sample
id | value | |
---|---|---|
0 | a | 1 |
1 | b | 2 |
2 | c | 3 |
3 | d | 4 |
4 | e | 90 |
df_sample.sample(3, replace=True, weights=df_sample.value)
id | value | |
---|---|---|
4 | e | 90 |
4 | e | 90 |
4 | e | 90 |
索引的名字names
索引的值values
获取单层索引get_level_values(无法修改索引值)
np.random.seed(0)
multi_index = pd.MultiIndex.from_product([list('ABCD'),
df.Gender.unique()],names=('School','Gender'))
multi_column = pd.MultiIndex.from_product([['Height','Weight'],
df.Grade.unique()],names=('Indicator','Grade'))
df_multi = pd.DataFrame(np.c_[(np.random.randn(8,4)*5+163).tolist(),
(np.random.randn(8,4)*5+65).tolist()],
index = multi_index,
columns = multi_column).round(1)
df_multi
Indicator | Height | Weight | |||||||
---|---|---|---|---|---|---|---|---|---|
Grade | Freshman | Senior | Sophomore | Junior | Freshman | Senior | Sophomore | Junior | |
School | Gender | ||||||||
A | Female | 171.8 | 165.0 | 167.9 | 174.2 | 60.6 | 55.1 | 63.3 | 65.8 |
Male | 172.3 | 158.1 | 167.8 | 162.2 | 71.2 | 71.0 | 63.1 | 63.5 | |
B | Female | 162.5 | 165.1 | 163.7 | 170.3 | 59.8 | 57.9 | 56.5 | 74.8 |
Male | 166.8 | 163.6 | 165.2 | 164.7 | 62.5 | 62.8 | 58.7 | 68.9 | |
C | Female | 170.5 | 162.0 | 164.6 | 158.7 | 56.9 | 63.9 | 60.5 | 66.9 |
Male | 150.2 | 166.3 | 167.3 | 159.3 | 62.4 | 59.1 | 64.9 | 67.1 | |
D | Female | 174.3 | 155.7 | 163.2 | 162.1 | 65.3 | 66.5 | 61.8 | 63.2 |
Male | 170.7 | 170.3 | 163.8 | 164.9 | 61.6 | 63.2 | 60.9 | 56.4 |
注意:外层索引的值在第一次出现后会被隐藏
df_multi.index.names #列索引
df_multi.columns.names #行名称
FrozenList(['Indicator', 'Grade'])
df_multi.index.values
array([('A', 'Female'), ('A', 'Male'), ('B', 'Female'), ('B', 'Male'),
('C', 'Female'), ('C', 'Male'), ('D', 'Female'), ('D', 'Male')],
dtype=object)
df_multi.columns.values
array([('Height', 'Freshman'), ('Height', 'Senior'),
('Height', 'Sophomore'), ('Height', 'Junior'),
('Weight', 'Freshman'), ('Weight', 'Senior'),
('Weight', 'Sophomore'), ('Weight', 'Junior')], dtype=object)
df_multi.index.get_level_values(0)
Index(['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D'], dtype='object', name='School')
df_multi_t = df.set_index(['School','Grade'])
df_multi_t.head()
Name | Gender | Weight | Transfer | ||
---|---|---|---|---|---|
School | Grade | ||||
Shanghai Jiao Tong University | Freshman | Gaopeng Yang | Female | 46.0 | N |
Peking University | Freshman | Changqiang You | Male | 70.0 | N |
Shanghai Jiao Tong University | Senior | Mei Sun | Male | 89.0 | N |
Fudan University | Sophomore | Xiaojuan Sun | Female | 41.0 | N |
Sophomore | Gaojuan You | Male | 74.0 | N |
多级索引中的单个元素以元组为单位,使用loc和iloc方法时,只需要将标量的位置替换为对应的元组,不过在索引前最好对MultiIndex进行排序
df_multi_t = df_multi_t.sort_index() #对索引进行排序
df_multi_t.head()
Name | Gender | Weight | Transfer | ||
---|---|---|---|---|---|
School | Grade | ||||
Fudan University | Freshman | Changqiang Yang | Female | 49.0 | N |
Freshman | Gaoqiang Qin | Female | 63.0 | N | |
Freshman | Gaofeng Zhao | Female | 43.0 | N | |
Freshman | Yanquan Wang | Female | 55.0 | N | |
Freshman | Feng Wang | Male | 74.0 | N |
df_multi_t.loc[('Fudan University','Junior'),['Weight','Gender']].head()
#列名还是需要使用列表进行获取
Weight | Gender | ||
---|---|---|---|
School | Grade | ||
Fudan University | Junior | 48.0 | Female |
Junior | 72.0 | Male | |
Junior | 76.0 | Male | |
Junior | 49.0 | Female | |
Junior | 43.0 | Female |
df_multi_t.loc[[('Fudan University','Senior'),
('Shanghai Jiao Tong University','Freshman')]]
Name | Gender | Weight | Transfer | ||
---|---|---|---|---|---|
School | Grade | ||||
Fudan University | Senior | Chengpeng Zheng | Female | 38.0 | N |
Senior | Feng Zhou | Female | 47.0 | N | |
Senior | Gaomei Lv | Female | 34.0 | N | |
Senior | Chunli Lv | Female | 56.0 | N | |
Senior | Chengpeng Zhou | Male | 81.0 | N | |
Senior | Gaopeng Qin | Female | 52.0 | N | |
Senior | Chunjuan Xu | Female | 47.0 | N | |
Senior | Juan Zhang | Female | 47.0 | N | |
Senior | Chengpeng Qian | Male | 73.0 | Y | |
Senior | Xiaojuan Qian | Female | 50.0 | N | |
Senior | Quan Xu | Female | 44.0 | N | |
Shanghai Jiao Tong University | Freshman | Gaopeng Yang | Female | 46.0 | N |
Freshman | Qiang Chu | Female | 52.0 | N | |
Freshman | Xiaopeng Zhou | Male | 74.0 | N | |
Freshman | Yanpeng Lv | Male | 65.0 | N | |
Freshman | Xiaopeng Zhao | Female | 53.0 | N | |
Freshman | Chunli Zhao | Male | 83.0 | N | |
Freshman | Peng Zhang | Female | NaN | N | |
Freshman | Xiaoquan Sun | Female | 40.0 | N | |
Freshman | Chunmei Shi | Female | 52.0 | N | |
Freshman | Xiaomei Yang | Female | 49.0 | N | |
Freshman | Xiaofeng Qian | Female | 49.0 | N | |
Freshman | Changmei Lv | Male | 75.0 | N | |
Freshman | Qiang Feng | Male | 80.0 | N |
df_multi_t.loc[df_multi_t.Weight>70].head()
Name | Gender | Weight | Transfer | ||
---|---|---|---|---|---|
School | Grade | ||||
Fudan University | Freshman | Feng Wang | Male | 74.0 | N |
Junior | Chunqiang Chu | Male | 72.0 | N | |
Junior | Changfeng Lv | Male | 76.0 | N | |
Senior | Chengpeng Zhou | Male | 81.0 | N | |
Senior | Chengpeng Qian | Male | 73.0 | Y |
df_multi_t.loc[lambda x:('Fudan University','Junior')].head()
Name | Gender | Weight | Transfer | ||
---|---|---|---|---|---|
School | Grade | ||||
Fudan University | Junior | Yanli You | Female | 48.0 | N |
Junior | Chunqiang Chu | Male | 72.0 | N | |
Junior | Changfeng Lv | Male | 76.0 | N | |
Junior | Yanjuan Lv | Female | 49.0 | NaN | |
Junior | Gaoqiang Zhou | Female | 43.0 | N |
与单层索引类似,若存在重复元素,则不能使用切片,请去除重复索引后给出一个元素切片的例子
df_multi_uni = df.drop_duplicates(['School','Grade']).set_index(['School', 'Grade'])
df_multi_uni = df_multi_uni.sort_index()
df_multi_uni.loc[('Shanghai Jiao Tong University','Freshman'):('Tsinghua University','Junior')]
Name | Gender | Weight | Transfer | ||
---|---|---|---|---|---|
School | Grade | ||||
Shanghai Jiao Tong University | Freshman | Gaopeng Yang | Female | 46.0 | N |
Junior | Feng Zheng | Female | 51.0 | N | |
Senior | Mei Sun | Male | 89.0 | N | |
Sophomore | Yanfeng Qian | Female | 48.0 | N | |
Tsinghua University | Freshman | Xiaoli Qian | Female | 51.0 | N |
Junior | Gaoqiang Qian | Female | 50.0 | N |
需要指定loc的列,全选为 : ,每一层需要选中的元素用列表存放,传入loc的形式为[(level_0_list,level_1_list),cols]
例:实现所有北大和复旦的大二大三同学
#使用之前的列表方法
res = df_multi_t.loc[[('Peking University','Sophomore'),
('Peking University','Junior'),
('Fudan University','Sophomore'),
('Fudan University','Junior')]]
print(res.head())
print(res.shape)
Name Gender Weight Transfer
School Grade
Peking University Sophomore Changmei Xu Female 43.0 N
Sophomore Xiaopeng Qin Male NaN N
Sophomore Mei Xu Female 39.0 N
Sophomore Xiaoli Zhou Female 55.0 N
Sophomore Peng Han Female 34.0 NaN
(33, 4)
#使用交叉组合
res_t = df_multi_t.loc[(['Peking University','Fudan University'],
['Sophomore','Junior']),:]
print(res_t.head())
print(res_t.shape)
Name Gender Weight Transfer
School Grade
Peking University Sophomore Changmei Xu Female 43.0 N
Sophomore Xiaopeng Qin Male NaN N
Sophomore Mei Xu Female 39.0 N
Sophomore Xiaoli Zhou Female 55.0 N
Sophomore Peng Han Female 34.0 NaN
(33, 4)
对每层进行切片,将切片和布尔列表混用。slice的两种形式:loc[idx[,]]和loc[idx[,],idx[,]]
#构建索引不重复的DataFrame
np.random.seed(0)
L1, L2 = ['A','B','C'],['a','b','c']
mul_index1 = pd.MultiIndex.from_product([L1,L2],names=('Upper','Lower'))
L3, L4 = ['D','E','F'],['d','e','f']
mul_index2 = pd.MultiIndex.from_product([L3,L4],names=('Big','Small'))
df_ex = pd.DataFrame(np.random.randint(-9,10,(9,9)),
index = mul_index1,
columns = mul_index2)
df_ex
Big | D | E | F | |||||||
---|---|---|---|---|---|---|---|---|---|---|
Small | d | e | f | d | e | f | d | e | f | |
Upper | Lower | |||||||||
A | a | 3 | 6 | -9 | -6 | -6 | -2 | 0 | 9 | -5 |
b | -3 | 3 | -8 | -3 | -2 | 5 | 8 | -4 | 4 | |
c | -1 | 0 | 7 | -4 | 6 | 6 | -9 | 9 | -6 | |
B | a | 8 | 5 | -2 | -9 | -8 | 0 | -9 | 1 | -6 |
b | 2 | 9 | -7 | -9 | -9 | -5 | -4 | -3 | -1 | |
c | 8 | 6 | -5 | 0 | 1 | -8 | -8 | -2 | 0 | |
C | a | -6 | -3 | 2 | 5 | 9 | -9 | 5 | -6 | 3 |
b | 1 | 2 | -5 | -3 | -5 | 6 | -6 | 3 | -5 | |
c | -1 | 5 | 6 | -6 | 6 | 4 | 7 | 8 | -4 |
定义slice
idx = pd.IndexSlice
这种情况不能进行多层切片,[,]表示[行,列],与loc一致
df_ex.loc[idx['C':,('D','f'):]]
Big | D | E | F | |||||
---|---|---|---|---|---|---|---|---|
Small | f | d | e | f | d | e | f | |
Upper | Lower | |||||||
C | a | 2 | 5 | 9 | -9 | 5 | -6 | 3 |
b | -5 | -3 | -5 | 6 | -6 | 3 | -5 | |
c | 6 | -6 | 6 | 4 | 7 | 8 | -4 |
布尔序列索引
此处存在疑问点???
df_ex.loc[idx[:'A',lambda x:x.sum()>0]]
#列和大于0,为什么第一列会出现
Big | D | F | ||
---|---|---|---|---|
Small | d | e | e | |
Upper | Lower | |||
A | a | 3 | 6 | 9 |
b | -3 | 3 | -4 | |
c | -1 | 0 | 9 |
这种可以分层进行切片,前一个idx指行索引,后一个idx指列索引
df_ex.loc[idx[:'A','b':],idx['E':,'e':]]
Big | E | F | |||
---|---|---|---|---|---|
Small | e | f | e | f | |
Upper | Lower | ||||
A | b | -2 | 5 | -4 | 4 |
c | 6 | 6 | 9 | -6 |
set_index
pd.MultiIndex对象下:
from_tuples
from_arrays
from_product
根据传入由元组组成的列表进行构造
my_tuple = [('a','cat'),('a','dog'),('b','cat'),('b','dog')]
pd.MultiIndex.from_tuples(my_tuple , names=['first','second'])
MultiIndex([('a', 'cat'),
('a', 'dog'),
('b', 'cat'),
('b', 'dog')],
names=['first', 'second'])
根据传入列表中的对应层的列表进行构造
my_array = [list('aabb'),['cat','dog']*2]
pd.MultiIndex.from_arrays(my_array,names=['First','Second'])
MultiIndex([('a', 'cat'),
('a', 'dog'),
('b', 'cat'),
('b', 'dog')],
names=['First', 'Second'])
根据给定多个列表的笛卡尔积进行构造
my_list1 = ['a','b']
my_list2 = ['cat','dog']
pd.MultiIndex.from_product([my_list1,
my_list2],
names=['First','Second'])
MultiIndex([('a', 'cat'),
('a', 'dog'),
('b', 'cat'),
('b', 'dog')],
names=['First', 'Second'])
np.random.seed(0)
L1,L2,L3=['A','B'],['a','b'],['alpha','beta']
mul_index1 = pd.MultiIndex.from_product([L1,L2,L3],
names=('Upper','Lower','Extra'))
L4,L5,L6=['C','D'],['c','d'],['cat','dog']
mul_index2 = pd.MultiIndex.from_product([L4,L5,L6],
names=('Big','Small','Other'))
df_ex = pd.DataFrame(np.random.randint(-9,10,(8,8)),
index=mul_index1,
columns=mul_index2)
df_ex
Big | C | D | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Small | c | d | c | d | ||||||
Other | cat | dog | cat | dog | cat | dog | cat | dog | ||
Upper | Lower | Extra | ||||||||
A | a | alpha | 3 | 6 | -9 | -6 | -6 | -2 | 0 | 9 |
beta | -5 | -3 | 3 | -8 | -3 | -2 | 5 | 8 | ||
b | alpha | -4 | 4 | -1 | 0 | 7 | -4 | 6 | 6 | |
beta | -9 | 9 | -6 | 8 | 5 | -2 | -9 | -8 | ||
B | a | alpha | 0 | -9 | 1 | -6 | 2 | 9 | -7 | -9 |
beta | -9 | -5 | -4 | -3 | -1 | 8 | 6 | -5 | ||
b | alpha | 0 | 1 | -8 | -8 | -2 | 0 | -6 | -3 | |
beta | 2 | 5 | 9 | -9 | 5 | -6 | 3 | 1 |
由swaplevel和reorder_levels完成,前者交换两层,后者交换任意层,两者都可以指定交换的轴式哪一个,即行索引和列索引
df_ex.swaplevel(0,2,axis=1).head()
#axis=1表示列,0代表第一行,2代表第三行。整个的意思是将列的第一行和第三行互换
Other | cat | dog | cat | dog | cat | dog | cat | dog | ||
---|---|---|---|---|---|---|---|---|---|---|
Small | c | c | d | d | c | c | d | d | ||
Big | C | C | C | C | D | D | D | D | ||
Upper | Lower | Extra | ||||||||
A | a | alpha | 3 | 6 | -9 | -6 | -6 | -2 | 0 | 9 |
beta | -5 | -3 | 3 | -8 | -3 | -2 | 5 | 8 | ||
b | alpha | -4 | 4 | -1 | 0 | 7 | -4 | 6 | 6 | |
beta | -9 | 9 | -6 | 8 | 5 | -2 | -9 | -8 | ||
B | a | alpha | 0 | -9 | 1 | -6 | 2 | 9 | -7 | -9 |
df_ex.reorder_levels([2,0,1],axis=0).head()
#[2,0,1]代表现在排在第1位的是原来的第3位,排在第2位的是原来的第1位,排在第3位的是原来的第2位,选取行索引
Big | C | D | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Small | c | d | c | d | ||||||
Other | cat | dog | cat | dog | cat | dog | cat | dog | ||
Extra | Upper | Lower | ||||||||
alpha | A | a | 3 | 6 | -9 | -6 | -6 | -2 | 0 | 9 |
beta | A | a | -5 | -3 | 3 | -8 | -3 | -2 | 5 | 8 |
alpha | A | b | -4 | 4 | -1 | 0 | 7 | -4 | 6 | 6 |
beta | A | b | -9 | 9 | -6 | 8 | 5 | -2 | -9 | -8 |
alpha | B | a | 0 | -9 | 1 | -6 | 2 | 9 | -7 | -9 |
droplevel
df_ex.droplevel(1,axis=1) #删除列索引的第2个
Big | C | D | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Other | cat | dog | cat | dog | cat | dog | cat | dog | ||
Upper | Lower | Extra | ||||||||
A | a | alpha | 3 | 6 | -9 | -6 | -6 | -2 | 0 | 9 |
beta | -5 | -3 | 3 | -8 | -3 | -2 | 5 | 8 | ||
b | alpha | -4 | 4 | -1 | 0 | 7 | -4 | 6 | 6 | |
beta | -9 | 9 | -6 | 8 | 5 | -2 | -9 | -8 | ||
B | a | alpha | 0 | -9 | 1 | -6 | 2 | 9 | -7 | -9 |
beta | -9 | -5 | -4 | -3 | -1 | 8 | 6 | -5 | ||
b | alpha | 0 | 1 | -8 | -8 | -2 | 0 | -6 | -3 | |
beta | 2 | 5 | 9 | -9 | 5 | -6 | 3 | 1 |
df_ex.droplevel([0,1],axis=0) #删除行索引的第1个和第2个
Big | C | D | ||||||
---|---|---|---|---|---|---|---|---|
Small | c | d | c | d | ||||
Other | cat | dog | cat | dog | cat | dog | cat | dog |
Extra | ||||||||
alpha | 3 | 6 | -9 | -6 | -6 | -2 | 0 | 9 |
beta | -5 | -3 | 3 | -8 | -3 | -2 | 5 | 8 |
alpha | -4 | 4 | -1 | 0 | 7 | -4 | 6 | 6 |
beta | -9 | 9 | -6 | 8 | 5 | -2 | -9 | -8 |
alpha | 0 | -9 | 1 | -6 | 2 | 9 | -7 | -9 |
beta | -9 | -5 | -4 | -3 | -1 | 8 | 6 | -5 |
alpha | 0 | 1 | -8 | -8 | -2 | 0 | -6 | -3 |
beta | 2 | 5 | 9 | -9 | 5 | -6 | 3 | 1 |
使用rename_axis对索引层的名字进行修改,常见形式是传入字典的映射
df_ex.rename_axis(index={
'Upper':'Changed_row'},
columns={
'Other':'Change_Col'}).head()
Big | C | D | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Small | c | d | c | d | ||||||
Change_Col | cat | dog | cat | dog | cat | dog | cat | dog | ||
Changed_row | Lower | Extra | ||||||||
A | a | alpha | 3 | 6 | -9 | -6 | -6 | -2 | 0 | 9 |
beta | -5 | -3 | 3 | -8 | -3 | -2 | 5 | 8 | ||
b | alpha | -4 | 4 | -1 | 0 | 7 | -4 | 6 | 6 | |
beta | -9 | 9 | -6 | 8 | 5 | -2 | -9 | -8 | ||
B | a | alpha | 0 | -9 | 1 | -6 | 2 | 9 | -7 | -9 |
rename对索引的值进行修改,如果是多级索引需要指定修改的层号level
df_ex.rename(columns={
'cat':'not_cat'},
level=2).head()
Big | C | D | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Small | c | d | c | d | ||||||
Other | not_cat | dog | not_cat | dog | not_cat | dog | not_cat | dog | ||
Upper | Lower | Extra | ||||||||
A | a | alpha | 3 | 6 | -9 | -6 | -6 | -2 | 0 | 9 |
beta | -5 | -3 | 3 | -8 | -3 | -2 | 5 | 8 | ||
b | alpha | -4 | 4 | -1 | 0 | 7 | -4 | 6 | 6 | |
beta | -9 | 9 | -6 | 8 | 5 | -2 | -9 | -8 | ||
B | a | alpha | 0 | -9 | 1 | -6 | 2 | 9 | -7 | -9 |
传入参数为函数,输入值就是索引元素
df_ex.rename(index=lambda x:str.upper(x),
level=2).head()
Big | C | D | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Small | c | d | c | d | ||||||
Other | cat | dog | cat | dog | cat | dog | cat | dog | ||
Upper | Lower | Extra | ||||||||
A | a | ALPHA | 3 | 6 | -9 | -6 | -6 | -2 | 0 | 9 |
BETA | -5 | -3 | 3 | -8 | -3 | -2 | 5 | 8 | ||
b | ALPHA | -4 | 4 | -1 | 0 | 7 | -4 | 6 | 6 | |
BETA | -9 | 9 | -6 | 8 | 5 | -2 | -9 | -8 | ||
B | a | ALPHA | 0 | -9 | 1 | -6 | 2 | 9 | -7 | -9 |
在rename_axis中使用函数完成和例子中一样的功能
df_ex.rename_axis(index=lambda x:'Changed_Row' if x=='Upper' else x,
columns=lambda x:'Changed_Col' if x=='Other' else x ).head()
Big | C | D | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Small | c | d | c | d | ||||||
Changed_Col | cat | dog | cat | dog | cat | dog | cat | dog | ||
Changed_Row | Lower | Extra | ||||||||
A | a | alpha | 3 | 6 | -9 | -6 | -6 | -2 | 0 | 9 |
beta | -5 | -3 | 3 | -8 | -3 | -2 | 5 | 8 | ||
b | alpha | -4 | 4 | -1 | 0 | 7 | -4 | 6 | 6 | |
beta | -9 | 9 | -6 | 8 | 5 | -2 | -9 | -8 | ||
B | a | alpha | 0 | -9 | 1 | -6 | 2 | 9 | -7 | -9 |
new_values = iter(list('abcdefgh'))
df_ex.rename(index=lambda x:next(new_values),
level=2)
#迭代器的元素一定要与替换的元素数量保持一致
Big | C | D | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Small | c | d | c | d | ||||||
Other | cat | dog | cat | dog | cat | dog | cat | dog | ||
Upper | Lower | Extra | ||||||||
A | a | a | 3 | 6 | -9 | -6 | -6 | -2 | 0 | 9 |
b | -5 | -3 | 3 | -8 | -3 | -2 | 5 | 8 | ||
b | c | -4 | 4 | -1 | 0 | 7 | -4 | 6 | 6 | |
d | -9 | 9 | -6 | 8 | 5 | -2 | -9 | -8 | ||
B | a | e | 0 | -9 | 1 | -6 | 2 | 9 | -7 | -9 |
f | -9 | -5 | -4 | -3 | -1 | 8 | 6 | -5 | ||
b | g | 0 | 1 | -8 | -8 | -2 | 0 | -6 | -3 | |
h | 2 | 5 | 9 | -9 | 5 | -6 | 3 | 1 |
思考:如果对于第一层索引的修改,设置的值是几个?首先考虑设置的值为8个,测试时发现使用两个会报错,因此要使用8个
new_values = iter(list('AB'*4))
df_ex.rename(index=lambda x:next(new_values),
level=0)
Big | C | D | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Small | c | d | c | d | ||||||
Other | cat | dog | cat | dog | cat | dog | cat | dog | ||
Upper | Lower | Extra | ||||||||
A | a | alpha | 3 | 6 | -9 | -6 | -6 | -2 | 0 | 9 |
B | a | beta | -5 | -3 | 3 | -8 | -3 | -2 | 5 | 8 |
A | b | alpha | -4 | 4 | -1 | 0 | 7 | -4 | 6 | 6 |
B | b | beta | -9 | 9 | -6 | 8 | 5 | -2 | -9 | -8 |
A | a | alpha | 0 | -9 | 1 | -6 | 2 | 9 | -7 | -9 |
B | a | beta | -9 | -5 | -4 | -3 | -1 | 8 | 6 | -5 |
A | b | alpha | 0 | 1 | -8 | -8 | -2 | 0 | -6 | -3 |
B | b | beta | 2 | 5 | 9 | -9 | 5 | -6 | 3 | 1 |
单层索引的迭代可以取出索引的values属性,再给定列表,最后进行index对象重新赋值
多级索引,先把某一层索引临时转为表的元素,再进行修改,最后重新设定为索引
2.map
定义在Index上的方法,与rename方法中函数用法类似,它传入的不是层的标量值,是直接传入索引的元组,可以提供跨层的遍历修改
#字符串转大写
df_temp = df_ex.copy()
new_idx = df_temp.index.map(lambda x:(x[0],
x[1],
str.upper(x[2])))
df_temp.index = new_idx
df_temp.head()
Big | C | D | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Small | c | d | c | d | ||||||
Other | cat | dog | cat | dog | cat | dog | cat | dog | ||
Upper | Lower | Extra | ||||||||
A | a | ALPHA | 3 | 6 | -9 | -6 | -6 | -2 | 0 | 9 |
BETA | -5 | -3 | 3 | -8 | -3 | -2 | 5 | 8 | ||
b | ALPHA | -4 | 4 | -1 | 0 | 7 | -4 | 6 | 6 | |
BETA | -9 | 9 | -6 | 8 | 5 | -2 | -9 | -8 | ||
B | a | ALPHA | 0 | -9 | 1 | -6 | 2 | 9 | -7 | -9 |
map的另一个使用方法:对多级索引的压缩
df_temp = df_ex.copy()
new_idx = df_temp.index.map(lambda x :(x[0]+'-'+
x[1]+'-'+
x[2]))
df_temp.index = new_idx
df_temp.head()#变为单层索引
Big | C | D | ||||||
---|---|---|---|---|---|---|---|---|
Small | c | d | c | d | ||||
Other | cat | dog | cat | dog | cat | dog | cat | dog |
A-a-alpha | 3 | 6 | -9 | -6 | -6 | -2 | 0 | 9 |
A-a-beta | -5 | -3 | 3 | -8 | -3 | -2 | 5 | 8 |
A-b-alpha | -4 | 4 | -1 | 0 | 7 | -4 | 6 | 6 |
A-b-beta | -9 | 9 | -6 | 8 | 5 | -2 | -9 | -8 |
B-a-alpha | 0 | -9 | 1 | -6 | 2 | 9 | -7 | -9 |
反向展开
new_idx = df_temp.index.map(lambda x :tuple(x.split('-')))
#使用split将数据分隔开,然后再通过tuple转换为数组
df_temp.index = new_idx
df_temp.head() #三层索引
Big | C | D | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Small | c | d | c | d | ||||||
Other | cat | dog | cat | dog | cat | dog | cat | dog | ||
A | a | alpha | 3 | 6 | -9 | -6 | -6 | -2 | 0 | 9 |
beta | -5 | -3 | 3 | -8 | -3 | -2 | 5 | 8 | ||
b | alpha | -4 | 4 | -1 | 0 | 7 | -4 | 6 | 6 | |
beta | -9 | 9 | -6 | 8 | 5 | -2 | -9 | -8 | ||
B | a | alpha | 0 | -9 | 1 | -6 | 2 | 9 | -7 | -9 |
df_new = pd.DataFrame({
'A':list('aacd'),
'B':list('PQRT'),
'C':[1,2,3,4]})
df_new
A | B | C | |
---|---|---|---|
0 | a | P | 1 |
1 | a | Q | 2 |
2 | c | R | 3 |
3 | d | T | 4 |
使用set_index完成,参数:append,表示是否保留原来的索引
df_new.set_index('A')
B | C | |
---|---|---|
A | ||
a | P | 1 |
a | Q | 2 |
c | R | 3 |
d | T | 4 |
df_new.set_index('A',append=True)
B | C | ||
---|---|---|---|
A | |||
0 | a | P | 1 |
1 | a | Q | 2 |
2 | c | R | 3 |
3 | d | T | 4 |
指定多列为索引
df_new.set_index(['A','B'])
C | ||
---|---|---|
A | B | |
a | P | 1 |
Q | 2 | |
c | R | 3 |
d | T | 4 |
添加新列作为索引,在参数中传入Series
df_new = pd.DataFrame({
'A':list('aacd'),
'B':list('PQRT'),
'C':[1,2,3,4]})
my_index = pd.Series(list('WXYZ'),name='D')
df_new = df_new.set_index(['A',my_index])
df_new
B | C | ||
---|---|---|---|
A | D | ||
a | W | P | 1 |
X | Q | 2 | |
c | Y | R | 3 |
d | Z | T | 4 |
reset_index是set_index的逆函数,主要参数是drop,表示是否将去掉索引层丢弃(不添加到列中)
df_new.reset_index(['D'])
D | B | C | |
---|---|---|---|
A | |||
a | W | P | 1 |
a | X | Q | 2 |
c | Y | R | 3 |
d | Z | T | 4 |
df_new.reset_index(['A'],drop=True)
B | C | |
---|---|---|
D | ||
W | P | 1 |
X | Q | 2 |
Y | R | 3 |
Z | T | 4 |
重置索引
df_new.reset_index()
A | D | B | C | |
---|---|---|---|---|
0 | a | W | P | 1 |
1 | a | X | Q | 2 |
2 | c | Y | R | 3 |
3 | d | Z | T | 4 |
对索引进行扩充或者剔除,指定一个新的索引或将原表相应的索引对应元素填充到新索引构成的表中
使用reindex
例:在新表中增加一名员工的同时去掉身高列并增加性别列
df_reindex = pd.DataFrame({
'Weight':[60,70,80],
'Height':[176,180,179]},
index = ['1001','1003','1002'])
df_reindex
Weight | Height | |
---|---|---|
1001 | 60 | 176 |
1003 | 70 | 180 |
1002 | 80 | 179 |
df_reindex.reindex(index=['1001','1003','1002','1004'],
columns=['Weight','Gender'])
Weight | Gender | |
---|---|---|
1001 | 60.0 | NaN |
1003 | 70.0 | NaN |
1002 | 80.0 | NaN |
1004 | NaN | NaN |
reindex_like
仿照传入的表的索引来进行被调用表索引的变形
df_reindex
Weight | Height | |
---|---|---|
1001 | 60 | 176 |
1003 | 70 | 180 |
1002 | 80 | 179 |
df_existed = pd.DataFrame(index=['1001','1002','1003','1004'],
columns=['Weight','Gender'])
df_reindex.reindex_like(df_existed)
Weight | Gender | |
---|---|---|
1001 | 60.0 | NaN |
1002 | 80.0 | NaN |
1003 | 70.0 | NaN |
1004 | NaN | NaN |
A并B: 属于A并且属于B (A.intersection(B) <> A&B)
A交B: 属于A或者属于B (A.union(B) <> A|B)
A减B: 属于A但是不属于B (A.difference(B) <> (A^B)&A)
A和B的对称差: A交B-A并B (A.symmetric_difference(B) <> A^B)
由于集合的元素是不重复的,先用unique去重后再进行运算
df_set_1 = pd.DataFrame([[0,1],[1,2],[3,4]],
index = pd.Index(['a','b','a'],name='id1'))
df_set_2 = pd.DataFrame([[4,5],[2,6],[7,1]],
index = pd.Index(['b','b','c'],name='id2'))
id1,id2 = df_set_1.index.unique(),df_set_2.index.unique()
id1.intersection(id2)
Index(['b'], dtype='object')
id1.union(id2)
Index(['a', 'b', 'c'], dtype='object')
id1.difference(id2)
Index(['a'], dtype='object')
id1.symmetric_difference(id2)
Index(['a', 'c'], dtype='object')
若两张表需要做集合运算的列并没有被设置为索引,一种方法:先转换为索引set_index,运算后再恢复。第二种:使用isin函数
例:选出id列交集所在的行
df_set_in_col1 = df_set_1.reset_index()
df_set_in_col1
id1 | 0 | 1 | |
---|---|---|---|
0 | a | 0 | 1 |
1 | b | 1 | 2 |
2 | a | 3 | 4 |
df_set_in_col2 = df_set_2.reset_index()
df_set_in_col2
id2 | 0 | 1 | |
---|---|---|---|
0 | b | 4 | 5 |
1 | b | 2 | 6 |
2 | c | 7 | 1 |
df_set_in_col1[df_set_in_col1.id1.isin(df_set_in_col2.id2)]
id1 | 0 | 1 | |
---|---|---|---|
1 | b | 1 | 2 |
df = pd.read_csv('data/Company.csv')
df.head()
EmployeeID | birthdate_key | age | city_name | department | job_title | gender | |
---|---|---|---|---|---|---|---|
0 | 1318 | 1/3/1954 | 61 | Vancouver | Executive | CEO | M |
1 | 1319 | 1/3/1957 | 58 | Vancouver | Executive | VP Stores | F |
2 | 1320 | 1/2/1955 | 60 | Vancouver | Executive | Legal Counsel | F |
3 | 1321 | 1/2/1959 | 56 | Vancouver | Executive | VP Human Resources | M |
4 | 1322 | 1/9/1958 | 57 | Vancouver | Executive | VP Finance | M |
1.分别使用query和loc选出年龄不超过四十岁且工作部门为Dairy或Bakery的男性
使用query
思路:就按照条件一级一级的去拼写,但是在将==换成is in 的时候会报错,之后测试发现正确的写法是in 没有is
df.query('((age <= 40)&'
' (department in ["Dairy","Bakery"])&'
'(gender in "M"))'
).shape
(441, 7)
使用loc
思路:loc需要使用布尔型来找到符合条件的值,最后再通过shape去判断查询的结果是否一致
x1 = df.age<=40
x2 = df.department.isin(["Dairy","Bakery"])
x3 = df.gender == 'M'
x = x1 & x2 & x3
df.loc[x].shape
(441, 7)
2.选出员工ID号为奇数所在行的第1、第3和倒数第2列
思路:首先使用不被2整除来判断ID号为奇数,这个是列值的判断,我们需要使用loc;然后找出第1,3,倒数第2列,这个是位置的索引,需要使用iloc,注意是从0开始计算
(df.loc[df.EmployeeID%2!=0]).iloc[:,[0,2,-2]].head()
EmployeeID | age | job_title | |
---|---|---|---|
1 | 1319 | 58 | VP Stores |
3 | 1321 | 56 | VP Human Resources |
5 | 1323 | 53 | Exec Assistant, VP Stores |
6 | 1325 | 51 | Exec Assistant, Legal Counsel |
8 | 1329 | 48 | Store Manager |
3.按照步骤进行索引操作
3.1 将后三列设为索引后交换内外两层
思路:首先使用columns取出后三列,然后使用set_index将后三列设置为索引,因为只交换两层,我们使用swaplevel对列索引的位置进行交换
df.columns[-4:-1].values
array(['city_name', 'department', 'job_title'], dtype=object)
df_demo = df.set_index(list(df.columns[-4:-1].values))
df_demo.head()
EmployeeID | birthdate_key | age | gender | |||
---|---|---|---|---|---|---|
city_name | department | job_title | ||||
Vancouver | Executive | CEO | 1318 | 1/3/1954 | 61 | M |
VP Stores | 1319 | 1/3/1957 | 58 | F | ||
Legal Counsel | 1320 | 1/2/1955 | 60 | F | ||
VP Human Resources | 1321 | 1/2/1959 | 56 | M | ||
VP Finance | 1322 | 1/9/1958 | 57 | M |
df_demo.swaplevel(0,2,axis=0).head()
EmployeeID | birthdate_key | age | gender | |||
---|---|---|---|---|---|---|
job_title | department | city_name | ||||
CEO | Executive | Vancouver | 1318 | 1/3/1954 | 61 | M |
VP Stores | Executive | Vancouver | 1319 | 1/3/1957 | 58 | F |
Legal Counsel | Executive | Vancouver | 1320 | 1/2/1955 | 60 | F |
VP Human Resources | Executive | Vancouver | 1321 | 1/2/1959 | 56 | M |
VP Finance | Executive | Vancouver | 1322 | 1/9/1958 | 57 | M |
3.2 恢复中间一层
恢复就是使用reset_index, 可以通过df_demo.index.names[1]找到索引第1列的名字
df_demo1 = df_demo.reset_index(df_demo.index.names[1])
df_demo1.head()
department | EmployeeID | birthdate_key | age | gender | ||
---|---|---|---|---|---|---|
city_name | job_title | |||||
Vancouver | CEO | Executive | 1318 | 1/3/1954 | 61 | M |
VP Stores | Executive | 1319 | 1/3/1957 | 58 | F | |
Legal Counsel | Executive | 1320 | 1/2/1955 | 60 | F | |
VP Human Resources | Executive | 1321 | 1/2/1959 | 56 | M | |
VP Finance | Executive | 1322 | 1/9/1958 | 57 | M |
3.3 修改外索引名为Gender
直接使用rename_axis,对index进行修改
df_demo1.rename_axis(index={
'city_name':'Gender'}).head()
department | EmployeeID | birthdate_key | age | gender | ||
---|---|---|---|---|---|---|
Gender | job_title | |||||
Vancouver | CEO | Executive | 1318 | 1/3/1954 | 61 | M |
VP Stores | Executive | 1319 | 1/3/1957 | 58 | F | |
Legal Counsel | Executive | 1320 | 1/2/1955 | 60 | F | |
VP Human Resources | Executive | 1321 | 1/2/1959 | 56 | M | |
VP Finance | Executive | 1322 | 1/9/1958 | 57 | M |
3.4 是下划线合并两层行索引
合并索引使用map函数
df_demo1.head()
department | EmployeeID | birthdate_key | age | gender | ||
---|---|---|---|---|---|---|
city_name | job_title | |||||
Vancouver | CEO | Executive | 1318 | 1/3/1954 | 61 | M |
VP Stores | Executive | 1319 | 1/3/1957 | 58 | F | |
Legal Counsel | Executive | 1320 | 1/2/1955 | 60 | F | |
VP Human Resources | Executive | 1321 | 1/2/1959 | 56 | M | |
VP Finance | Executive | 1322 | 1/9/1958 | 57 | M |
df_temp = df_demo1.copy()
new_idx = df_temp.index.map(lambda x :(x[0]+'_'+
x[1]))
df_temp.index = new_idx
df_temp.head()
department | EmployeeID | birthdate_key | age | gender | |
---|---|---|---|---|---|
Vancouver_CEO | Executive | 1318 | 1/3/1954 | 61 | M |
Vancouver_VP Stores | Executive | 1319 | 1/3/1957 | 58 | F |
Vancouver_Legal Counsel | Executive | 1320 | 1/2/1955 | 60 | F |
Vancouver_VP Human Resources | Executive | 1321 | 1/2/1959 | 56 | M |
Vancouver_VP Finance | Executive | 1322 | 1/9/1958 | 57 | M |
3.5 将行索引拆分为原状态
反向使用map和split
new_idx = df_temp.index.map(lambda x :tuple(x.split('_')))
df_temp.index = new_idx
df_temp.head()
department | EmployeeID | birthdate_key | age | gender | ||
---|---|---|---|---|---|---|
Vancouver | CEO | Executive | 1318 | 1/3/1954 | 61 | M |
VP Stores | Executive | 1319 | 1/3/1957 | 58 | F | |
Legal Counsel | Executive | 1320 | 1/2/1955 | 60 | F | |
VP Human Resources | Executive | 1321 | 1/2/1959 | 56 | M | |
VP Finance | Executive | 1322 | 1/9/1958 | 57 | M |
3.6 修改索引名为原表名称
之前使用的是copy出来的df_temp,现在先将df_demo1的索引名称选取出来,再将名称赋值给df_temp.index.names
new_name = df_demo1.index.names
new_name
FrozenList(['city_name', 'job_title'])
df_temp.index.names = new_name
df_temp.head(1)
department | EmployeeID | birthdate_key | age | gender | ||
---|---|---|---|---|---|---|
city_name | job_title | |||||
Vancouver | CEO | Executive | 1318 | 1/3/1954 | 61 | M |
3.7 恢复默认索引并将列保持为原表的相对位置
先将索引还原,然后利用loc按照顺序获取值之后再将它赋值到本身
df_new = df_temp.reset_index().head()
df_new
city_name | job_title | department | EmployeeID | birthdate_key | age | gender | |
---|---|---|---|---|---|---|---|
0 | Vancouver | CEO | Executive | 1318 | 1/3/1954 | 61 | M |
1 | Vancouver | VP Stores | Executive | 1319 | 1/3/1957 | 58 | F |
2 | Vancouver | Legal Counsel | Executive | 1320 | 1/2/1955 | 60 | F |
3 | Vancouver | VP Human Resources | Executive | 1321 | 1/2/1959 | 56 | M |
4 | Vancouver | VP Finance | Executive | 1322 | 1/9/1958 | 57 | M |
cols = list(df.columns)
df_new = df_new.loc[:,cols]
df_new
EmployeeID | birthdate_key | age | city_name | department | job_title | gender | |
---|---|---|---|---|---|---|---|
0 | 1318 | 1/3/1954 | 61 | Vancouver | Executive | CEO | M |
1 | 1319 | 1/3/1957 | 58 | Vancouver | Executive | VP Stores | F |
2 | 1320 | 1/2/1955 | 60 | Vancouver | Executive | Legal Counsel | F |
3 | 1321 | 1/2/1959 | 56 | Vancouver | Executive | VP Human Resources | M |
4 | 1322 | 1/9/1958 | 57 | Vancouver | Executive | VP Finance | M |
df = pd.read_csv('data/chocolate.csv')
df.head(3)
Company | Review\nDate | Cocoa\nPercent | Company\nLocation | Rating | |
---|---|---|---|---|---|
0 | A. Morin | 2016 | 63% | France | 3.75 |
1 | A. Morin | 2015 | 70% | France | 2.75 |
2 | A. Morin | 2015 | 70% | France | 3.00 |
df_demo = df.copy()
df_demo.columns = df_demo.columns.map(lambda x:x.replace('\n',' '))
df_demo.head()
Company | Review Date | Cocoa Percent | Company Location | Rating | |
---|---|---|---|---|---|
0 | A. Morin | 2016 | 63% | France | 3.75 |
1 | A. Morin | 2015 | 70% | France | 2.75 |
2 | A. Morin | 2015 | 70% | France | 3.00 |
3 | A. Morin | 2015 | 70% | France | 3.50 |
4 | A. Morin | 2015 | 70% | France | 3.50 |
我们发现Cocoa Percent存的是字符,我们需要将它转换为float的型进行中位数的计算;字段包含由空格,要注意使用"`"
df_demo['Cocoa Percent'] = df_demo['Cocoa Percent'].apply(lambda x: float(x[:-1])/100)
df_demo.head()
Company | Review Date | Cocoa Percent | Company Location | Rating | |
---|---|---|---|---|---|
0 | A. Morin | 2016 | 0.63 | France | 3.75 |
1 | A. Morin | 2015 | 0.70 | France | 2.75 |
2 | A. Morin | 2015 | 0.70 | France | 3.00 |
3 | A. Morin | 2015 | 0.70 | France | 3.50 |
4 | A. Morin | 2015 | 0.70 | France | 3.50 |
df_demo.query('((Rating <=2.75) &'
'(`Cocoa Percent` >= `Cocoa Percent`.median()))').head()
Company | Review Date | Cocoa Percent | Company Location | Rating | |
---|---|---|---|---|---|
1 | A. Morin | 2015 | 0.70 | France | 2.75 |
5 | A. Morin | 2014 | 0.70 | France | 2.75 |
10 | A. Morin | 2013 | 0.70 | France | 2.75 |
14 | A. Morin | 2013 | 0.70 | France | 2.75 |
33 | Akesson's (Pralus) | 2010 | 0.75 | Switzerland | 2.75 |
思路:我考虑的是通过将2012年以后的年份和需要筛选的Location条件筛选后作为需要去查找的条件,再通过两者求得两者的笛卡尔积找出最后符合条件的值。(我觉得这不是一个很好的方法,但是是目前所能想到的)
df_demo1 = df.copy()
df_demo1.columns = df_demo1.columns.map(lambda x:x.replace('\n',' '))
df_demo1 = df_demo1.set_index(['Review Date','Company Location'])
date = df_demo1.index.get_level_values(0)
date1 = list(set(date))
location = df_demo1.index.get_level_values(1)
location1 = list(set(location))
x = [i for i in date1 if i>2012 ]
y = [i for i in location1 if i not in ['France','Canada','Amsterdam','Belgium']]
df_demo1.loc[(x,y),:].head(2)
Company | Cocoa Percent | Rating | ||
---|---|---|---|---|
Review Date | Company Location | |||
2016 | Austria | Martin Mayer | 76% | 2.75 |
Austria | Martin Mayer | 82% | 3.00 |