目录
1. 学习内容
2. 准备工作
3. 分类变量的创建及其性质
3.1 创建
3.2 性质
3.2.1 查看分类类别以及是否有序
3.2.2 修改类别
3.2.3 添加类别
3.2.4 删除类别
4. 分类变量的排序
4.1 序的建立与退化
4.1.2 建立
4.1.2 退化
4.2 排序
5. 分类变量的比较操作
5.1 与标量或等长序列的比较
5.2 与另一分类变量的比较
5.2.1 等式判别
5.2.2 不等式判别
1. 学习分类类型的创建和性质
2. 学会对分类类型进行排序操作和比较操作
本项目参见https://github.com/datawhalechina/team-learning/tree/master/Pandas%E6%95%99%E7%A8%8B%EF%BC%88%E4%B8%8A%EF%BC%89
import pandas as pd
import numpy as np
df = pd.read_csv('data/table.csv')
df.head()
School Class ID Gender Address Height Weight Math Physics
0 S_1 C_1 1101 M street_1 173 63 34.0 A+
1 S_1 C_1 1102 F street_2 192 73 32.5 B+
2 S_1 C_1 1103 M street_2 186 82 87.2 B+
3 S_1 C_1 1104 F street_2 167 81 80.4 B-
4 S_1 C_1 1105 F street_4 159 64 84.8 B+
分类变量有很多种创建方法:从序列中创建,从表格中指定列创建,利用内置Categorical类型创建和利用cut()方法进行创建。
pd.Series(["a", "b", "c", "a"], dtype = "category")
0 a
1 b
2 c
3 a
dtype: category
Categories (3, object): [a, b, c]
temp_df = pd.DataFrame({'A': pd.Series(["a", "b", "c", "a"], \
dtype = "category"), 'B': list('abcd')})
temp_df.dtypes
A category
B object
dtype: object
cat = pd.Categorical(["a", "b", "c", "a"], categories = ['a', 'b', 'c'])
pd.Series(cat)
0 a
1 b
2 c
3 a
dtype: category
Categories (3, object): [a, b, c]
# 默认以区间为标签,不过也可以指定某种字符为标签
pd.cut(np.random.randint(0, 60, 5), [0, 10, 30, 60])
[(0, 10], (30, 60], (30, 60], (30, 60], (30, 60]]
Categories (3, interval[int64]): [(0, 10] < (10, 30] < (30, 60]]
pd.cut(np.random.randint(0, 60, 5), [0, 10, 30, 60], \
right = False, labels = ['0-10', '10-30', '30-60'])
[10-30, 0-10, 30-60, 0-10, 30-60]
Categories (3, object): [0-10 < 10-30 < 30-60]
一个分类变量包括三个部分,元素值(values)、分类类别(categories)、是否有序(order)使用cut函数创建的分类变量默认为有序分类变量。
s = pd.Series(pd.Categorical(["a", "b", "c", "a", np.nan], \
categories = ['a', 'b', 'c', 'd']))
s.describe()
count 4
unique 3
top a
freq 2
dtype: object
print(s.cat.categories)
print(s.cat.ordered)
Index(['a', 'b', 'c', 'd'], dtype='object')
False
# 利用set_categories修改。修改分类,但本身值不会变化
s = pd.Series(pd.Categorical(["a", "b", "c", "a", np.nan], \
categories = ['a', 'b', 'c', 'd']))
s.cat.set_categories(['new_a', 'c'])
0 NaN
1 NaN
2 c
3 NaN
4 NaN
dtype: category
Categories (2, object): [new_a, c]
# 利用rename_categories修改。需要注意的是该方法会把值和分类同时修改
s = pd.Series(pd.Categorical(["a", "b", "c", "a", np.nan], \
categories = ['a', 'b', 'c', 'd']))
s.cat.rename_categories(['new_%s' % i for i in s.cat.categories])
0 new_a
1 new_b
2 new_c
3 new_a
4 NaN
dtype: category
Categories (4, object): [new_a, new_b, new_c, new_d]
# 利用字典修改值
s.cat.rename_categories({'a': 'new_a', 'b': 'new_b'})
0 new_a
1 new_b
2 c
3 new_a
4 NaN
dtype: category
Categories (4, object): [new_a, new_b, c, d]
s = pd.Series(pd.Categorical(["a", "b", "c", "a", np.nan], \
categories = ['a', 'b', 'c', 'd']))
s.cat.add_categories(['e'])
0 a
1 b
2 c
3 a
4 NaN
dtype: category
Categories (5, object): [a, b, c, d, e]
s = pd.Series(pd.Categorical(["a", "b", "c", "a", np.nan], \
categories = ['a', 'b', 'c', 'd']))
s.cat.remove_categories(['d'])
0 a
1 b
2 c
3 a
4 NaN
dtype: category
Categories (3, object): [a, b, c]
# 删除元素值未出现的分类类型
s = pd.Series(pd.Categorical(["a", "b", "c", "a", np.nan], \
categories = ['a', 'b', 'c', 'd']))
s.cat.remove_unused_categories()
0 a
1 b
2 c
3 a
4 NaN
dtype: category
Categories (3, object): [a, b, c]
s = pd.Series(["a", "d", "c", "a"]).astype('category').cat.as_ordered()
s
0 a
1 d
2 c
3 a
dtype: category
Categories (3, object): [a < c < d]
s = pd.Series(["a", "d", "c", "a"]).astype('category')
s.cat.set_categories(['a', 'c', 'd'], ordered = True)
0 a
1 d
2 c
3 a
dtype: category
Categories (3, object): [a < c < d]
# 这个方法的特点在于,新设置的分类必须与原分类为同一集合
s = pd.Series(["a", "d", "c", "a"]).astype('category')
s.cat.reorder_categories(['a', 'c', 'd'],ordered = True)
0 a
1 d
2 c
3 a
dtype: category
Categories (3, object): [a < c < d]
s.cat.as_unordered()
0 a
1 d
2 c
3 a
dtype: category
Categories (3, object): [a, c, d]
s = pd.Series(np.random.choice(['perfect', 'good', 'fair', 'bad', 'awful'], 50)).astype('category')
s.cat.set_categories(['perfect', 'good', 'fair', 'bad', 'awful'][::-1], ordered = True).head()
0 good
1 perfect
2 fair
3 good
4 fair
dtype: category
Categories (5, object): [awful < bad < fair < good < perfect]
s.sort_values(ascending = False).head()
37 perfect
9 perfect
19 perfect
18 perfect
17 perfect
dtype: category
Categories (5, object): [awful, bad, fair, good, perfect]
df_sort = pd.DataFrame({'cat': s.values, 'value': np.random.randn(50)})
df_sort.set_index('cat').head()
df_sort.sort_index().head()
s = pd.Series(["a", "d", "c", "a"]).astype('category')
s == 'a'
0 True
1 False
2 False
3 True
dtype: bool
s == list('abcd')
0 True
1 False
2 True
3 False
dtype: bool
两个分类变量的等式判别需要满足分类完全相同。
s = pd.Series(["a", "d", "c", "a"]).astype('category')
s == s
0 True
1 True
2 True
3 True
dtype: bool
两个分类变量的不等式判别需要满足两个条件:分类完全相同和排序完全相同。
s = pd.Series(["a", "d", "c", "a"]).astype('category')
#s >= s #报错
s = pd.Series(["a", "d", "c", "a"]).astype('category')
s = s.cat.reorder_categories(['a', 'c', 'd'], ordered = True)
s >= s
0 True
1 True
2 True
3 True
dtype: bool