pandas组队学习-Task8 分类类型

目录

1. 学习内容

2. 准备工作

3. 分类变量的创建及其性质

3.1 创建

3.2 性质

3.2.1 查看分类类别以及是否有序

3.2.2 修改类别

3.2.3 添加类别

3.2.4 删除类别

4. 分类变量的排序

4.1 序的建立与退化

4.1.2 建立

4.1.2 退化

4.2 排序

5. 分类变量的比较操作

5.1 与标量或等长序列的比较

5.2 与另一分类变量的比较

5.2.1 等式判别

5.2.2 不等式判别

1. 学习内容

1. 学习分类类型的创建和性质

2. 学会对分类类型进行排序操作和比较操作

本项目参见https://github.com/datawhalechina/team-learning/tree/master/Pandas%E6%95%99%E7%A8%8B%EF%BC%88%E4%B8%8A%EF%BC%89

2. 准备工作

import pandas as pd
import numpy as np

df = pd.read_csv('data/table.csv')
df.head()
  School Class    ID Gender   Address  Height  Weight  Math Physics
0    S_1   C_1  1101      M  street_1     173      63  34.0      A+
1    S_1   C_1  1102      F  street_2     192      73  32.5      B+
2    S_1   C_1  1103      M  street_2     186      82  87.2      B+
3    S_1   C_1  1104      F  street_2     167      81  80.4      B-
4    S_1   C_1  1105      F  street_4     159      64  84.8      B+

3. 分类变量的创建及其性质

3.1 创建

分类变量有很多种创建方法:从序列中创建,从表格中指定列创建,利用内置Categorical类型创建和利用cut()方法进行创建。

pd.Series(["a", "b", "c", "a"], dtype = "category")
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a, b, c]
temp_df = pd.DataFrame({'A': pd.Series(["a", "b", "c", "a"], \
                                       dtype = "category"), 'B': list('abcd')})
temp_df.dtypes
A    category
B      object
dtype: object
cat = pd.Categorical(["a", "b", "c", "a"], categories = ['a', 'b', 'c'])
pd.Series(cat)
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a, b, c]
# 默认以区间为标签,不过也可以指定某种字符为标签
pd.cut(np.random.randint(0, 60, 5), [0, 10, 30, 60])
[(0, 10], (30, 60], (30, 60], (30, 60], (30, 60]]
Categories (3, interval[int64]): [(0, 10] < (10, 30] < (30, 60]]
pd.cut(np.random.randint(0, 60, 5), [0, 10, 30, 60], \
       right = False, labels = ['0-10', '10-30', '30-60'])
[10-30, 0-10, 30-60, 0-10, 30-60]
Categories (3, object): [0-10 < 10-30 < 30-60]

3.2 性质

一个分类变量包括三个部分,元素值(values)、分类类别(categories)、是否有序(order)使用cut函数创建的分类变量默认为有序分类变量。

s = pd.Series(pd.Categorical(["a", "b", "c", "a", np.nan], \
                             categories = ['a', 'b', 'c', 'd']))
s.describe()
count     4
unique    3
top       a
freq      2
dtype: object

3.2.1 查看分类类别以及是否有序

print(s.cat.categories)
print(s.cat.ordered)
Index(['a', 'b', 'c', 'd'], dtype='object')
False

3.2.2 修改类别

# 利用set_categories修改。修改分类,但本身值不会变化
s = pd.Series(pd.Categorical(["a", "b", "c", "a", np.nan], \
                             categories = ['a', 'b', 'c', 'd']))
s.cat.set_categories(['new_a', 'c'])
0    NaN
1    NaN
2      c
3    NaN
4    NaN
dtype: category
Categories (2, object): [new_a, c]
# 利用rename_categories修改。需要注意的是该方法会把值和分类同时修改
s = pd.Series(pd.Categorical(["a", "b", "c", "a", np.nan], \
                             categories = ['a', 'b', 'c', 'd']))
s.cat.rename_categories(['new_%s' % i for i in s.cat.categories])
0    new_a
1    new_b
2    new_c
3    new_a
4      NaN
dtype: category
Categories (4, object): [new_a, new_b, new_c, new_d]
# 利用字典修改值
s.cat.rename_categories({'a': 'new_a', 'b': 'new_b'})
0    new_a
1    new_b
2        c
3    new_a
4      NaN
dtype: category
Categories (4, object): [new_a, new_b, c, d]

3.2.3 添加类别

s = pd.Series(pd.Categorical(["a", "b", "c", "a", np.nan], \
                             categories = ['a', 'b', 'c', 'd']))
s.cat.add_categories(['e'])
0      a
1      b
2      c
3      a
4    NaN
dtype: category
Categories (5, object): [a, b, c, d, e]

3.2.4 删除类别

s = pd.Series(pd.Categorical(["a", "b", "c", "a", np.nan], \
                             categories = ['a', 'b', 'c', 'd']))
s.cat.remove_categories(['d'])
0      a
1      b
2      c
3      a
4    NaN
dtype: category
Categories (3, object): [a, b, c]
# 删除元素值未出现的分类类型
s = pd.Series(pd.Categorical(["a", "b", "c", "a", np.nan], \
                             categories = ['a', 'b', 'c', 'd']))
s.cat.remove_unused_categories()
0      a
1      b
2      c
3      a
4    NaN
dtype: category
Categories (3, object): [a, b, c]

4. 分类变量的排序

4.1 序的建立与退化

4.1.2 建立

s = pd.Series(["a", "d", "c", "a"]).astype('category').cat.as_ordered()
s
0    a
1    d
2    c
3    a
dtype: category
Categories (3, object): [a < c < d]
s = pd.Series(["a", "d", "c", "a"]).astype('category')
s.cat.set_categories(['a', 'c', 'd'], ordered = True)
0    a
1    d
2    c
3    a
dtype: category
Categories (3, object): [a < c < d]
# 这个方法的特点在于,新设置的分类必须与原分类为同一集合
s = pd.Series(["a", "d", "c", "a"]).astype('category')
s.cat.reorder_categories(['a', 'c', 'd'],ordered = True)
0    a
1    d
2    c
3    a
dtype: category
Categories (3, object): [a < c < d]

4.1.2 退化

s.cat.as_unordered()
0    a
1    d
2    c
3    a
dtype: category
Categories (3, object): [a, c, d]

4.2 排序

s = pd.Series(np.random.choice(['perfect', 'good', 'fair', 'bad', 'awful'], 50)).astype('category')
s.cat.set_categories(['perfect', 'good', 'fair', 'bad', 'awful'][::-1], ordered = True).head()
0       good
1    perfect
2       fair
3       good
4       fair
dtype: category
Categories (5, object): [awful < bad < fair < good < perfect]
s.sort_values(ascending = False).head()
37    perfect
9     perfect
19    perfect
18    perfect
17    perfect
dtype: category
Categories (5, object): [awful, bad, fair, good, perfect]
df_sort = pd.DataFrame({'cat': s.values, 'value': np.random.randn(50)})
df_sort.set_index('cat').head()
df_sort.sort_index().head()

5. 分类变量的比较操作

5.1 与标量或等长序列的比较

s = pd.Series(["a", "d", "c", "a"]).astype('category')
s == 'a'
0     True
1    False
2    False
3     True
dtype: bool
s == list('abcd')
0     True
1    False
2     True
3    False
dtype: bool

5.2 与另一分类变量的比较

5.2.1 等式判别

两个分类变量的等式判别需要满足分类完全相同。

s = pd.Series(["a", "d", "c", "a"]).astype('category')
s == s
0    True
1    True
2    True
3    True
dtype: bool

5.2.2 不等式判别

两个分类变量的不等式判别需要满足两个条件:分类完全相同和排序完全相同。

s = pd.Series(["a", "d", "c", "a"]).astype('category')
#s >= s #报错
s = pd.Series(["a", "d", "c", "a"]).astype('category')
s = s.cat.reorder_categories(['a', 'c', 'd'], ordered = True)
s >= s
0    True
1    True
2    True
3    True
dtype: bool

 

你可能感兴趣的:(Datawhale,Team,Learning,pandas)