简介
案例:一个班30名考生考英语,成绩出来后,需要对每一位考生评估不及格,几个,良好,优秀四个等级。例如小红考了95,该成绩落入(90, 100]这个区间,则评为优秀
所以,对数据进行等级划分,再延申做频率统计,可以使用pandas库中的 cut和qcut函数
区分
cut在划分区间时,按照绝对值
qcut在划分区间时,使用分位数
函数一
pd.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False)x:需要离散化的数据
bins:如果是整数,则将数值范围分为bins份,最大值和最小值延申1%,保证包含x的最大/小值 如果是列表,则表示划分的各个区间的边界
right:right=True表示分组右边闭合,right=False表示分组左边闭合
labels:表示分组各个区间的标签,长度必须等于区间数、
include_lowest:布尔值,第一个区间的左端点是否包含
案例1
import pandas as pd
import numpy as np
a = [30, 60, 80, 85, 100]
bin = [0, 60, 80, 90, 100]
rank = pd.cut(a, bin)
print(rank)
print(pd.value_counts(rank))
type(rank)
结果输出
[(0, 60], (0, 60], (60, 80], (80, 90], (90, 100]]
Categories (4, interval[int64]): [(0, 60] < (60, 80] < (80, 90] < (90, 100]]
(0, 60] 2
(90, 100] 1
(80, 90] 1
(60, 80] 1
dtype: int64
pandas.core.arrays.categorical.Categorical
案例2
import pandas as pd
import numpy as np
a = pd.Series([30, 60, 80, 85, 100])
bin = [0, 60, 80, 90, 100]
rank = pd.cut(a, bin, label=['不及格', '及格', '良好', '优秀'])
rank
type(rank)
# 结果输出============================================
0 不及格
1 不及格
2 及格
3 良好
4 优秀
dtype: category
Categories (4, object): [不及格 < 及格 < 良好 < 优秀]
pandas.core.series.Series
函数二
pandas.qcut(x, q, labels=None, retbins=False, precision=3, duplicates=’raise’)x: 要分组的数据
q: 整数,按分位数划分,会尽量保持落入每个区间的样本数量相同 分位数组成的数组
案例1
import pandas as pd
import numpy as np
a = pd.Series([30, 60, 80, 85, 100])
rank1 = pd.qcut(a, 3)
rank2 = pd.qcut(a, [0, 0.2, 0.3,1])
rank1
# 输出结果===================================
0 (29.999, 66.667]
1 (29.999, 66.667]
2 (66.667, 83.333]
3 (83.333, 100.0]
4 (83.333, 100.0]
dtype: category
Categories (3, interval[float64]): [(29.999, 66.667] < (66.667, 83.333] < (83.333, 100.0]]
0 (29.999, 54.0]
1 (54.0, 64.0]
2 (64.0, 100.0]
3 (64.0, 100.0]
4 (64.0, 100.0]
dtype: category
Categories (3, interval[float64]): [(29.999, 54.0] < (54.0, 64.0] < (64.0, 100.0]]