约定
import pandas as pd
通常对于我们不想要连续的数值,我们可将其离散化,离散化也可称为分组、区间化。
Pandas为我们提供了方便的函数cut():
pd.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False)
参数解释:
# 假设有一组人员的年龄
ages=[20,19,30,34,23,40,50]
se_ages=pd.Series(ages)
1 离散化
bin=[0,18,25,35,60]
se1=pd.cut(se_ages,bin)
print(se1)
0 (18, 25]
1 (18, 25]
2 (25, 35]
3 (25, 35]
4 (18, 25]
5 (35, 60]
6 (35, 60]
dtype: category
Categories (4, interval[int64]): [(0, 18] < (18, 25] < (25, 35] < (35, 60]]
2 对分组计数
pd.value_counts(se1)
(18, 25] 3
(35, 60] 2
(25, 35] 2
(0, 18] 0
dtype: int64
3 指定左边为闭端
pd.cut(se_ages,bin,right=False)
0 [18, 25)
1 [18, 25)
2 [25, 35)
3 [25, 35)
4 [18, 25)
5 [35, 60)
6 [35, 60)
dtype: category
Categories (4, interval[int64]): [[0, 18) < [18, 25) < [25, 35) < [35, 60)]
4 为区间指定名称
pd.cut(se_ages,bin,labels=['Youth','YoungAdult','MiddleAge','Senior'])
0 YoungAdult
1 YoungAdult
2 MiddleAge
3 MiddleAge
4 YoungAdult
5 Senior
6 Senior
dtype: category
Categories (4, object): [Youth < YoungAdult < MiddleAge < Senior]
谢谢大家的浏览,
希望我的努力能帮助到您,
共勉!