在数据分析的过程中,经常会遇到:年龄,收入,价格以及类似的数据,在数据分析前,需要将这些数据划分到一系列区间中,再将区间进行不同的编码,对编码后的数据进行分析。
在pandas中可以使用pandas.cut()方法实现对数据的区间划分,以及对区间进行标记。
import pandas as pd
import numpy as np
df = pd.DataFrame(data={
"name":["A","B","C","D","E","F","G","H","I","J"],
"age":[23,26,37,46,85,12,53,80,66,32],
"score":[13,23,22,76,56,89,99,100,10,54],
})
数据形式展示:
name | age | score | |
---|---|---|---|
0 | A | 23 | 13 |
1 | B | 26 | 23 |
2 | C | 37 | 22 |
3 | D | 46 | 76 |
4 | E | 85 | 56 |
5 | F | 12 | 89 |
6 | G | 53 | 99 |
7 | H | 80 | 100 |
8 | I | 66 | 10 |
9 | J | 32 | 54 |
用来将数据划分为不同的区间
返回数据:
例如:bins=3,right=True,pandas会将数据划分为3个区间,划分方法,
(max-(max-min)/bins,max]>(60.667,85]
(max-(max-min)/bins*2,max-(max-min)/bins]>(36.333.60.667]
(max-(max-min)/bins*3,max-(max-min)/bins*2]==>(11.927, 36.333]
a,b = pd.cut(x=df["age"],bins=3,right=True,retbins=True)
# a,bins传入的是int类型,自动生成的区间
0 (11.927, 36.333]
1 (11.927, 36.333]
2 (36.333, 60.667]
3 (36.333, 60.667]
4 (60.667, 85.0]
5 (11.927, 36.333]
6 (36.333, 60.667]
7 (60.667, 85.0]
8 (60.667, 85.0]
9 (11.927, 36.333]
Name: age, dtype: category
Categories (3, interval[float64]): [(11.927, 36.333] < (36.333, 60.667] < (60.667, 85.0]]
# b,自动划分的区间
array([11.927, 36.333, 60.667, 85.0])
eg:自定义一个年龄分段列表,age_bins = [10,20,30,50,70,80,90]
对应的区间为:[(10, 20] < (20, 30] < (30, 50] < (50, 70] < (70, 80] < (80, 90]]
这样pandas会按照age_bins指定的区间进行划分
age_bins = [10,20,30,50,70,80,90]
a,b = pd.cut(x=df["age"],bins=age_bins,right=True,retbins=True)
a和b的值:
# a返回的数据区间array对象
0 (20, 30]
1 (20, 30]
2 (30, 50]
3 (30, 50]
4 (80, 90]
5 (10, 20]
6 (50, 70]
7 (70, 80]
8 (50, 70]
9 (30, 50]
Name: age, dtype: category
Categories (6, interval[int64]): [(10, 20] < (20, 30] < (30, 50] < (50, 70] < (70, 80] < (80, 90]]
# b数据区间retbins=True
array([10, 20, 30, 50, 70, 80, 90])
使用场景:当age为80的时候,应该归为(70,80]还是[80,90),这是个问题
eg:bins = [10,20,30,50,70,80,90],right=True
对应的区间为:[(10, 20] < (20, 30] < (30, 50] < (50, 70] < (70, 80] < (80, 90]]
pd.cut(x=df.age,bins=age_bins,retbins=True,right=True)
(0 (20, 30]
1 (20, 30]
2 (30, 50]
3 (30, 50]
4 (80, 90]
5 (10, 20]
6 (50, 70]
7 (70, 80]
8 (50, 70]
9 (30, 50]
Name: age, dtype: category
Categories (6, interval[int64]): [(10, 20] < (20, 30] < (30, 50] < (50, 70] < (70, 80] < (80, 90]],
array([10, 20, 30, 50, 70, 80, 90]))
bins = [10,20,30,50,70,80,90],right=False
对应的区间为:[[10, 20) < [20, 30) < [30, 50) < [50, 70) < [70, 80) < [80, 90)]
pd.cut(x=df.age,bins=age_bins,retbins=True,right=False)
(0 [20, 30)
1 [20, 30)
2 [30, 50)
3 [30, 50)
4 [80, 90)
5 [10, 20)
6 [50, 70)
7 [80, 90)
8 [50, 70)
9 [30, 50)
Name: age, dtype: category
Categories (6, interval[int64]): [[10, 20) < [20, 30) < [30, 50) < [50, 70) < [70, 80) < [80, 90)],
array([10, 20, 30, 50, 70, 80, 90]))
使用labels参数可以对区间加上标签,例如score列,小于60为不及格,60-80良好,80以上优秀
eg:bins:[0,60,80,100],labels:[“不及格”,“良好”,“优秀”]
返回的是对应的标签,而不是对应的区间
pd.cut(x=df.score,bins=[0,60,80,100],labels=["不及格","良好","优秀"])
代码运行结果:
0 不及格
1 不及格
2 不及格
3 良好
4 不及格
5 优秀
6 优秀
7 优秀
8 不及格
9 不及格
Name: score, dtype: category
Categories (3, object): ['不及格' < '良好' < '优秀']
df["age_range"] = pd.cut(x=df["age"],bins=[10, 20, 30, 50, 70, 80, 90])
df["score_label"] = pd.cut(x=df["score"],bins=[0,60,80,100],labels=["不及格","良好","优秀"])
代码运行结果:
name | age | score | age_range | score_label | |
---|---|---|---|---|---|
0 | A | 23 | 13 | (20, 30] | 不及格 |
1 | B | 26 | 23 | (20, 30] | 不及格 |
2 | C | 37 | 22 | (30, 50] | 不及格 |
3 | D | 46 | 76 | (30, 50] | 良好 |
4 | E | 85 | 56 | (80, 90] | 不及格 |
5 | F | 12 | 89 | (10, 20] | 优秀 |
6 | G | 53 | 99 | (50, 70] | 优秀 |
7 | H | 80 | 100 | (70, 80] | 优秀 |
8 | I | 66 | 10 | (50, 70] | 不及格 |
9 | J | 32 | 54 | (30, 50] | 不及格 |