离散化,就是把无限空间中有限的个体映射到有限的空间中。数据离散化操作大多是针对连续数据进行的,处理之后的数据值域分布将从连续属性变为离散属性,这种属性一般包含2个或2个以上的值域
pd.cut 指定分组区间,需要注意默认情况,传入的分组区间是左开右闭
pd.qcut,qcut 等频分组 只需要传入分成几组,尽量按照每一组样本数量相等.(qcut是根据这些值的频率来选择箱子的均匀间隔,即每个箱子中含有的数的数量是相同的)
pd.cut 根据指定分界点对连续数据进行分箱处理,pd.qcut 根据指定箱子的数量对连续数据进行等宽分箱处理
import pandas as pd
from sklearn.cluster import KMeans
from sklearn import preprocessing
# 读取数据
df = pd.read_table('data7.txt', names=['id', 'amount', 'income', 'datetime', 'age']) # 读取数据文件
print(df.head(5)) # 打印输出前5条数据
id amount income datetime age
0 15093 1390 10.40 2017-04-30 19:24:13 0-10
1 15062 4024 4.68 2017-04-27 22:44:59 70-80
2 15028 6359 3.84 2017-04-27 10:07:55 40-50
3 15012 7759 3.70 2017-04-04 07:28:18 30-40
4 15021 331 4.25 2017-04-08 11:14:00 70-80
针对时间数据的离散化
# 针对时间数据的离散化
df['datetime'] = list(map(pd.to_datetime,df['datetime'])) # 将时间转换为datetime格式
df.info()
df['datetime'] = [i.weekday() for i in df['datetime']]# 离散化为周几
print(df.head(5)) # 打印输出前5条数据
'''
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 100 non-null int64
1 amount 100 non-null int64
2 income 100 non-null float64
3 datetime 100 non-null datetime64[ns]
4 age 100 non-null object
dtypes: datetime64[ns](1), float64(1), int64(2), object(1)
memory usage: 4.0+ KB
id amount income datetime age
0 15093 1390 10.40 6 0-10
1 15062 4024 4.68 3 70-80
2 15028 6359 3.84 3 40-50
3 15012 7759 3.70 1 30-40
4 15021 331 4.25 5 70-80
'''
df.describe() # 根据最小值最大值定义出区间的范围
id amount income datetime
count 100.000000 100.000000 100.000000 100.000000
mean 15053.850000 3983.220000 4.105900 3.010000
std 29.431439 2302.857349 1.195348 2.002498
min 15002.000000 176.000000 1.210000 0.000000
25% 15028.750000 2050.000000 3.295000 1.000000
50% 15057.500000 3944.500000 4.210000 3.000000
75% 15080.000000 5980.250000 4.610000 5.000000
max 15100.000000 7952.000000 10.400000 6.000000
bins = [0, 200, 1000, 5000, 10000] # 自定义区间边界
df['amount1'] = pd.cut(df['amount'], bins) # 使用边界做离散化
print(df.head(5)) # 打印输出前5条数据
id amount income datetime age amount1
0 15093 1390 10.40 6 0-10 (1000, 5000]
1 15062 4024 4.68 3 70-80 (1000, 5000]
2 15028 6359 3.84 3 40-50 (5000, 10000]
3 15012 7759 3.70 1 30-40 (5000, 10000]
4 15021 331 4.25 5 70-80 (200, 1000]
data = df['amount'] # 获取要聚类的数据,名为amount的列
data_reshape = data.values.reshape((data.shape[0], 1)) # 转换数据形状
model_kmeans = KMeans(n_clusters=4, random_state=0) #分成四类0,1,2,3
#创建KMeans模型并指定要聚类数量4; # random_state随机数种子,如果指定的random_state相同,每次生成的随机数都是一样的
keames_result = model_kmeans.fit_predict(data_reshape) # 建模聚类
df['amount2'] = keames_result # 新离散化的数据合并到原数据框
print(df.head(5)) # 打印输出前5条数据
id amount income datetime age amount1 amount2
0 15093 1390 10.40 6 0-10 (1000, 5000] 0
1 15062 4024 4.68 3 70-80 (1000, 5000] 2
2 15028 6359 3.84 3 40-50 (5000, 10000] 1
3 15012 7759 3.70 1 30-40 (5000, 10000] 1
4 15021 331 4.25 5 70-80 (200, 1000] 0
df['amount3'] = pd.qcut(df['amount'], 4, labels=['bad', 'medium', 'good', 'awesome'])
#quct等频分组只需要传入分成几组,尽量按照每一组样本数量相等
df = df.drop('amount', 1) # 丢弃名为amount的列#quct等频分组 只需要传入分组,尽量按照每一组样本数量相等
print(df.head(5)) # 打印输出前5条数据
id income datetime age amount1 amount2 amount3
0 15093 10.40 6 0-10 (1000, 5000] 0 bad
1 15062 4.68 3 70-80 (1000, 5000] 2 good
2 15028 3.84 3 40-50 (5000, 10000] 1 awesome
3 15012 3.70 1 30-40 (5000, 10000] 1 awesome
4 15021 4.25 5 70-80 (200, 1000] 0 bad
df.groupby(["amount3"])["id"].count()
'''
amount3
bad 25
medium 25
good 25
awesome 25
Name: id, dtype: int64
'''
df.head()
'''
id income datetime age amount1 amount2 amount3
0 15093 10.40 6 0-10 (1000, 5000] 0 bad
1 15062 4.68 3 70-80 (1000, 5000] 2 good
2 15028 3.84 3 40-50 (5000, 10000] 1 awesome
3 15012 3.70 1 30-40 (5000, 10000] 1 awesome
4 15021 4.25 5 70-80 (200, 1000] 0 bad
'''
binarizer_scaler = preprocessing.Binarizer(threshold=df['income'].mean()) # 建立Binarizer模型对象
binarizer_scaler #利用income这一列的平均值作为阈值,进行二值化 。阈值是4.1
#Binarizer(threshold=4.105899999999999)
income_tmp = binarizer_scaler.fit_transform(df[['income']]) # Binarizer标准化转换
income_tmp.resize(df['income'].shape) # 转换数据形状
income_tmp.shape
#(100,)
df['income'] = income_tmp # Binarizer标准化转换
print(df.head(5)) # 打印输出前5条数据
id income datetime age amount1 amount2 amount3
0 15093 1.0 6 0-10 (1000, 5000] 0 bad
1 15062 1.0 3 70-80 (1000, 5000] 2 good
2 15028 0.0 3 40-50 (5000, 10000] 1 awesome
3 15012 0.0 1 30-40 (5000, 10000] 1 awesome
4 15021 1.0 5 70-80 (200, 1000] 0 bad