sklearn-KMeans聚类分析-汽车分类

目录
01 | 项目简介
02 | KMeans算法
03 | 思路
04 | 代码

01| 项目简介

这是一份有205个数据样本的汽车数据集,其中包括汽车的名称、排放量、车身大小等相关数据。本项目目的在于通过非监督式算法,对数据集进行分类。将汽车分为几大类。


数据集:
链接: https://pan.baidu.com/s/15iFV5NY2OWvhpDkGc2EtbA 提取码: qb6j

02 | KMeans算法

k均值聚类算法(k-means clustering algorithm)是一种迭代求解的聚类分析算法,其步骤是,预将数据分为K组,则随机选取K个对象作为初始的聚类中心,然后计算每个对象与各个种子聚类中心之间的距离,把每个对象分配给距离它最近的聚类中心。

03 | 思路

1.数据集清洗
本数据集包括object对象,只有int类型才能参与数据集训练,因此要先将数据集分为object类型与非object类型,方法有两种:

# 方法1
df_str = df.select_dtypes(include = object)
df_notstr = df.select_dtypes(exclude = object) # 不包括object的字符`

# 方法2
df[[' '],[' '],[' ']......]

2.K值确定 设定k值范围,通过导入silhouette_score模块,计算k值最佳取值
3.分类

04 | 代码

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn import preprocessing as pp

pd.set_option('display.max_columns',1000)
pd.set_option('display.width',1000)
pd.set_option('display.max_colwidth',1000)
plt.rcParams['font.sans-serif'] = ['SimHei']
sns.set_style('whitegrid',{'font.sans-serif':['simhei','Arial']})

# 因为object不能用来训练,要转为int类型
df = pd.read_csv(r'C:\Users\Administrator\Documents\Downloads\car_price.csv')
print(df.head())
print(df.info())

# 筛选字符类型
df_str = df.select_dtypes(include = object)
df_notstr = df.select_dtypes(exclude = object) # 不包括object的字符

# 对象转为int
df_str = df_str.apply(pp.LabelEncoder().fit_transform)
train_data = pd.concat([df_str,df_notstr],1)
print(train_data.head())
print('--------------------------------------------------------------------------------')
# 将所有数值规范到[0,1]的范围
train_x = pp.MinMaxScaler().fit_transform(train_data)
print(train_x)

# 测试K值
import numpy as np
from sklearn.metrics import silhouette_score

scale = np.arange(2,10) # k值取值范围为2到9
score_list = []

for i in scale:
    # kmeans = KMeans(init = 'K-means++',n_clusters = i,max_iter = 300,random_state = 1)
    kmeans = KMeans()
    kmeans.fit(train_x)
    score = silhouette_score(train_x,kmeans.labels_,metric = 'euclidean',sample_size = len(train_x))
    score_list.append(score)

# 可视化
sns.barplot(x = scale,y = score_list)
plt.show()
sns.lineplot(x = scale,y = score_list)
plt.show()
# 最佳k值是7

kmeans = KMeans(n_clusters = 7).fit(train_x)
predicted = kmeans.predict(train_x) # 无监督预测分类
predicted = pd.DataFrame(predicted,columns = ['type'])
car_type = pd.concat((df,predicted),axis = 1)
car_type.to_csv('汽车分类',index = False)
print(car_type.head())

car_type0 = car_type.loc[car_type['type'] == 0]
print(car_type0.head())

car_type1 = car_type.loc[car_type['type'] == 1]
print(car_type1.head())

car_type2 = car_type.loc[car_type['type'] == 2]
print(car_type2.head())

car_type3 = car_type.loc[car_type['type'] == 3]
print(car_type3.head())

car_type4 = car_type.loc[car_type['type'] == 4]
print(car_type4.head())

sklearn-KMeans聚类分析-汽车分类_第1张图片
sklearn-KMeans聚类分析-汽车分类_第2张图片

你可能感兴趣的:(算法)