聚类(cluster)与分类(class)问题不同,聚类属于无监督学习模型,而分类属于有监督学习模型。聚类使用某种算法将样本分为N个群落,群落内部相似度较高,群落之间相似度较低。通常采用‘距离’来度量样本间的相似度,距离越小,相似度越高;距离越大,相似度越低。
1)k均值聚类(K-Means clustering)算法是一种常用的、基于原型的聚类算法。执行步骤为:
import numpy as np
import sklearn.cluster as sc
import sklearn.metrics as sm
import matplotlib.pyplot as plt
x = []
with open("D:/python/data/multiple3.txt", 'r') as f:
for line in f.readlines():
data = [float(substr) for substr in line.split(',')]
x.append(data)
x = np.array(x)
print(x.shape)
# k均值聚类器
model = sc.KMeans(n_clusters=4) # n_clusters为事先预定的聚类簇数量
# 训练
model.fit(x)
cluster_labels_ = model.labels_ # 每一个点的聚类结果
cluster_centers_ = model.cluster_centers_ # 获得聚类几何中心
print('cluster_labels_: ', cluster_labels_)
print('cluster_centers_: ', cluster_centers_)
silhouette_score = sm.silhouette_score(x, cluster_labels_,
sample_size=len(x),
metric='euclidean' # 欧式距离度量方式
)
print('silhouette_score: ', silhouette_score)
print('--------可视化--------')
plt.figure('K-Means Cluster', facecolor='lightgray')
plt.title('K-Means Cluster', fontsize=16)
plt.xlabel('x', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.tick_params(labelsize=10)
plt.scatter(x[:,0], x[:,1], s=50, c=cluster_labels_, cmap='brg')
plt.scatter(cluster_centers_[:,0], cluster_centers_[:,1], marker='+',
c='black', s=120, linewidths=3)
plt.show()
"""
cluster_labels_: [2 2 3 0 2 1 3 0 2 1 3 0 2 1 3 0 2 1 3 0 2 1 3 0 1 1 3 0 2 1 3 0 2 1 3 0 2
1 3 0 2 1 3 0 2 1 3 0 2 1 3 0 2 1 3 0 2 1 3 0 2 1 3 0 2 1 3 0 2 1 3 0 2 1
3 0 2 1 3 0 2 1 3 0 2 1 3 0 2 1 3 0 2 1 3 0 3 1 3 0 2 1 3 0 2 1 3 0 2 1 3
0 2 1 3 0 2 1 3 0 2 1 3 0 2 1 3 0 2 1 3 0 2 1 3 0 2 1 3 0 2 1 3 0 2 1 3 0
2 2 3 0 2 1 3 0 2 1 3 1 2 1 3 0 2 1 3 0 2 1 3 0 2 1 3 0 2 1 3 0 2 1 3 0 2
1 3 0 2 1 3 0 2 1 3 0 2 1 3 0]
cluster_centers_: [[7.07326531 5.61061224]
[3.1428 5.2616 ]
[1.831 1.9998 ]
[5.91196078 2.04980392]]
silhouette_score: 0.5773232071896658
"""
2)噪声聚类DBSCAN(Density-Based Spatial Clustering of Applications with Noise)。执行步骤为:
import numpy as np
import sklearn.cluster as sc
import matplotlib.pyplot as plt
import sklearn.metrics as sm
x = []
with open('D:/python/data/perf.txt', 'r') as f:
for line in f.readlines():
data = [float(substr) for substr in line.split(',')]
x.append(data)
x = np.array(x)
print(x.shape)
# 模型超参数
radius = 0.8 # 邻域半径
min_samples = 5 # 聚类簇最少样本点数量
# 创建噪声密度聚类器
model = sc.DBSCAN(eps=radius, min_samples=min_samples)
# 训练模型
model.fit(x)
# 划分结果
cluster_labels_ = model.labels_
# 评价聚类模型
score = sm.silhouette_score(x, model.labels_,
sample_size=len(x),
metric='euclidean') # 轮廓系数
print('silhouette_score: ', score)
# silhouette_score: 0.6366395861050828
# 区分三类样本
# 核心样本
core_mask = np.zeros(len(x), dtype=bool)
core_mask[model.core_sample_indices_] = True
# 噪声样本
offset_mask = cluster_labels_==-1
# 边界样本
border_mask = ~(core_mask | offset_mask)
print('--------可视化--------')
plt.figure('DBSCAN Cluster', facecolor='lightgray')
plt.title('DBSCAN Cluster', fontsize=18)
plt.xlabel('x', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.tick_params(labelsize=8)
plt.grid(':')
labels = set(cluster_labels_)
cs = plt.get_cmap('brg', len(labels))(range(len(labels)))
# 核心点
plt.scatter(x[core_mask][:, 0],
x[core_mask][:, 1],
c=cs[cluster_labels_[core_mask]],
s=80, label='Core')
# 边界点
plt.scatter(x[border_mask][:, 0],
x[border_mask][:, 1],
edgecolor=cs[cluster_labels_[border_mask]],
facecolor='none', s=80, label='Periphery')
# 噪声点
plt.scatter(x[offset_mask][:, 0],
x[offset_mask][:, 1],
marker='D', c=cs[cluster_labels_[offset_mask]],
s=80, label='Offset')
plt.legend()
plt.show().
3)凝聚层次(Agglomerative)算法,执行步骤为:
首先将每个样本看作独立的聚类,如果聚类数大于预设的数值,则合并最近的样本作为一个新的聚类,如此反复迭代,不断扩大聚类规模的同时减少聚类的总数。直到聚类减少到预设数值为止。凝聚层次算法的核心是如何衡量不同的聚类簇之间的距离。
import numpy as np
import sklearn.cluster as sc
import sklearn.metrics as sm
import matplotlib.pyplot as plt
x = []
with open("D:/python/data/multiple3.txt", 'r') as f:
for line in f.readlines():
data = [float(substr) for substr in line.split(',')]
x.append(data)
x = np.array(x)
print(x.shape)
# 凝聚层次聚类器
model = sc.AgglomerativeClustering(n_clusters=4) # n_clusters为事先预定的聚类簇数量
# 训练
model.fit(x)
# 获取预测结果
cluster_labels_ = model.labels_ # 每一个点的聚类结果
silhouette_score = sm.silhouette_score(x, cluster_labels_,
sample_size=len(x),
metric='euclidean' # 欧式距离度量方式
)
print('silhouette_score: ', silhouette_score)
# silhouette_score: 0.5736608796903743
print('--------可视化--------')
plt.figure('Agglomerative Cluster', facecolor='lightgray')
plt.title('Agglomerative Cluster', fontsize=16)
plt.xlabel('x', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.tick_params(labelsize=10)
plt.scatter(x[:,0], x[:,1], s=50, c=cluster_labels_, cmap='brg')
plt.show()
理想的聚类问题可以概括为:内密外疏,即同一聚类内部足够紧密,聚类之间足够疏远。学科中使用“轮廓系数”进行度量,见下图:
假设已经通过某算法,将样本点进行聚类,对于簇中每个样本,分别计算出轮廓系数。对于其中某个聚类簇而言:
a(i) = average(i向量到所有它属于的簇中其他点的距离)
b(i) = min(i向量到其他簇内所有点的平均距离)
则i向量的轮廓系数为:
S ( i ) = b ( i ) − a ( i ) m a x ( a ( i ) , b ( i ) ) S(i) = \frac{b(i) - a(i)}{max(a(i), b(i))} S(i)=max(a(i),b(i))b(i)−a(i)
由公式可知:
score = sm.silhouette_score(x, # 样本
clustering_labels_, # 标签
sample_size=len(x), # 样本数量
metric="euclidean") # 欧式距离度量
OK,简单总结就是如此。真的希望可以结识更多志同道合的朋友,一起加油,未来真的是如此精彩。。。