Python【极简】聚类算法(KMeans+DBSCAN+MeanShift)
链接:https://blog.csdn.net/Yellow_python/article/details/81461056?utm_source=copy
1、聚类算法极简代码
1.1、K-Means:基于欧式距离
1.2、DBSCAN:基于密度
1.3、Mean Shift:均值漂移(三维可视化)
2、聚类评估:轮廓系数(Silhouette Coefficient)
2.1、KMeans聚类评估
2.2、DBSCAN聚类评估
2.3、MeanShift聚类评估
4、附录
4.1、翻译
4.2、数据集
4.2.1、数据集1
4.2.2、数据集2
1、聚类算法极简代码
1.1、K-Means:基于欧式距离
K-Means聚类算法的时间复杂度是O(nkt) ,适合挖掘大规模数据集
n:数据集中对象的数量
t:算法迭代的次数
k:簇的数目
创建数据
import numpy as np
X = np.array([[3, 4], [6, 8], [1, 2], [6, 7], [3, 1], [5, 8], [2, 3], [8, 7], [2, 2], [4, 2], [8, 6], [7, 8], [5, 1]])
聚类算法
from sklearn.cluster import KMeans
km = KMeans(n_clusters=2) # 创建KMeans对象,设置簇的数量
km.fit(X) # 传入数据
labels = km.labels_ # 聚类结果(分类标签)
print(labels)
centers = km.cluster_centers_ # 簇的中心
print(centers)
可视化
import matplotlib.pyplot as mp
for x, l in zip(X, labels): # 聚类标签
if l == 0:
mp.scatter(x[0], x[1], c=‘r’)
else:
mp.scatter(x[0], x[1], c=‘g’)
for i in range(len(centers)): # 簇的中心
if i == 0:
mp.scatter(centers[i][0], centers[i][1], c=‘r’, marker=‘x’, s=99)
else:
mp.scatter(centers[i][0], centers[i][1], c=‘g’, marker=‘x’, s=99)
mp.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
1.2、DBSCAN:基于密度
Density-Based Spatial Clustering of Applications with Noise
优点:
1、不需要事先知道要形成的簇类的数量
2、可发现任意形状的簇类
3、可识别出噪声点
4、对样本的顺序不敏感。但对于处于簇类之间边界样本,可能会根据哪个簇类优先被探测到而其归属有所摆动
缺点:
1、不能很好反映高维数据
2、不能很好反映数据集以变化的密度
3、如果样本集的密度不均匀、聚类间距差相差很大时,聚类质量较差
创造数据
from sklearn.datasets.samples_generator import make_blobs
X, _ = make_blobs(n_samples=100, centers=[[1, 1], [9, 9], [7, 3]])
DBSCAN:基于密度的聚类方法
from sklearn.cluster import DBSCAN
labels = DBSCAN(eps=1, min_samples=3).fit(X).labels_
print(labels)
可视化
import matplotlib.pyplot as mp
colors = [‘red’, ‘blue’, ‘green’, ‘black’]
for x, l in zip(X, labels):
mp.scatter(x[0], x[1], c=colors[l])
mp.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
1.3、Mean Shift:均值漂移(三维可视化)
寻找核密度极值点并作为簇的质心,然后根据最近邻原则为样本点赋予质心
从网络读取数据
import requests, re, numpy as np
def download():
url = ‘https://blog.csdn.net/Yellow_python/article/details/81461056’
header = {‘User-Agent’: ‘Opera/8.0 (Windows NT 5.1; U; en)’}
r = requests.get(url, headers=header)
data = re.findall(’
([\s\S]+?)
’, r.text)[1].strip()
array = np.array([i.split(’,’) for i in data.split()]).astype(float)
return array
X = download()
均值漂移
from sklearn.cluster import MeanShift
labels = MeanShift().fit(X).labels_
可视化
import matplotlib.pyplot as mp
from mpl_toolkits import mplot3d
fig = mp.figure()
ax = mplot3d.Axes3D(fig)
colors = [‘red’, ‘blue’, ‘green’, ‘black’]
for x, l in zip(X, labels):
ax.scatter(x[0], x[1], x[2], c=colors[l], s=150, alpha=0.3)
mp.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
2、聚类评估:轮廓系数(Silhouette Coefficient)
a(i)a(i):样本ii到同簇其他样本的平均距离
b(i)b(i):样本ii的簇间不相似度
s(i)s(i)接近1:样本ii聚类合理
s(i)s(i)接近-1:样本ii更适合分到别的簇
s(i)s(i)接近0:样本ii在两个簇的边界上
from sklearn import metrics
score = metrics.silhouette_score(X, labels)
2.1、KMeans聚类评估
从网络读取数据
import requests, re, numpy as np
def download():
url = ‘https://blog.csdn.net/Yellow_python/article/details/81461056’
header = {‘User-Agent’: ‘Opera/8.0 (Windows NT 5.1; U; en)’}
r = requests.get(url, headers=header)
data = re.findall(’
([\s\S]+?)
’, r.text)[0].strip()
array = np.array([i.split(’,’) for i in data.split()]).astype(float)
return array
X = download()
m, n = 2, 6 # 设定簇的数量
for i in range(m, n):
# KMeans聚类算法
from sklearn.cluster import KMeans
labels = KMeans(n_clusters=i).fit(X).labels_
# 可视化
import matplotlib.pyplot as mp
mp.subplot(1, n - m, i - m + 1)
colors = [‘red’, ‘blue’, ‘green’, ‘purple’, ‘orange’, ‘cyan’, ‘gray’, ‘brown’, ‘yellow’, ‘pink’, ‘black’]
for x, l in zip(X, labels):
mp.scatter(x[0], x[1], c=colors[l])
# 聚类评估:轮廓系数(Silhouette Coefficient)
from sklearn import metrics
score = metrics.silhouette_score(X, labels)
print(‘n_clusters = %d 的聚类得分为:’ % i, score)
mp.tight_layout()
mp.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
打印结果
n_clusters = 2 的聚类得分为: 0.6094103841500139
n_clusters = 3 的聚类得分为: 0.4249285827871494
n_clusters = 4 的聚类得分为: 0.3447569550742587
n_clusters = 5 的聚类得分为: 0.34076078057327047
2.2、DBSCAN聚类评估
创建数据
import numpy as np
X = np.array([[1, 4], [6, 8], [1, 2], [6, 7], [5, 3], [5, 8], [2, 3], [8, 7], [2, 2], [4, 2], [8, 6], [7, 8], [5, 1]])
radii = [1.414, 1.415, 2]
for i in range(3):
# DBSCAN:基于密度的聚类方法
from sklearn.cluster import DBSCAN
labels = DBSCAN(eps=radii[i], min_samples=2).fit(X).labels_
# 可视化
import matplotlib.pyplot as mp
mp.subplot(1, 3, i + 1)
colors = [‘red’, ‘blue’, ‘green’, ‘purple’, ‘orange’, ‘cyan’, ‘gray’, ‘brown’, ‘yellow’, ‘pink’, ‘black’]
for x, l in zip(X, labels):
mp.scatter(x[0], x[1], c=colors[l])
# 聚类评估:轮廓系数(Silhouette Coefficient)
from sklearn import metrics
score = metrics.silhouette_score(X, labels)
print(‘eps = %.3f 的聚类得分是:’ % radii[i], score)
mp.tight_layout()
mp.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
打印结果
eps = 1.414 的聚类得分是: 0.36739772676132704
eps = 1.415 的聚类得分是: 0.6018738849706604
eps = 2.000 的聚类得分是: 0.6431136276704154
2.3、MeanShift聚类评估
创建数据 -------------------------------------------------------------------------------------------------------------
from sklearn.datasets.samples_generator import make_blobs
centers = [[0, 0, 0], [6, 4, 1], [9, 9, 9]]
X, _ = make_blobs(n_samples=100, centers=centers, cluster_std=2, random_state=0)
均值偏移 -------------------------------------------------------------------------------------------------------------
from sklearn.cluster import MeanShift, estimate_bandwidth
bandwidth = estimate_bandwidth(X, quantile=0.2, n_samples=50) # 带宽(分位点、样本数)
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True).fit(X)
聚类标签
labels = ms.labels_
簇的中心
centers = ms.cluster_centers_
print(centers)
聚类评估 ---------------------------------------------------------------------------------------------------------
from sklearn import metrics
score = metrics.silhouette_score(X, labels)
print(‘聚类得分是:%.2f’ % score)
可视化 -----------------------------------------------------------------------------------------------------------
import matplotlib.pyplot as mp
from mpl_toolkits import mplot3d
fig = mp.figure()
ax = mplot3d.Axes3D(fig)
colors = [‘red’, ‘blue’, ‘green’, ‘purple’, ‘orange’, ‘cyan’, ‘gray’, ‘brown’, ‘yellow’, ‘pink’, ‘black’]
样本集聚类结果
for x, l in zip(X, labels):
ax.scatter(x[0], x[1], x[2], c=colors[l], s=120, alpha=0.2)
簇的中心
for i in range(len(centers)):
ax.scatter(centers[i][0], centers[i][1], centers[i][2], c=colors[i], s=200, marker=‘x’)
mp.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
4、附录
4.1、翻译
cluster
n. 簇;v 群聚
radius
半径(复数:radii)
cyan
蓝绿色
density
密度
spatial
空间的
distance
距离
silhouette
轮廓
coefficient
系数;合作的
shift
n. 移动;vi. 转换;vt. 转移
bandwidth
带宽
quantile
n. [计] 分位数;分位点
4.2、数据集
4.2.1、数据集1
1.093,1.227
-1.386,-2.334
1.040,1.181
-1.663,-0.969
-1.273,-0.990
1.920,1.882
1.106,0.759
-1.382,-0.594
1.707,0.892
-0.024,2.170
-0.287,-0.810
0.456,1.031
-0.597,-0.756
-1.059,-1.398
0.298,-0.198
1.086,1.873
-1.041,0.028
-0.632,-0.447
-0.654,-1.125
-1.417,-1.090
-1.462,-0.676
0.694,0.292
1.125,1.586
1.241,0.589
1.214,1.424
1.519,0.555
0.758,1.733
2.121,0.414
0.301,1.540
0.130,-1.809
1.738,1.721
2.362,0.127
-1.704,0.166
-0.625,-1.961
-0.537,-0.506
-1.600,-1.927
-1.776,-0.840
0.512,-0.036
-1.621,-0.591
0.430,-0.433
-0.532,1.392
-0.324,-1.648
-1.790,-1.277
-0.447,-0.809
-0.821,-0.204
1.587,2.345
1.893,2.138
1.570,0.909
1.006,2.072
-0.762,-1.656
0.791,1.094
1.684,0.259
0.768,0.819
2.058,1.240
1.188,0.488
-1.024,-1.701
1.279,0.078
-1.482,-1.414
-1.735,-0.493
-0.486,-1.391
-1.261,0.110
0.121,-0.456
-1.248,-1.448
-1.762,-0.418
0.022,1.278
0.619,0.782
0.983,1.257
-1.447,-1.496
-1.161,-0.519
0.371,0.148
0.463,1.232
0.154,-0.112
0.597,0.784
-0.686,-1.103
0.938,1.246
0.032,0.872
0.248,1.466
-1.517,0.146
0.467,-0.188
-0.774,-1.660
-1.405,-0.981
-0.900,-0.619
-0.430,-0.947
1.457,1.073
-1.212,-1.825
-1.688,-1.263
0.694,0.737
-1.299,0.158
1.266,1.200
-1.444,-0.074
1.896,0.877
0.813,1.034
0.478,0.653
-1.895,-0.736
1.027,0.888
0.358,1.633
-1.548,-0.330
1.076,1.241
-0.432,-1.093
1.437,1.077
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
4.2.2、数据集2
1.0050,1.8073,2.7386
1.9259,2.4306,0.2451
1.6545,0.4596,1.1321
-0.9050,2.0691,1.0404
2.7505,-0.4984,0.6483
-0.0468,2.2348,-0.5511
0.4628,0.6909,1.4954
-0.6743,-0.5436,-0.6257
1.9558,3.7735,2.3948
1.9286,3.3877,-0.5870
1.2800,1.3642,1.4622
0.3422,2.5313,2.7598
2.1371,0.4341,-0.0924
-0.3058,2.8154,-0.8745
-0.8878,0.9490,2.2726
0.5742,1.4715,0.0139
2.2790,1.9328,1.6814
3.1572,2.8172,-0.5836
3.0747,-0.7420,2.2908
1.9663,1.6197,2.9873
3.9519,-0.0063,-0.9588
1.7085,1.0218,-0.8038
1.0407,0.4408,1.0324
1.3979,-0.4051,0.8771
2.3624,2.4067,3.8003
1.5867,1.1362,0.9736
2.2561,-0.8329,3.8295
0.4978,2.5666,3.7187
2.3976,0.5715,1.1566
0.6931,2.8798,1.4282
0.9098,0.4515,2.6086
-0.9297,-0.6534,1.8047
2.7165,2.5808,3.1228
3.8165,2.4418,1.5569
2.9530,0.2010,2.6209
1.7720,3.0011,3.9736
3.1591,0.6074,1.1515
0.7855,2.1894,1.6889
1.5885,0.4375,2.8251
1.9275,1.2649,1.3579
1.6962,4.5295,4.3640
1.4627,3.3351,2.4612
2.2010,1.2483,4.5353
0.6686,3.0405,3.1677
2.9674,1.9180,1.7931
-0.6761,4.8257,1.4628
0.9636,2.0809,3.3889
0.0456,2.9660,1.7052
2.6762,9.3597,9.7102
2.2002,9.2157,6.9957
1.3057,8.5777,9.4119
0.9371,10.8166,9.4066
1.0637,7.9897,7.1191
0.7614,10.0867,7.1291
-0.1443,7.4889,8.3340
0.9175,8.2982,6.8904
2.9528,9.4759,7.1891
3.8934,10.4592,5.9079
3.4689,8.6936,8.2476
0.4932,10.5898,7.8731
3.7129,8.9430,5.2596
1.3458,10.4512,6.7899
0.9601,8.7501,7.9669
1.0073,8.2057,5.9963
4.6192,9.2501,10.2311
4.2770,9.6612,8.8755
3.0643,7.9322,10.9946
1.4359,8.8022,9.7678
4.9768,7.8666,7.7231
2.2322,9.6188,7.1282
2.3977,6.1008,9.5358
2.8024,7.9230,7.7847
1.7787,9.6656,8.5730
2.5588,9.2667,5.3680
2.2381,7.4282,8.5095
-0.3990,9.5905,7.2105
1.6016,6.4013,5.2564
0.2722,8.6830,5.5263
-0.0809,7.4194,8.4713
0.8066,6.9738,6.9004
9.1410,8.5563,6.4202
9.9706,8.4647,4.7648
9.2004,5.5402,5.7502
7.7044,7.2929,5.9979
10.0255,6.8514,4.1539
7.9756,8.3657,3.8871
7.5363,5.7118,5.9797
8.0823,5.8570,3.7665
7.1929,10.9547,5.5867
7.1595,10.1112,3.8669
8.6083,8.6523,5.0490
6.2183,9.3562,4.0778
8.5655,7.1762,2.7066
6.6339,10.6751,2.1566
6.3456,8.5746,4.9021
6.7410,7.2915,3.9987
7.1588,9.6255,7.4575
8.4862,9.7134,5.3465
7.8276,7.9110,6.6756
5.9208,8.8276,7.3346
8.7747,7.0975,5.5719
6.1980,8.1578,4.7254
5.8346,7.6449,6.6972
5.8241,6.9158,4.8532
8.8481,8.2460,4.6843
9.9423,8.5196,2.1585
8.1127,6.3995,4.4851
7.7731,9.5787,5.9392
8.5547,6.3210,3.8450
6.6214,9.6489,2.2730
6.8169,7.2050,5.9877
6.8193,7.5884,3.0205
9.6705,9.1018,4.9061
9.7746,8.8037,2.3269
9.8415,7.0949,5.7011
8.2221,8.3008,5.6954
10.5455,7.2361,2.2893
7.9733,8.6451,2.1530
7.8287,6.6419,4.5996
8.0243,6.2347,3.9297
9.2898,10.9877,7.0184
9.2897,10.3033,4.8531
9.2510,7.3422,7.1935
6.6630,10.3948,6.8442
9.7234,7.7998,5.5406
6.7309,10.5378,5.5303
6.9013,7.0826,6.2990
7.4853,7.9020,4.5881
10.9317,10.9432,5.1278
10.3645,10.3791,4.4309
9.4693,8.6197,5.1706
8.4627,10.2630,5.3924
10.1341,7.2184,3.0029
8.4236,10.5672,4.8410
8.2941,7.5425,5.4322
7.2079,7.7771,3.5309
10.9911,9.9904,4.8578
10.0575,10.4442,2.7242
10.9198,8.1909,4.7242
7.8456,10.1644,4.4754
9.3154,7.7046,1.6364
8.5910,10.6952,1.2513
8.1458,7.0717,4.8291
8.3870,7.6361,1.0139