前面在支持向量机(分类)采用了scikit-learn内部集成的手写数字图像数据,是原始数据集合的一部分。而本次,我们使用该数据的完整版本:https://archive.ics.uci.edu/ml/machine-learning-databases/optdigits/
Python源码:
#coding=utf-8
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#-------------
from sklearn.cluster import KMeans
#-------------
from sklearn import metrics
#-------------
from sklearn.metrics import silhouette_score
#-------------load data
digits_train=pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/optdigits/optdigits.tra',header=None)
digits_test=pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/optdigits/optdigits.tes',header=None)
#sperate 64 dimens picutre pixels features with 1 dimen target number
X_train=digits_train[np.arange(64)]
y_train=digits_train[64]
X_test=digits_test[np.arange(64)]
y_test=digits_test[64]
#-------------training
#initialize,and set cluster nums
kmeans=KMeans(n_clusters=10)
kmeans.fit(X_train)
y_pred=kmeans.predict(X_test)
#-------------performance measure by ARI(Adjusted Rand Index)
print metrics.adjusted_rand_score(y_test,y_pred)
Result:
0.667773154268
#coding=utf-8
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn.metrics import silhouette_score
#-------------performance measure by Silhouette Coefficient
#cut 3*2=6,6 maps, and draw on map 1
plt.subplot(3,2,1)
#initialize data
x1=np.array([1,2,3,1,5,6,5,5,6,7,8,9,7,9])
x2=np.array([1,3,2,2,8,6,7,6,7,1,2,1,1,3])
X=np.array(zip(x1,x2)).reshape(len(x1),2)
#
plt.xlim([0,10])
plt.xlim([0,10])
plt.title('Instances')
plt.scatter(x1,x2)
colors=['b','g','r','c','m','y','k','b']
markers=['o','s','D','v','^','p','*','+']
clusters=[2,3,4,5,8]
subplot_counter=1
sc_scores=[]
for t in clusters:
subplot_counter+=1
plt.subplot(3,2,subplot_counter)
kmeans_model=KMeans(n_clusters=t).fit(X)
for i,l in enumerate(kmeans_model.labels_):
plt.plot(x1[i],x2[i],color=colors[1],marker=markers[1],ls='None')
plt.xlim([0,10])
plt.ylim([0,10])
sc_score=silhouette_score(X,kmeans_model.labels_,metric='euclidean')
sc_scores.append(sc_score)
plt.title('K=%s,silhouette coefficient=%0.03f'%(t,sc_score))
plt.figure()
plt.plot(clusters,sc_scores,'*-')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Coefficient Score')
plt.show()
Result:
可以观察到聚类中心数量为3,轮廓系数最大,此时也符合数据的分布特点,是想对较为合理的类簇数量。
#coding=utf-8
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from scipy.spatial.distance import cdist
#random generate 3 clusters,each cluster has 10 datas
cluster1=np.random.uniform(0.5,1.5,(2,10))
cluster2=np.random.uniform(5.5,6.5,(2,10))
cluster3=np.random.uniform(3.0,4.0,(2,10))
#draw the picture
X=np.hstack((cluster1,cluster2,cluster3)).T
plt.scatter(X[:,0],X[:,1])
plt.xlabel('x1')
plt.xlabel('x2')
plt.show()
#
K=range(1,10)
meandistortions=[]
for k in K:
kmeans=KMeans(n_clusters=k)
kmeans.fit(X)
meandistortions.append(sum(np.min(cdist(X,kmeans.cluster_centers_,'euclidean'),axis=1))/X.shape[0])
plt.plot(K,meandistortions,'bx-')
plt.xlabel('k')
plt.ylabel('Average Dispersion')
plt.title('Selecting k with the Elbow Method')
plt.show()
Result:
随机采样3个类簇的数据点,当类簇数量为1或者2时,样本距所属类簇的平均距离的下降速度很快,说明更改K值回让整体聚类结构有很大改变,也意味着新的聚类数量让算法有更大的收敛空间,这样的K值不能反映真实的类簇数量。当K=3时,平均距离的下降有了显著放缓,意味着进一步增加K值不再会有利于算法的收敛,K=3时想对最佳的类簇数量。