实际应用
下面通过例子(R实现,完整代码见附件)讲解kmeans使用方法,会将上面提到的内容全部串起来。
加载实验数据iris,这个数据在机器学习领域使用比较频繁,主要是通过画的几个部分的大小,对花的品种分类,实验中需要使用fpc库估计轮廓系数,如果没有可以通过install.packages安装。
# install.packages("fpc")
library(fpc)
library(datasets)
# names(iris)
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
# 0-1 正规化数据
min.max.norm
(x-min(x))/(max(x)-min(x))
}
raw.data
norm.data
head(norm.data)
## sl sw pl pw
## 1 0.22222222 0.6250000 0.06779661 0.04166667
## 2 0.16666667 0.4166667 0.06779661 0.04166667
## 3 0.11111111 0.5000000 0.05084746 0.04166667
## 4 0.08333333 0.4583333 0.08474576 0.04166667
## 5 0.19444444 0.6666667 0.06779661 0.04166667
## 6 0.30555556 0.7916667 0.11864407 0.12500000
对iris的4个feature做数据正规化,每个feature均是花的某个部位的尺寸。
# k取2到8,评估K
K
round
rst
print(paste("K=",i))
mean(sapply(1:round,function(r){
print(paste("Round",r))
result
stats
stats$avg.silwidth
}))
})
## [1] "K= 2"
## [1] "Round 1"
## [1] "Round 2"
## [1] "Round 3"
## [1] "Round 4"
## [1] "Round 5"
## [1] "Round 6"
## [1] "Round 7"
## [1] "Round 8"
## [1] "Round 9"
## [1] "Round 10"
## [1] "Round 11"
## [1] "Round 12"
## [1] "Round 13"
## [1] "Round 14"
## [1] "Round 15"
## [1] "Round 16"
## [1] "Round 17"
## [1] "Round 18"
## [1] "Round 19"
## [1] "Round 20"
## [1] "Round 21"
## [1] "Round 22"
## [1] "Round 23"
## [1] "Round 24"
## [1] "Round 25"
## [1] "Round 26"
## [1] "Round 27"
## [1] "Round 28"
## [1] "Round 29"
## [1] "Round 30"
## [1] "K= 3"
## [1] "Round 1"
## [1] "Round 2"
## [1] "Round 3"
## [1] "Round 4"
## [1] "Round 5"
## [1] "Round 6"
## [1] "Round 7"
## [1] "Round 8"
## [1] "Round 9"
## [1] "Round 10"
## [1] "Round 11"
## [1] "Round 12"
## [1] "Round 13"
## [1] "Round 14"
## [1] "Round 15"
## [1] "Round 16"
## [1] "Round 17"
## [1] "Round 18"
## [1] "Round 19"
## [1] "Round 20"
## [1] "Round 21"
## [1] "Round 22"
## [1] "Round 23"
## [1] "Round 24"
## [1] "Round 25"
## [1] "Round 26"
## [1] "Round 27"
## [1] "Round 28"
## [1] "Round 29"
## [1] "Round 30"
## [1] "K= 4"
## [1] "Round 1"
## [1] "Round 2"
## [1] "Round 3"
## [1] "Round 4"
## [1] "Round 5"
## [1] "Round 6"
## [1] "Round 7"
## [1] "Round 8"
## [1] "Round 9"
## [1] "Round 10"
## [1] "Round 11"
## [1] "Round 12"
## [1] "Round 13"
## [1] "Round 14"
## [1] "Round 15"
## [1] "Round 16"
## [1] "Round 17"
## [1] "Round 18"
## [1] "Round 19"
## [1] "Round 20"
## [1] "Round 21"
## [1] "Round 22"
## [1] "Round 23"
## [1] "Round 24"
## [1] "Round 25"
## [1] "Round 26"
## [1] "Round 27"
## [1] "Round 28"
## [1] "Round 29"
## [1] "Round 30"
## [1] "K= 5"
## [1] "Round 1"
## [1] "Round 2"
## [1] "Round 3"
## [1] "Round 4"
## [1] "Round 5"
## [1] "Round 6"
## [1] "Round 7"
## [1] "Round 8"
## [1] "Round 9"
## [1] "Round 10"
## [1] "Round 11"
## [1] "Round 12"
## [1] "Round 13"
## [1] "Round 14"
## [1] "Round 15"
## [1] "Round 16"
## [1] "Round 17"
## [1] "Round 18"
## [1] "Round 19"
## [1] "Round 20"
## [1] "Round 21"
## [1] "Round 22"
## [1] "Round 23"
## [1] "Round 24"
## [1] "Round 25"
## [1] "Round 26"
## [1] "Round 27"
## [1] "Round 28"
## [1] "Round 29"
## [1] "Round 30"
## [1] "K= 6"
## [1] "Round 1"
## [1] "Round 2"
## [1] "Round 3"
## [1] "Round 4"
## [1] "Round 5"
## [1] "Round 6"
## [1] "Round 7"
## [1] "Round 8"
## [1] "Round 9"
## [1] "Round 10"
## [1] "Round 11"
## [1] "Round 12"
## [1] "Round 13"
## [1] "Round 14"
## [1] "Round 15"
## [1] "Round 16"
## [1] "Round 17"
## [1] "Round 18"
## [1] "Round 19"
## [1] "Round 20"
## [1] "Round 21"
## [1] "Round 22"
## [1] "Round 23"
## [1] "Round 24"
## [1] "Round 25"
## [1] "Round 26"
## [1] "Round 27"
## [1] "Round 28"
## [1] "Round 29"
## [1] "Round 30"
## [1] "K= 7"
## [1] "Round 1"
## [1] "Round 2"
## [1] "Round 3"
## [1] "Round 4"
## [1] "Round 5"
## [1] "Round 6"
## [1] "Round 7"
## [1] "Round 8"
## [1] "Round 9"
## [1] "Round 10"
## [1] "Round 11"
## [1] "Round 12"
## [1] "Round 13"
## [1] "Round 14"
## [1] "Round 15"
## [1] "Round 16"
## [1] "Round 17"
## [1] "Round 18"
## [1] "Round 19"
## [1] "Round 20"
## [1] "Round 21"
## [1] "Round 22"
## [1] "Round 23"
## [1] "Round 24"
## [1] "Round 25"
## [1] "Round 26"
## [1] "Round 27"
## [1] "Round 28"
## [1] "Round 29"
## [1] "Round 30"
## [1] "K= 8"
## [1] "Round 1"
## [1] "Round 2"
## [1] "Round 3"
## [1] "Round 4"
## [1] "Round 5"
## [1] "Round 6"
## [1] "Round 7"
## [1] "Round 8"
## [1] "Round 9"
## [1] "Round 10"
## [1] "Round 11"
## [1] "Round 12"
## [1] "Round 13"
## [1] "Round 14"
## [1] "Round 15"
## [1] "Round 16"
## [1] "Round 17"
## [1] "Round 18"
## [1] "Round 19"
## [1] "Round 20"
## [1] "Round 21"
## [1] "Round 22"
## [1] "Round 23"
## [1] "Round 24"
## [1] "Round 25"
## [1] "Round 26"
## [1] "Round 27"
## [1] "Round 28"
## [1] "Round 29"
## [1] "Round 30"
plot(K,rst,type='l',main='轮廓系数与K的关系', ylab='轮廓系数')
评估k,由于一般K不会太大,太大了也不易于理解,所以遍历K为2到8。由于kmeans具有一定随机性,并不是每次都收敛到全局最小,所以针对每一个k值,重复执行30次,取并计算轮廓系数,最终取平均作为最终评价标准,可以看到如上的示意图。
当k取2时,有最大的轮廓系数,虽然实际上有3个种类。聚类完成后,有源原始数据是4纬,无法可视化,所以通过多维定标(Multidimensional scaling)将纬度将至2维,查看聚类效果。
# 降纬度观察
old.par
k
clu
mds
plot(mds, col=clu$cluster, main='kmeans聚类 k=2', pch = 19)
plot(mds, col=iris$Species, main='原始聚类', pch = 19)
par(old.par)
可以发现原始分类中和聚类中左边那一簇的效果还是拟合的很好的,右测原始数据就连在一起,kmeans无法很好的区分,需要寻求其他方法。
kmeans最佳实践
随机选取训练数据中的k个点作为起始点
当k值选定后,随机计算n次,取得到最小开销函数值的k作为最终聚类结果,避免随机引起的局部最优解
手肘法选取k值:绘制出k–开销函数闪点图,看到有明显拐点(如下)的地方,设为k值,可以结合轮廓系数。
k值有时候需要根据应用场景选取,而不能完全的依据评估参数选取。