matlab 轮廓系数,kmeans聚类理论篇K的选择(轮廓系数)

实际应用

下面通过例子(R实现,完整代码见附件)讲解kmeans使用方法,会将上面提到的内容全部串起来。

加载实验数据iris,这个数据在机器学习领域使用比较频繁,主要是通过画的几个部分的大小,对花的品种分类,实验中需要使用fpc库估计轮廓系数,如果没有可以通过install.packages安装。

# install.packages("fpc")

library(fpc)

library(datasets)

# names(iris)

head(iris)

## Sepal.Length Sepal.Width Petal.Length Petal.Width Species

## 1 5.1 3.5 1.4 0.2 setosa

## 2 4.9 3.0 1.4 0.2 setosa

## 3 4.7 3.2 1.3 0.2 setosa

## 4 4.6 3.1 1.5 0.2 setosa

## 5 5.0 3.6 1.4 0.2 setosa

## 6 5.4 3.9 1.7 0.4 setosa

# 0-1 正规化数据

min.max.norm

(x-min(x))/(max(x)-min(x))

}

raw.data

norm.data

head(norm.data)

## sl sw pl pw

## 1 0.22222222 0.6250000 0.06779661 0.04166667

## 2 0.16666667 0.4166667 0.06779661 0.04166667

## 3 0.11111111 0.5000000 0.05084746 0.04166667

## 4 0.08333333 0.4583333 0.08474576 0.04166667

## 5 0.19444444 0.6666667 0.06779661 0.04166667

## 6 0.30555556 0.7916667 0.11864407 0.12500000

对iris的4个feature做数据正规化,每个feature均是花的某个部位的尺寸。

# k取2到8,评估K

K

round

rst

print(paste("K=",i))

mean(sapply(1:round,function(r){

print(paste("Round",r))

result

stats

stats$avg.silwidth

}))

})

## [1] "K= 2"

## [1] "Round 1"

## [1] "Round 2"

## [1] "Round 3"

## [1] "Round 4"

## [1] "Round 5"

## [1] "Round 6"

## [1] "Round 7"

## [1] "Round 8"

## [1] "Round 9"

## [1] "Round 10"

## [1] "Round 11"

## [1] "Round 12"

## [1] "Round 13"

## [1] "Round 14"

## [1] "Round 15"

## [1] "Round 16"

## [1] "Round 17"

## [1] "Round 18"

## [1] "Round 19"

## [1] "Round 20"

## [1] "Round 21"

## [1] "Round 22"

## [1] "Round 23"

## [1] "Round 24"

## [1] "Round 25"

## [1] "Round 26"

## [1] "Round 27"

## [1] "Round 28"

## [1] "Round 29"

## [1] "Round 30"

## [1] "K= 3"

## [1] "Round 1"

## [1] "Round 2"

## [1] "Round 3"

## [1] "Round 4"

## [1] "Round 5"

## [1] "Round 6"

## [1] "Round 7"

## [1] "Round 8"

## [1] "Round 9"

## [1] "Round 10"

## [1] "Round 11"

## [1] "Round 12"

## [1] "Round 13"

## [1] "Round 14"

## [1] "Round 15"

## [1] "Round 16"

## [1] "Round 17"

## [1] "Round 18"

## [1] "Round 19"

## [1] "Round 20"

## [1] "Round 21"

## [1] "Round 22"

## [1] "Round 23"

## [1] "Round 24"

## [1] "Round 25"

## [1] "Round 26"

## [1] "Round 27"

## [1] "Round 28"

## [1] "Round 29"

## [1] "Round 30"

## [1] "K= 4"

## [1] "Round 1"

## [1] "Round 2"

## [1] "Round 3"

## [1] "Round 4"

## [1] "Round 5"

## [1] "Round 6"

## [1] "Round 7"

## [1] "Round 8"

## [1] "Round 9"

## [1] "Round 10"

## [1] "Round 11"

## [1] "Round 12"

## [1] "Round 13"

## [1] "Round 14"

## [1] "Round 15"

## [1] "Round 16"

## [1] "Round 17"

## [1] "Round 18"

## [1] "Round 19"

## [1] "Round 20"

## [1] "Round 21"

## [1] "Round 22"

## [1] "Round 23"

## [1] "Round 24"

## [1] "Round 25"

## [1] "Round 26"

## [1] "Round 27"

## [1] "Round 28"

## [1] "Round 29"

## [1] "Round 30"

## [1] "K= 5"

## [1] "Round 1"

## [1] "Round 2"

## [1] "Round 3"

## [1] "Round 4"

## [1] "Round 5"

## [1] "Round 6"

## [1] "Round 7"

## [1] "Round 8"

## [1] "Round 9"

## [1] "Round 10"

## [1] "Round 11"

## [1] "Round 12"

## [1] "Round 13"

## [1] "Round 14"

## [1] "Round 15"

## [1] "Round 16"

## [1] "Round 17"

## [1] "Round 18"

## [1] "Round 19"

## [1] "Round 20"

## [1] "Round 21"

## [1] "Round 22"

## [1] "Round 23"

## [1] "Round 24"

## [1] "Round 25"

## [1] "Round 26"

## [1] "Round 27"

## [1] "Round 28"

## [1] "Round 29"

## [1] "Round 30"

## [1] "K= 6"

## [1] "Round 1"

## [1] "Round 2"

## [1] "Round 3"

## [1] "Round 4"

## [1] "Round 5"

## [1] "Round 6"

## [1] "Round 7"

## [1] "Round 8"

## [1] "Round 9"

## [1] "Round 10"

## [1] "Round 11"

## [1] "Round 12"

## [1] "Round 13"

## [1] "Round 14"

## [1] "Round 15"

## [1] "Round 16"

## [1] "Round 17"

## [1] "Round 18"

## [1] "Round 19"

## [1] "Round 20"

## [1] "Round 21"

## [1] "Round 22"

## [1] "Round 23"

## [1] "Round 24"

## [1] "Round 25"

## [1] "Round 26"

## [1] "Round 27"

## [1] "Round 28"

## [1] "Round 29"

## [1] "Round 30"

## [1] "K= 7"

## [1] "Round 1"

## [1] "Round 2"

## [1] "Round 3"

## [1] "Round 4"

## [1] "Round 5"

## [1] "Round 6"

## [1] "Round 7"

## [1] "Round 8"

## [1] "Round 9"

## [1] "Round 10"

## [1] "Round 11"

## [1] "Round 12"

## [1] "Round 13"

## [1] "Round 14"

## [1] "Round 15"

## [1] "Round 16"

## [1] "Round 17"

## [1] "Round 18"

## [1] "Round 19"

## [1] "Round 20"

## [1] "Round 21"

## [1] "Round 22"

## [1] "Round 23"

## [1] "Round 24"

## [1] "Round 25"

## [1] "Round 26"

## [1] "Round 27"

## [1] "Round 28"

## [1] "Round 29"

## [1] "Round 30"

## [1] "K= 8"

## [1] "Round 1"

## [1] "Round 2"

## [1] "Round 3"

## [1] "Round 4"

## [1] "Round 5"

## [1] "Round 6"

## [1] "Round 7"

## [1] "Round 8"

## [1] "Round 9"

## [1] "Round 10"

## [1] "Round 11"

## [1] "Round 12"

## [1] "Round 13"

## [1] "Round 14"

## [1] "Round 15"

## [1] "Round 16"

## [1] "Round 17"

## [1] "Round 18"

## [1] "Round 19"

## [1] "Round 20"

## [1] "Round 21"

## [1] "Round 22"

## [1] "Round 23"

## [1] "Round 24"

## [1] "Round 25"

## [1] "Round 26"

## [1] "Round 27"

## [1] "Round 28"

## [1] "Round 29"

## [1] "Round 30"

plot(K,rst,type='l',main='轮廓系数与K的关系', ylab='轮廓系数')

matlab 轮廓系数,kmeans聚类理论篇K的选择(轮廓系数)_第1张图片

评估k,由于一般K不会太大,太大了也不易于理解,所以遍历K为2到8。由于kmeans具有一定随机性,并不是每次都收敛到全局最小,所以针对每一个k值,重复执行30次,取并计算轮廓系数,最终取平均作为最终评价标准,可以看到如上的示意图。

当k取2时,有最大的轮廓系数,虽然实际上有3个种类。聚类完成后,有源原始数据是4纬,无法可视化,所以通过多维定标(Multidimensional scaling)将纬度将至2维,查看聚类效果。

# 降纬度观察

old.par

k

clu

mds

plot(mds, col=clu$cluster, main='kmeans聚类 k=2', pch = 19)

plot(mds, col=iris$Species, main='原始聚类', pch = 19)

matlab 轮廓系数,kmeans聚类理论篇K的选择(轮廓系数)_第2张图片

par(old.par)

可以发现原始分类中和聚类中左边那一簇的效果还是拟合的很好的,右测原始数据就连在一起,kmeans无法很好的区分,需要寻求其他方法。

kmeans最佳实践

随机选取训练数据中的k个点作为起始点

当k值选定后,随机计算n次,取得到最小开销函数值的k作为最终聚类结果,避免随机引起的局部最优解

手肘法选取k值:绘制出k–开销函数闪点图,看到有明显拐点(如下)的地方,设为k值,可以结合轮廓系数。

k值有时候需要根据应用场景选取,而不能完全的依据评估参数选取。

你可能感兴趣的:(matlab,轮廓系数)