2 聚类 - 层次聚类

层次凝聚聚类法 HAC

也称为全连接聚类,与单连接聚类不同的是,两个类之间的距离不是最近点距离,而是最远点距离

层次聚类

优点

  • 能够帮助进行数据可视化
  • 适合某些特殊的数据集和领域

缺点

  • 对离群点和噪点敏感
  • 计算量大 O(n^2)

sklearn使用

  • 凝聚聚类示例
from sklearn import datasets, cluster
# iris数据集
X = datasets.load_iris().data[:10]

clust = cluster.AgglomerativeClustering(n_clusters = 3, linkage='ward')

labels = clust.fit_predict(X)
  • 系统树绘画
from scipy.cluster.hierarchy import dendrogram, ward, single
from sklearn import datasets
import matlplotlib.pyplot as plt

# iris数据集
X = datasets.load_iris().data[:10]

linkage_matrix = ward(X)
dendogram(linkage_matrix)

plt.show

  • 层次聚类的三种参数
from sklearn import datasets
iris = datasets.load_iris()

from sklearn.cluster import AgglomerativeClustering
ward = AgglomerativeClustering(n_clusters=3)
ward_pred = ward.fit_predict(iris.data)

complete = AgglomerativeClustering(n_clusters=3, linkage="complete")
complete_pred = complete.fit_predict(iris.data)

avg = AgglomerativeClustering(n_clusters=3, linkage="average")
avg_pred = avg.fit_predict(iris.data)

from sklearn.metrics import adjusted_rand_score
ward_ar_score = adjusted_rand_score(iris.target, ward_pred)
complete_ar_score = adjusted_rand_score(iris.target, complete_pred)
avg_ar_score = adjusted_rand_score(iris.target, avg_pred)

Scores: 
Ward: 0.731198556771 
Complete:  0.642251251836 
Average:  0.759198707107

# 标准化数据
from sklearn import preprocessing
normalized_X = preprocessing.normalize(iris.data)

ward = AgglomerativeClustering(n_clusters=3)
ward_pred = ward.fit_predict(normalized_X)

complete = AgglomerativeClustering(n_clusters=3, linkage="complete")
complete_pred = complete.fit_predict(normalized_X)

avg = AgglomerativeClustering(n_clusters=3, linkage="average")
avg_pred = avg.fit_predict(normalized_X)


ward_ar_score = adjusted_rand_score(iris.target, ward_pred)
complete_ar_score = adjusted_rand_score(iris.target, complete_pred)
avg_ar_score = adjusted_rand_score(iris.target, avg_pred)

Scores: 
Ward: 0.885697031028 
Complete:  0.644447235392 
Average:  0.558371443754

你可能感兴趣的:(2 聚类 - 层次聚类)