关于sklearn中各个聚类模型的特点总结

在sklearn中,聚类模型有:
K-means
Affinity propagation
Mean-shift
Spectral culstering
Ward hierarchical clustering
Agglomerative clustering
DBSCAN
Gaussian mixtures
Birch

由于聚类问题是无监督学习,很难对结果进行测评,所以一开始就就根据数据特点选择合适的模型非常重要。下面对各个模型作一番总结,以便日后需要用时可以根据具体的数据情况来选择对应的模型。

K-means 需要设置类别的数目,可扩展性比较强,但是由于计算原理比较简单(只是计算点之间的距离),所以只能应用于一些比较general的分类目的,适合于每个类别的数量差不多,并且不能有太多的类别的数据集。

Affinity propagation 会根据数据自主选择类别的数量,计算的是图距离(非平面几何距离),可以有很多类别而且类别的数目可以不均匀,但是它的计算时间空间复杂度都很高,比较适合于小数据。
(英文解释:AffinityPropagation creates clusters by sending messages between pairs of samples until convergence. A dataset is then described using a small number of exemplars, which are identified as those most representative of other samples. The messages sent between pairs represent the suitability for one sample to be the exemplar of the other, which is updated in response to the values from other pairs. This updating happens iteratively until convergence, at which point the final exemplars are chosen, and hence the final clustering is given.
Affinity Propagation can be interesting as it chooses the number of clusters based on the data provided. For this purpose, the two important parameters are the preference, which controls how many exemplars are used, and the damping factor which damps the responsibility and availability messages to avoid numerical oscillations when updating these messages.
The main drawback of Affinity Propagation is its complexity. The algorithm has a time complexity of the order O(N^2 T), where N is the number of samples and T is the number of iterations until convergence. Further, the memory complexity is of the order O(N^2) if a dense similarity matrix is used, but reducible if a sparse similarity matrix is used. This makes Affinity Propagation most appropriate for small to medium sized datasets.)

Mean-shift 计算的是平面几何距离,可以有很多类别而且类别的数目可以不均匀。

Spectral clustering 需要给定类别数量,适合于少类别的数据集,类别量可以不均匀,它计算的是图距离。

Ward hierarchical clustering 需要给定类别数量,可以用于大数据集,即类别数量多,类别样本多,计算的是几何距离。

Agglomerative clustering 需要给定类别数量,可以用于大数据集,即类别数量多,类别样本多,计算的非欧几里德距离。

DBSCAN 需要给定相邻样本定的距离,可以适合用于数据量超大的数据集,类别量可以不平衡。

Gaussian mixtures 有很多需要定义的参数,有利于密度计算。

Birch 可以用于大数据集,有异常值删除,计算的是点之间的欧几里德距离。

你可能感兴趣的:(机器学习,聚类)