聚类算法摘录

这一段时间要应用聚类算法,在网上找了一些资料,在此摘录一下。最后列出了参考的链接,如有遗漏请谅解。

聚类算法分类简介:
1、Connectivity-based clustering (hierarchical clustering)
2、Centroid-based clustering
3、Distribution-based clustering
4、Density-based clustering


1、Connectivity-based clustering (hierarchical clustering)
Connectivity based clustering, also known as hierarchical clustering, is based on the core idea of objects being more related to nearby objects than to objects farther away. 
At different distances, different clusters will form, which can be represented using a dendrogram.
Apart from the usual choice of distance functions, the user also needs to decide on the linkage criterion. Popular choices are known as single-linkage clustering (the minimum of object distances), complete linkage clustering (the maximum of object distances) or UPGMA ("Unweighted Pair Group Method with Arithmetic Mean", also known as average linkage clustering). 
Hierarchical clustering can be agglomerative (starting with single elements and aggregating them into clusters) or divisive (starting with the complete data set and dividing it into partitions). In the general case, the complexity is (n^3) for agglomerative clustering and 2^{n-1}for divisive clustering。
不适用情形:考虑细长图(“小“字)。

2、Centroid-based clustering. Example: k-means clustering
In centroid-based clustering, clusters are represented by a central vector, which may not necessarily be a member of the data set.
The optimization problem itself is known to be NP-hard, and thus the common approach is to search only for approximate solutions.
Most k-means-type algorithms require the number of clusters - k - to be specified in advance, which is considered to be one of the biggest drawbacks of these algorithms. 
K-means has a number of interesting theoretical properties. First, it partitions the data space into a structure known as a Voronoi diagram. Second, it is conceptually close to nearest neighbor classification, and as such is popular in machine learning. Third, it can be seen as a variation of model based classification, and Lloyd's algorithm as a variation of the Expectation-maximization algorithm for this model discussed below.
不适用情形:考虑双哑铃图。

k-means变种:
A、K-MEDOIDS:区别在于中心点的选取,在K-means中,我们将中心点取为当前cluster中所有数据点的平均值,在 K-medoids算法中,我们将从当前cluster中选取这样一个点——它到其他所有(当前cluster中的)点的距离之和最小(费马点)——作为中心点,所以K-MEDOIDS的复杂度会变高。
B、Mini Batch K-Means算法是K-Means算法的变种,采用小批量的数据子集减小计算时间,同时仍试图优化目标函数,这里所谓的小批量是指每次训练算法时所随机抽取的数据子集,采用这些随机产生的子集进行训练算法,大大减小了计算时间,与其他算法相比,减少了k-均值的收敛时间,小批量k-均值产生的结果,一般只略差于标准算法。
C、FCM:Make initial guesses for the means m1, m2,..., mk。
Until there are no changes in any mean:
Use the estimated means to find the degree of membership u(j,i) of xj in Cluster i;
for example, if a(j,i) = exp(- || xj - mi ||2 ), one might use u(j,i) = a(j,i) / sum_j a(j,i)
For i from 1 to k
Replace mi with the fuzzy mean of all of the examples for Cluster i --
end_for
end_until
一个应用:Vector Quantization(矢量量化)
一种数据压缩技术基本思想:将若干个标量数据组构成一个矢量,然后在矢量空间给以整体量化,从而压缩了数据而不损失多少信息。它是一种数据压缩技术,指从N维实空间RN到RN中L个离散矢量的映射,也可称为分组量化,标量量化是矢量量化在维数为1时的特例。
将每个像素点当作一个数据,跑一下 K-means ,得到 k 个 centroids ,然后用这些 centroids 的像素值来代替对应的 cluster 里的所有点的像素值。对于彩色图片来说,也可以用同样的方法来做,例如 RGB 三色的图片,每一个像素被当作是一个 3 维向量空间中的点。
把原来的许多颜色值用 centroids 代替之后,总的颜色数量减少了,重复的颜色增加了,这种冗余正是压缩算法最喜欢的。考虑一种最简单的压缩办法:单独存储(比如 100 个)centroids 的颜色信息,然后每个像素点存储 centroid 的索引而不是颜色信息值,如果一个 RGB 颜色值需要 24 bits 来存放的话,每个(128 以内的)索引值只需要 7 bits 来存放,这样就起到了压缩的效果。

3、Distribution-based clustering
The clustering model most closely related to statistics is based on distribution models. Clusters can then easily be defined as objects belonging most likely to the same distribution.
While the theoretical foundation of these methods is excellent, they suffer from one key problem known as overfitting, unless constraints are put on the model complexity。
One prominent method is known as Gaussian mixture models (using the expectation-maximization algorithm). 

4、Density-based clustering, Example: DBSCAN(Density-Based Spatial Clustering of Applications with Noise)
In density-based clustering, clusters are defined as areas of higher density than the remainder of the data set. 
The most popular[10] density based clustering method is DBSCAN.
大致流程:初始,给定数据集D中所有对象都被标记为“unvisited”,DBSCAN随机选择一个未访问的对象p,标记p为“visited”,并检查p的是否至少包含MinPts个对象。如果不是,则p被标记为噪声点。否则为p创建一个新的簇C,并且把p的中所有对象都放在候选集合N中。DBSCAN迭代地把N中不属于其他簇的对象添加到C中。在此过程中,对应N中标记为“unvisited”的对象,DBSCAN把它标记为“visited”,并且检查它的,如果的至少包含MinPts个对象,则的中的对象都被添加到N中。DBSCAN继续添加对象到C,直到C不能扩展,即直到N为空。此时簇C完成生成,输出。
为了找到下一个簇,DBSCAN从剩下的对象中随机选择一个未访问过的对象。聚类过程继续,直到所有对象都被访问。
另外,如果觉得经验值聚类的结果不满意,可以适当调整半径Eps和最少点MinPts的值,经过多次迭代计算对比,选择最合适的参数值。可以看出,如果MinPts不变,Eps取得值过大,会导致大多数点都聚到同一个簇中,Eps过小,会导致已一个簇的分裂;如果Eps不变,MinPts的值取得过大,会导致同一个簇中点被标记为噪声点,MinPts过小,会导致发现大量的核心点。
流程:




参考:
https://en.wikipedia.org/wiki/Cluster_analysis
http://www.cnblogs.com/aijianiula/p/4339960.html
http://scikit-learn.org/stable/modules/clustering.html
http://www.cs.princeton.edu/courses/archive/fall08/cos436/Duda/C/fk_means.htm
http://blog.pluskid.org/?p=407

你可能感兴趣的:(其他技术)