轨迹聚类光谱分析_光谱聚类

轨迹聚类光谱分析

In multivariate statistics and the clustering of data, Spectral clustering techniques make use of the spectrum of the similarity matrix of the data to perform dimensionality reduction before clustering in fewer dimensions.

在多元统计和数据聚类中, 频谱聚类技术利用数据相似性矩阵的频谱在聚类较少维度之前执行降维。

The similarity matrix is ​​provided as an input and consists of a quantitative the relative of similarity of assessment each pair of points ,these points can be connected by edges. The edge weight value between the two points farther apart is lower, and the edge weight value between the two points closer together is higher.

相似度矩阵作为输入提供,由定量评估相似度相对的每对点组成,这些点可以通过边连接。 相距较远的两点之间的边缘权重值较低,相距较近的两点之间的边缘权重值较高。

By cutting the graph composed of all data points, the sum of the edge weights between the subgraphs is as low as possible, and the weight sum of the edges within the subgraphs is as high as possible, so as to achieve the purpose of clustering.

通过切割由所有数据点组成的图,子图之间的边权重之和越小,子图中各边的权重之和就越大,以达到聚类的目的。

无向加权图 (Undirected Weighted Graph)

An undirected graph G(V,E) is a set of vertices V that are connected together, where all the edges E are bidirectional.

无向图G(V,E)是一组连接在一起的顶点V ,其中所有边E都是双向的。

For any point in the graph v_i, its degree d_i is defined as the sum of the weights of all edges connected to it, namely

对于图v_i中的任何点,其度数d_i定义为与其连接的所有边的权重之和,即

Using the definition of each point degree, we can get an nxn degree matrix D,it is a diagonal matrix, only the main diagonal has a value, corresponding to the degree of the i-th point in the i-th row, defined as follows:

使用每个点度的定义,我们可以得到一个nxn度矩阵D,它是一个对角矩阵,只有主对角线有一个值,对应于第i行第i个点的度,定义为如下:

Using the weight values ​​between all points, we can get the adjacency matrix of the graph W, it is also an n x n matrix, and the j-th value in the i-th row corresponds to our weight w_{i j}.

使用所有点之间的权重值,我们可以获得图W的邻接矩阵,它也是一个nxn矩阵,第i行中的第j个值对应于我们的权重w_ {ij}。

Usually we only have the definition of the data points, and we do not directly give the adjacency matrix.

通常,我们仅定义数据点,而没有直接给出邻接矩阵。

To construct adjacency matrix W there are three types of methods:

要构造邻接矩阵W,可以使用三种类型的方法:

  • ϵ-Neighboring method

    ϵ-邻接法
  • K-neighboring method

    K邻域法
  • Full connection method.

    完整的连接方法。

For ϵ-Neighboring method,we set a distance threshold ϵ and then we use Euclidean distance between any two points

对于ϵ邻域法,我们设置距离阈值ϵ,然后使用任意两点之间的欧几里得距离

Then according to s_{i,j} and ϵ,we define the adjacency matrix W as follows:

然后根据s_ {i,j}和ϵ定义邻接矩阵W如下:

The distance measurement is very imprecise, so in practical applications, we rarely use ϵ-Proximity method.

距离的测量非常不精确,因此在实际应用中,我们很少使用ϵ-Proximity方法。

The K neighbor method, use the KNN algorithm to traverse all the sample points, and takes the k nearest points of each sample as the nearest neighbors, and only the k points closest to the sample have a postive weight .

K邻居方法,使用KNN算法遍历所有样本点,并将每个样本的k个最近点作为最近邻居,并且只有最接近样本的k个点具有正权重。

However, this method will cause the reconstructed adjacency matrix W to be asymmetric, and our subsequent algorithm needs a symmetric adjacency matrix. In order to solve this problem, one of the following two methods is generally adopted:

但是,该方法将导致重建的邻接矩阵W不对称,因此我们的后续算法需要对称的邻接矩阵。 为了解决此问题,通常采用以下两种方法之一:

  • The first method is to keep s_{ij} as long as a point is in the neighbor of another point :

    第一种方法是保持s_ {ij},只要一个点位于另一个点的邻居中:
  • The second method requires two points to be neighbors in order to keep s_{ij}:

    第二种方法要求两个点成为邻居,以保持s_ {ij}:

In last method we can choose different kernel functions to define the edge weights. Commonly used are polynomial kernel function, Gaussian kernel function and Sigmoid kernel function. The most commonly used is the Gaussian kernel function (RBF), where the similarity matrix and the adjacency matrix are the same:

在最后一种方法中,我们可以选择不同的核函数来定义边缘权重。 常用的有多项式核函数,高斯核函数和Sigmoid核函数。 最常用的是高斯核函数(RBF),其中相似度矩阵和邻接矩阵相同:

拉普拉斯矩阵 (Laplacian matrix)

The Laplacians matrix is defined asLaplacian matrix

拉普拉斯矩阵定义为拉普拉斯矩阵

Where ,D is a diagonal matrix, and the diagonal element is the degree of the corresponding node and W is the adjacency matrix

其中,D是对角矩阵,对角元素是相应节点的度数,W是邻接矩阵

The Laplacian matrix has some good properties as follows:

拉普拉斯矩阵具有一些良好的特性,如下所示:

1. The Laplacian matrix is ​​a symmetric matrix and all its eigenvalues ​​are real numbers.

1.拉普拉斯矩阵是一个对称矩阵,其所有特征值均为实数。

3. For any vector f,we have

3.对于任何向量f,我们都有

4. The Laplace matrix is ​​positive semi-definite, and the corresponding n real eigenvalues ​​are all greater than or equal to 0, that is

4.拉普拉斯矩阵是正半定的,对应的n个实特征值都大于或等于0,即

And the smallest eigenvalue is 0, which is easily derived from property 3.

最小特征值是0,可以很容易地从属性3得出。

A very well-known related spectral clustering technique, using the normalized cuts algorithm, is widely used in image segmentation. When dividing, the feature based on is thought to be the second smallest feature of the symmetric regularized Laplacian matrix. This matrix is ​​defined as:

使用归一化切割算法的非常著名的相关光谱聚类技术被广泛用于图像分割。 划分时,基于的特征被认为是对称正则拉普拉斯矩阵的第二个最小特征。 该矩阵定义为:

To perform a spectral clustering we need 3 main steps:

要执行光谱聚类,我们需要3个主要步骤:

  1. Create a similarity graph between our N objects to cluster. C.

    在要聚类的N个对象之间创建一个相似度图。 C。
  2. Compute the first k eigenvectors of its Laplacian matrix to define a feature vector for each object. Calculate the feature vector of the first k Laplacian matrix, and define a feature vector for each point.

    计算其拉普拉斯矩阵的前k个特征向量,以定义每个对象的特征向量。 计算第一个k拉普拉斯矩阵的特征向量,并为每个点定义一个特征向量。
  3. Run k-means on these features to separate objects into k classes.

    在这些功能上运行k均值可将对象分为k个类。

    from sklearn.cluster import KMeans

    从sklearn.cluster导入KMeans

提价 (Refrences)

  1. J. Demmel, [1], CS267: Notes for Lecture 23, April 9, 1999, Graph Partitioning, Part 2

    J. Demmel, [1] ,CS267:关于第23讲的说明,1999年4月9日,图分区,第2部分

  2. Jump up to:a b c Jianbo Shi and Jitendra Malik, “Normalized Cuts and Image Segmentation”, IEEE Transactions on PAMI, Vol. 22, №8, Aug 2000.

    跳至: a b c Jianbo Shi和Jitendra Malik, “归一化的剪切和图像分割” ,PAMI上的IEEE Transactions 22,№8,2000年8月。

  3. Marina Meilă & Jianbo Shi, “Learning Segmentation by Random Walks”, Neural Information Processing Systems 13 (NIPS 2000), 2001, pp. 873–879.

    MarinaMeilă和Shijianbo,“ 通过随机游走进行学习细分 ”,神经信息处理系统13(NIPS 2000),2001年,第873–879页。

  4. Zare, Habil; P. Shooshtari; A. Gupta; R. Brinkman (2010). “Data reduction for spectral clustering to analyze high throughput flow cytometry data”. BMC Bioinformatics. 11: 403. doi:10.1186/1471–2105–11–403. PMC 2923634. PMID 20667133.

    扎尔·哈比勒; P. Shooshtari; A.古普塔; R.布林克曼(2010)。 “用于光谱聚类的数据缩减,以分析高通量流式细胞仪数据” 。 BMC生物信息学11:403。 DOI : 10.1186 / 1471-2105-11-403 。 PMC 2923634 。 PMID 20667133 。

  5. Arias-Castro, E. and Chen, G. and Lerman, G. (2011), “Spectral clustering based on local linear approximations.”, Electronic Journal of Statistics, 5: 1537–1587, arXiv:1001.1323, doi:10.1214/11-ejs651

    Arias-Castro,E.和Chen,G.和Lerman,G.(2011),“基于局部线性近似的谱聚类。”,《 电子统计》5 :1537–1587, arXiv : 1001.1323 , doi : 10.1214 / 11-ejs651

  6. http://scikit-learn.org/stable/modules/clustering.html#spectral-clustering

    http://scikit-learn.org/stable/modules/clustering.html#spectral-clustering

  7. Knyazev, Andrew V. (2006). Multiscale Spectral Graph Partitioning and Image Segmentation. Workshop on Algorithms for Modern Massive Data Sets Stanford University and Yahoo! Research.

    克尼亚泽夫,安德鲁五世(2006)。 多尺度谱图划分与图像分割 。 斯坦福大学和Yahoo!现代海量数据集算法研讨会 研究。

  8. http://spark.apache.org/docs/latest/mllib-clustering.html#power-iteration-clustering-pic

    http://spark.apache.org/docs/latest/mllib-clustering.html#power-iteration-clustering-pic

翻译自: https://medium.com/ai-in-plain-english/spectral-clustering-60f61f79002d

轨迹聚类光谱分析

你可能感兴趣的:(轨迹聚类光谱分析_光谱聚类)