流形学习是一种减少非线性维度的方法。 这个任务的算法基于许多数据集的维度只是人为导致的高。
高维数据集可能非常难以可视化。 虽然可以绘制两维或三维数据来显示数据的固有结构,但等效的高维图不太直观。 为了帮助可视化数据集的结构,必须以某种方式减小维度。
通过对数据的随机投影来实现降维的最简单方法。 虽然这允许数据结构的一定程度的可视化,但是选择的随机性远远不够。 在随机投影中,数据中更有趣的结构很可能会丢失。
为了解决这一问题,设计了一些监督和无监督的线性维数降低框架,如主成分分析(PCA),独立成分分析,线性判别分析等。 这些算法定义了特定的标题来选择数据的“有趣”线性投影。 这些是强大的,但是经常会错过重要的非线性结构的数据。
流形可以被认为是将线性框架(如PCA)推广为对数据中的非线性结构敏感的尝试。 虽然存在监督变量,但是典型的流形学习问题是无监督的:它从数据本身学习数据的高维结构,而不使用预定的分类。
流形学习的最早方法之一是 Isomap 算法,等距映射(Isometric Mapping)的缩写。 Isomap 可以被视为多维缩放(Multi-dimensional Scaling:MDS)或 Kernel PCA 的扩展。 Isomap 寻求一个维度较低的嵌入,它保持所有点之间的测量距离。 Isomap 可以与 Isomap
Isomap 算法包括三个阶段:
进行有效的邻居搜索。 对于 维中 个点的 个最近邻,成本约为 Isomap 的整体复杂度是 .
局部线性嵌入(LLE)寻求保留局部邻域内距离的数据的低维投影。 它可以被认为是一系列局部主成分分析,与整体相比,找到最好的非线性嵌入。
局部线性嵌入可以使用 locally_linear_embedding
函数或其面向对象的副本方法 LocallyLinearEmbedding
标准的 LLE 算法包括三个阶段:
标准 LLE 的整体复杂度是 .
One well-known issue with LLE is the regularization problem. When the number of neighbors is greater than the number of input dimensions, the matrix defining each local neighborhood is rank-deficient. To address this, standard LLE applies an arbitrary regularization parameter , which is chosen relative to the trace of the local weight matrix. Though it can be shown formally that as , the solution converges to the desired embedding, there is no guarantee that the optimal solution will be found for . This problem manifests itself in embeddings which distort the underlying geometry of the manifold.
One method to address the regularization problem is to use multiple weight vectors in each neighborhood. This is the essence of modified locally linear embedding (MLLE). MLLE can be performed with function locally_linear_embedding
or its object-oriented counterpart LocallyLinearEmbedding
, with the keyword method = 'modified'
. It requires n_neighbors > n_components
The MLLE algorithm comprises three stages:
The overall complexity of MLLE is .
Hessian Eigenmapping (also known as Hessian-based LLE: HLLE) is another method of solving the regularization problem of LLE. It revolves around a hessian-based quadratic form at each neighborhood which is used to recover the locally linear structure. Though other implementations note its poor scaling with data size, sklearn
implements some algorithmic improvements which make its cost comparable to that of other LLE variants for small output dimension. HLLE can be performed with function locally_linear_embedding
or its object-oriented counterpart LocallyLinearEmbedding
, with the keyword method = 'hessian'
. It requires n_neighbors > n_components * (n_components + 3) / 2
The HLLE algorithm comprises three stages:
The overall complexity of standard HLLE is .
Spectral Embedding is an approach to calculating a non-linear embedding. Scikit-learn implements Laplacian Eigenmaps, which finds a low dimensional representation of the data using a spectral decomposition of the graph Laplacian. The graph generated can be considered as a discrete approximation of the low dimensional manifold in the high dimensional space. Minimization of a cost function based on the graph ensures that points close to each other on the manifold are mapped close to each other in the low dimensional space, preserving local distances. Spectral embedding can be performed with the function spectral_embedding
or its object-oriented counterpart SpectralEmbedding
The Spectral Embedding (Laplacian Eigenmaps) algorithm comprises three stages:
The overall complexity of spectral embedding is .
Though not technically a variant of LLE, Local tangent space alignment (LTSA) is algorithmically similar enough to LLE that it can be put in this category. Rather than focusing on preserving neighborhood distances as in LLE, LTSA seeks to characterize the local geometry at each neighborhood via its tangent space, and performs a global optimization to align these local tangent spaces to learn the embedding. LTSA can be performed with function locally_linear_embedding
or its object-oriented counterpart LocallyLinearEmbedding
, with the keyword method = 'ltsa'
The LTSA algorithm comprises three stages:
The overall complexity of standard LTSA is .
Multidimensional scaling (MDS
) seeks a low-dimensional representation of the data in which the distances respect well the distances in the original high-dimensional space.
In general, is a technique used for analyzing similarity or dissimilarity data. MDS
attempts to model similarity or dissimilarity data as distances in a geometric spaces. The data can be ratings of similarity between objects, interaction frequencies of molecules, or trade indices between countries.
There exists two types of MDS algorithm: metric and non metric. In the scikit-learn, the class MDS
implements both. In Metric MDS, the input similarity matrix arises from a metric (and thus respects the triangular inequality), the distances between output two points are then set to be as close as possible to the similarity or dissimilarity data. In the non-metric version, the algorithms will try to preserve the order of the distances, and hence seek for a monotonic relationship between the distances in the embedded space and the similarities/dissimilarities.
Let be the similarity matrix, and the coordinates of the input points. Disparities are transformation of the similarities chosen in some optimal ways. The objective, called the stress, is then defined by
The simplest metric MDS
model, called absolute MDS, disparities are defined by . With absolute MDS, the value should then correspond exactly to the distance between point and in the embedding point.
Most commonly, disparities are set to .
Non metric MDS
focuses on the ordination of the data. If , then the embedding should enforce . A simple algorithm to enforce that is to use a monotonic regression of on , yielding disparities in the same order as .
A trivial solution to this problem is to set all the points on the origin. In order to avoid that, the disparities are normalized.
) converts affinities of data points to probabilities. The affinities in the original space are represented by Gaussian joint probabilities and the affinities in the embedded space are represented by Student’s t-distributions. This allows t-SNE to be particularly sensitive to local structure and has a few other advantages over existing techniques:
While Isomap, LLE and variants are best suited to unfold a single continuous low dimensional manifold, t-SNE will focus on the local structure of the data and will tend to extract clustered local groups of samples as highlighted on the S-curve example. This ability to group samples based on the local structure might be beneficial to visually disentangle a dataset that comprises several manifolds at once as is the case in the digits dataset.
The Kullback-Leibler (KL) divergence of the joint probabilities in the original space and the embedded space will be minimized by gradient descent. Note that the KL divergence is not convex, i.e. multiple restarts with different initializations will end up in local minima of the KL divergence. Hence, it is sometimes useful to try different seeds and select the embedding with the lowest KL divergence.
The disadvantages to using t-SNE are roughly:
The main purpose of t-SNE is visualization of high-dimensional data. Hence, it works best when the data will be embedded on two or three dimensions.
Optimizing the KL divergence can be a little bit tricky sometimes. There are five parameters that control the optimization of t-SNE and therefore possibly the quality of the resulting embedding:
The perplexity is defined as where is the Shannon entropy of the conditional probability distribution. The perplexity of a -sided die is , so that is effectively the number of nearest neighbors t-SNE considers when generating the conditional probabilities. Larger perplexities lead to more nearest neighbors and less sensitive to small structure. Conversely a lower perplexity considers a smaller number of neighbors, and thus ignores more global information in favour of the local neighborhood. As dataset sizes get larger more points will be required to get a reasonable sample of the local neighborhood, and hence larger perplexities may be required. Similarly noisier datasets will require larger perplexity values to encompass enough local neighbors to see beyond the background noise.
The maximum number of iterations is usually high enough and does not need any tuning. The optimization consists of two phases: the early exaggeration phase and the final optimization. During early exaggeration the joint probabilities in the original space will be artificially increased by multiplication with a given factor. Larger factors result in larger gaps between natural clusters in the data. If the factor is too high, the KL divergence could increase during this phase. Usually it does not have to be tuned. A critical parameter is the learning rate. If it is too low gradient descent will get stuck in a bad local minimum. If it is too high the KL divergence will increase during optimization. More tips can be found in Laurens van der Maaten’s FAQ (see references). The last parameter, angle, is a tradeoff between performance and accuracy. Larger angles imply that we can approximate larger regions by a single point,leading to better speed but less accurate results.
“How to Use t-SNE Effectively” provides a good discussion of the effects of the various parameters, as well as interactive plots to explore the effects of different parameters.
The Barnes-Hut t-SNE that has been implemented here is usually much slower than other manifold learning algorithms. The optimization is quite difficult and the computation of the gradient is , where is the number of output dimensions and is the number of samples. The Barnes-Hut method improves on the exact method where t-SNE complexity is , but has several other notable differences:
For visualization purpose (which is the main use case of t-SNE), using the Barnes-Hut method is strongly recommended. The exact t-SNE method is useful for checking the theoretically properties of the embedding possibly in higher dimensional space but limit to small datasets due to computational constraints.
Also note that the digits labels roughly match the natural grouping found by t-SNE while the linear 2D projection of the PCA model yields a representation where label regions largely overlap. This is a strong clue that this data can be well separated by non linear methods that focus on the local structure (e.g. an SVM with a Gaussian RBF kernel). However, failing to visualize well separated homogeneously labeled groups with t-SNE in 2D does not necessarily implie that the data cannot be correctly classified by a supervised model. It might be the case that 2 dimensions are not enough low to accurately represents the internal structure of the data.
is increased until n_components == d
will fail to find the null space. The easiest way to address this is to use solver='dense'
which will work on a singular matrix, though it may be very slow depending on the number of input points. Alternatively, one can attempt to understand the source of the singularity: if it is due to disjoint sets, increasing n_neighbors
may help. If it is due to identical points in the dataset, removing these points may help.See also
完全随机树嵌入 can also be useful to derive non-linear representations of feature space, also it does not perform dimensionality reduction.
