关于多维数据显示的一篇文章

从SNE到t-SNE再到LargeVis
原文 http://bindog.github.io/blog/2016/06/04/from-sne-to-tsne-to-largevis

其实就是把多维数据投影到2D上显示,其中构造KNN的算法比较有意思,这篇文章说的很清楚。

Abstract
1. Compute similarity structure of data points.
2. Project into a low-dimensional space.
This 2 steps computation costs, such as T-SNE.
(T-SNE, a method from scaling to large-scale and high-dimensional data.)

LargeVis (algorithm):
1. Constructs an accurately approximated K-nearest neighbor graph
2. Layout the graph in the low-dimensional space.

Advantages:
1. Reduce the cost of graph construction step
2. Employs a principled probabilistic model for the visualization step, objective can be effectively optimized through asynchronous stochastic gradient descent with a linear time complexity.( 具有线性时间复杂度的异步随机梯度下降来有效地优化)
3. Easily scales to millions of high dimensional data points.

Introduction
Essential idea: project high-dimensional data into space with fewer dimensions.
Preserve the intrinsic structure of the high-dimensional data. (i.e, keeping similar data points close and dissimilar data far apart)

There are many dimensionality reduction techniques
1. Linear mapping: Principle Component Analysis, multidimensional scaling)
As most high-dimensional data usually lie on or near a low-dimensional non-linear manifold, linear mapping methods usually not satisifactory.
2. Non-linear mapping: c, Locally linear embedding, Laplacian Eigenmaps)
Effective on small, laboratory data sets, not perform well on high-dimensional, real data as they are typically not able to preserve both the local and the global structures of the high-dimensional data.
3. t-SNE:
3.1. Constructing a K-nearest neighbor (KNN) graph of the data points.
3.2. Projecting the graph into low-dimensional spaces with tree-based algorithms.
Reason for unsatisfied applied to data with millions of points and hundreds of dimensions.
1) K-nearest neighbor graph is a computational bottleneck for dealing with large-scale and high-dimensional data.(vantage-point trees)
2) The efficiency of the graph visualization step significantly deteriorates when the size of the data becomes large.
3) The parameters of the t-SNE are very sensitive on different data sets
4. LargeVis
4.1. Propose a very efficient algorithm to construct an approximate K-nearest neighbor graph from large-scale, high-dimensional data.
4.2. Propose a principled probabilistic model for graph visualization. The model preserves the structures of the graph in the low-dimensional space, keeping similar data points close and dissimilar data points far away from each other. The objective function of the model can be effectively optimized through asynchronous stochastic gradient descent with a time complexity of O(N).
4.3 The parameters are not sensitive to different data sets and effective in real data.

Related Work
2.1 K-nearest neighbor Graph construction
While the exact computation of a KNN has a complexity of O(N 2 d) (with N being the number of data points and d being the number of dimensions) which is too costly.
1) Space-partitioning trees
2) Locality sensitive hashing
3) Neighbor exploring

LargeVis
A neighbor of my neighbor is also likely to be my neighbor.
1. Build a few random projection trees to construct an approximate K-nearest neighbor graph.
2. Then for each node of the graph, we search the neighbors of its neighbors. May repeat multiple iterations to improve the accuracy of the graph.
3. A principled probabilistic model is used for projecting the graph into a 2D/3D space.

你可能感兴趣的:(数据挖掘)