t-SNE 可以将高维数据进行降维,同时实现可视化。由于t-SNE的损失函数是非凸的,所以在不同的初始化条件下降维可视化的结果不同。因此在降维方面不是很受推荐。其他如 PCA 的降维方法是专业的且更加合理。降维方法有助于抑制噪声的影响,加速样本间pariwise distance的计算。
t-SNE is a tool to visualize high-dimensional data. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. t-SNE has a cost function that is not convex, i.e. with different initializations we can get different results.
It is highly recommended to use another dimensionality reduction method (e.g. PCA for dense data or TruncatedSVD for sparse data) to reduce the number of dimensions to a reasonable amount (e.g. 50) if the number of features is very high. This will suppress some noise and speed up the computation of pairwise distances between samples. For more tips see Laurens van der Maaten’s FAQ.
先看代码,tSNE可以使用sklearn库实现,非常简单。
sklearn.manifold.TSNE——官方介绍
from sklearn.manifold import TSNE
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
iris = load_iris()
X_embedded= TSNE(learning_rate=100, n_components=2).fit_transform(iris.data)
X_pca = PCA().fit_transform(iris.data)
plt.figure(figsize=(10, 5))
plt.subplot(121)
plt.scatter(X_embedded[:, 0], X_embedded[:, 1], c=iris.target)
plt.subplot(122)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=iris.target)
plt.show()
class sklearn.manifold.TSNE(n_components=2, *, perplexity=30.0,
early_exaggeration=12.0, learning_rate=200.0, n_iter=1000,
n_iter_without_progress=300, min_grad_norm=1e-07, metric=‘euclidean’,
init=‘random’, verbose=0, random_state=None, method=‘barnes_hut’,
angle=0.5, n_jobs=None)
n_components: int, 可视化的维度, 2
perplexity: float, 30.影响比较大,建议5-50之间,大型数据集需要使用较大的数值。
The perplexity is related to the number of nearest neighbors that is used in other manifold learning algorithms. Larger datasets usually require a larger perplexity. Consider selecting a value between 5 and 50. Different values can result in significanlty different results.
early_exaggeration: float, 12.0, 不是关键参数,若损失函数在初始优化的时候提高了,则需降低该值
Controls how tight natural clusters in the original space are in the embedded space and how much space will be between them. For larger values, the space between natural clusters will be larger in the embedded space. Again, the choice of this parameter is not very critical. If the cost function increases during initial optimization, the early exaggeration factor or the learning rate might be too high.
learning_rate: float, 200.0, 通常取值10 - 1000. 如果损失函数困在较差的局部最小值,提高该值。
The learning rate for t-SNE is usually in the range [10.0, 1000.0]. If the learning rate is too high, the data may look like a ‘ball’ with any point approximately equidistant from its nearest neighbours. If the learning rate is too low, most points may look compressed in a dense cloud with few outliers. If the cost function gets stuck in a bad local minimum increasing the learning rate may help.
等等。
Attributes
embedding_array-like, shape (n_samples, n_components)
Stores the embedding vectors.
kl_divergence_float
Kullback-Leibler divergence after optimization.
n_iter_int
Number of iterations run