好记性不如烂笔头。
dataNumpy为numpy.array类型的数据。详细见参考[1].
from sklearn.manifold import TSNE
import numpy as np
## this parameters are default parameters
## data_tsne is: array, shape (n_samples, n_components)
# Embedding of the training data in low-dimensional space.
data_tsne = TSNE(n_components=2, perplexity=30.0, early_exaggeration=12.0, learning_rate=200.0, n_iter=1000, n_iter_without_progress=300, min_grad_norm=1e-07, metric=’euclidean’, init=’random’, verbose=0, random_state=None, method=’barnes_hut’, angle=0.5).fit_transform( dataNumpy )
各参数的意义参考官方给出的解释:
重要的参数:
n_components : int, optional (default: 2)
Dimension of the embedded space.
perplexity : float, optional (default: 30)
The perplexity is related to the number of nearest neighbors that is used in
other manifold learning algorithms. Larger datasets usually require a larger
perplexity. Consider selecting a value between 5 and 50. The choice is not
extremely critical since t-SNE is quite insensitive to this parameter.
n_iter : int, optional (default: 1000)
Maximum number of iterations for the optimization. Should be at least 250.
learning_rate : float, optional (default: 200.0)
The learning rate for t-SNE is usually in the range [10.0, 1000.0]. If
the learning rate is too high, the data may look like a ‘ball’ with any
point approximately equidistant from its nearest neighbours. If the
learning rate is too low, most points may look compressed in a dense
cloud with few outliers. If the cost function gets stuck in a bad local
minimum increasing the learning rate may help.
metric : string or callable, optional
The metric to use when calculating distance between instances in a feature
array. If metric is a string, it must be one of the options allowed by
scipy.spatial.distance.pdist for its metric parameter, or a metric listed in
pairwise.PAIRWISE_DISTANCE_FUNCTIONS. If metric is “precomputed”, X is assumed
to be a distance matrix. Alternatively, if metric is a callable function, it is
called on each pair of instances (rows) and the resulting value recorded. The
callable should take two arrays from X as input and return a value indicating
the distance between them. The default is “euclidean” which is interpreted as
squared euclidean distance.
init : string or numpy array, optional (default: “random”)
Initialization of embedding. Possible options are ‘random’, ‘pca’, and a numpy
array of shape (n_samples, n_components). PCA initialization cannot be used with
precomputed distances and is usually more globally stable than random
initialization.
random_state : int, RandomState instance or None, optional (default: None)
If int, random_state is the seed used by the random number generator; If
RandomState instance, random_state is the random number generator; If None,
the random number generator is the RandomState instance used by np.random.
Note that different initializations might result in different local minima
of the cost function.
其他参数:
early_exaggeration : float, optional (default: 12.0)
Controls how tight natural clusters in the original space are in the
embedded space and how much space will be between them. For larger
values, the space between natural clusters will be larger in the
embedded space. Again, the choice of this parameter is not very
critical. If the cost function increases during initial
optimization, the early exaggeration factor or the learning rate
might be too high.
n_iter_without_progress : int, optional (default: 300)
Maximum number of iterations without progress before we abort
the optimization, used after 250 initial iterations with early
exaggeration. Note that progress is only checked every 50
iterations so this value is rounded to the next multiple of 50.
New in version 0.17: parameter n_iter_without_progress to
control stopping criteria.
min_grad_norm : float, optional (default: 1e-7)
If the gradient norm is below this threshold, the optimization will be
stopped.
method : string (default: ‘barnes_hut’)
By default the gradient calculation algorithm uses Barnes-Hut approximation
running in O(NlogN) time. method=’exact’ will run on the slower, but exact,
algorithm in O(N^2) time. The exact algorithm should be used when nearest-
neighbor errors need to be better than 3%. However, the exact method cannot
scale to millions of examples.
New in version 0.17: Approximate optimization method via the Barnes-Hut.
angle : float (default: 0.5)
Only used if method=’barnes_hut’ This is the trade-off between speed and accuracy
for Barnes-Hut T-SNE. ‘angle’ is the angular size (referred to as theta in [3])
of a distant node as measured from a point. If this size is below ‘angle’ then it
is used as a summary node of all points contained within it. This method is not
very sensitive to changes in this parameter in the range of 0.2 - 0.8. Angle less
than 0.2 has quickly increasing computation time and angle greater 0.8 has
quickly increasing error.
由于T-SNE只能对不大于2维的数据进行压缩,因此需要将图像转为一维数据,然后再进行压缩。
这里给出实验中的一副图像(并不是最优的),仍可以调整达到较好结果。
参数的选择很关键,有时参数选择不正确会误导人的判断。详细参考[3,4]。
总结一下[3]:
i) 参数很重要
ii) 在t-sne结果图中的类的大小没有什么意义,即不能反映原始数据的大小
iii) 在t-sne结果中的不同类的距离,不能反映原始类的距离
iv) 随机产生的数据,经过t-sne分类之后,可能结果没有那么随机
v) 有时,可以从t-sne结果图中,看出原始类的分布形状
vi) 对于同一数据,多画几幅图对比一下,以免由于不确定影响对数据的真实理解
在原有的基础上[2], 加上了显示的小技巧和保存的技巧。
import matplotlib.pyplot as plt
def plot_embedding(data, label, title):
x_min, x_max = np.min(data, 0), np.max(data, 0)
data = (data - x_min) / (x_max - x_min)
fig = plt.figure()
ax = plt.subplot(111)
for i in range(data.shape[0]):
plt.text(data[i, 0], data[i, 1], str(label[i]),
color='blue',## color=plt.cm.Set1(np.random.randint(10))
fontdict={'weight': 'bold', 'size': 9})
plt.xticks([])
plt.yticks([])
plt.title(title)
return fig
plt.rcParams['figure.figsize'] = (20, 20) ## 显示的大小
fig = plot_embedding( t-sne_dataNumpy, labelNumpy, 't-sne test' )
### plt.savefig() 一定在前,不然将会保存空白的图像
plt.savefig( 't-sne-test.jpg' )
plt.show( fig )
[1] 官方介绍:http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
[2] 有关t-sne的matplot显示:https://www.deeplearn.me/2137.html
[3] t-sne参数的选择:https://distill.pub/2016/misread-tsne/
[4] t-sne参数的选择: https://mp.weixin.qq.com/s/cnzQ7XepftDOZXslCf1MUA
[5] 论文: van der Maaten, L.J.P.; Hinton, G.E. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research 9:2579-2605, 2008.
[6] 论文:van der Maaten, L.J.P. t-Distributed Stochastic Neighbor Embedding http://homepage.tudelft.nl/19j49/t-SNE.html
[7] 论文: L.J.P. van der Maaten. Accelerating t-SNE using Tree-Based Algorithms. Journal of Machine Learning Research 15(Oct):3221-3245, 2014.http://lvdmaaten.github.io/publications/papers/JMLR_2014.pdf