基于sklearn.manifold的 T-SNE 的简单使用(介绍关系数据和图像数据)+ matplotlib的简单使用

0. 写作目的

好记性不如烂笔头。

1. 针对关系数据(表格类型)的使用

1.1 将关系数据降维二维

dataNumpy为numpy.array类型的数据。详细见参考[1].

from sklearn.manifold import TSNE
import numpy as np


## this parameters are default parameters
## data_tsne is:  array, shape (n_samples, n_components)
             #     Embedding of the training data in low-dimensional space.

data_tsne = TSNE(n_components=2, perplexity=30.0, early_exaggeration=12.0, learning_rate=200.0, n_iter=1000, n_iter_without_progress=300, min_grad_norm=1e-07, metric=’euclidean’, init=’random’, verbose=0, random_state=None, method=’barnes_hut’, angle=0.5).fit_transform( dataNumpy )

各参数的意义参考官方给出的解释:

重要的参数:

n_components : int, optional (default: 2)

               Dimension of the embedded space.

perplexity : float, optional (default: 30)

            The perplexity is related to the number of nearest neighbors that is used in 
            other manifold learning algorithms. Larger datasets usually require a larger 
            perplexity. Consider selecting a value between 5 and 50. The choice is not 
            extremely critical since t-SNE is quite insensitive to this parameter.

n_iter : int, optional (default: 1000)

         Maximum number of iterations for the optimization. Should be at least 250.

learning_rate : float, optional (default: 200.0)

                The learning rate for t-SNE is usually in the range [10.0, 1000.0]. If 
                the learning rate is too high, the data may look like a ‘ball’ with any 
                point approximately equidistant from its nearest neighbours. If the 
                learning rate is too low, most points may look compressed in a dense 
                cloud with few outliers. If the cost function gets stuck in a bad local 
                minimum increasing the learning rate may help.

metric : string or callable, optional

         The metric to use when calculating distance between instances in a feature 
         array. If metric is a string, it must be one of the options allowed by 
         scipy.spatial.distance.pdist for its metric parameter, or a metric listed in 
         pairwise.PAIRWISE_DISTANCE_FUNCTIONS. If metric is “precomputed”, X is assumed 
         to be a distance matrix. Alternatively, if metric is a callable function, it is 
         called on each pair of instances (rows) and the resulting value recorded. The 
         callable should take two arrays from X as input and return a value indicating 
         the distance between them. The default is “euclidean” which is interpreted as 
         squared euclidean distance.

init : string or numpy array, optional (default: “random”)

       Initialization of embedding. Possible options are ‘random’, ‘pca’, and a numpy 
       array of shape (n_samples, n_components). PCA initialization cannot be used with 
       precomputed distances and is usually more globally stable than random 
       initialization.

random_state : int, RandomState instance or None, optional (default: None)

              If int, random_state is the seed used by the random number generator; If 
              RandomState instance, random_state is the random number generator; If None, 
              the random number generator is the RandomState instance used by np.random. 
              Note that different initializations might result in different local minima 
              of the cost function.



其他参数:

early_exaggeration : float, optional (default: 12.0)

                     Controls how tight natural clusters in the original space are in the 
                     embedded space and how much space will be between them. For larger 
                     values, the space between natural clusters will be larger in the 
                     embedded space. Again, the choice of this parameter is not very 
                     critical. If the cost function increases during initial 
                     optimization, the early exaggeration factor or the learning rate 
                     might be too high.

n_iter_without_progress : int, optional (default: 300)

                          Maximum number of iterations without progress before we abort 
                          the optimization, used after 250 initial iterations with early 
                          exaggeration. Note that progress is only checked every 50 
                          iterations so this value is rounded to the next multiple of 50.

                          New in version 0.17: parameter n_iter_without_progress to 
                          control stopping criteria.


min_grad_norm : float, optional (default: 1e-7)

                 If the gradient norm is below this threshold, the optimization will be 
                 stopped.

method : string (default: ‘barnes_hut’)

         By default the gradient calculation algorithm uses Barnes-Hut approximation 
         running in O(NlogN) time. method=’exact’ will run on the slower, but exact, 
         algorithm in O(N^2) time. The exact algorithm should be used when nearest-
         neighbor errors need to be better than 3%. However, the exact method cannot 
         scale to millions of examples.

         New in version 0.17: Approximate optimization method via the Barnes-Hut.

angle : float (default: 0.5)

        Only used if method=’barnes_hut’ This is the trade-off between speed and accuracy 
        for Barnes-Hut T-SNE. ‘angle’ is the angular size (referred to as theta in [3]) 
        of a distant node as measured from a point. If this size is below ‘angle’ then it 
        is used as a summary node of all points contained within it. This method is not 
        very sensitive to changes in this parameter in the range of 0.2 - 0.8. Angle less 
        than 0.2 has quickly increasing computation time and angle greater 0.8 has 
        quickly increasing error.

1.2 有关图像数据的压缩

由于T-SNE只能对不大于2维的数据进行压缩,因此需要将图像转为一维数据,然后再进行压缩。

这里给出实验中的一副图像(并不是最优的),仍可以调整达到较好结果。

基于sklearn.manifold的 T-SNE 的简单使用(介绍关系数据和图像数据)+ matplotlib的简单使用_第1张图片

1.3 如何选择T-SNE的参数

参数的选择很关键,有时参数选择不正确会误导人的判断。详细参考[3,4]。

总结一下[3]:

i) 参数很重要

ii) 在t-sne结果图中的类的大小没有什么意义,即不能反映原始数据的大小

iii) 在t-sne结果中的不同类的距离,不能反映原始类的距离

iv) 随机产生的数据,经过t-sne分类之后,可能结果没有那么随机

v) 有时,可以从t-sne结果图中,看出原始类的分布形状

vi) 对于同一数据,多画几幅图对比一下,以免由于不确定影响对数据的真实理解

1.4 使用matplotlib来显示t-sne的结果

在原有的基础上[2], 加上了显示的小技巧和保存的技巧。

 
import matplotlib.pyplot as plt


def plot_embedding(data, label, title):
    x_min, x_max = np.min(data, 0), np.max(data, 0)
    data = (data - x_min) / (x_max - x_min)
 
    fig = plt.figure()
    ax = plt.subplot(111)
    for i in range(data.shape[0]):
        plt.text(data[i, 0], data[i, 1], str(label[i]),
                 color='blue',## color=plt.cm.Set1(np.random.randint(10))
                 fontdict={'weight': 'bold', 'size': 9})
    plt.xticks([])
    plt.yticks([])
    plt.title(title)
    return fig

plt.rcParams['figure.figsize'] = (20, 20) ## 显示的大小
fig = plot_embedding( t-sne_dataNumpy, labelNumpy, 't-sne test' )
### plt.savefig() 一定在前,不然将会保存空白的图像
plt.savefig( 't-sne-test.jpg' ) 
plt.show( fig )

 

[Reference]

[1] 官方介绍:http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html

[2] 有关t-sne的matplot显示:https://www.deeplearn.me/2137.html

[3] t-sne参数的选择:https://distill.pub/2016/misread-tsne/

[4] t-sne参数的选择: https://mp.weixin.qq.com/s/cnzQ7XepftDOZXslCf1MUA

[5] 论文: van der Maaten, L.J.P.; Hinton, G.E. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research 9:2579-2605, 2008.

           [6] 论文:van der Maaten, L.J.P. t-Distributed Stochastic Neighbor Embedding http://homepage.tudelft.nl/19j49/t-SNE.html

          [7] 论文: L.J.P. van der Maaten. Accelerating t-SNE using Tree-Based Algorithms. Journal of Machine Learning Research 15(Oct):3221-3245, 2014.http://lvdmaaten.github.io/publications/papers/JMLR_2014.pdf

你可能感兴趣的:(基于sklearn.manifold的 T-SNE 的简单使用(介绍关系数据和图像数据)+ matplotlib的简单使用)