Literature Review(1): A short but comprehensive comparison about tSNE, UMAP and PCA

最近又有大佬发Nature Protocols了,题目为Tutorial: guidelines for annotating single-cell transcriptomic maps using automated and manual methods,主要讲的是怎样做好单细胞的细胞类型注释工作,包括自动注释、手动标注以及最后的验证3个步骤。不过,我却在这篇文章中的一个box里面发现了他们对tSNE和UMAP的理解,以下为原文:

An scRNA-seq data set is typically visualized as a 2D scatter plot where cells (points) with similar transcriptomes are placed near each other. This 2D representation is projected from a higher dimensional space where each cell is described by the expression of thousands of genes, each of which is considered a separate dimension. The three most popular projection methods used for scRNA-seq data are t-SNE, UMAP and PCA.

t-SNE (Fig. 6c) is a nonlinear projection that preserves local groups of similar cells, while equalizing the density of cells within each group. The scale of a ‘local group’ is controlled by the ‘perplexity’ parameter, with higher values creating larger local groups. t-SNE effectively visualizes distinct robust clusters, making it easy to observe discrete cell types; however, global relationships between cell types are not maintained, and thus cluster-to-cluster relationships cannot be inferred and may be misleading. Cell subtypes can be combined into one large cluster or split into distinct plot regions depending on the perplexity.

UMAP (Extended Data Fig. 1) is a nonlinear projection method that differentiates discrete cell clusters20. UMAP is typically regarded as better for visualizing global relationships and gradients than t-SNE, although these differences are probably due to default parameters. UMAP is often less computationally intensive to run than t-SNE.

PCA (Fig. 6b) performs a linear transformation of normalized and scaled scRNA-seq data, to identify independent principal components (PCs) that capture major axes of variation in the data, which could represent biological factors, like cell types and states, or technical factors. PCs are ranked in decreasing order of variance, and typically the first two PCs are used to visualize the data, but more can be considered to detect more subtle expression patterns between cells. PCA can be useful for visualizing cell gradients and states.

Although these methods visually group similar cells and help visualize clusters, they do not define clusters and, therefore, are not clustering algorithms. Cell-clustering algorithm output is typically visualized as colors on the plot, and these colors may or may not correspond to patterns observed in the 2D plot.

其中的两张图(Fig. 6b, 6c and Extended Data Fig. 1)如下:

Fig. 6b

Fig. 6c
Extended Data Fig. 1

简单来说,tSEN和UMAP都属于非线性降维,但是tSNE能够让不同的细胞cluster之间很好的在二维图上彼此区分开来,却不能在全局范围内保留不同细胞类型之间的相互关系,二维图上的细胞cluster紧密程度由perplexity,即困惑度这个参数控制,困惑度越高,cluster中细胞聚集的越紧密。

相比之下,UMAP就能够很好的展示不同细胞类型之间的相互关系,并且UMAP相较之于tSEN来说会有更少的运行时间和内存占用。

PCA是一种线性降维的方式,通过PCA分析能够捕捉到数据当中的主要差异,且不同PC的差异度按照PC_1、PC_2、……逐渐减少,所以我们在后续UMAP和PCA分析时可以选用前一部分PC进行分析,所以说PCA降低了数据的维度。

最后,附上原文链接:https://www.nature.com/articles/s41596-021-00534-0

你可能感兴趣的:(Literature Review(1): A short but comprehensive comparison about tSNE, UMAP and PCA)