10X单细胞(10X空间转录组)数据分析之寻找最佳的聚类数k

相信大家在分析数据的时候,都不太清楚聚多少个类算合理的,都是按照默认参数来分析数据,那么,今天,我来分享一个方法,帮助大家选择最好的k值。我们边分享代码,边介绍。文献在MultiK: an automated tool to determine optimal cluster numbers in single-cell RNA sequencing data,影响因子13分(Genome Biology)

简单看一下原理

1、First, MultiK takes a gene expression matrix as input, in which cells are the columns and genes are the rows. Each entry of the input matrix corresponds to the expression of a gene in each cell. MultiK subsamples 80% of the cells from the input preprocessed data matrix and applies the standard Seurat pipeline on the subsampled data matrix 100 times over 40 resolution parameters (from 0.05 to 2.00 with step size 0.05; thus, 4000 subsampling runs in total: 40 resolution parameters × 100 subsamples).. During each run, features are reselected each time to cluster the cells. Then, for each K, MultiK aggregates all the clustering runs that give rise to the same K groups regardless of the resolution parameter and computes a consensus matrix.MultiK then evaluates the consensus of clustering using two metrics: (1) for each K, the frequency of runs where that K is observed
图片.png
and (2) the relative proportion of ambiguous clustering PAC (rPAC(relative Proportion of Ambiguous Clustering)) score for each K,which is a variation of the PAC score。(PAC quantifies the proportion of entries in the consensus matrix strictly between the lower and upper bounds that determine ambiguity.)The rPAC criterion addresses the upward bias of PAC towards higher K by better handling the proportion of zeros in the consensus matrix. Combining both measures, MultiK produces a scatter plot that shows the relationship between the frequency of K and (1 – rPAC) for each observed K.
图片.png
To determine several multi-scale optimal K candidates (mostly 2 and up to 3), MultiK applies a convex hull approach [24]. This is based on the upper right of the smallest convex polygon that encloses all the points. MultiK takes extreme points from this set and uses a frequency cutoff of 100 to select candidate Ks。
2、Once candidate Ks are determined, MultiK then performs a second step: label each cluster as either a class or subclass using Statistical Significance of Clustering (SigClust)
图片.png
MultiK first constructs a dendrogram of the cluster centroids using hierarchical clustering. Then, MultiK runs SigClust on each pair of terminal clusters. Significant terminal pairs in the dendrogram determine classes, and non-significant pairs are subclasses. For consistency of the whole dendrogram, when any split is significant, all parent splits are also considered to be significant. In this way, MultiK assigns class and subclass labels to each terminal cluster (i.e., the leaves of the dendrogram) based on the SigClust significance. This assessment of cluster significance, after deciding on the value of optimal K, helps elucidate the structural relationships between the identified clusters as well.

第一步,加载R包

library(Seurat)
library(sigclust)
###devtools::install_github("siyao-liu/MultiK")
library(MultiK)
MultiK()是实现Seurat聚类在多个分辨率参数上的子采样和应用的主要函数。
主函数 MultiK( ) 接受一个 Seurat 对象,该对象具有归一化的表达式矩阵和其他参数,如果未指定,则默认值设置。 MultiK 在 Seurat 聚类中探索了一系列分辨率参数(从 0.05 到 2.00,步长为 0.05),并聚合所有产生相同 K 组的聚类运行,而不管分辨率参数如何,并为每个 K 计算一致矩阵 .
图片.png
注意:MultiK 在每次子采样运行中重新选择高度可变的基因。 此外,默认情况下,MultiK 在 Seurat 聚类中使用 30 个主成分和 20 个 K 最近邻。

运行代码

seu = readRDS(sc_RDS)
步骤 1:运行 MultiK 主算法以确定最佳 Ks
运行子采样和一致性聚类以生成用于评估的输出(此步骤可能需要很长时间)。 出于演示目的,在这里运行 10 次。 对于真实的数据练习,建议至少使用 100 次。
multik <- MultiK(seu, reps=10)
Make MultiK diagnostic plots:
DiagMultiKPlot(multik$k, multik$consensus)
图片.png

Step 2: Assign classes and subclasses

Get the clustering labels at optimal K level:
clusters <- getClusters(seu, 3)
Run SigClust at optimal K level:
pval <- CalcSigClust(seu, clusters$clusters)
制作诊断图(这包括在节点上映射成对 SigClust p 值的聚类质心树状图,以及成对 SigClust p 值的热图)
PlotSigClust(seu, clusters$clusters, pval)
图片.png

对,这才是我们想要的结果

生活很好,有你更好

你可能感兴趣的:(10X单细胞(10X空间转录组)数据分析之寻找最佳的聚类数k)