什么是cell type?
可扩展性:scRNAseq的细胞数量增长了几个数量级(从10^2 到10^6)。
Guo, Minzhe, Hui Wang, S. Steven Potter, Jeffrey A. Whitsett, and Yan Xu. 2015. “SINCERA: A Pipeline for Single-Cell RNA-Seq Profiling Analysis.” PLoS Comput Biol 11 (11). Public Library of Science (PLoS): e1004575. doi:10.1371/journal.pcbi.1004575.
žurauskienė, Justina, and Christopher Yau. 2016. “pcaReduce: Hierarchical Clustering of Single Cell Transcriptional Profiles.” BMC Bioinformatics 17 (1). Springer Nature. doi:10.1186/s12859-016-0984-y.
Kiselev, Vladimir Yu, Kristina Kirschner, Michael T Schaub, Tallulah Andrews, Andrew Yiu, Tamir Chandra, Kedar N Natarajan, et al. 2017. “SC3: Consensus Clustering of Single-Cell RNA-Seq Data.” Nat Meth 14 (5). Springer Nature: 483–86. doi:10.1038/nmeth.4236.
SNN-Cliq is a graph-based method. First the method identifies the k-nearest-neighbours of each cell according to the distance measure. This is used to calculate the number of Shared Nearest Neighbours (SNN) between each pair of cells. A graph is built by placing an edge between two cells If they have at least one SNN. Clusters are defined as groups of cells with many edges between them using a “clique” method. SNN-Cliq requires several parameters to be defined manually.
Xu, Chen, and Zhengchang Su. 2015. “Identification of Cell Types from Single-Cell Transcriptomes Using a Novel Clustering Method.” Bioinformatics 31 (12). Oxford University Press (OUP): 1974–80. doi:10.1093/bioinformatics/btv088.
Seurat clustering is based on a community detection approach similar to SNN-Cliq and to one previously proposed for analyzing CyTOF data (Levine et al. 2015). Seurat has become more like an all-in-one tool for scRNA-seq data analysis.
为了比较两个聚类标签的结果,我们可以使用 adjusted Rand index,这个index表明了两个聚类结果之间的相似性,值在[0,1]区间,1表明两个聚类结果是一致的,0表明可能是随机期望的相似性。
Most genes detected in a scRNASeq experiment will only be detected at different levels due to technical noise. One consequence of this is that technical noise and batch effects can obscure the biological signal of interest.
因此,对于下游分析来说,进行特征选择是十分有好处的。不仅能够增加信号:数据中的noise ratio;而且能够减少计算复杂性。特征选择通常关注无监督方法,不需要先验知识,例如细胞类型的标签,生物分组等;相反对于差异表达基因来说,可以被考虑是一个有监督的特征选择过程,因为它可以使用每个样本的一直生物标签来识别在不同水平表达的特征(eg gene)。
1、library size的标准化
scRNA-seq data can be QCed and normalized for library size using
M3Drop, which removes cells with few detected genes, removes undetected genes, and converts raw counts to CPM.
2、对于无监督特征选择过程,有两种主要的方法:一种是 highly Variable Genes,另一种是high Dropout Genes。
2.1 highly variable Genes(HVG)
HVG assumes that if genes have large differences in expression across cells some of those differences are due to biological difference between the cells rather than technical noise. However, because of the nature of count data, there is a positive relationship between the mean expression of a gene and the variance in the read counts across cells. This relationship must be corrected for to properly identify HVGs.
下图为:使用rowmeans 和rowVars来刻画数据集中所有基因的mean expression和variance之间的关系。(图中使用log-scale)。
一个很好的来correct for the relationship between variance and mean expression 的方法是Brennecke method(Accounting for technical noise in single-cell RNA-seq experiments.Philip Brennecke, Simon Anders, Jong Kyoung Kim, Aleksandra A Kołodziejczyk, Xiuwei Zhang et al. )
To use the Brennecke method, we first normalize for library size then calculate the mean and the square coefficient of variation (variation divided by the squared mean expression). A quadratic curve is fit to the relationship between these two variables for the ERCC spike-in, and then a chi-square test is used to find genes significantly above the curve. This method is included in the M3Drop package as the Brennecke_getVariableGenes(counts, spikes) function. However, this dataset does not contain spike-ins so we will use the entire dataset to estimate the technical noise.
In the figure below the red curve is the fitted technical noise model and the dashed line is the 95% CI. Pink dots are the genes with significant biological variability after multiple-testing correction.
2.2 Dropout Genes
另一种代替HVGs的方法是识别Dropout Genes( identify genes with unexpectedly high numbers of zeros)。零值是单细胞测序数据的主要特征,通常在最后的表达矩阵中有超过一半的零值。产生零值的原因有两种:其一是mRNAs failing reversed transcribed(逆转录失败);其二是 针对UMI-tagged data,由于 low sequencing coverage(低测序覆盖度)。
一、mRNAs failing reversed transcribed
零值的原因是mRNAs逆转录失败(参考论文:Modelling dropouts for feature selection in scRNASeq experiments. Andrews and Hemberg,2016),逆转录是一种酶促反应,因此能够使用Michaelis-Menten等式来建模:
由于Michaelis-Menten等式是非线性凸函数,数据集中细胞群体之间的差异表达基因存在于up/right of the Michaelis-Menten model (see Figure below).
add log=“x” to the plot call above to see how this looks on the log scale, which is used in M3Drop figures.Produce the same plot as above with different expression levels (S1 & S2) and/or mixtures (mix).
We use M3Drop to identify significant outliers to the right of the MM curve. We also apply 1% FDR multiple testing correction:
二、low sequencing coverage
An alternative method is contained in the M3Drop package that is tailored specifically for UMI-tagged data which generally contains many zeros resulting from low sequencing coverage in addition to those resulting from insufficient reverse-transcription. This model is the Depth-Adjusted Negative Binomial (DANB). This method describes each expression observation as a negative binomial model with a mean related to both the mean expression of the respective gene and the sequencing depth of the respective cell, and a variance related to the mean-expression of the gene.
Unlike the Michaelis-Menten and HVG methods, there isn’t a reliable statistical test for features selected by this model, so we will consider the top 1500 genes instead.