原文链接Chapter 6 Quality Control

一、基础知识

1 为什么QC

因为低质量的cell通常会contribute to misleading results in downstream analyses

form their distinct cluster
distort PCA
cause "upregulated" genes

2 QC指标

细胞测序文库：越大越好
The library size is defined as the total sum of counts across all relevant features for each cell.
非0基因数量：越多越好
The number of expressed features in each cell is defined as the number of endogenous genes with non-zero counts for each cell.
ERCC含量：一般取小
The proportion of reads mapped to spike-in transcripts is calculated relative to the total count across all features (including spike-ins) for each cell.

spike-in是已知浓度的外源RNA分子。在单细胞裂解液中加入spike-in后,再进行反转录。最广泛使用的spike-in是由External RNA Control Consortium (ERCC)提供的。ERCC含量大，则说明total sum变小

Mito线粒体基因含量：一般取小，意义同ERCC
the proportion of reads mapped to genes in the mitochondrial genome

3、示例数据

#scRNAseq包的一个公共数据
library(scRNAseq)
sce.416b <- LunSpikeInData(which="416b") 
sce.416b$block <- factor(sce.416b$block)
sce.416b
dim(sce.416b) #46604个基因，192个细胞

二、QC metrics 计算

如上所说主要是基于测序文库，基因数量，ERCC含量/MIto含量三个指标进行筛选。
一般测序文库，基因数量可根据原始表达矩阵计算，ERCC/Mito含量则需要进行则需要根据基因名判断。利用scater包可自动计算每个细胞上述指标
以mito含量为例，本例为ensemble ID。可利用AnnotationHub包获取基因的染色体信息，再进行相关含量统计。

## 注释线粒体基因
library(AnnotationHub)
ens.mm.v97 <- AnnotationHub()[["AH73905"]]
chr.loc <- mapIds(ens.mm.v97, keys=rownames(sce.416b),
                  keytype="GENEID", column="SEQNAME")
sort(table(chr.loc))
is.mito.alt <- which(chr.loc=="MT")
table(is.mito.alt)
#返回所有是线粒体基因的行名号
#biomaRt法
library("biomaRt")
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl") #人
attributes = listAttributes(ensembl)
test <- getBM(attributes=c('ensembl_gene_id', 'chromosome_name', 'hgnc_symbol'), 
      mart = ensembl)

2-1

如果基因名为SYMBOL ID，那么我们可以直接grepl("^MT-", rownames(sce))

# 基于线粒体注释，利用scater包直接计算上述三者指标
library(scater)
df <- perCellQCMetrics(sce.416b, subsets=list(Mito=is.mito.alt))
dim(df) ##192个细胞里表达情况（包括线粒体的相关情况）
colnames(df)
df[1:4,1:4]

2-2

关于 percent_top_number，表示的是一个比例。以percent_top_50为例，表示某细胞sum数前50个gene的sum和除以该细胞的counts sum总和。

上述得到的是独立的结果数据表，可以利用函数，将结果注释到sce对象的cilData slot 里

colnames(colData(sce.416b))
sce.416b <- addPerCellQC(sce.416b, subsets=list(Mito=is.mito))
colnames(colData(sce.416b))
#除Mito信息外，也还自动注释了ERCC信息

计算所有细胞的这三个及指标，就可以筛选低质量细胞了。注意筛选的前体是我们假设the QC metrics are independent of the biological state of each cell. Poor values (e.g., low library sizes, high mitochondrial proportions) are presumed to be driven by technical factors rather than biological processes, meaning that the subsequent removal of cells will not misrepresent the biology in downstream analyses.

三、鉴定低质量细胞

这一步的关键是如何设置异常的指标阈值

1、fixed thresholds

即设置固定的阈值，一般通过经验判断，例如（1）library sizes below 100,000 reads；（2）express fewer than 5,000 genes；（3）have spike-in proportions above 10%; or have mitochondrial proportions above 10%.

qc.lib <- df$sum < 1e5
qc.nexprs <- df$detected < 5e3
qc.spike <- df$altexps_ERCC_percent > 10
qc.mito <- df$subsets_Mito_percent > 10
discard <- qc.lib | qc.nexprs | qc.spike | qc.mito
DataFrame(LibSize=sum(qc.lib), NExprs=sum(qc.nexprs),
    SpikeProp=sum(qc.spike), MitoProp=sum(qc.mito), Total=sum(discard))

3-1

2、adaptive thresholds

即根据指标里所有细胞的分布，判断异常点isOutlier。这就假设了大部分都是高质量细胞
原理：if it is more than 3 MADs from the median. 一般对log转换后的指标进行筛选

qc.lib2 <- isOutlier(df$sum, log=TRUE, type="lower")  
#lower表示取小，结果返回的是逻辑值T/F
attr(qc.lib2, "thresholds") #查看筛选阈值
qc.nexprs2 <- isOutlier(df$detected, log=TRUE, type="lower")
attr(qc.nexprs2, "thresholds")
qc.spike2 <- isOutlier(df$altexps_ERCC_percent, type="higher")
attr(qc.spike2, "thresholds")
qc.mito2 <- isOutlier(df$subsets_Mito_percent, type="higher")
attr(qc.mito2, "thresholds")
discard2 <- qc.lib2 | qc.nexprs2 | qc.spike2 | qc.mito2
#整合
DataFrame(LibSize=sum(qc.lib2), NExprs=sum(qc.nexprs2),
    SpikeProp=sum(qc.spike2), MitoProp=sum(qc.mito2), Total=sum(discard2))

3-2

可利用scater提供的函数，一次完成上述计算

reasons <- quickPerCellQC(df, percent_subsets=c("subsets_Mito_percent",
    "altexps_ERCC_percent"))
#返回维度与df相同的数据框，内容变为逻辑值，即是否应该抛弃
colSums(as.matrix(reasons))

tips：batch factor

考虑批次效应可能带来的错误筛选，计算予以调整，在上面的quickPerCellQC可以设置batch参数

batch <- paste0(sce.416b$phenotype, "-", sce.416b$block)
batch.reasons <- quickPerCellQC(df, percent_subsets=c("subsets_Mito_percent",
                                                      "altexps_ERCC_percent"), batch=batch)
colSums(as.matrix(batch.reasons))

3-3

注意这里同样也假设所有的batch都是高质量的，如果已知某一个batch可能都有问题，那么就不应该调整这个batch，以免影响整体的判断。具体可分别判断每个batch的阈值，观察比较有无明显异常值。

四、绘图观察剔除细胞

1、指标细胞阈值可视化

sce.416b$phenotype <- ifelse(grepl("induced", sce.416b$phenotype),
                             "induced", "wild type")
sce.416b$discard <- reasons$discard

gridExtra::grid.arrange(
  plotColData(sce.416b, x="block", y="sum", colour_by="discard",
              other_fields="phenotype") + facet_wrap(~phenotype) + 
    scale_y_log10() + ggtitle("Total count"),
  plotColData(sce.416b, x="block", y="detected", colour_by="discard", 
              other_fields="phenotype") + facet_wrap(~phenotype) + 
    scale_y_log10() + ggtitle("Detected features"),
  plotColData(sce.416b, x="block", y="subsets_Mito_percent", 
              colour_by="discard", other_fields="phenotype") + 
    facet_wrap(~phenotype) + ggtitle("Mito percent"),
  plotColData(sce.416b, x="block", y="altexps_ERCC_percent", 
              colour_by="discard", other_fields="phenotype") + 
    facet_wrap(~phenotype) + ggtitle("ERCC percent"),
  ncol=1
)

4-1

2、线粒体基因相关性绘图

plotColData(sce.416b, x="sum", y="subsets_Mito_percent", colour_by="discard")

如图表明，剔除的细胞都是sum小，而造成mito含量相对提高的异常细胞。正常细胞是mito含量与sum成正比的。

4-2

五、去除低质量细胞

straightforward!

filtered <- sce.416b[,!reasons$discard]

最后的验证

whether an entire cell type is inadvertently discarded. 即被删除的细胞是否为一种特殊的细胞类型，而非低质量细胞

lost <- calculateAverage(counts(sce.416b)[,!discard])
kept <- calculateAverage(counts(sce.416b)[,discard])

library(edgeR)
logged <- cpm(cbind(lost, kept), log=TRUE, prior.count=2)
logFC <- logged[,1] - logged[,2]
abundance <- rowMeans(logged)
plot(abundance, logFC, xlab="Average count", ylab="Log-FC (lost/kept)", pch=16)
points(abundance[is.mito], logFC[is.mito], col="dodgerblue", pch=16)

If the discarded pool is enriched for a certain cell type, we should observe increased expression of the corresponding marker genes. 如图点均在水平轴附近，说明并非 filter out a cell type in the 416B dataset.

5

关于empty droplets

一般来说，我们拿到的数据都是过滤掉empty droplets的了。但如果看到sce对象有好几万个cell，就可能是未过滤empty droplets
关于判断、过滤empty droplets，就简单介绍下相关函数，具体原理见原文

e.out <- emptyDrops(counts(sce.pbmc))
# See ?emptyDrops for an explanation of why there are NA values.
summary(e.out$FDR <= 0.001)
sce.pbmc <- sce.pbmc[,which(e.out$FDR <= 0.001)]

以上是第六章QC部分的简单流程笔记。原文还介绍了判断是否为empty droplets (第五点)，以及保留低质量细胞的意义（第七点），详见Chapter 6 Quality Control
本系列笔记基于OSCA全流程的大致流程梳理，详细原理可参考原文。如有错误，恳请指正！
此外还有刘小泽老师整理的全文翻译笔记，详见目录

1、Quality control