SingleR
教程网址(https://www.bioconductor.org/packages/release/bioc/vignettes/SingleR/inst/doc/SingleR.html)
安装网址(https://bioconductor.org/packages/3.11/bioc/html/SingleR.html)
Package
SingleR 1.2.4
1. Introduction
注释单细胞测序数据。注释原理:首先给定参考集,即已知身份的测序数据(single-cell or bulk),然后,SingleR基于相似性注释测试集数据。
流程:
- 计算测试集中每个样品与参考集中每个样品的相似度(spearman correlation)。这里只考虑marker基因,是来自于参考集中不同label样品两两比较找到的marker基因,合并在一起的合集。
- 以参考集细胞类型为单位,计算per-label score(quantile 0.8 of correlation distribution)
- 对所有label重复这个过程。选取最好分数的标签当做细胞身份。
- 选择性微调:
参考集根据标签微调;使用marker gene重算score;循环
2. Reference - using built-in reference
SingleR自带一些reference,多为bulk数据或者microarray数据。
参考集:eg. HumanPrimaryCellAtlasData()
library(SingleR)
hpca.se <- HumanPrimaryCellAtlasData()
hpca.se
## class: SummarizedExperiment
## dim: 19363 713
## metadata(0):
## assays(1): logcounts
## rownames(19363): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
## rowData names(0):
## colnames(713): GSM112490 GSM112491 ... GSM92233 GSM92234
## colData names(3): label.main label.fine label.ont
测试集: dataset from La Manno et al. (2016).
For the sake of speed, 只选前100
library(scRNAseq)
hESCs <- LaMannoBrainData('human-es')
hESCs <- hESCs[,1:100]
# SingleR() expects log-counts, but the function will also happily take raw
# counts for the test dataset. The reference, however, must have log-values.
library(scater)
hESCs <- logNormCounts(hESCs)
hESCs
class: SingleCellExperiment
dim: 18538 100
metadata(0):
assays(2): counts logcounts
rownames(18538): WASH7P_p1 LINC01002_loc4 ... IL9R_loc1 DDX11L16_loc1
rowData names(0):
colnames(100): 1772122_301_C02 1772122_180_E05 ... 1772122_298_F09 1772122_302_A11
colData names(3): Cell_ID Cell_type Timepoint
reducedDimNames(0):
spikeNames(0):
altExpNames(0):
接下来,利用参考集hpca.se注释测试集hESCs。
pred.hesc <- SingleR(test = hESCs, ref = hpca.se, labels = hpca.se$label.main)
pred.hesc
## DataFrame with 100 rows and 5 columns
## scores first.labels
##
## 1772122_301_C02 0.347652:0.109547:0.123901:... Neuroepithelial_cell
## 1772122_180_E05 0.361187:0.134934:0.148672:... Neuroepithelial_cell
## 1772122_300_H02 0.446411:0.190084:0.222594:... Neuroepithelial_cell
## 1772122_180_B09 0.373512:0.143537:0.164743:... Neuroepithelial_cell
## 1772122_180_G04 0.357341:0.126511:0.141987:... Neuroepithelial_cell
## tuning.scores labels pruned.labels
##
## 1772122_301_C02 0.1824402:0.0991116 Neuroepithelial_cell Neuroepithelial_cell
## 1772122_180_E05 0.1375484:0.0647134 Neurons Neurons
## 1772122_300_H02 0.2757982:0.1369690 Neuroepithelial_cell Neuroepithelial_cell
## 1772122_180_B09 0.0851623:0.0819878 Neuroepithelial_cell Neuroepithelial_cell
## 1772122_180_G04 0.1988415:0.1016622 Neuroepithelial_cell Neuroepithelial_cell
※※※ 3. Reference - using single-cell data
以 scRNAseq package里的两个human pancreas 数据集为例。
参考集: data from Muraro et al. (2016)
library(scRNAseq)
sceM <- MuraroPancreasData()
# One should normally do cell-based quality control at this point, but for
# brevity's sake, we will just remove the unlabelled libraries here.
sceM <- sceM[,!is.na(sceM$label)]
sceM <- logNormCounts(sceM)
测试集:data from Grun et al. (2016)
sceG <- GrunPancreasData()
sceG <- sceG[,colSums(counts(sceG)) > 0] # Remove libraries with no counts.
sceG <- logNormCounts(sceG)
sceG <- sceG[,1:100]
※※※ SingleR() 加上marker detection模式来排除单细胞中基因表达的差异,用wilcoxon ranked sum test来确定标签之间的top marker基因。这个过程比较慢,但是相比于默认的marker检测程序(low coverage and median of expression is frequently 0)会更适合单细胞数据。
pred.grun <- SingleR(test=sceG, ref=sceM, labels=sceM$label, de.method="wilcox")
table(pred.grun$labels)
## acinar beta delta duct
## 53 4 2 41
4. annotation diagnostics
4.1 based on the scores within cells
plotScoreHeatmap(pred.grun)
plotScoreHeatmap(pred.grun,
annotation_col=as.data.frame(colData(sceG)[,"donor",drop=FALSE]))
4.2 based on deltas across cells
to.remove <- pruneScores(pred.grun)
summary(to.remove)
## Mode FALSE TRUE
## logical 96 4
plotScoreDistribution(pred.grun, show = "delta.med", ncol = 3, show.nmads = 3)
new.pruned <- pred.grun$labels
new.pruned[pruneScores(pred.grun, nmads=5)] <- NA
table(new.pruned, useNA="always")
## new.pruned
## acinar beta delta duct
## 53 4 2 41 0
4.3 based on marker gene expression
all.markers <- metadata(pred.grun)$de.genes
sceG$labels <- pred.grun$labels
# Beta cell-related markers
plotHeatmap(sceG, order_columns_by="labels",
features=unique(unlist(all.markers$beta)))
for (lab in unique(pred.grun$labels)) {
plotHeatmap(sceG, order_columns_by=list(I(pred.grun$labels)),
features=unique(unlist(all.markers[[lab]])))
}
5. Available reference
https://www.bioconductor.org/packages/release/bioc/vignettes/SingleR/inst/doc/SingleR.html#1_introduction
6. Reference options
※※※ 6.1 Pseudo-bulk aggregation
单细胞测序数据集提供了like-for-like模式的比较,帮助细胞更好的分类(hopefully)。但是这种参考集也存在一些问题,单细胞参考集中存在比bulk参考集多得多的样品,增加了计算工作。我们可以通过整合每个label的单细胞数据,构建“pseudo bulk”数据(e.g., 平均log-expression value)。
这个方法最大的缺点是丢失了每种类型细胞的分布信息。这可能导致一个远离中心的细胞的异质性丢失。所以,我们使用k-means聚类方法尽量保留这些信息。
这种整合的方法应用于aggregateReferences()。
set.seed(100) # for the k-means step.
aggr <- aggregateReference(sceM, labels=sceM$label)
aggr
## class: SummarizedExperiment
## dim: 19059 116
## metadata(0):
## assays(1): logcounts
## rownames(19059): A1BG-AS1__chr19 A1BG__chr19 ... ZZEF1__chr17
## ZZZ3__chr1
## rowData names(0):
## colnames(116): alpha.1 alpha.2 ... mesenchymal.8 epsilon.1
## colData names(1): label
pred.aggr <- SingleR(sceG, aggr, labels=aggr$label)
table(pred.aggr$labels)
## acinar beta delta duct
## 53 4 1 42
6.2 多重参考集
6.3 融合标签
待更新
“贤者以其昭昭使人昭昭,今以其昏昏使人昭昭。” --《孟子·尽心下》