SingleR辅助注释单细胞

很多做单细胞的研究者都提出过这个问题，是否有直接的功能能对单细胞直接进行注释，而不是繁琐的参看文献，搜索marker，人为对单细胞进行注释。
单细胞真的可以实现自动化注释吗？我想答案应该是肯定可以的。但是很多方法注释结果的准确性有待探讨，不过作为单细胞注释的辅助工具是一个不错的选择。
这儿我们将详细讲解SingleR单细胞注释工具的使用以及弊端
我们可以通过得到singleR的细胞注释结果之后，同时结合Seurat的分群结果，具体组织类型来综合完成细胞注释。
官方教程Using SingleR to annotate single-cell RNA-seq data: https://www.bioconductor.org/packages/release/bioc/vignettes/SingleR/inst/doc/SingleR.html

使用内置参考进行注释（最简便的）
使用SingleR的最简单方法是使用内置参考对细胞进行注释。celldex包通过专用的检索功能提供了7个参考数据集（主要来自大量RNA-seq或微阵列数据）。
singleR自带的7个参考数据集，需要联网才能下载，其中5个是人类数据，2个是小鼠的数据：
BlueprintEncodeData Blueprint (Martens and Stunnenberg 2013) and Encode (The ENCODE Project Consortium 2012) （人）
DatabaseImmuneCellExpressionData The Database for Immune Cell Expression(/eQTLs/Epigenomics)(Schmiedel et al. 2018)（人）
HumanPrimaryCellAtlasData the Human Primary Cell Atlas (Mabbott et al. 2013)（人）
MonacoImmuneData, Monaco Immune Cell Data - GSE107011 (Monaco et al. 2019)（人）
NovershternHematopoieticData Novershtern Hematopoietic Cell Data - GSE24759（人）
ImmGenData the murine ImmGen (Heng et al. 2008) （鼠）
MouseRNAseqData a collection of mouse data sets downloaded from GEO (Benayoun et al. 2019).鼠）

相关包安装

conda install -c bioconda bioconductor-Seurat
conda install -c bioconda bioconductor-singler ##devtools::install_github('dviraran/SingleR')安装报错，直接用conda安装了
conda install -c bioconda bioconductor-celldex ##安装这个包用来调用参考数据集

导入相关包，并下载参考数据集

library(Seurat) ##
library(SingleR)
library(ggplot2)
library(reshape2)
hpca.se=HumanPrimaryCellAtlasData() ##第一次载入会下载数据集，可能会慢一些，后面在用时就不用下载了
Blue.se=BlueprintEncodeData() 
Immune.se=DatabaseImmuneCellExpressionData()
Nover.se=NovershternHematopoieticData()
MonacoIm.se=MonacoImmuneData()
ImmGen.se=ImmGenData() #(鼠)
Mouse.se=MouseRNAseqData() #(鼠)

在这里，我们还是使用我们前面经常使用的pbmc3k数据集，这样也是为了方便SingleR与Seurat分析结合起来
pbmc数据集相关下载，seurat聚类都可参照前面的：https://www.jianshu.com/p/adda4536b2cb

setwd("/home/wucheng/jianshu/function/data")
pbmc <-readRDS("pbmc.rds") ##这儿我们直接导入Seurat标准化，聚类的pbmc数据
> pbmc
An object of class Seurat 
13714 features across 2638 samples within 1 assay 
Active assay: RNA (13714 features, 2000 variable features)
 2 dimensional reductions calculated: pca, umap

[email protected] #pbmc的meta文件，包含了seurat的聚类结果
pbmc_for_SingleR <- GetAssayData(pbmc, slot="data") ##获取标准化矩阵
pbmc.hesc <- SingleR(test = pbmc_for_SingleR, ref = hpca.se, labels = hpca.se$label.main) # 使用HumanPrimaryCellAtlasData参考数据集，main大类注释，也可使用fine小类注释，不过小类注释准确性不好确定

table(pbmc.hesc[[i]]$labels,meta$seurat_clusters) ##查看新注释的标签与seurat分类的结果的交叠情况
> table(pbmc.hesc[[i]]$labels,meta$seurat_clusters) 

                    0   1   2   3   4   5   6   7   8
  B cells           0   0   3 342   0   0   0   0   1
  CD4+ T cells    488 404   0   1   5   0   0   0   0
  CD8+ T cells    159  51   0   1  96   0   3   0   0
  Dendritic cells   0   0  14   0   0   0   0  31   0
  Monocytes         0   0 463   0   0 162   0   1   2
  NK cells          0   0   0   0  15   0 148   0   0
  Progenitors       4   0   0   0   0   0   0   0  11
  T cells          46  28   0   0 155   0   4   0   0

我们可以看到有些细胞簇分类还是很明确的，接着我们借助一些可视化函数看看注释效果

pdf("plotScoreHeatmap.pdf")
print(plotScoreHeatmap(pbmc.hesc))
dev.off()
[email protected]$labels <-pbmc.hesc$labels
pdf(paste(i,"Umap.pdf",sep ="_"),height=5,width=10)
print(DimPlot(pbmc, group.by = c("seurat_clusters", "labels"),reduction = "umap"))
dev.off()

image

可以看到参考数据集中的大部分细胞类别这儿都没有

image

umap直观的可以看到通过singleR注释的细胞标签准确性应该可以（不过注意这儿时pbmc数据集，有些组织单细胞数据可能就不是这样了哦，可能会很乱，做好心理准备哦）

aa=table(pbmc.hesc[[i]]$labels,meta$seurat_clusters)
aa= apply(aa,2,function(x) x/sum(x))
df=as.data.frame(melt(aa))
df$Var2=as.factor(df$Var2)
g <- ggplot(df, aes(Var2, Var1)) + geom_point(aes(size = value), colour = "green") + theme_bw() 
pdf("singleR_match_seurat.pdf",height=5,width=10)
print(g)
dev.off()
library(pheatmap)
pdf(paste(i,"heatmap.pdf",sep ="_"),height=5,width=10)
pheatmap(log2(aa+10), color=colorRampPalette(c("white", "blue"))(101))
dev.off()

image

两个图意思差不多，可以作为判断簇具体细胞类型的一个借鉴。

另一种是不用这儿的参考数据集，利用已有参考数据集，给其它数据集注释（Seurat也能实现）
这儿从pbmc数据集中抽取500个细胞作为新的数据集pbmc1,使用前面给pbmc打上的标签，为pbmc1重新打标签

pbmc1 <-pbmc[,1:500]
test <- GetAssayData(pbmc1, slot="data")
library(scran)
pbmc1.hesc <- SingleR(test=test, ref=pbmc_for_SingleR, labels=pbmc$labels, de.method="wilcox")
[email protected]$labels1 <-pbmc1.hesc$labels
pdf("Umap1.pdf",height=5,width=10)
print(DimPlot(pbmc1, group.by = c("seurat_clusters", "labels"),reduction = "umap"))
dev.off()

image

因为pbmc1是从pbmc抽取的，所以新的标签和之前的一致。

利用多个数据参考集为单细胞数据打标签
一些时候，如果希望使用多个参考数据集对单细胞数据进行注释。可以避免单个参考数据集中不能覆盖到的标记，从而得到一组更加全面的细胞类型标记，尤其是在考虑分辨率差异的情况下。我们可以通过将多个对象简单地传递到SingleR()函数中的ref=和label=参数，即可支持使用多个参考数据集。

pbmc.hesc <- SingleR(test = pbmc_for_SingleR, ref = list(BP=Blue.se, HPCA=hpca.se), labels = list(Blue.se$label.main, hpca.se$label.main)) 
table(pbmc.hesc$labels,meta$seurat_clusters)
table(pbmc.hesc$labels,meta$seurat_clusters)

                     0   1   2   3   4   5   6   7   8
  B_cell             0   0   0   4   0   0   0   0   0
  B-cells            0   0   0 334   0   0   0   1   1
  CD4+ T-cells     310 140   0   1   1   0   0   0   0
  CD8+ T-cells     366 318   0   4 240   0   5   1   0
  HSC                4   0   0   0   0   0   0   1   0
  Monocyte           0   0 292   0   0 130   0  21   0
  Monocytes          0   0 175   0   0  30   0   8   1
  NK cells           0   0   0   0  30   0 150   0   0
  Platelets          0   0   0   0   0   0   0   0  12
  Pre-B_cell_CD34-   0   0  13   0   0   1   0   0   0
  T_cells           17  25   0   1   0   1   0   0   0

不同参考数据集命名不同，有些其实是一样的细胞类型。

总而言之，参考库是作者基于已发表的单一种类的纯细胞转录组数据构建的，所以如果纯转录组数据不全，细胞注释是存在影响的。
所以说，SingleR作为单细胞注释的辅助工具是一个不错的选择。
后面我们还会讲到其它的单细胞注释辅助工具，谢谢！

作者：清竹饮酒
链接：https://www.jianshu.com/p/1c4abf05cb3e

SingleR辅助注释单细胞

你可能感兴趣的:(SingleR辅助注释单细胞)