NBIS系列单细胞转录组数据分析实战（六）：细胞类型注释

第六节：细胞类型注释

在本节教程中，我们将进行细胞类型预测分析。我们既可以对单个细胞（每个细胞都获得预测的细胞类型）进行预测，也可以在每个细胞簇上执行。所有方法均基于与其他数据集的相似性比较，可以是sorting分选的单细胞或批量RNA-seq数据集，或针对每种细胞类型使用已知的标记基因。

在这里，我们将从Covid数据中选择一个样本ctrl_13，并预测该样本中的细胞类型。我们利用scPred包中的PBMC参考数据集，使用Seurat包中基于标签转移（label trasfer）的TransferData函数和scPred方法进行细胞类型预测。同时，我们还使用基于每个簇的DEG进行基因集富集分析来预测细胞类型。

即使参考数据集中未包含一些细胞的细胞类型，某些方法也会根据最相似的原则预测每个细胞的细胞类型。当然，这些预测方法也会有一定的不确定性，因此具有较低相似性评分的细胞将无法被分类。目前，已有多种不同的预测细胞类型的方法，这里我们仅介绍其中的一些。

加载所需的R包和数据集

suppressPackageStartupMessages({
    library(Seurat)
    library(venn)
    library(dplyr)
    library(cowplot)
    library(ggplot2)
    library(pheatmap)
    library(rafalib)
    library(scPred)
})

# load the data and select 'ctrl_13` sample
alldata <- readRDS("data/results/covid_qc_dr_int_cl.rds")

# 提取ctrl_13样本的数据
ctrl = alldata[, alldata$orig.ident == "ctrl_13"]

# set active assay to RNA and remove the CCA assay
[email protected] = "RNA"
ctrl[["CCA"]] = NULL
ctrl
## An object of class Seurat 
## 18121 features across 1129 samples within 1 assay 
## Active assay: RNA (18121 features, 0 variable features)
##  6 dimensional reductions calculated: umap, tsne, harmony, umap_harmony, scanorama, umap_scanorama

获取参考数据集

接下来，我们从scPred包中提取PBMC参考数据集，并进行常规的数据标准化、可变基因筛选、归一化和降维处理。

# 提取scPred包中PBMC参考数据集
reference <- scPred::pbmc_1

reference
## An object of class Seurat 
## 32838 features across 3500 samples within 1 assay 
## Active assay: RNA (32838 features, 0 variable features)

对参考数据集进行预处理

这里，我们使用magittr包中的管道符%>%一次性运行预处理的所有步骤。

reference <- reference %>% NormalizeData() %>% FindVariableFeatures() %>% ScaleData() %>% 
                           RunPCA(verbose = F) %>% RunUMAP(dims = 1:30)

DimPlot(reference, group.by = "cell_type", label = TRUE, repel = TRUE) + NoAxes()

image.png

同样的，我们对ctrl_13样本执行相应的操作，并提取先前整合数据中在0.3分辨率下的聚类分群结果。

# Set the identity as louvain with resolution 0.3
ctrl <- SetIdent(ctrl, value = "CCA_snn_res.0.5")

ctrl <- ctrl %>% NormalizeData() %>% FindVariableFeatures() %>% ScaleData() %>% RunPCA(verbose = F) %>% RunUMAP(dims = 1:30)

DimPlot(ctrl, label = TRUE, repel = TRUE) + NoAxes()

image.png

使用Seurat包中的标签转移方法进行细胞类型预测

First we will run label transfer using a similar method as in the integration exercise. But, instad of CCA the default for the ’FindTransferAnchors` function is to use “pcaproject”, e.g. the query datset is projected onto the PCA of the reference dataset. Then, the labels of the reference data are predicted.

# 使用FindTransferAnchors函数寻找query和reference数据集之间的anchors
transfer.anchors <- FindTransferAnchors(reference = reference, query = ctrl, dims = 1:30)
# 使用TransferData函数进行标签转移的细胞类型预测
predictions <- TransferData(anchorset = transfer.anchors, refdata = reference$cell_type, 
    dims = 1:30)

# 将细胞类型预测的结果添加到metadata中
ctrl <- AddMetaData(object = ctrl, metadata = predictions)

DimPlot(ctrl, group.by = "predicted.id", label = T, repel = T) + NoAxes()

image.png

Now plot how many cells of each celltypes can be found in each cluster.

ggplot([email protected], aes(x = CCA_snn_res.0.5, fill = predicted.id)) + geom_bar() + theme_classic()

image.png

使用scPred预测细胞类型

scPred will train a classifier based on all principal components. First, getFeatureSpace will create a scPred object stored in the @misc slot where it extracts the PCs that best separates the different celltypes. Then trainModel will do the actual training for each celltype.

reference <- getFeatureSpace(reference, "cell_type")

# ●  Extracting feature space for each cell type...
## DONE!

reference <- trainModel(reference)
## ●  Training models for each cell type...
## maximum number of iterations reached 0.000116588 -0.0001156614DONE!

We can then print how well the training worked for the different celltypes by printing the number of PCs used for each, the ROC value and Sensitivity/Specificity.

get_scpred(reference)
## 'scPred' object
## ✔  Prediction variable = cell_type 
## ✔  Discriminant features per cell type
## ✔  Training model(s)
## Summary
## 
## |Cell type   |    n| Features|Method    |   ROC|  Sens|  Spec|
## |:-----------|----:|--------:|:---------|-----:|-----:|-----:|
## |B cell      |  280|       50|svmRadial | 1.000| 0.964| 1.000|
## |CD4 T cell  | 1620|       50|svmRadial | 0.997| 0.971| 0.975|
## |CD8 T cell  |  945|       50|svmRadial | 0.985| 0.902| 0.978|
## |cDC         |   26|       50|svmRadial | 0.995| 0.547| 1.000|
## |cMono       |  212|       50|svmRadial | 0.994| 0.958| 0.970|
## |ncMono      |   79|       50|svmRadial | 0.998| 0.582| 1.000|
## |NK cell     |  312|       50|svmRadial | 0.999| 0.936| 0.996|
## |pDC         |   20|       50|svmRadial | 1.000| 0.700| 1.000|
## |Plasma cell |    6|       50|svmRadial | 1.000| 0.800| 1.000|

我们可以通过更改参数和测试不同类型的模型来优化每个数据集的参数，有关更多信息，请访问：https : //powellgenomicslab.github.io/scPred/articles/introduction.html。但是目前，我们将继续使用该模型进行细胞类型预测。

接下来，我们可以根据训练好的数据集来预测细胞类型，其中scPred会基于Harmony将两个数据集对齐，然后执行分类。

ctrl <- scPredict(ctrl, reference)
## ●  Matching reference with new dataset...
##   ─ 2000 features present in reference loadings
##   ─ 1774 features shared between reference and new dataset
##   ─ 88.7% of features in the reference are present in new dataset
## ●  Aligning new data to reference...
## ●  Classifying cells...
## DONE!

DimPlot(ctrl, group.by = "scpred_prediction", label = T, repel = T) + NoAxes()

image.png

Now plot how many cells of each celltypes can be found in each cluster.

ggplot([email protected], aes(x = CCA_snn_res.0.5, fill = scpred_prediction)) + geom_bar() + 
    theme_classic()

image.png

比较不同预测方法的结果

现在，我们将使用scPred包中的crossTab 函数方便的比较这两种分类方法的结果，该函数会得到两种分类结果中的交集。

crossTab(ctrl, "predicted.id", "scpred_prediction")

image.png

使用GSEA富集分析预测细胞类型

对于细胞簇水平上的细胞类型预测，我们还可以使用针对特定细胞类型标记基因的GSEA富集预测。类似于我们在差异表达分析中对DEG进行功能富集的方式。有一些可用的细胞类型基因集的数据库，如CellMarker，PanglaoDB或MSigDB。我们还可以查看参考数据集中的DEG与您正在分析的数据集之间的重叠。

使用重叠的DEGs进行基因集的富集

首先，我们提取Covid-19数据集和参考数据集中的top DEGs。

# run differential expression in our dataset, using clustering at resolution 0.3
alldata <- SetIdent(alldata, value = "CCA_snn_res.0.5")

DGE_table <- FindAllMarkers(alldata, logfc.threshold = 0, test.use = "wilcox", min.pct = 0.1, 
    min.diff.pct = 0, only.pos = TRUE, max.cells.per.ident = 20, return.thresh = 1, 
    assay = "RNA")

# split into a list
DGE_list <- split(DGE_table, DGE_table$cluster)

unlist(lapply(DGE_list, nrow))
##    0    1    2    3    4    5    6    7    8    9   10 
## 3153 2483 3394 2837 2573 3956 2150 3753 2465 2142 3342

# Compute differential gene expression in reference dataset (that has cell annotation)
reference <- SetIdent(reference, value = "cell_type")

reference_markers <- FindAllMarkers(reference, min.pct = 0.1, min.diff.pct = 0.2, 
    only.pos = T, max.cells.per.ident = 20, return.thresh = 1)

# Identify the top cell marker genes in reference dataset select top 50 with highest foldchange among top 100 signifcant genes.
reference_markers <- reference_markers[order(reference_markers$avg_logFC, decreasing = T), ]
top50_cell_selection <- reference_markers %>% group_by(cluster) %>% top_n(-100, p_val) %>% 
    top_n(50, avg_logFC)

# Transform the markers into a list
ref_list = split(top50_cell_selection$gene, top50_cell_selection$cluster)

unlist(lapply(ref_list, length))
##  CD8 T cell  CD4 T cell       cMono      B cell     NK cell         pDC 
##          30          14          50          50          50          50 
##      ncMono         cDC Plasma cell 
##          50          50          50

接下来，我们基于这些细胞类型特异的DEGs进行GSEA富集分析，并检查DEGs在参考数据集中的富集程度。

suppressPackageStartupMessages(library(fgsea))

# run fgsea for each of the clusters in the list
res <- lapply(DGE_list, function(x) {
    gene_rank <- setNames(x$avg_logFC, x$gene)
    fgseaRes <- fgsea(pathways = ref_list, stats = gene_rank, nperm = 10000)
    return(fgseaRes)
})
names(res) <- names(DGE_list)

# You can filter and resort the table based on ES, NES or pvalue
res <- lapply(res, function(x) {
    x[x$pval < 0.1, ]
})
res <- lapply(res, function(x) {
    x[x$size > 2, ]
})
res <- lapply(res, function(x) {
    x[order(x$NES, decreasing = T), ]
})
res
## $`0`
##    pathway         pval        padj        ES      NES nMoreExtreme size
## 1:   cMono 0.0000999900 0.000299970 0.9588744 2.095372            0   48
## 2:  ncMono 0.0000999900 0.000299970 0.8410417 1.833205            0   46
## 3:     cDC 0.0000999900 0.000299970 0.8160502 1.772541            0   43
## 4:     pDC 0.0005017561 0.001128951 0.7652807 1.584164            4   21
## 5:  B cell 0.0069809794 0.012565763 0.7410824 1.493208           68   15
## 6: NK cell 0.0150437919 0.022565688 0.7579453 1.475439          145   11
##                                     leadingEdge
## 1:      S100A8,S100A9,LYZ,S100A12,VCAN,FCN1,...
## 2:     CTSS,TYMP,CST3,S100A11,AIF1,SERPINA1,...
## 3:              LYZ,GRN,TYMP,CST3,AIF1,CPVL,...
## 4:         GRN,MS4A6A,CST3,MPEG1,CTSB,TGFBI,...
## 5: NCF1,LY86,MARCH1,HLA-DRB5,POU2F2,PHACTR1,...
## 6:       TYROBP,FCER1G,SRGN,CCL3,CD63,MYO1F,...
## 
## $`1`
##        pathway         pval        padj        ES      NES nMoreExtreme size
## 1:      B cell 0.0000999900 0.000408455 0.8973600 1.988512            0   46
## 2:         cDC 0.0001021138 0.000408455 0.8750203 1.778727            0   14
## 3:         pDC 0.0011068625 0.002951633 0.7719359 1.612224           10   18
## 4: Plasma cell 0.0631477722 0.101036436 0.7333284 1.389860          590    8
## 5:      ncMono 0.0913272011 0.121769601 0.8427419 1.344760          694    3
##                                             leadingEdge
## 1:      CD79A,TCL1A,LINC00926,MS4A1,CD79B,TNFRSF13C,...
## 2: CD74,HLA-DQB1,HLA-DRA,HLA-DPB1,HLA-DRB1,HLA-DQA1,...
## 3:               CD74,TCF4,BCL11A,IRF8,HERPUD1,SPIB,...
## 4:                PLPP5,ISG20,HERPUD1,MZB1,ITM2C,JCHAIN
## 5:                                  HLA-DPA1,POU2F2,LYN
## 
## $`2`
##       pathway         pval         padj        ES      NES nMoreExtreme size
## 1: CD8 T cell 0.0001001603 0.0003505609 0.9432365 2.276503            0   29
## 2:    NK cell 0.0001000801 0.0003505609 0.8661551 2.108346            0   32
## 3: CD4 T cell 0.0007128431 0.0016633005 0.9256514 1.721886            5    5
## 4:        pDC 0.0398826342 0.0697946099 0.7225340 1.474840          366    8
##                            leadingEdge
## 1:   GZMH,CD8A,CD3D,CD3G,CD8B,CCL5,...
## 2: CCL5,NKG7,GZMA,FGFBP2,CCL4,GZMM,...
## 3:                      CD3G,CD3E,IL7R
## 4:               C12orf75,GZMB,SELENOS
## 
## $`3`
##       pathway         pval         padj        ES      NES nMoreExtreme size
## 1: CD8 T cell 0.0001002205 0.0003006615 0.9553147 2.173136            0   25
## 2:    NK cell 0.0001001302 0.0003006615 0.8360990 1.915575            0   27
## 3: CD4 T cell 0.0026475455 0.0052950910 0.8663678 1.674383           23    7
##                           leadingEdge
## 1: DUSP2,CCL5,CD3D,LYAR,CD8A,CD3E,...
## 2: CCL5,KLRB1,GZMM,CMC1,CST7,GZMA,...
## 3:        CD3E,CD3G,IL7R,PIK3IP1,TCF7
## 
## $`4`
##    pathway         pval       padj        ES      NES nMoreExtreme size
## 1:   cMono 0.0005000500 0.00270081 0.7200941 1.556501            4   45
## 2:  ncMono 0.0006001801 0.00270081 0.7215360 1.545517            5   38
## 3:  B cell 0.0191673788 0.04312660 0.9013244 1.505237          156    4
## 4:     cDC 0.0049014704 0.01470441 0.6806718 1.448777           48   34
## 5:     pDC 0.0843953838 0.15191169 0.6245503 1.295140          840   22
##                                 leadingEdge
## 1:   CST3,FCER1G,COTL1,LYZ,STXBP2,AP1S2,...
## 2: OAZ1,TIMP1,CST3,FKBP1A,IFITM3,FCER1G,...
## 3:                     PDLIM1,HLA-DRB5,NCF1
## 4: GAPDH,CST3,FCER1G,COTL1,LYZ,HLA-DRB5,...
## 5:       PTCRA,CST3,TXN,CTSB,APP,MS4A6A,...
## 
## $`5`
##        pathway         pval         padj        ES      NES nMoreExtreme size
## 1:     NK cell 0.0000999900 0.0004019697 0.9377894 2.439144            0   50
## 2:  CD8 T cell 0.0001004924 0.0004019697 0.9138145 2.225902            0   23
## 3:         pDC 0.0018019928 0.0048053141 0.8029287 1.747492           16   10
## 4:      ncMono 0.0115903265 0.0231806531 0.7906541 1.605347          103    7
## 5: Plasma cell 0.0228548516 0.0365677626 0.5842785 1.450093          227   28
##                                  leadingEdge
## 1:        SPON2,GNLY,PRF1,GZMB,CD7,CLIC3,...
## 2:       GNLY,PRF1,GZMB,NKG7,FGFBP2,CTSW,...
## 3: GZMB,C12orf75,RRBP1,PLAC8,ALOX5AP,HSP90B1
## 4:                   FCGR3A,IFITM2,RHOC,HES4
## 5:  CD38,FKBP11,SLAMF7,SDF2L1,PRDM1,PPIB,...
## 
## $`6`
##       pathway         pval         padj        ES      NES nMoreExtreme size
## 1: CD4 T cell 0.0001020616 0.0006123699 0.9154707 1.793599            0   13
## 2: CD8 T cell 0.0011968230 0.0035904689 0.8968754 1.622288           10    7
##                          leadingEdge
## 1: IL7R,LTB,LDHB,RCAN3,MAL,NOSIP,...
## 2:      IL32,CD3E,CD3D,CD2,CD3G,CD8B
## 
## $`7`
##        pathway         pval         padj        ES      NES nMoreExtreme size
## 1:     NK cell 0.0000999900 0.0004013646 0.9150850 2.398852            0   46
## 2:  CD8 T cell 0.0001003412 0.0004013646 0.9281782 2.318084            0   26
## 3:      ncMono 0.0024107450 0.0064286534 0.8617933 1.703177           20    6
## 4:         pDC 0.0130282809 0.0260565618 0.7292142 1.595377          122   10
## 5: Plasma cell 0.0451813264 0.0722901222 0.5447121 1.375779          450   29
##                                   leadingEdge
## 1:        FGFBP2,GNLY,NKG7,CST7,GZMB,CTSW,...
## 2:        FGFBP2,GNLY,NKG7,CST7,GZMB,CTSW,...
## 3:                         FCGR3A,IFITM2,RHOC
## 4:  GZMB,C12orf75,HSP90B1,ALOX5AP,RRBP1,PLAC8
## 5: PRDM1,FKBP11,HSP90B1,PPIB,SPCS2,SDF2L1,...
## 
## $`8`
##        pathway         pval         padj        ES      NES nMoreExtreme size
## 1:      B cell 0.0000999900 0.0004571312 0.8983372 1.896468            0   45
## 2:         cDC 0.0001015847 0.0004571312 0.8787620 1.717309            0   14
## 3:         pDC 0.0005035247 0.0015105740 0.8235666 1.638689            4   17
## 4: Plasma cell 0.0116822430 0.0175233645 0.7578645 1.481047          114   14
##                                             leadingEdge
## 1:        CD79A,MS4A1,BANK1,CD74,TNFRSF13C,HLA-DQA1,...
## 2: CD74,HLA-DQA1,HLA-DRA,HLA-DPB1,HLA-DQB1,HLA-DPA1,...
## 3:             CD74,JCHAIN,SPIB,HERPUD1,TCF4,CCDC50,...
## 4:                JCHAIN,HERPUD1,ISG20,ITM2C,PEBP1,MZB1
## 
## $`9`
##       pathway         pval         padj        ES      NES nMoreExtreme size
## 1: CD4 T cell 0.0001023227 0.0006139364 0.9248473 1.981064            0   13
## 2: CD8 T cell 0.0668711656 0.1587804395 0.7936868 1.374165          544    4
##                             leadingEdge
## 1: IL7R,TCF7,PIK3IP1,TSHZ2,LTB,LEF1,...
## 2:                   CD3E,CD3G,CD3D,CD2
## 
## $`10`
##    pathway       pval         padj        ES      NES nMoreExtreme size
## 1:  ncMono 0.00009999 0.0002666667 0.9578305 2.052171            0   49
## 2:   cMono 0.00010000 0.0002666667 0.8907972 1.877527            0   35
## 3:     cDC 0.00009999 0.0002666667 0.8272454 1.750653            0   38
## 4: NK cell 0.00255050 0.0051009998 0.8054483 1.571513           24   13
## 5:     pDC 0.04759980 0.0761596766 0.6792488 1.351779          470   16
## 6:  B cell 0.07367357 0.0945474534 0.6570759 1.307652          728   16
##                                               leadingEdge
## 1:                CDKN1C,LST1,FCGR3A,AIF1,COTL1,MS4A7,...
## 2:               LST1,AIF1,COTL1,SERPINA1,FCER1G,PSAP,...
## 3:                   LST1,AIF1,COTL1,FCER1G,CST3,SPI1,...
## 4:             FCGR3A,FCER1G,RHOC,TYROBP,IFITM2,MYO1F,...
## 5:                   CST3,NPC2,PLD4,MPEG1,VAMP8,TGFBI,...
## 6: HLA-DPA1,POU2F2,HLA-DRB5,HLA-DRA,HLA-DPB1,HLA-DRB1,...

现在，我们可以根据每个细胞簇富集的最优结果对它们进行重命名。OBS！请注意，如果某些cluster富集到的所有基因集的p值都不好，那么这个预测的结果将会不太可靠。同样，如果我们使用的基因集无法涵盖所有的细胞类型，那么预测的结果可能只是最相似的细胞类型。

new.cluster.ids <- unlist(lapply(res, function(x) {
    as.data.frame(x)[1, 1]
}))

alldata$ref_gsea <- new.cluster.ids[as.character([email protected])]

cowplot::plot_grid(ncol = 2, DimPlot(alldata, label = T, group.by = "CCA_snn_res.0.5") + 
    NoAxes(), DimPlot(alldata, label = T, group.by = "ref_gsea") + NoAxes())

image.png

将富集预测的结果与ctrl_13样本中的其他细胞类型预测方法进行比较。

ctrl$ref_gsea = alldata$ref_gsea[alldata$orig.ident == "ctrl_13"]

cowplot::plot_grid(ncol = 3, DimPlot(ctrl, label = T, group.by = "ref_gsea") + NoAxes() + 
    ggtitle("GSEA"), DimPlot(ctrl, label = T, group.by = "predicted.id") + NoAxes() + 
    ggtitle("LabelTransfer"), DimPlot(ctrl, label = T, group.by = "scpred_prediction") + 
    NoAxes() + ggtitle("scPred"))

image.png

使用已知的带有注释的基因集进行富集

首先，我们从CellMarker数据库下载特定细胞类型的基因集。

# Download gene marker list
if (!dir.exists("data/CellMarker_list/")) {
    dir.create("data/CellMarker_list")
    download.file(url = "http://bio-bigdata.hrbmu.edu.cn/CellMarker/download/Human_cell_markers.txt", 
        destfile = "./data/CellMarker_list/Human_cell_markers.txt")
    download.file(url = "http://bio-bigdata.hrbmu.edu.cn/CellMarker/download/Mouse_cell_markers.txt", 
        destfile = "./data/CellMarker_list/Mouse_cell_markers.txt")
}

读入基因集列表，并做初步的筛选。

# Load the human marker table
markers <- read.delim("data/CellMarker_list/Human_cell_markers.txt")
markers <- markers[markers$speciesType == "Human", ]
markers <- markers[markers$cancerType == "Normal", ]

# Filter by tissue (to reduce computational time and have tissue-specific
# classification) sort(unique(markers$tissueType))
# grep('blood',unique(markers$tissueType),value = T) markers <- markers [
# markers$tissueType %in% c('Blood','Venous blood', 'Serum','Plasma',
# 'Spleen','Bone marrow','Lymph node'), ]

# remove strange characters etc.
celltype_list <- lapply(unique(markers$cellName), function(x) {
    x <- paste(markers$geneSymbol[markers$cellName == x], sep = ",")
    x <- gsub("[[]|[]]| |-", ",", x)
    x <- unlist(strsplit(x, split = ","))
    x <- unique(x[!x %in% c("", "NA", "family")])
    x <- casefold(x, upper = T)
})
names(celltype_list) <- unique(markers$cellName)
# celltype_list <- lapply(celltype_list , function(x) {x[1:min(length(x),50)]} )
celltype_list <- celltype_list[unlist(lapply(celltype_list, length)) < 100]
celltype_list <- celltype_list[unlist(lapply(celltype_list, length)) > 5]

对已知的基因集进行GSEA富集分析。

# run fgsea for each of the clusters in the list
res <- lapply(DGE_list, function(x) {
    gene_rank <- setNames(x$avg_logFC, x$gene)
    fgseaRes <- fgsea(pathways = celltype_list, stats = gene_rank, nperm = 10000)
    return(fgseaRes)
})
names(res) <- names(DGE_list)

# You can filter and resort the table based on ES, NES or pvalue
res <- lapply(res, function(x) {
    x[x$pval < 0.01, ]
})
res <- lapply(res, function(x) {
    x[x$size > 5, ]
})
res <- lapply(res, function(x) {
    x[order(x$NES, decreasing = T), ]
})

# show top 3 for each cluster.
lapply(res, head, 3)
## $`0`
##                   pathway         pval       padj        ES      NES
## 1:             Neutrophil 0.0000999900 0.00427395 0.8225470 1.803093
## 2:             Fibroblast 0.0001042427 0.00427395 0.8978804 1.725605
## 3: CD1C+_B dendritic cell 0.0000999900 0.00427395 0.7846113 1.711199
##    nMoreExtreme size                              leadingEdge
## 1:            0   55 S100A8,S100A9,S100A12,CD14,MNDA,G0S2,...
## 2:            0   10        CD14,VIM,CD36,CKAP4,LRP1,CD44,...
## 3:            0   49  S100A8,S100A9,LYZ,S100A12,VCAN,FCN1,...
## 
## $`1`
##                         pathway         pval       padj        ES      NES
## 1:            Follicular B cell 0.0001026905 0.01314438 0.8905277 1.776479
## 2: Megakaryocyte erythroid cell 0.0020831163 0.03662375 0.8307946 1.620952
##    nMoreExtreme size                         leadingEdge
## 1:            0   12 MS4A1,CD69,FCER2,CD22,CD40,PAX5,...
## 2:           19   10     CD79A,CD83,CD69,FCER2,LY9,CXCR5
## 
## $`2`
##                         pathway         pval         padj        ES      NES
## 1:        CD4+ cytotoxic T cell 0.0000999900 0.0008897005 0.8208075 2.102096
## 2: Megakaryocyte erythroid cell 0.0001005126 0.0008897005 0.8803002 2.071241
## 3:                  CD8+ T cell 0.0001055186 0.0008897005 0.9672513 2.058873
##    nMoreExtreme size                          leadingEdge
## 1:            0   65 GZMH,CCL5,NKG7,KLRG1,GZMA,FGFBP2,...
## 2:            0   22    CD8A,CD3D,CD3G,CD2,CD3E,KLRG1,...
## 3:            0   10     CD8A,CD3D,CD3G,CD8B,CD2,CD3E,...
## 
## $`3`
##                         pathway         pval        padj        ES      NES
## 1:                  CD8+ T cell 0.0001023332 0.001457502 0.9558597 2.023154
## 2: Megakaryocyte erythroid cell 0.0001000801 0.001457502 0.8371009 1.911415
## 3:                T helper cell 0.0001012863 0.001457502 0.8699055 1.893383
##    nMoreExtreme size                         leadingEdge
## 1:            0   13   GZMK,CD3D,CD8A,CD3E,CD3G,CD8B,...
## 2:            0   27 CD3D,CD8A,KLRB1,CD3E,KLRG1,CD3G,...
## 3:            0   16  GZMK,CD3D,KLRB1,CD3E,CD3G,IL7R,...
## 
## $`4`
##                   pathway         pval       padj        ES      NES
## 1:          Megakaryocyte 0.0001033058 0.01580579 0.9165646 1.785300
## 2: Circulating fetal cell 0.0059165346 0.24682396 0.8357811 1.589161
## 3:               Platelet 0.0080661424 0.24682396 0.7460329 1.521198
##    nMoreExtreme size                         leadingEdge
## 1:            0   11     PPBP,PF4,GP9,ITGA2B,CD9,RASGRP2
## 2:           55    9                   PF4,CD9,ACTB,CD68
## 3:           79   18 GP9,ITGA2B,CD9,CD151,CD63,ICAM2,...
## 
## $`5`
##                              pathway         pval        padj        ES
## 1:             CD4+ cytotoxic T cell 0.0000999900 0.002568843 0.8537503
## 2: Effector CD8+ memory T (Tem) cell 0.0000999900 0.002568843 0.8218874
## 3:      Megakaryocyte erythroid cell 0.0001004924 0.002568843 0.8644641
##         NES nMoreExtreme size                            leadingEdge
## 1: 2.269733            0   71    SPON2,GNLY,PRF1,PTGDS,GZMB,NKG7,...
## 2: 2.166913            0   62 SPON2,GNLY,GZMB,FGFBP2,KLRF1,KLRD1,...
## 3: 2.102014            0   23    GZMB,CD7,KLRB1,KLRD1,IL2RB,GZMA,...
## 
## $`6`
##             pathway         pval        padj        ES      NES nMoreExtreme
## 1:      CD4+ T cell 0.0001014816 0.005206526 0.8985023 1.769758            0
## 2: Activated T cell 0.0004203005 0.008105796 0.8828033 1.651863            3
## 3:      CD8+ T cell 0.0003074085 0.006916692 0.8470356 1.641751            2
##    size                          leadingEdge
## 1:   14   IL7R,LTB,CD3E,CD3D,CD5,TNFRSF4,...
## 2:    9 CD3E,CD3D,TNFRSF4,CD3G,CD28,CD27,...
## 3:   12      IL7R,CD3E,CD3D,CD5,CD2,CD3G,...
## 
## $`7`
##                              pathway       pval        padj        ES      NES
## 1:             CD4+ cytotoxic T cell 0.00009999 0.002383736 0.8720059 2.357424
## 2: Effector CD8+ memory T (Tem) cell 0.00009999 0.002383736 0.8328810 2.242414
## 3:               Natural killer cell 0.00010002 0.002383736 0.8444422 2.197239
##    nMoreExtreme size                           leadingEdge
## 1:            0   74   FGFBP2,GNLY,NKG7,CST7,GZMB,CTSW,...
## 2:            0   68 FGFBP2,GNLY,GZMB,GZMH,KLRF1,KLRD1,...
## 3:            0   41   GNLY,NKG7,GZMB,KLRF1,KLRD1,GZMA,...
## 
## $`8`
##              pathway        pval      padj        ES      NES nMoreExtreme size
## 1: Follicular B cell 0.009812022 0.1706815 0.7917685 1.518359           94   11
##                           leadingEdge
## 1: MS4A1,CD24,CD40,CD22,PAX5,EBF1,...
## 
## $`9`
##                  pathway         pval        padj        ES      NES
## 1:     Naive CD8+ T cell 0.0000999900 0.006053027 0.7754973 1.914284
## 2:     Naive CD4+ T cell 0.0001000500 0.006053027 0.8116887 1.887479
## 3: Central memory T cell 0.0007809885 0.019694812 0.9093896 1.701239
##    nMoreExtreme size                             leadingEdge
## 1:            0   73 CCR7,TCF7,PIK3IP1,LEF1,TRABD2A,LDHB,...
## 2:            0   29    CCR7,IL7R,TCF7,TSHZ2,TRABD2A,MAL,...
## 3:            6    6                     CCR7,IL7R,CD28,CD27
## 
## $`10`
##                         pathway         pval        padj        ES      NES
## 1: Megakaryocyte erythroid cell 0.0001011122 0.008190091 0.8301819 1.653260
## 2:                 Myeloid cell 0.0001002406 0.008190091 0.7868515 1.611696
## 3:                Lymphoid cell 0.0016443988 0.053278520 0.8249257 1.598846
##    nMoreExtreme size                            leadingEdge
## 1:            0   16  FCGR3A,PECAM1,CD68,ITGAX,SPN,CD86,...
## 2:            0   23 FCGR3A,CSF1R,PECAM1,CD68,ITGAX,SPN,...
## 3:           15   12              FCGR3A,CD68,ITGAX,SPN,CD4

可视化基因集富集分析注释到的细胞类型。

new.cluster.ids <- unlist(lapply(res, function(x) {
    as.data.frame(x)[1, 1]
}))
alldata$cellmarker_gsea <- new.cluster.ids[as.character([email protected])]

cowplot::plot_grid(ncol = 2, DimPlot(alldata, label = T, group.by = "ref_gsea") + 
    NoAxes(), DimPlot(alldata, label = T, group.by = "cellmarker_gsea") + NoAxes())

image.png

基于以上分析，您认为这些预测方法可以很好的重叠吗？您在哪里看到最多的不一致之处？

在这种情况下，我们没有任何事实依据，也不能说“夹心法”效果最好。我们应该牢记，任何细胞类型分类方法都只是一种预测，我们仍然需要利用已有的生物学知识来判断预测结果是否有意义。

保存细胞类型预测的结果

saveRDS(ctrl, "data/results/ctrl13_qc_dr_int_cl_celltype.rds")

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Catalina 10.15.5
## 
## Matrix products: default
## BLAS/LAPACK: /Users/paulo.czarnewski/.conda/envs/scRNAseq2021/lib/libopenblasp-r0.3.12.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] fgsea_1.16.0    caret_6.0-86    lattice_0.20-41 scPred_1.9.0   
##  [5] rafalib_1.0.0   pheatmap_1.0.12 ggplot2_3.3.3   cowplot_1.1.1  
##  [9] dplyr_1.0.3     venn_1.9        Seurat_3.2.3    RJSONIO_1.3-1.4
## [13] optparse_1.6.6 
## 
## loaded via a namespace (and not attached):
##   [1] fastmatch_1.1-0       plyr_1.8.6            igraph_1.2.6         
##   [4] lazyeval_0.2.2        splines_4.0.3         BiocParallel_1.24.0  
##   [7] listenv_0.8.0         scattermore_0.7       digest_0.6.27        
##  [10] foreach_1.5.1         htmltools_0.5.1       fansi_0.4.2          
##  [13] magrittr_2.0.1        tensor_1.5            cluster_2.1.0        
##  [16] ROCR_1.0-11           limma_3.46.0          recipes_0.1.15       
##  [19] globals_0.14.0        gower_0.2.2           matrixStats_0.57.0   
##  [22] colorspace_2.0-0      ggrepel_0.9.1         xfun_0.20            
##  [25] crayon_1.3.4          jsonlite_1.7.2        spatstat_1.64-1      
##  [28] spatstat.data_1.7-0   survival_3.2-7        zoo_1.8-8            
##  [31] iterators_1.0.13      glue_1.4.2            polyclip_1.10-0      
##  [34] gtable_0.3.0          ipred_0.9-9           leiden_0.3.6         
##  [37] kernlab_0.9-29        future.apply_1.7.0    abind_1.4-5          
##  [40] scales_1.1.1          DBI_1.1.1             miniUI_0.1.1.1       
##  [43] Rcpp_1.0.6            viridisLite_0.3.0     xtable_1.8-4         
##  [46] reticulate_1.18       rsvd_1.0.3            stats4_4.0.3         
##  [49] lava_1.6.8.1          prodlim_2019.11.13    htmlwidgets_1.5.3    
##  [52] httr_1.4.2            getopt_1.20.3         RColorBrewer_1.1-2   
##  [55] ellipsis_0.3.1        ica_1.0-2             pkgconfig_2.0.3      
##  [58] farver_2.0.3          nnet_7.3-14           uwot_0.1.10          
##  [61] deldir_0.2-9          tidyselect_1.1.0      labeling_0.4.2       
##  [64] rlang_0.4.10          reshape2_1.4.4        later_1.1.0.1        
##  [67] munsell_0.5.0         tools_4.0.3           cli_2.2.0            
##  [70] generics_0.1.0        ggridges_0.5.3        evaluate_0.14        
##  [73] stringr_1.4.0         fastmap_1.0.1         yaml_2.2.1           
##  [76] goftest_1.2-2         ModelMetrics_1.2.2.2  knitr_1.30           
##  [79] fitdistrplus_1.1-3    admisc_0.11           purrr_0.3.4          
##  [82] RANN_2.6.1            pbapply_1.4-3         future_1.21.0        
##  [85] nlme_3.1-151          mime_0.9              formatR_1.7          
##  [88] compiler_4.0.3        beeswarm_0.2.3        plotly_4.9.3         
##  [91] png_0.1-7             spatstat.utils_1.20-2 tibble_3.0.5         
##  [94] stringi_1.5.3         highr_0.8             RSpectra_0.16-0      
##  [97] Matrix_1.3-2          vctrs_0.3.6           pillar_1.4.7         
## [100] lifecycle_0.2.0       lmtest_0.9-38         RcppAnnoy_0.0.18     
## [103] data.table_1.13.6     irlba_2.3.3           httpuv_1.5.5         
## [106] patchwork_1.1.1       R6_2.5.0              promises_1.1.1       
## [109] KernSmooth_2.23-18    gridExtra_2.3         vipor_0.4.5          
## [112] parallelly_1.23.0     codetools_0.2-18      MASS_7.3-53          
## [115] assertthat_0.2.1      withr_2.4.0           sctransform_0.3.2    
## [118] harmony_1.0           mgcv_1.8-33           parallel_4.0.3       
## [121] grid_4.0.3            rpart_4.1-15          timeDate_3043.102    
## [124] tidyr_1.1.2           class_7.3-17          rmarkdown_2.6        
## [127] Rtsne_0.15            pROC_1.17.0.1         shiny_1.5.0          
## [130] lubridate_1.7.9.2     ggbeeswarm_0.6.0

参考来源：https://nbisweden.github.io/workshop-scRNAseq/labs/compiled/seurat/seurat_06_celltype.html

NBIS系列单细胞转录组数据分析实战（六）：细胞类型注释

第六节：细胞类型注释

加载所需的R包和数据集

获取参考数据集

对参考数据集进行预处理

使用Seurat包中的标签转移方法进行细胞类型预测

使用scPred预测细胞类型

比较不同预测方法的结果

使用GSEA富集分析预测细胞类型

使用重叠的DEGs进行基因集的富集

使用已知的带有注释的基因集进行富集

保存细胞类型预测的结果

你可能感兴趣的:(NBIS系列单细胞转录组数据分析实战（六）：细胞类型注释)