AUCell：计算单细胞转录组的每个细胞中特定基因集的活性程度

AUC允许在单细胞rna数据集中识别具有活性基因集（如gene signature,gene module）的细胞。AUCell使用曲线下面积来计算输入基因集的一个关键子集是否在每个细胞的表达基因中富集。AUCell和GSVA的算法是一回事，都是排序。AUCell借鉴了ssGSVA的算法，但是在排序的时候，它们的有些处理不太一样。在SCENIC 转录因子预测的分析过程中，已经封装了AUCell。
AUCell的工作流程分为三步，分别通过三个函数完成：
1. AUCell_buildRankings：Build the rank
2. AUCell_calcAUC：Calculate the Area Under the Curve (AUC)
3. AUCell_exploreThresholds：Set the assignment thresholds

1. 数据准备

1.1 R包下载

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
# To support paralell execution:
BiocManager::install(c("doMC", "doRNG","doSNOW"))
# For the main example:
BiocManager::install(c("mixtools", "GEOquery", "SummarizedExperiment"))
# For the examples in the follow-up section of the tutorial:
BiocManager::install(c("DT", "plotly", "NMF", "d3heatmap", "shiny", "rbokeh",
                       "dynamicTreeCut","R2HTML","Rtsne", "zoo"))
browseVignettes("AUCell")

1.2 数据准备：AUCell所需要的输入数据分为两部分，矩阵数据和基因集数据。

#设置工作目录
dir.create("AUCell_tutorial")
setwd("AUCell_tutorial") # or in the first code chunk (kntr options), if running as Notebook

下载演示数据

library(GEOquery)
attemptsLeft <- 20
while(attemptsLeft>0)
{
  geoFile <- tryCatch(getGEOSuppFiles("GSE60361", makeDirectory=FALSE), error=identity) 
  if(methods::is(geoFile,"error")) 
  {
    attemptsLeft <- attemptsLeft-1
    Sys.sleep(5)
  }
  else
    attemptsLeft <- 0
}

gzFile <- grep(".txt.gz", basename(rownames(geoFile)), fixed=TRUE, value=TRUE)
message(gzFile)
#GSE60361_C1-3005-Expression.txt.gz
txtFile <- gsub(".gz", "", gzFile, fixed=TRUE)
message(txtFile)
#GSE60361_C1-3005-Expression.txt
gunzip(filename=gzFile, destname=txtFile, remove=TRUE)

library(data.table)
geoData <- fread(txtFile, sep="\t")
geneNames <- unname(unlist(geoData[,1, with=FALSE]))
exprMatrix <- as.matrix(geoData[,-1, with=FALSE])
rm(geoData)
dim(exprMatrix)
# [1] 19972  3005
rownames(exprMatrix) <- geneNames
exprMatrix[1:5,1:4]
#          1772071015_C02 1772071017_G12 1772071017_A05 1772071014_B06
# Tspan12               0              0              0              3
# Tshz1                 3              1              0              2
# Fnbp1l                3              1              6              4
# Adamts15              0              0              0              0
# Cldn12                1              1              1              0

# Remove file
file.remove(txtFile)
# [1] TRUE

# Save for future use
mouseBrainExprMatrix <- exprMatrix
save(mouseBrainExprMatrix, file="exprMatrix_AUCellVignette_MouseBrain.RData")


# 仅使用数据集中的5000个基因做演示
# load("exprMatrix_AUCellVignette_MouseBrain.RData")
set.seed(333)
exprMatrix <- mouseBrainExprMatrix[sample(rownames(mouseBrainExprMatrix), 5000),]
dim(exprMatrix)
# [1] 5000 3005

准备输入基因集

library(AUCell)
library(GSEABase)
gmtFile <- paste(file.path(system.file('examples', package='AUCell')), "geneSignatures.gmt", sep="/")
geneSets <- getGmt(gmtFile)
# check how many of these genes are in the expression matrix:
geneSets <- subsetGeneSets(geneSets, rownames(exprMatrix)) 
cbind(nGenes(geneSets)) #演示基因集中有6个基因集 后面的数字是基因集中对应的基因数目
##                       [,1]
## Astrocyte_Cahoy        538
## Neuron_Cahoy           384
## Oligodendrocyte_Cahoy  477
## Astrocyte_Lein           7
## Neuron_Lein             13
## Microglia_lavin        165

# 为了方便演示，将基因集中的基因数目添加到基因集名字中
geneSets <- setGeneSetNames(geneSets, newNames=paste(names(geneSets), " (", nGenes(geneSets) ,"g)", sep=""))

# 添加一些随机抽样出的基因组成的的基因集和在多数细胞中表达的基因（如house-keeping like gene）。（仅供演示，实操不必这样做）
# Random
set.seed(321)
extraGeneSets <- c(
  GeneSet(sample(rownames(exprMatrix), 50), setName="Random (50g)"),
  GeneSet(sample(rownames(exprMatrix), 500), setName="Random (500g)"))

countsPerGene <- apply(exprMatrix, 1, function(x) sum(x>0))
# Housekeeping-like
extraGeneSets <- c(extraGeneSets,
                   GeneSet(sample(names(countsPerGene)[which(countsPerGene>quantile(countsPerGene, probs=.95))], 100), setName="HK-like (100g)"))

geneSets <- GeneSetCollection(c(geneSets,extraGeneSets))
names(geneSets) 
## [1] "Astrocyte_Cahoy (538g)"       "Neuron_Cahoy (384g)"         
## [3] "Oligodendrocyte_Cahoy (477g)" "Astrocyte_Lein (7g)"         
## [5] "Neuron_Lein (13g)"            "Microglia_lavin (165g)"      
## [7] "Random (50g)"                 "Random (500g)"               
## [9] "HK-like (100g)"

2. Build gene-expression rankings for each cell (`AUCell_buildRankings`函数)

输入的是表达矩阵

cells_rankings <- AUCell_buildRankings(exprMatrix, nCores=1, plotStats=TRUE)
# Quantiles for the number of genes detected by cell: 
# (Non-detected genes are shuffled at the end of the ranking. Keep it in mind when choosing the threshold for calculating the AUC).
#  min   1%   5%  10%  50% 100% 
#  191  270  369  445  929 2062

查看排列的结果

cells_rankings
# Ranking for 5000 genes (rows) and 3005 cells (columns).

# Top-left corner of the ranking:
#                     cells
# genes                1772071015_C02 1772071017_G12 1772071017_A05 1772071014_B06 1772067065_H06
#   0610009B22Rik                3417            859            672            648           1822
#   0610009L18Rik                4196           4453           2603           3537           4642
#   0610010B08Rik_loc6           1497           1645           4543           2843           1646
#   0610011F06Rik                 355            861           1227           2095            847
#   0610012H03Rik                2331           4084           2533           3098           4702
#   0610039K10Rik                3390           2443           4490           2202           4239

# Quantiles for the number of genes detected by cell:
#  min   1%   5%  10%  50% 100% 
#  191  270  369  445  929 2062

这一步的排名是对每个基因从高到低进行排名，也就是对每一个细胞的每一个基因进行排名，得到一个排名的矩阵。

3. Calculate the Area Under the Curve (AUC)(`AUCell_calcAUC`函数)

cells_AUC <- AUCell_calcAUC(geneSets, cells_rankings)
save(cells_AUC, file="cells_AUC.RData")
cells_AUC
# AUC for 9 gene sets (rows) and 3005 cells (columns).

# Top-left corner of the AUC matrix:
#                        cells
# gene sets               1772071015_C02 1772071017_G12 1772071017_A05 1772071014_B06 1772067065_H06
#   Astrocyte_Cahoy           0.14705221     0.13297992     0.15524498     0.16514056     0.14599197
#   Neuron_Cahoy              0.27110040     0.29031325     0.29326908     0.31334940     0.32883534
#   Oligodendrocyte_Cahoy     0.15183936     0.13073092     0.13898795     0.12356627     0.13783133
#   Astrocyte_Lein            0.15389082     0.13995354     0.21777003     0.24332172     0.10046458
#   Neuron_Lein               0.37828427     0.41278886     0.50743906     0.46438746     0.45868946
#   Microglia_lavin           0.09373979     0.06140446     0.06481582     0.08372346     0.07410633

根据输入基因list中的基因在每个细胞中的表达量，对每个细胞都算了一个该genelist的AUC值。

因为我们是直接输入的矩阵，如果矩阵是来自于单细胞数据，这时对于每个输入的list，每个细胞都已经有一个auc值，可以将auc值投射回umap或tsne图（见6.实操）。

4. Determine the cells with the given gene signatures or active gene sets (`AUCell_exploreThresholds`函数)

AUC估计了输入基因集中的基因在每个细胞高表达的比例。表达了输入基因集中较多基因的细胞具有较高的AUC值。由于AUC值可以代表细胞表达基因集中基因的比例，我们可以使用跨细胞的相对AUCs来查看单细胞转录组的细胞对输入基因集的响应情况。
由于AUC并不是一个绝对的数值，而是受到细胞类型、数据集、基因集等的影响，尤其是在单细胞水平，大多数基因并不在所有细胞中稳定表达。因此有时判断某个基因集的gene在一个细胞中是'on'还是'off'状态比较困难。最理想的方法是双峰分布bi-modal distribution，也就是数据集中部分细胞有较高AUC而大多数的细胞有较低的AUC。AUCell提供了AUCell_exploreThresholds()函数来区分较高和较低的AUC值，也就是自动划定双峰分布的阈值，判断基因集的'on'或'off'状态。

set.seed(123)
par(mfrow=c(3,3)) 
cells_assignment <- AUCell_exploreThresholds(cells_AUC, plotHist=TRUE, assign=TRUE)

分别划定了九个基因集的阈值

getAUC(cells_AUC)[,1:5]
#                               cells
# gene sets                      1772071015_C02 1772071017_G12 # 1772071017_A05 1772071014_B06 1772067065_H06
#   Astrocyte_Cahoy (538g)           0.14621687    0.132208835     0.15315663    0.163823293    0.147855422
#   Neuron_Cahoy (384g)              0.27293173    0.291887550     0.29529317    0.312481928    0.328899598
#   Oligodendrocyte_Cahoy (477g)     0.15752610    0.130120482     0.13937349    0.124144578    0.134650602
#   Astrocyte_Lein (7g)              0.14982578    0.145180023     0.21544715    0.242160279    0.103368177
#   Neuron_Lein (13g)                0.38524850    0.404558405     0.51187085    0.467553023    0.446660336
#   Microglia_lavin (165g)           0.09196153    0.062928688     0.06546906    0.082852477    0.074578116
#   Random (50g)                     0.01273942    0.008908686     0.01336303    0.008819599    0.003296214
#   Random (500g)                    0.10306827    0.102522088     0.11167871    0.113959839    0.067726908
#   HK-like (100g)                   0.40626566    0.427769424     0.41182957    0.419548872    0.403208020

warningMsg <- sapply(cells_assignment, function(x) x$aucThr$comment)
warningMsg[which(warningMsg!="")]

# Random (500g) 
# "The AUC might follow a normal distribution (random gene-set?).  The global distribution overlaps the partial distributions. "

AUCell_exploreThresholds函数计算的每个基因集的阈值储存在$aucThr。如查看oligodendrocyte基因集的阈值：

cells_assignment$Oligodendrocyte_Cahoy$aucThr$thresholds
##             threshold nCells
## Global_k1   0.2419241    739
## L_k2        0.2437017    733
## R_k3        0.1766730   1008
## minimumDens 0.2417192    740

cells_assignment$Oligodendrocyte_Cahoy$aucThr$selected
## minimumDens 
##   0.2417192

查看在这个阈值下处于on状态的细胞数目

oligodencrocytesAssigned <- cells_assignment$Oligodendrocyte_Cahoy$assignment
length(oligodencrocytesAssigned)
# [1] 740

head(oligodencrocytesAssigned)
## [1] "1772067069_B02" "1772066077_G03"
## [3] "1772066070_D08" "1772067076_A07"
## [5] "1772067057_G07" "1772067076_C07"

也可以手动对阈值进行调整
绘制特定基因集的条型图，设定新阈值

geneSetName <- rownames(cells_AUC)[grep("Oligodendrocyte_Cahoy", rownames(cells_AUC))]
AUCell_plotHist(cells_AUC[geneSetName,], aucThr=0.25)
abline(v=0.25)

使用新阈值分配细胞

newSelectedCells <- names(which(getAUC(cells_AUC)[geneSetName,]>0.08))
length(newSelectedCells)
# [1] 3002
head(newSelectedCells)
# [1] "1772071015_C02" "1772071017_G12" "1772071017_A05" "1772071014_B06"
# [5] "1772067065_H06" "1772071017_E02"

5. 后续分析

加载3005个细胞的tsne坐标并绘图

load(paste(file.path(system.file('examples', package='AUCell')), "cellsTsne.RData", sep="/"))
cellsTsne <- cellsTsne$Y
plot(cellsTsne, pch=16, cex=.3)

tsne数据的来源：

load("exprMatrix_AUCellVignette_MouseBrain.RData")
sumByGene <- apply(mouseBrainExprMatrix, 1, sum)
exprMatSubset <- mouseBrainExprMatrix[which(sumByGene>0),]
logMatrix <- log2(exprMatSubset+1)

library(Rtsne)
set.seed(123)
cellsTsne <- Rtsne(t(logMatrix))
rownames(cellsTsne$Y) <- colnames(logMatrix)
colnames(cellsTsne$Y) <- c("tsne1", "tsne2")
save(cellsTsne, file="cellsTsne.RData")

上面的tSNE图可以通过AUC评分来上色。为了更好的区分细胞，我们将阈值内的细胞标为pink-red，其他细胞标为black-blue。

selectedThresholds <- getThresholdSelected(cells_assignment)
par(mfrow=c(2,4)) # Splits the plot into two rows and three columns
for(geneSetName in names(selectedThresholds))
{
  nBreaks <- 5 # Number of levels in the color palettes
  # Color palette for the cells that do not pass the threshold
  colorPal_Neg <- grDevices::colorRampPalette(c("black","blue", "skyblue"))(nBreaks)
  # Color palette for the cells that pass the threshold
  colorPal_Pos <- grDevices::colorRampPalette(c("pink", "magenta", "red"))(nBreaks)
  
  # Split cells according to their AUC value for the gene set
  passThreshold <- getAUC(cells_AUC)[geneSetName,] >  selectedThresholds[geneSetName]
  if(sum(passThreshold) >0 )
  {
    aucSplit <- split(getAUC(cells_AUC)[geneSetName,], passThreshold)
    
    # Assign cell color
    cellColor <- c(setNames(colorPal_Neg[cut(aucSplit[[1]], breaks=nBreaks)], names(aucSplit[[1]])), 
    setNames(colorPal_Pos[cut(aucSplit[[2]], breaks=nBreaks)], names(aucSplit[[2]])))
    
    # Plot
    plot(cellsTsne, main=geneSetName,
    sub="Pink/red cells pass the threshold",
    col=cellColor[rownames(cellsTsne)], pch=16) 
  }
}

查看阈值对细胞分配的影响

selectedThresholds[2] <-  0.25
par(mfrow=c(2,3))
AUCell_plotTSNE(tSNE=cellsTsne, exprMat=exprMatrix, 
cellsAUC=cells_AUC[1:2,], thresholds=selectedThresholds)

6. 使用pbmc3k数据集进行实操

library(AUCell)
library(clusterProfiler)
library(ggplot2)
library(Seurat)
pbmc <- readRDS("pbmc.rds")
H <- read.gmt("h.all.v7.4.symbols.gmt.txt")
cells_rankings <- AUCell_buildRankings(pbmc@assays$RNA@data)  # 关键一步
unique(H$term)
geneSets <- lapply(unique(H$term), function(x){H$gene[H$term == x]})
names(geneSets) <- unique(H$term)
cells_AUC <- AUCell_calcAUC(geneSets, cells_rankings, aucMaxRank=nrow(cells_rankings)*0.1)
length(rownames(cells_AUC@assays@data$AUC))
grep("INFLAMMATORY",rownames(cells_AUC@assays@data$AUC),value = T)
# [1] "HALLMARK_INFLAMMATORY_RESPONSE"
geneSet <- "HALLMARK_INFLAMMATORY_RESPONSE"
aucs <- as.numeric(getAUC(cells_AUC)[geneSet, ])
pbmc$AUC <- aucs
df<- data.frame([email protected], pbmc@[email protected])
head(df)
class_avg <- df %>%
  group_by(cell_type) %>%
  summarise(
    UMAP_1 = median(UMAP_1),
    UMAP_2 = median(UMAP_2)
  )
ggplot(df, aes(UMAP_1, UMAP_2))+
  geom_point(aes(colour= AUC)) + 
  viridis::scale_color_viridis(option="A") +
  ggrepel::geom_label_repel(aes(label = cell_type),
                            data = class_avg,
                            size = 6,
                            label.size = 0,
                            segment.color = NA)+
  theme(legend.position = "none") + 
  theme_bw()

参考：
http://127.0.0.1:28753/library/AUCell/doc/AUCell.html
https://www.jianshu.com/p/2fb20f44da67

AUCell：计算单细胞转录组的每个细胞中特定基因集的活性程度

1. 数据准备

2. Build gene-expression rankings for each cell (AUCell_buildRankings函数)

3. Calculate the Area Under the Curve (AUC)(AUCell_calcAUC函数)

4. Determine the cells with the given gene signatures or active gene sets (AUCell_exploreThresholds函数)

5. 后续分析

6. 使用pbmc3k数据集进行实操

你可能感兴趣的:(AUCell：计算单细胞转录组的每个细胞中特定基因集的活性程度)

2. Build gene-expression rankings for each cell (`AUCell_buildRankings`函数)

3. Calculate the Area Under the Curve (AUC)(`AUCell_calcAUC`函数)

4. Determine the cells with the given gene signatures or active gene sets (`AUCell_exploreThresholds`函数)