DoRothEA：scRNAseq转录因子活性预测工具

在看文献Single-cell RNA sequencing of blood antigen-presenting cells in severe COVID-19 reveals multi-process defects in antiviral immunity的时候看到一个没用过的转录因子预测方法DoRothEA。

Fig.6b, Heatmap of top 50 highly variable TF activities among the three severity groups; the z-scores of TF activities are colour-coded

官网：https://bioconductor.org/packages/release/data/experiment/vignettes/dorothea/inst/doc/single_cell_vignette.html

DoRothEA是一种包含转录因子（TF）与其靶标相互作用的基因集。一个TF及其对应靶点的集合被定义为调节子（regulons）。DoRothEA regulons 收集了文献，ChIP-seq peaks，TF结合位点基序，从基因表达推断相互作用等不同类型的互作证据。TF和靶标之间的互作可信度根据支持的证据数量划分为A-E五个等级，A是最可信，E为可信度低。
DoRothEA可以用于bulk RNAseq和scRNAseq的数据。
DoRothEA regulon可以与几种统计方法结合使用，从而产生一种功能分析工具，以从基因表达数据推断TF活性。通过不考虑TF本身的基因表达，而是考虑其直接转录靶标的mRNA水平来计算活性。

R包安装和载入

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("dorothea")
## We load the required packages
library(dorothea)
library(dplyr)
library(Seurat)
library(tibble)
library(pheatmap)
library(tidyr)
library(viper)

导入注释好的pbmc数据集

pbmc <- readRDS("pbmc.rds")
DimPlot(pbmc,label = T,repel = T)

计算细胞的TF活性

## We read Dorothea Regulons for Human:
dorothea_regulon_human <- get(data("dorothea_hs", package = "dorothea"))
##如果是小鼠，就用
##dorothea_regulon_mouse <- get(data("dorothea_mm", package = "dorothea"))

## We obtain the regulons based on interactions with confidence level A, B and C
regulon <- dorothea_regulon_human %>%
    dplyr::filter(confidence %in% c("A","B","C"))

## We compute Viper Scores 
pbmc <- run_viper(pbmc, regulon,
                  options = list(method = "scale", minsize = 4, 
                                 eset.filter = FALSE, cores = 1, 
                                 verbose = FALSE))

这一步之后，assays中除了"RNA"以外，多了一个"dorothea"。

266x2638的矩阵是266个TFx2638个细胞

随后可以用TFx细胞的矩阵对细胞进行重新聚类，方法和用GENEx细胞的矩阵做聚类一样。

## We compute the Nearest Neighbours to perform cluster
DefaultAssay(object = pbmc) <- "dorothea"
pbmc <- ScaleData(pbmc)
pbmc <- RunPCA(pbmc, features = rownames(pbmc), verbose = FALSE)
pbmc <- FindNeighbors(pbmc, dims = 1:10, verbose = FALSE)
pbmc <- FindClusters(pbmc, resolution = 0.5, verbose = FALSE)

pbmc <- RunUMAP(pbmc, dims = 1:10, umap.method = "uwot", metric = "cosine")

pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, 
                               logfc.threshold = 0.25, verbose = FALSE)

## Assigning cell type identity to clusters
new.cluster.ids <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", 
                     "FCGR3A+ Mono", "NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()

转录因子的聚类图

每个细胞群的TF活性（相当于每个细胞群的bulk RNAseq）

## We transform Viper scores, scaled by seurat, into a data frame to better 
## handling the results
viper_scores_df <- GetAssayData(pbmc, slot = "scale.data", 
                                    assay = "dorothea") %>%
  data.frame(check.names = F) %>%
  t()

## We create a data frame containing the cells and their clusters
CellsClusters <- data.frame(cell = names(Idents(pbmc)), 
                            cell_type = as.character(Idents(pbmc)),
                            check.names = F) #也可以使用其他的分类信息

## We create a data frame with the Viper score per cell and its clusters
viper_scores_clusters <- viper_scores_df  %>%
  data.frame() %>% 
  rownames_to_column("cell") %>%
  gather(tf, activity, -cell) %>%
  inner_join(CellsClusters)

## We summarize the Viper scores by cellpopulation
summarized_viper_scores <- viper_scores_clusters %>% 
  group_by(tf, cell_type) %>%
  summarise(avg = mean(activity),
            std = sd(activity))

根据前一步计算的score，选择在细胞群间变化最大的20个TFs进行可视化

## We select the 20 most variable TFs. (20*9 populations = 180)
highly_variable_tfs <- summarized_viper_scores %>%
  group_by(tf) %>%
  mutate(var = var(avg))  %>%
  ungroup() %>%
  top_n(180, var) %>%
  distinct(tf)

## We prepare the data for the plot
summarized_viper_scores_df <- summarized_viper_scores %>%
  semi_join(highly_variable_tfs, by = "tf") %>%
  dplyr::select(-std) %>%   
  spread(tf, avg) %>%
  data.frame(row.names = 1, check.names = FALSE) 
palette_length = 100
my_color = colorRampPalette(c("Darkblue", "white","red"))(palette_length)

my_breaks <- c(seq(min(summarized_viper_scores_df), 0, 
                   length.out=ceiling(palette_length/2) + 1),
               seq(max(summarized_viper_scores_df)/palette_length, 
                   max(summarized_viper_scores_df), 
                   length.out=floor(palette_length/2)))

viper_hmap <- pheatmap(t(summarized_viper_scores_df),fontsize=14, 
                       fontsize_row = 10, 
                       color=my_color, breaks = my_breaks, 
                       main = "DoRothEA (ABC)", angle_col = 45,
                       treeheight_col = 0,  border_color = NA)

DoRothEA：scRNAseq转录因子活性预测工具

R包安装和载入

导入注释好的pbmc数据集

计算细胞的TF活性

每个细胞群的TF活性（相当于每个细胞群的bulk RNAseq）

你可能感兴趣的:(DoRothEA：scRNAseq转录因子活性预测工具)