monocle实操+细胞功能富集分析

monocle实操

1 软件安装

#所需软件
tidyverse
monocle(建议使用2.22.0,2.4为测试版)
Seurat
ggsci

1.1 conda安装——Linux

https://anaconda.org/bioconda/bioconductor-monocle

1.2 BiocManager安装——R

https://bioconductor.org/packages/release/bioc/html/monocle.html

2 环境激活及软件包加载

conda activate monocle
library(tidyverse)
library(monocle)
library(Seurat)
library(ggsci)

3 利用monocle2构建S4对象(monocle2指定的数据结构)

#测试数据
pro <- readRDS("/mnt/sdd/singleron_training_class/resources/monocle/pbmc_add_cluster.rds")

3.1 表达矩阵——行名是基因,列名是细胞编号,建议使用count矩阵

count <- GetAssayData(object = pro[["RNA"]], slot = "counts")

3.2 基因注释信息——第一列是基因编号,第一列列名必须为“geneshortname”,随着分析软件会自动为该数据框添加列信息

fdata <- data.frame(gene_short_name = row.names(count), row.names = row.names(count))

3.3 细胞的表型信息phenoData: 第一列是细胞编号(barcode),其他列是细胞的相关信息(样本名称、细胞类型等)

pdata <- [email protected]

3.4 构建monocle对象

my_cds <- newCellDataSet(count,
                         featureData = new("AnnotatedDataFrame", data =fdata),
                         phenoData = new("AnnotatedDataFrame", data = pdata),
                         lowerDetectionLimit = 0.5,
                         expressionFamily = negbinomial.size())

1、negbinomial.size()/ negbinomial()参数使用UMI表达矩阵,negbinomial()结果更加准确,但耗时;稀疏矩阵使用negbinomial.size(); 2、tobit():适用于输入表达矩阵为FPKM/ TPM,构建monocle对象时,会自动取log; 3、gaussianff():适用于输入表达矩阵为取log后的FPKM/ TPM

4 标准化、scale及筛选

#标准化细胞之间的mRNA的差异
my_cds <- estimateSizeFactors(my_cds)
#离散度值可以帮助我们进行后续的差异分析
my_cds <- estimateDispersions(my_cds)
#在fData(my_cds)中添加“num_cell_expressed”列,可用于过滤细胞
my_cds <- detectGenes(my_cds, min_expr = 0.1)

#像pData(my_cds)中添加“UMI”列
pData(my_cds)$UMI <- Matrix::colSums(exprs(my_cds))

5 筛选高变基因

为啥需要此步? monocle2轨迹构建分为无监督与半监督过程, 无监督:软件自行在数据中选择发育差异表达、cluster差异表达或离散程度较高的基因进行轨迹构建,不受人为干预; 半监督:利用marker基因定义生物学进程,可以使用先验知识辅助构建轨迹;此处我们使用高变基因辅助轨迹构建

pro <- FindVariableFeatures(pro, selection.method = "vst", nfeatures = 2000)
ordergene <- head(VariableFeatures(pro), 2000)

6 标记排序基因——将高变基因嵌入cds,后续分析对此依赖

my_cds <- setOrderingFilter(my_cds,ordergene)
p <- plot_ordering_genes(my_cds)
pdf("/mnt/sdd/singleron_training_class/resources/monocle/ordering_genes.pdf")
print(p)
dev.off()

黑色点表示轨迹构建使用的基因,灰色为背景基因,红线为基因大小和离散程度分布趋势

7 降维

my_cds <- reduceDimension(my_cds, reduction_method = "DDRTree")

8 拟时间轴轨迹构建和在拟时间内排列细胞

my_cds <- orderCells(my_cds)

9 可视化

9.1 cluster在轨迹上的展示

p <- plot_cell_trajectory(my_cds, color_by = "cluster")

pdf("/mnt/sdd/singleron_training_class/resources/monocle/ti_cluster.pdf")
print(p)
dev.off()

9.2 state在轨迹上的展示

p <- plot_cell_trajectory(my_cds, color_by = "State")

pdf("/mnt/sdd/singleron_training_class/resources/monocle/ti_satate.pdf")
print(p)
dev.off()

9.3 时序在轨迹上的展示

p <- plot_cell_trajectory(my_cds, color_by = "Pseudotime")
pdf("/mnt/sdd/singleron_training_class/resources/monocle/ti_pseudotime.pdf")
print(p)
dev.off()

9.3 分面展示

p <- plot_cell_trajectory(my_cds, color_by = "State") +
  facet_wrap(~State)

pdf("/mnt/sdd/singleron_training_class/resources/monocle/ti_facet_satate.pdf")
print(p)
dev.off()
#p <- plot_cell_trajectory(my_cds, color_by = "cluster") +
  #facet_wrap(~cluster)
#p <- plot_cell_trajectory(my_cds, color_by = "sample") +
  #facet_wrap(~sample)
#p <- plot_cell_trajectory(my_cds, color_by = "group") +
  #facet_wrap(~group)

monocle绘图函数基于ggplot2,所以可以使用ggplot语法对图片进行个性化修改:颜色、点大小、图例等

10 monocle对象

#使用'plot_cell_trajectory()'函数绘图时,'color_by'参数指定的字符必须在数据框my_cds@phenoData@data中,如果没有需要添加

11 关注基因在轨迹图谱上的表达可视化

pData(my_cds)[['NOC2L']] = log2(exprs(my_cds)['NOC2L',] + 1)

p <-plot_cell_trajectory(my_cds, color_by = 'NOC2L') +  scale_color_gsea()
pdf("/mnt/sdd/singleron_training_class/resources/monocle/ti_gene.pdf")
print(p)
dev.off()

细胞功能富集分析

1 简介

image.png

1.1 富集分析的基本定义

在进行单细胞分析时我们往往会得到许多差异基因列表,而在列表中,我们往往很难得到一些有用的信息。
富集分析:基于差异基因列表,结合相应数据库信息和一定的统计学方法,将差异基因信息整合转化为具有生物学意义的通路信息。
常用分析:GO,KEGG,GSEA,GSVA等。其中GO和KEGG是两个数据库,里面有每个基因相关的功能信息,而富集分析就是把这些功能进行进行整合计算的算法。

1.2 富集分析常用方法

  • ORA: Over-expression Analysis(过表达分析)

  • FCS: Functional Class Scoring(功能集打分)

  • PT: Pathway Topology(通路拓扑学)


    image.png

1.2.1 ORA(Over-expression Analysis):

又称为“2X2法”
具体步骤:
1.获得一组感兴趣的基因(一般是差异表达基因)。
2.给定的基因列表与某个通路中的基因集做交集,找出其中共同的基因并进行计数(统计值)。
3.利用统计检验的方式来评估观察的计数值是否显著高于随机,即待测功能集在基因列表中是否显著富集。
(最常用的统计检验包括:超几何分布、卡方检验、二项分布。)

优点:
统计学理论较为完备,结果稳健可。

缺点:
1.人为因素:设定阈值获取输入基因集。
2.灵敏性低。
3.假设基因独立,忽视通路内部生物学意义的影响。
4.假设通路间独立。

1.2.2 FCS(Functional Class Scoring):

1.对基因表达谱中基因表达水平差异值进行打分或者排序(或者输入排序过的基因表达谱)。
2.依据特定统计模型,将待测功能基因集中基因的分数或统计值转换为基因集的分数或统计值。
3.通过多次随机抽样获得不同的抽样条件下待测功能基因集分数的背景分布,并基于背景分布检验实际观测值的显著水平,以此判断待测基因集在实验对照条件下是否发生了统计学上的显著变化。

优点:
相较于ORA,将基因表达值信息加入考虑范围,并以待测基因集为对象进行检验,更灵敏。

缺点:
与ORA类似,假设通路与基因独立,忽略了基因的生物学属性与基因间复杂的相互作用关系。

1.2.3 PT(Pathway Topology):

把基因在通路中的位置(上下游关系),与其他基因的连接度和调控作用类型等信息综合在一起来评估每个基因对通路的贡献并给予相应的权重,然后再把基因的权重整合入功能富集分析。不同的PT方法在具体的权重打分时,采用了不同的方式。

优点:
对于研究较完善、拓扑结构完整的通路,基于PT的基因功能富集算法会有更强的显著性。

缺点:
对于通路拓扑结构存在依赖性,该类方法对于研究较少、信息不完善的通路稳健性较差,因此目前通路注释的不完善也是限制基于PT的基因功能富集分析方法进一步发展的重要因素。

2 实操部分:

2.1 clusterprofiler安装方法

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

#BiocManager::install("clusterProfiler")
#BiocManager::install("org.Hs.eg.db")
image.png

2.2 获取差异基因

利用官方数据pbmc3k_final.rds,计算CD14+ Mono和FCGR3A+ Mono的差异基因

pbmc <- readRDS("D:/singleron/项目/其他/2021生信培训/富集分析/pbmc3k_final.rds")
[email protected]$celltype <- [email protected]
#获取细胞间的差异基因
cells1 <- subset([email protected], celltype %in% c("CD14+ Mono"))  %>% rownames()
cells2 <- subset([email protected], celltype %in%  c("FCGR3A+ Mono"))  %>% rownames()
deg <- FindMarkers(pbmc, ident.1 = cells1, ident.2 = cells2)
deg <- data.frame(gene = rownames(deg), deg)
head(deg)
##          gene         p_val avg_logFC pct.1 pct.2    p_val_adj
## FCGR3A FCGR3A 1.193617e-101 -2.617707 0.131 0.975 1.636926e-97
## LYZ       LYZ  8.134552e-75  1.812078 1.000 0.988 1.115572e-70
## RHOC     RHOC  4.479768e-68 -1.611576 0.162 0.864 6.143554e-64
## S100A8 S100A8  7.471811e-65  2.610696 0.975 0.500 1.024684e-60
## S100A9 S100A9  1.318422e-64  2.286734 0.996 0.870 1.808084e-60
## IFITM2 IFITM2  4.821669e-64 -1.445772 0.677 1.000 6.612437e-60
deg1 <- deg
logFC_t=0.5
P.Value_t = 0.05
k1 = (deg1$p_val_adj < P.Value_t)&(deg1$avg_logFC < -logFC_t)
k2 = (deg1$p_val_adj < P.Value_t)&(deg1$avg_logFC > logFC_t)
table(k1)
## k1
## FALSE  TRUE 
##   569    94
table(k2)
## k2
## FALSE  TRUE 
##   620    43
#分组 上调/下调
change = ifelse(k1,"down",ifelse(k2,"up","stable"))
deg1$change <- change
head(deg1)
##          gene         p_val avg_logFC pct.1 pct.2    p_val_adj change
## FCGR3A FCGR3A 1.193617e-101 -2.617707 0.131 0.975 1.636926e-97   down
## LYZ       LYZ  8.134552e-75  1.812078 1.000 0.988 1.115572e-70     up
## RHOC     RHOC  4.479768e-68 -1.611576 0.162 0.864 6.143554e-64   down
## S100A8 S100A8  7.471811e-65  2.610696 0.975 0.500 1.024684e-60     up
## S100A9 S100A9  1.318422e-64  2.286734 0.996 0.870 1.808084e-60     up
## IFITM2 IFITM2  4.821669e-64 -1.445772 0.677 1.000 6.612437e-60   down
#基因名称转换
s2e <- bitr(deg1$gene, 
            fromType = "SYMBOL",
            toType = "ENTREZID",
            OrgDb = org.Hs.eg.db)#人类 转换成ENTREZID
## 'select()' returned 1:1 mapping between keys and columns
## Warning in bitr(deg1$gene, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = org.Hs.eg.db): 5.73% of input
## gene IDs are fail to map...
deg1 <- inner_join(deg1,s2e,by=c("gene"="SYMBOL"))
head(deg1)
##     gene         p_val avg_logFC pct.1 pct.2    p_val_adj change ENTREZID
## 1 FCGR3A 1.193617e-101 -2.617707 0.131 0.975 1.636926e-97   down     2214
## 2    LYZ  8.134552e-75  1.812078 1.000 0.988 1.115572e-70     up     4069
## 3   RHOC  4.479768e-68 -1.611576 0.162 0.864 6.143554e-64   down      389
## 4 S100A8  7.471811e-65  2.610696 0.975 0.500 1.024684e-60     up     6279
## 5 S100A9  1.318422e-64  2.286734 0.996 0.870 1.808084e-60     up     6280
## 6 IFITM2  4.821669e-64 -1.445772 0.677 1.000 6.612437e-60   down    10581
#GO富集分析差异基因列表[Symbol]
gene_up = deg1[deg1$change == 'up','gene'] 
gene_down = deg1[deg1$change == 'down','gene'] 
gene_diff = c(gene_up,gene_down)

#KEGG富集分析差异基因列表[ENTREZID]
gene_all = deg1[,'ENTREZID']
gene_up_KEGG = deg1[deg1$change == 'up','ENTREZID']
gene_down_KEGG = deg1[deg1$change == 'down','ENTREZID']
gene_diff_KEGG = c(gene_up_KEGG,gene_down_KEGG)

2.3 GO富集分析

富集分析的本质是什么呢?富集表示差异基因中注释到某个代谢通路的基因数目在所有差异基因中的比例显著大于背景基因中注释到某个代谢通路的基因数目在所有背景基因中的比例。
image.png

从统计上来讲富集上就是一个超几何分布,超几何分布是统计学上一种离散概率分布。它描述了从有限N个物件(其中包含M个指定种类的物件)中抽出n个物件,成功抽出该指定种类的物件的次数(不放回)。最常用的统计检验包括:超几何分布、卡方检验、二项分布。

image.png

N:所有通路总基因数目
n:差异基因列表中基因数目
M:指定通路的基因数目
i(k):差异基因列表中富集在指定通路的基因数目

GO数据库,全称是Gene Ontology(基因本体),他们把基因的功能分成了三个部分分别是:细胞组分(cellular component, CC)、分子功能(molecular function, MF)、生物过程(biological process, BP)。利用GO数据库,我们就可以得到我们的目标基因在CC, MF和BP三个层面上,主要和什么有关。
Gene Ontology:
BP:通过多种分子活动完成的生物学过程。
MF:单个的基因产物(包括蛋白质和RNA)或多个基因产物的复合物在分子水平上的活动。
CC:基因产物在执行功能时所处的细胞结构位置。

[图片上传中...(image-da7175-1651394285751-11)]

2.3.1 GO富集分析代码示例

#以上调基因的富集分析为例子

#细胞组分
ego_CC <- enrichGO(gene          = gene_up,
                  keyType       = 'SYMBOL', #基因ID的类型
                OrgDb         = org.Hs.eg.db,  #包含人注释信息的数据库
                ont           = "CC",
                pAdjustMethod = "BH", #指定多重假设检验矫正的方法
                pvalueCutoff  = 0.01,
                qvalueCutoff  = 0.05)

head(data.frame(ego_CC))
##                    ID               Description GeneRatio   BgRatio       pvalue     p.adjust       qvalue
## GO:0034774 GO:0034774   secretory granule lumen     10/42 321/19717 1.055125e-09 5.596029e-08 4.386585e-08
## GO:0060205 GO:0060205 cytoplasmic vesicle lumen     10/42 338/19717 1.735874e-09 5.596029e-08 4.386585e-08
## GO:0031983 GO:0031983             vesicle lumen     10/42 339/19717 1.785967e-09 5.596029e-08 4.386585e-08
## GO:1904724 GO:1904724    tertiary granule lumen      3/42  55/19717 2.182913e-04 5.129845e-03 4.021155e-03
## GO:0035580 GO:0035580    specific granule lumen      3/42  62/19717 3.114424e-04 5.855118e-03 4.589678e-03
##                                                              geneID Count
## GO:0034774 LYZ/S100A8/S100A9/GSTP1/GRN/FCN1/FOLR3/S100A12/QPCT/APRT    10
## GO:0060205 LYZ/S100A8/S100A9/GSTP1/GRN/FCN1/FOLR3/S100A12/QPCT/APRT    10
## GO:0031983 LYZ/S100A8/S100A9/GSTP1/GRN/FCN1/FOLR3/S100A12/QPCT/APRT    10
## GO:1904724                                           LYZ/FOLR3/QPCT     3
## GO:0035580                                           LYZ/FOLR3/QPCT     3
#生物过程
ego_BP <- enrichGO(gene          = gene_up,
                OrgDb          = org.Hs.eg.db,
                 keyType       = 'SYMBOL',
                ont           = "BP",
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.01,
                qvalueCutoff  = 0.05)

#分子功能
ego_MF <- enrichGO(gene          = gene_up,
                OrgDb         = org.Hs.eg.db,
                 keyType       = 'SYMBOL',
                ont           = "MF",
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.01,
                qvalueCutoff  = 0.05)
save(ego_CC,ego_BP,ego_MF,file = "GO.Rdata")

#细胞组分、分子功能、生物学过程
go <- enrichGO(gene = gene_up, OrgDb = "org.Hs.eg.db", keyType  = 'SYMBOL',ont="all")

2.3.2 GO富集结果可视化

dotplot(ego_CC, showCategory=30)
image.png
barplot(ego_CC)
image.png
p <- dotplot(go, split="ONTOLOGY") +facet_grid(ONTOLOGY~., scale="free")
p
image.png

2.4 KEGG富集分析

KEGG(Kyoto Encyclopedia of Genes and Genomes)数据库是系统地分析基因功能、链接基因组信息和功能信息的数据库,包括代谢通路(pathway)数据库、分层分类数据库、基因数据库、基因组数据库等。KEGG的pathway数据库是应用最广泛的代谢通路公共数据库。显著性计算方法为超几何检验。

#KEGG全部的物种及其简写:https://www.genome.jp/kegg/catalog/org_list.html
#pathway对应的描述信息,比如人的:http://rest.kegg.jp/list/pathway/hsa
[图片上传中...(image-8cda06-1651394285751-10)]

2.4.1 KEGG富集分析代码示例

#上调基因富集
kk.up <- enrichKEGG(gene         = gene_up_KEGG, #注意这里只能用 entrzeid
                      organism     = 'hsa',
                      universe     = gene_all, ##背景基因集,可省
                      pvalueCutoff = 0.9, ##指定 p 值阈值,不显著的值将不显示在结果中
                      qvalueCutoff = 0.9)
#下调基因富集
kk.down <- enrichKEGG(gene         =  gene_down_KEGG,
                        organism     = 'hsa',
                        universe     = gene_all,
                        pvalueCutoff = 0.9,
                        qvalueCutoff =0.9)
kk.diff <- enrichKEGG(gene         = gene_diff_KEGG,
                        organism     = 'hsa',
                        pvalueCutoff = 0.9)
save(kk.diff,kk.down,kk.up,file = "GSE4107kegg.Rdata")

#从富集结果中提取结果数据框
ekegg <- setReadable(kk.up, OrgDb = org.Hs.eg.db, keyType="ENTREZID")
kegg_diff_dt <- data.frame(ekegg)
head(kegg_diff_dt)
##                ID                Description GeneRatio BgRatio     pvalue  p.adjust    qvalue
## hsa04640 hsa04640 Hematopoietic cell lineage      4/28  14/364 0.01647484 0.6593479 0.6593479
## hsa05140 hsa05140              Leishmaniasis      4/28  16/364 0.02685361 0.6593479 0.6593479
## hsa01100 hsa01100         Metabolic pathways      8/28  60/364 0.06925488 0.6593479 0.6593479
## hsa04380 hsa04380 Osteoclast differentiation      4/28  22/364 0.07783745 0.6593479 0.6593479
## hsa04514 hsa04514    Cell adhesion molecules      3/28  14/364 0.08361902 0.6593479 0.6593479
## hsa05152 hsa05152               Tuberculosis      4/28  23/364 0.08924392 0.6593479 0.6593479
##                                                 geneID Count
## hsa04640                     CD14/CSF3R/FCGR1A/HLA-DMA     4
## hsa05140                    NCF1/NFKBIA/FCGR1A/HLA-DMA     4
## hsa01100 GPX1/GSTP1/GAPDH/BLVRB/ALDH2/TALDO1/APRT/SAT2     8
## hsa04380                       NCF1/NFKBIA/FOSB/FCGR1A     4
## hsa04514                             CD99/VCAN/HLA-DMA     3
## hsa05152                    CD14/FCGR1A/CLEC4E/HLA-DMA     4

2.4.2 KEGG结果可视化

p1 <- barplot(ekegg, showCategory=10)
p2 <- dotplot(ekegg, showCategory=10)
plotc = p1/p2
plotc
image.png
up_kegg <- kk.up@result %>%
  filter(pvalue<0.01) %>%
  mutate(group=1)

down_kegg <- kk.down@result %>%
  filter(pvalue<0.05) %>% #筛选行
  mutate(group=-1) #新增列

#可视化
#barplot(up_kegg, showCategory = 10)
#g_kegg <- kegg_plot(up_kegg,down_kegg)
#g_kegg 
#g_kegg +scale_y_continuous(labels = c(15,10,5,0,5,10,15,20,25,30))

2.5 GSEA富集分析

GSEA(Gene Set Enrichment Analysis):基因集富集分析,由Broad Institute研究所提出的一种富集方法。
image.png

GSEA的输入是一个基因表达量矩阵,其中的样本分成了A和B两组,首先对所有基因进行排序,简单理解就是根据处理后的差异倍数值进行从大到小排序,用来表示基因在两组间的表达量变化趋势。
排序之后的基因列表其顶部可看做是上调的差异基因,其底部是下调的差异基因。


image.png

GSEA分析的是一个基因集下的所有基因是富集在这个排序列表的顶部还是底部,如果在顶部富集,可以说,从总体上看,该基因集是上调趋势,反之,如果在底部富集,则是下调趋势。
GSEA富集分析的优点:
1.传统的GO和KEGG是针对有差异的基因进行富集分析,GSEA不需要对基因进行显著差异的筛选,可以保留变化不大但功能重要的基因信息
2.分析的是单个基因集合。

2.5.1 GSEA代码

library(dplyr)
library(GSEABase)
library(clusterProfiler)
library(DOSE)
library(org.Hs.eg.db)
library(ggplot2)
library(stringr)
library(enrichplot)
options(stringsAsFactors = F)

geneList = deg1$avg_logFC 
names(geneList) = deg1$gene
geneList = sort(geneList,decreasing = T)
geneList[1:10]
##   S100A8   S100A9   LGALS2      LYZ   MS4A6A     CD14     CCL3  S100A12     GPX1    FOLR3 
## 2.610696 2.286734 2.049431 1.812078 1.645181 1.630356 1.531812 1.249115 1.238310 1.120697
#准备gmt文件
#构建gmt文件方法:https://www.jianshu.com/p/9a6a90697c25
geneset <- read.gmt("D:/singleron/项目/其他/2021生信培训/富集分析/h.all.v7.4.symbols.gmt")  
geneset$ont = str_remove(geneset$ont,"HALLMARK_")
head(geneset)
##                       ont    gene
## 1 TNFA_SIGNALING_VIA_NFKB    JUNB
## 2 TNFA_SIGNALING_VIA_NFKB   CXCL2
## 3 TNFA_SIGNALING_VIA_NFKB    ATF3
## 4 TNFA_SIGNALING_VIA_NFKB  NFKBIA
## 5 TNFA_SIGNALING_VIA_NFKB TNFAIP3
## 6 TNFA_SIGNALING_VIA_NFKB   PTGS2
egmt <- GSEA(geneList, TERM2GENE=geneset,verbose=F,pvalueCutoff = 0.5)

#结果转化
y=data.frame(egmt)
head(y)
##                                                  ID               Description setSize enrichmentScore
## INTERFERON_ALPHA_RESPONSE INTERFERON_ALPHA_RESPONSE INTERFERON_ALPHA_RESPONSE      14      -0.5434808
##                                 NES      pvalue  p.adjust   qvalues rank                   leading_edge
## INTERFERON_ALPHA_RESPONSE -1.923206 0.005597015 0.1287313 0.1287313  266 tags=93%, list=43%, signal=55%
##                                                                                     core_enrichment
## INTERFERON_ALPHA_RESPONSE HLA-C/PSMB9/CD47/ADAR/LY6E/IFI30/GBP2/OAS1/B2M/CASP1/IFITM3/IFITM1/IFITM2

2.5.2 GSEA结果可视化

#经典gseaplot
gseaplot2(egmt, geneSetID = 1, title = egmt$Description[1])
image.png

2.6 GSVA富集分析

GSVA(gene set variation analysis),通过将基因在不同样品间的表达矩阵转化成基因集在样品间的表达矩阵,从而来评估不同的代谢通路在不同样品间是否富集。


image.png

富集过程主要包含一下四步:
1.评估基因i在样品j中是高表达还是低表达:累积密度函数的核估计,得到每个样本的表达水平统计
2.对每个样本的表达水平统计进行排序和标准化
3.计算每个基因集的类KS随机游走统计量
4.将类KS随机游走统计量转化为ES:最大偏离量(双峰),最大正负偏离量之差(近似正态分布)

2.6.1 GSVA富集示例

library(GSVA)
expr <- as.data.frame(pbmc@assays$RNA@data)
expr[1:10,1:10]
##               AAACATACAACCAC AAACATTGAGCTAC AAACATTGATCAGC AAACCGTGCTTCCG AAACCGTGTATGCG AAACGCACTGGTAC
## AL627309.1                 0              0              0              0              0              0
## AP006222.2                 0              0              0              0              0              0
## RP11-206L10.2              0              0              0              0              0              0
## RP11-206L10.9              0              0              0              0              0              0
## LINC00115                  0              0              0              0              0              0
## NOC2L                      0              0              0              0              0              0
## KLHL17                     0              0              0              0              0              0
## PLEKHN1                    0              0              0              0              0              0
## RP11-54O7.17               0              0              0              0              0              0
## HES4                       0              0              0              0              0              0
##               AAACGCTGACCAGT AAACGCTGGTTCTT AAACGCTGTAGCCA AAACGCTGTTTCTG
## AL627309.1                 0              0              0              0
## AP006222.2                 0              0              0              0
## RP11-206L10.2              0              0              0              0
## RP11-206L10.9              0              0              0              0
## LINC00115                  0              0              0              0
## NOC2L                      0              0              0              0
## KLHL17                     0              0              0              0
## PLEKHN1                    0              0              0              0
## RP11-54O7.17               0              0              0              0
## HES4                       0              0              0              0
expr=as.matrix(expr)
kegg_list = split(geneset$gene, geneset$ont)
kegg_list[1:2]
## $ADIPOGENESIS
##   [1] "FABP4"    "ADIPOQ"   "PPARG"    "LIPE"     "DGAT1"    "LPL"      "CPT2"     "CD36"     "GPAM"    
##  [10] "ADIPOR2"  "ACAA2"    "ETFB"     "ACOX1"    "ACADM"    "HADH"     "IDH1"     "SORBS1"   "ACADS"   
##  [19] "UCK1"     "SCP2"     "DECR1"    "CDKN2C"   "TALDO1"   "TST"      "MCCC1"    "PGM1"     "REEP5"   
##  [28] "BCL2L13"  "SLC25A10" "ME1"      "PHYH"     "PIM3"     "YWHAG"    "NDUFAB1"  "GPD2"     "ADIG"    
##  [37] "ECHS1"    "QDPR"     "CS"       "ECH1"     "SLC25A1"  "ACADL"    "TOB1"     "GRPEL1"   "CRAT"    
##  [46] "GBE1"     "CAVIN2"   "SCARB1"   "PEMT"     "CHCHD10"  "AK2"      "APOE"     "UQCR10"   "TANK"    
##  [55] "ANGPTL4"  "ACO2"     "FAH"      "ACLY"     "IFNGR1"   "SLC5A6"   "JAGN1"    "EPHX2"    "IDH3G"   
##  [64] "GPX3"     "ELMOD3"   "ORM1"     "RETSAT"   "ESRRA"    "HIBCH"    "SUCLG1"   "STAT5A"   "ITGA7"   
##  [73] "MRAP"     "PLIN2"    "CYC1"     "ALDH2"    "RNF11"    "ALDOA"    "SULT1A1"  "DDT"      "SDHB"    
##  [82] "CD151"    "SLC27A1"  "BCKDHA"   "C3"       "LEP"      "ADCY6"    "ELOVL6"   "LTC4S"    "SPARCL1" 
##  [91] "RMDN3"    "MTCH2"    "SOWAHC"   "SLC1A5"   "CMPK1"    "REEP6"    "NDUFA5"   "FZD4"     "DRAM2"   
## [100] "MGST3"    "ATP1B3"   "RETN"     "STOM"     "ESYT1"    "GHITM"    "DNAJC15"  "GADD45A"  "VEGFB"   
## [109] "PFKL"     "COQ3"     "NABP1"    "CYP4B1"   "PPM1B"    "ARAF"     "CAVIN1"   "COL4A1"   "IMMT"    
## [118] "DHRS7"    "COL15A1"  "NMT1"     "COQ5"     "LAMA4"    "AGPAT3"   "BAZ2A"    "IDH3A"    "LIFR"    
## [127] "PREB"     "PTGER3"   "GPHN"     "PFKFB3"   "GPX4"     "SSPN"     "SQOR"     "MTARC2"   "DLD"     
## [136] "ITIH5"    "CD302"    "ATL2"     "GPAT4"    "LPCAT3"   "TKT"      "UQCRC1"   "CAT"      "OMD"     
## [145] "DLAT"     "MRPL15"   "RIOK3"    "RTN3"     "CHUK"     "G3BP2"    "SDHC"     "SAMM50"   "ARL4A"   
## [154] "SNCG"     "PDCD4"    "COQ9"     "APLP2"    "SOD1"     "PTCD3"    "PHLDB1"   "ENPP2"    "HSPB8"   
## [163] "AIFM1"    "CCNG2"    "PPP1R15B" "MDH2"     "ABCA1"    "COX7B"    "MYLK"     "COX8A"    "DHRS7B"  
## [172] "MIGA2"    "MGLL"     "ITSN1"    "DHCR7"    "RREB1"    "CMBL"     "UBC"      "ATP5PO"   "PRDX3"   
## [181] "DBT"      "NDUFS3"   "NKIRAS1"  "RAB34"    "CIDEA"    "UQCRQ"    "PEX14"    "BCL6"     "COX6A1"  
## [190] "DNAJB9"   "MAP4K3"   "ANGPT1"   "UBQLN1"   "NDUFB7"   "SLC19A1"  "ABCB8"    "SLC66A3"  "POR"     
## [199] "UCP2"     "UQCR11"  
## 
## $ALLOGRAFT_REJECTION
##   [1] "PTPRC"    "IL12B"    "TGFB1"    "IL12A"    "CD3E"     "CD3D"     "CD28"     "LYN"      "HCLS1"   
##  [10] "IL18"     "CRTAM"    "IFNG"     "CD3G"     "CD86"     "IL10"     "UBE2N"    "BCL10"    "CD4"     
##  [19] "LCK"      "NCK1"     "C2"       "HLA-A"    "ITGB2"    "HLA-DQA1" "CD1D"     "CD80"     "HLA-DRA" 
##  [28] "THY1"     "TLR1"     "HLA-G"    "HLA-DMB"  "IL7"      "IL4"      "TNF"      "CD247"    "IL2"     
##  [37] "HLA-DMA"  "STAT1"    "IRF4"     "SRGN"     "INHBA"    "TLR3"     "ZAP70"    "CD74"     "CD40"    
##  [46] "TRAF2"    "B2M"      "BCL3"     "LTB"      "IFNGR1"   "CCR5"     "CD40LG"   "HLA-DOA"  "GLMN"    
##  [55] "IL6"      "HLA-E"    "CD2"      "CCL5"     "FAS"      "FASLG"    "TLR6"     "PF4"      "TGFB2"   
##  [64] "CD79A"    "INHBB"    "ELANE"    "SPI1"     "MAP3K7"   "IL15"     "CTSS"     "CD47"     "PRF1"    
##  [73] "IL12RB1"  "LCP2"     "SOCS1"    "CDKN2A"   "STAT4"    "CD7"      "HLA-DOB"  "CD8A"     "ICAM1"   
##  [82] "CCL4"     "GZMB"     "CSF1"     "IL11"     "STAB1"    "IL2RA"    "NLRP3"    "CCND3"    "EIF3A"   
##  [91] "SIT1"     "IFNAR2"   "HDAC9"    "CARTPT"   "TRAT1"    "CCL22"    "APBB1"    "FYB1"     "IL1B"    
## [100] "TIMP1"    "RPS19"    "JAK2"     "KRT1"     "WARS1"    "IFNGR2"   "CCR2"     "EREG"     "MMP9"    
## [109] "EGFR"     "IL16"     "CFP"      "WAS"      "ITGAL"    "KLRD1"    "RARS1"    "TLR2"     "CCND2"   
## [118] "IL2RG"    "ETS1"     "ITK"      "NCR1"     "MAP4K1"   "CCL19"    "PSMB10"   "RPL39"    "EIF3J"   
## [127] "ABCE1"    "CD8B"     "F2"       "ELF4"     "LY86"     "FCGR2B"   "GBP2"     "PRKCG"    "RPS9"    
## [136] "MTIF2"    "GZMA"     "AARS1"    "CD96"     "CSK"      "HIF1A"    "CCL2"     "ICOSLG"   "NPM1"    
## [145] "IL4R"     "CCL11"    "NME1"     "FLNA"     "GPR65"    "ACHE"     "EIF3D"    "IGSF6"    "F2R"     
## [154] "IL13"     "TAP1"     "DARS1"    "IRF7"     "ACVR2A"   "CXCR3"    "PRKCB"    "CXCL9"    "PTPN6"   
## [163] "NCF4"     "UBE2D1"   "LIF"      "CCR1"     "MBL2"     "DEGS1"    "TPD52"    "AKT1"     "RIPK2"   
## [172] "IKBKB"    "GCNT1"    "SOCS5"    "IRF8"     "TAP2"     "EIF4G3"   "ABI1"     "CCL7"     "IL2RB"   
## [181] "BRCA1"    "FGR"      "IL18RAP"  "MRPL3"    "CXCL13"   "CAPG"     "EIF5A"    "RPS3A"    "GALNT1"  
## [190] "ST8SIA4"  "CCL13"    "RPL3L"    "LY75"     "TAPBP"    "NOS2"     "RPL9"     "BCAT1"    "IL9"     
## [199] "IL27RA"   "DYRK3"
kegg2 <- gsva(expr, kegg_list, kcdf="Gaussian",method = "gsva",parallel.sz=12)
## Warning in .local(expr, gset.idx.list, ...): 1 genes with constant expression values throuhgout the samples.
## Warning in .local(expr, gset.idx.list, ...): Since argument method!="ssgsea", genes with constant expression
## values are discarded.
## Estimating GSVA scores for 50 gene sets.
## Computing observed enrichment scores
## Estimating ECDFs with Gaussian kernels
## Using parallel with 8 cores
## 
  |                                                                                                          
  |                                                                                                    |   0%
  |                                                                                                          
  |==============                                                                                      |  14%
  |                                                                                                          
  |=============================                                                                       |  29%
  |                                                                                                          
  |===========================================                                                         |  43%
  |                                                                                                          
  |=========================================================                                           |  57%
  |                                                                                                          
  |=======================================================================                             |  71%
  |                                                                                                          
  |======================================================================================              |  86%
  |                                                                                                          
  |====================================================================================================| 100%
write.table(kegg2,"D:/singleron/项目/其他/2021生信培训/富集分析/kegg2.txt",sep="\t",quote = F)
data <- read.table('D:/singleron/项目/其他/2021生信培训/富集分析/kegg2.txt',header = TRUE)
data[1:10,1:10]
##                         AAACATACAACCAC AAACATTGAGCTAC AAACATTGATCAGC AAACCGTGCTTCCG AAACCGTGTATGCG
## ADIPOGENESIS               -0.31805820    -0.18462905    -0.15811209   -0.154243884    -0.37283740
## ALLOGRAFT_REJECTION        -0.19507167    -0.03794476    -0.18071016   -0.165990361    -0.20320035
## ANDROGEN_RESPONSE          -0.23005329    -0.16156074    -0.04563213   -0.235821291    -0.15594740
## ANGIOGENESIS               -0.15038280    -0.19228819    -0.22088828    0.246071801     0.21906782
## APICAL_JUNCTION            -0.03969773     0.10783847     0.03563322    0.066345815     0.10212727
## APICAL_SURFACE             -0.03905613    -0.21580357    -0.01430062   -0.339748444     0.14688488
## APOPTOSIS                  -0.20717708    -0.38186444    -0.03548629   -0.121050533    -0.33615891
## BILE_ACID_METABOLISM        0.06506092     0.09057561     0.17393321    0.001260199    -0.11349065
## CHOLESTEROL_HOMEOSTASIS    -0.23527893    -0.30489279    -0.18848634   -0.078379978    -0.22117629
## COAGULATION                -0.12148690    -0.04065053    -0.08133865   -0.016734377     0.06458631
##                         AAACGCACTGGTAC AAACGCTGACCAGT AAACGCTGGTTCTT AAACGCTGTAGCCA AAACGCTGTTTCTG
## ADIPOGENESIS              -0.180867145   -0.215086765    -0.27301188    -0.29392780    -0.22786709
## ALLOGRAFT_REJECTION       -0.270354626   -0.175938344    -0.11045626    -0.34202594    -0.32641029
## ANDROGEN_RESPONSE         -0.088974281   -0.179532791    -0.11313698    -0.24368755    -0.21325196
## ANGIOGENESIS               0.058400640   -0.050501579    -0.02987406     0.12116168     0.33073463
## APICAL_JUNCTION            0.100274333   -0.002669347    -0.09365991     0.06416936     0.03224153
## APICAL_SURFACE            -0.029738394   -0.162577894    -0.15939289    -0.09816950    -0.24598109
## APOPTOSIS                 -0.256041490   -0.336932260    -0.15765883    -0.27774790    -0.22567893
## BILE_ACID_METABOLISM       0.123196649   -0.049568414     0.01197374     0.05770083    -0.03440201
## CHOLESTEROL_HOMEOSTASIS   -0.158308258   -0.327274194    -0.18463169    -0.27325823    -0.13552362
## COAGULATION               -0.008076284    0.021259688     0.08345908     0.08178405     0.18399532

2.6.2 GSVA结果可视化

meta <- as.data.frame([email protected][,c('orig.ident',"celltype")])
#细胞按照细胞类型排序
meta <- meta %>% arrange(meta$celltype)
data <- data[,rownames(meta)]
#取各细胞类型对应的通路score的均值
identical(colnames(data),rownames(meta))
## [1] TRUE
data$CD4_T <- apply(data[,1:697], 1, mean)
data$Memory_CD4_T <- apply(data[,698:1181], 1, mean)
data$CD14_Mono <- apply(data[,1182:1662], 1, mean)
data$B <- apply(data[,1663:2007], 1, mean)
data$CD8_T <- apply(data[,2007:2279], 1, mean)
table(meta$celltype3)
## < table of extent 0 >
test <- data[,c("CD4_T","Memory_CD4_T","CD14_Mono","B","CD8_T")]

pathway <- c("COAGULATION","COMPLEMENT","DNA_REPAIR","E2F_TARGETS","EPITHELIAL_MESENCHYMAL_TRANSITION","ESTROGEN_RESPONSE_EARLY","ESTROGEN_RESPONSE_LATE","FATTY_ACID_METABOLISM","G2M_CHECKPOINT","GLYCOLYSIS","HEDGEHOG_SIGNALING")
test1 <- test[pathway,]
result_plot<- t(scale(t(test1)))
library(pheatmap)
G1 <- pheatmap(result_plot,
                cluster_rows = F,
                cluster_cols = F,
                show_rownames = T,
                show_colnames = T,
                color =colorRampPalette(c("blue", "white","red"))(100),
                cellwidth = 10, cellheight = 15,
                fontsize = 10)
image.png
pdf(("D:/singleron/项目/其他/2021生信培训/富集分析/G1.pdf"),width = 7,height = 7)
参考文献
  1、Khatri P, Sirota M, Butte AJ. Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput Biol. 2012;8(2):e1002375\. doi: 10.1371/journal.pcbi.1002375\. Epub 2012 Feb 23\. PMID: 22383865; PMCID: PMC3285573.

你可能感兴趣的:(monocle实操+细胞功能富集分析)