目的:用GSVA分析单细胞转录组数据
基因集变异分析(GSVA)是一种非参数,无监督的方法,用于通过表达数据集的样本估算基因集富集的差异,即基于通路上的差异分析
一、GSVA介绍与使用方法
查看以下网络教程:
GSVA的使用
Day 11 充分理解GSVA和GSEA
GSVA + limma进行差异通路分析
因为没有现成的小鼠gmt底层文件,需要自己构造,这里只重点说明如何分析小鼠单细胞转录组的数据。
二、操作
下载gmt文件
- https://www.gsea-msigdb.org/gsea/msigdb/index.jsp
gmt文件格式:h.all.v7.0.symbols.gmt
HALLMARK_TNFA_SIGNALING_VIA_NFKB http://www.gsea-msigdb.org/gsea/msigdb/cards/HALLMARK_TNFA_SIGNALING_VIA_NFKB JUNB CXCL2 ATF3 NFKBIA TNFAIP3 PTGS2 CXCL1 IER3 CD83 CCL20 CXCL3 MAFF NFKB2 TNFAIP2 HBEGF KLF6 BIRC3 PLAUR ZFP36 ICAM1 JUN EGR3 IL1B BCL2A1 PPP1R15A ZC3H12A SOD2 NR4A2 IL1A RELB TRAF1 BTG2 DUSP1 MAP3K8 ETS2 F3 SDC4 EGR1 IL6 TNF KDM6B NFKB1 LIF PTX3 FOSL1 NR4A1 JAG1 CCL4 GCH1 CCL2 RCAN1 DUSP2 EHD1 IER2 REL CFLAR RIPK2 NFKBIE NR4A3 PHLDA1 IER5 TNFSF9 GEM GADD45A CXCL10 PLK2 BHLHE40 EGR2 SOCS3 SLC2A6 PTGER4 DUSP5 SERPINB2 NFIL3 SERPINE1 TRIB1 TIPARP RELA BIRC2 CXCL6 LITAF TNFAIP6 CD44 INHBA PLAU MYC TNFRSF9 SGK1 TNIP1 NAMPT FOSL2 PNRC1 ID2 CD69 IL7R EFNA1 PHLDA2 PFKFB3 CCL5 YRDC IFNGR2 SQSTM1 BTG3 GADD45B KYNU G0S2 BTG1 MCL1 VEGFA MAP2K3 CDKN1A CCN1 TANK IFIT2 IL18 TUBB2A IRF1 FOS OLR1 RHOB AREG NINJ1 ZBTB10 PLPP3 KLF4 CXCL11 SAT1 CSF1 GPR183 PMEPA1 PTPRE TLR2 ACKR3 KLF10 MARCKS LAMB3 CEBPB TRIP10 F2RL1 KLF9 LDLR TGIF1 RNF19B DRAM1 B4GALT1 DNAJB4 CSF2 PDE4B SNN PLEK STAT5A DENND5A CCND1 DDX58 SPHK1 CD80 TNFAIP8 CCNL1 FUT4 CCRL2 SPSB1 TSC22D1 B4GALT5 SIK1 CLCF1 NFE2L2 FOSB AC129492.1 NFAT5 ATP2B1 IL12B IL6ST SLC16A6 ABCA1 HES1 BCL6 IRS2 SLC2A3 CEBPD IL23A SMAD3 TAP1 MSC IFIH1 IL15RA TNIP2 BCL3 PANX1 FJX1 EDN1 EIF1 BMP2 DUSP4 PDLIM5 ICOSLG GFPT2 KLF2 TNC SERPINB8 MXD1
HALLMARK_HYPOXIA http://www.gsea-msigdb.org/gsea/msigdb/cards/HALLMARK_HYPOXIA PGK1 PDK1 GBE1 PFKL ALDOA ENO2 PGM1 NDRG1 HK2 ALDOC GPI MXI1 SLC2A1 P4HA1 ADM P4HA2 ENO1 PFKP AK4 FAM162A PFKFB3 VEGFA BNIP3L TPI1 ERO1A KDM3A CCNG2 LDHA GYS1 GAPDH BHLHE40 ANGPTL4 JUN SERPINE1 LOX GCK PPFIA4 MAFF DDIT4 SLC2A3 IGFBP3 NFIL3 FOS RBPJ HK1 CITED2 ISG20 GALK1 WSB1 PYGM STC1 ZNF292 BTG1 PLIN2 CSRP2 VLDLR JMJD6 EXT1 F3 PDK3 ANKZF1 UGP2 ALDOB STC2 ERRFI1 ENO3 PNRC1 HMOX1 PGF GAPDHS CHST2 TMEM45A BCAN ATF3 CAV1 AMPD3 GPC3 NDST1 IRS2 SAP30 GAA SDC4 STBD1 IER3 PKLR IGFBP1 PLAUR CAVIN3 CCN5 LARGE1 NOCT S100A4 RRAGD ZFP36 EGFR EDN2 IDS CDKN1A RORA DUSP1 MIF PPP1R3C DPYSL4 KDELR3 DTNA ADORA2B HS3ST1 CAVIN1 NR3C1 KLF6 GPC4 CCN1 TNFAIP3 CA12 HEXA BGN PPP1R15A PGM2 PIM1 PRDX5 NAGK CDKN1B BRS3 TKTL1 MT1E ATP7A MT2A SDC3 TIPARP PKP1 ANXA2 PGAM2 DDIT3 PRKCA SLC37A4 CXCR4 EFNA3 CP KLF7 CCN2 CHST3 TPD52 LXN B4GALNT2 PPARGC1A BCL2 GCNT2 HAS1 KLHL24 SCARB1 SLC25A1 SDC2 CASP6 VHL FOXO3 PDGFB B3GALT6 SLC2A5 SRPX EFNA1 GLRX ACKR3 PAM TGFBI DCN SIAH2 PLAC8 FBP1 TPST2 PHKG1 MYH9 CDKN1C GRHPR PCK1 INHA HSPA5 NDST2 NEDD4L TPBG XPNPEP1 IL6 SLC6A6 MAP3K1 LDHC AKAP12 TES KIF5A LALBA COL5A1 GPC1 HDLBP ILVBL NCAN TGM2 ETS1 HOXB9 SELENBP1 FOSL2 SULT2B1 TGFB3
HALLMARK_CHOLESTEROL_HOMEOSTASIS http://www.gsea-msigdb.org/gsea/msigdb/cards/HALLMARK_CHOLESTEROL_HOMEOSTASIS FDPS CYP51A1 IDI1 FDFT1 DHCR7 SQLE HMGCS1 NSDHL LSS MVD LDLR TM7SF2 ALDOC EBP SCD PMVK MVK LPL SC5D FADS2 HMGCR HSD17B7 ANXA13 SREBF2 PCYT2 ACSS2 ATF3 ADH4 ETHE1 ECH1 CBS GUSB FASN LGALS3 ATF5 ANXA5 TP53INP1 CHKA GSTM2 ACAT2 AVPR1A PLSCR1 CLU ERRFI1 TRIB3 CXCL16 TNFRSF12A ACTG1 JAG1 LGMN FBXO6 GPX8 PNRC1 ANTXR2 MAL2 CD9 PPARG GLDC STX5 STARD4 CTNNB1 TMEM97 FAM129A PDK3 PLAUR SEMA3B GNAI1 ABCA2 ATXN2 NFIL3 ALCAM FABP5 S100A11 CPEB2
这个时候就有一个问题:gmt数据库只有人的,如果要改为其他物种(如小鼠)(则需要自己去构造一个这样的底层文件)
下面以构建小鼠的gmt文件为例:
思路:
将小鼠与人的同源基因用于替换掉文件每一行中的小鼠基因名(Symbol ID)
从Ensembl下载小鼠与人同源信息的对应关系
目的构建一个这样这样格式的gmt文件格式的文件
不过由于我们需要的是小鼠的基因名,作为后面基因,所以需要进行如下处理:
1. 到 http://asia.ensembl.org/biomart/中下载小鼠和人的同源基因对应关系
-
选择小鼠的entrezID 与版本号的对应关系
-
选择人的entrezID与Symbol ID的对应关系
- 得到的是:
mart_export.txt
映射关系:小鼠Entrez ID --> 人的Symbol ID$head mart_export.txt Gene stable ID Gene stable ID version Human gene stable ID Human gene name Human orthology confidence [0 low, 1 high] ENSMUSG00000064372 ENSMUSG00000064372.1 ENSMUSG00000064371 ENSMUSG00000064371.1 ENSMUSG00000064370 ENSMUSG00000064370.1 ENSG00000198727 MT-CYB 1
寻找小鼠EntrezID映射到小鼠Symbol ID的映射关系文件
方法有很多,这里用一个最简单粗暴的方法:直接使用Cellranger跑出来用于矩阵文件的目录里的features.tsv
文件找到这种映射关系
head features.tsv
ENSMUSG00000102693 4933401J01Rik Gene Expression
ENSMUSG00000064842 Gm26206 Gene Expression
ENSMUSG00000051951 Xkr4 Gene Expression
ENSMUSG00000102851 Gm18956 Gene Expression
编写Python脚本
写脚本将小鼠的gene symbol ID替换掉原本gmt文件中的人对应的同源基因 symbol ID(感觉用perl或awk写会比较简洁,当然Python也可以)
- 脚本思路:
- 从
mart_export.txt
中寻找映射关系:用哈希1存放映射关系(人gene symbol ID-->小鼠Entrez ID) - 从
features.tsv
中寻找映射关系:用哈希2存放映射关系(小鼠Entrez ID-->小鼠symbol ID) - 循环遍历gmt文件
h.all.v7.0.symbols.gmt
,将每一行里面的人类gene symbol ID作为键去查找相应的值(小鼠的Entrez ID):- 若查找不到返回空值;
- 若查找到,则继续用哈希1的获取到值(小鼠的Entrez ID),作为哈希2的键获取相应的小鼠Gene symbol ID,替换原本的人类基因symbol ID
- 从
#!/usr/bin/python
#Author:Robin 20200220
#For transfer human to mouse
human_set = set()
h2m = dict()
m2m = dict()
# human gene ID to mouse Entrez ID
with open(./mart_export.txt') as f1:
for line in f1:
line = line.strip('\n')
lst = line.split('\t')
if lst[3]:
if lst[3] in human_set:
h2m[lst[3]].append(lst[0])
else:
h2m[lst[3]] = [ lst[0] ]
human_set.add(lst[3])
# mouse Entrez ID to mouse symbol ID
with open('./features.tsv') as f2:
for line in f2:
line = line.strip('\n')
lst = line.split('\t')
m2m[lst[0]] = lst[1]
# 遍历gmt文件每行,进行替换
with open('./h.all.v7.0.symbols.gmt') as f3:
with open('./mm.all.v7.0.symbols.gmt', 'w') as f4:
for line in f3:
line = line.strip('\n')
lst = line.split('\t')
name = lst[0]
url = lst[1]
gene_list = lst[2:]
# replace the human gene symbol into mouse gene symbol
tmp_list = []
for hg in gene_list: #each gene in row
mouse_entrez = h2m.get(hg, "")
tmp = []
for entrez in mouse_entrez: #each element in entrez IDs
mouse_gene = m2m[entrez]
tmp.append(mouse_gene)
tmp_list.extend(tmp)
row = [name, url]
row.extend(tmp_list)
row = '\t'.join(row)
row = row + '\n'
print(row)
f4.write(row)
用GSVA将表达矩阵转换成通路矩阵,并生成热图
.libPaths(c('/share/nas2/public/R/library/3.6','/home/honghh/R/x86_64-redhat-linux-gnu-library/3.6'))
library(GSVA)
library(GSEABase)
library(limma)
library("org.Hs.eg.db")
library(parallel)
library(Seurat)
options(future.globals.maxSize=9891289600)
wd='/share/nas1/Data/Users/luohb/.../20200220'
dir.create(wd)
setwd(wd)
RSA<-readRDS('/share/nas1/Data/Users/yinl/Project/personality/…/merge_seurat_rna.rds')
Idents(RSA)<[email protected]$seurat_clusters
c11_seu<-subset(RSA, idents = 11)
sam_list<-c(c11_seu)
name_list<-c('c11_seu')
i<-1
for(sam in sam_list){
name<-name_list[i]
print(name)
#remove Mitochondrial gene
tmp<-as.matrix(sam@assays$RNA@counts)
bool<-!grepl('^mt-',rownames(sam),perl = T)
tmp<-tmp[bool,]
print(class(tmp))
rownames(tmp)<-toupper(rownames(tmp))
count.marix<-tmp
keggSet <- getGmt("/share/nas1/Data/Users/luohb/20191223/mm.all.v7.0.symbols.gmt")
keggEs <- gsva(expr=as.matrix(count.marix), gset.idx.list=keggSet, kcdf="Poisson", parallel.sz=30)
saveRDS(keggEs, file=paste0(name, '_keggEs_filter.rds'))
keggEs_filter<-readRDS(paste0(name, '_keggEs_filter.rds'))
#add annotation
group<-gsub('[1-9]_.*', '',colnames(keggEs_filter))
col_annotation=data.frame(Type=group)
rownames(col_annotation) = colnames(keggEs_filter)
#plot heatmap
p<-pheatmap::pheatmap(keggEs_filter, show_colnames=F, annotation_col=col_annotation)
filename<-paste0(name, '_keggEs_filter.heatmap.pdf')
ggsave(p, filename=filename,
width = 40, height = 20)
i<-i+1
}