【单细胞转录组1】GSVA分析小鼠的单细胞转录组数据

目的:用GSVA分析单细胞转录组数据

基因集变异分析(GSVA)是一种非参数,无监督的方法,用于通过表达数据集的样本估算基因集富集的差异,即基于通路上的差异分析

一、GSVA介绍与使用方法

查看以下网络教程:
GSVA的使用
Day 11 充分理解GSVA和GSEA
GSVA + limma进行差异通路分析

因为没有现成的小鼠gmt底层文件,需要自己构造,这里只重点说明如何分析小鼠单细胞转录组的数据。

二、操作

下载gmt文件

  • https://www.gsea-msigdb.org/gsea/msigdb/index.jsp

gmt文件格式:h.all.v7.0.symbols.gmt

HALLMARK_TNFA_SIGNALING_VIA_NFKB    http://www.gsea-msigdb.org/gsea/msigdb/cards/HALLMARK_TNFA_SIGNALING_VIA_NFKB   JUNB    CXCL2   ATF3    NFKBIA  TNFAIP3 PTGS2   CXCL1   IER3    CD83    CCL20   CXCL3   MAFF    NFKB2   TNFAIP2 HBEGF   KLF6    BIRC3   PLAUR   ZFP36   ICAM1   JUN EGR3    IL1B    BCL2A1  PPP1R15A    ZC3H12A SOD2    NR4A2   IL1A    RELB    TRAF1   BTG2    DUSP1   MAP3K8  ETS2    F3  SDC4    EGR1    IL6 TNF KDM6B   NFKB1   LIF PTX3    FOSL1   NR4A1   JAG1    CCL4    GCH1    CCL2    RCAN1   DUSP2   EHD1    IER2    REL CFLAR   RIPK2   NFKBIE  NR4A3   PHLDA1  IER5    TNFSF9  GEM GADD45A CXCL10  PLK2    BHLHE40 EGR2    SOCS3   SLC2A6  PTGER4  DUSP5   SERPINB2    NFIL3   SERPINE1    TRIB1   TIPARP  RELA    BIRC2   CXCL6   LITAF   TNFAIP6 CD44    INHBA   PLAU    MYC TNFRSF9 SGK1    TNIP1   NAMPT   FOSL2   PNRC1   ID2 CD69    IL7R    EFNA1   PHLDA2  PFKFB3  CCL5    YRDC    IFNGR2  SQSTM1  BTG3    GADD45B KYNU    G0S2    BTG1    MCL1    VEGFA   MAP2K3  CDKN1A  CCN1    TANK    IFIT2   IL18    TUBB2A  IRF1    FOS OLR1    RHOB    AREG    NINJ1   ZBTB10  PLPP3   KLF4    CXCL11  SAT1    CSF1    GPR183  PMEPA1  PTPRE   TLR2    ACKR3   KLF10   MARCKS  LAMB3   CEBPB   TRIP10  F2RL1   KLF9    LDLR    TGIF1   RNF19B  DRAM1   B4GALT1 DNAJB4  CSF2    PDE4B   SNN PLEK    STAT5A  DENND5A CCND1   DDX58   SPHK1   CD80    TNFAIP8 CCNL1   FUT4    CCRL2   SPSB1   TSC22D1 B4GALT5 SIK1    CLCF1   NFE2L2  FOSB    AC129492.1  NFAT5   ATP2B1  IL12B   IL6ST   SLC16A6 ABCA1   HES1    BCL6    IRS2    SLC2A3  CEBPD   IL23A   SMAD3   TAP1    MSC IFIH1   IL15RA  TNIP2   BCL3    PANX1   FJX1    EDN1    EIF1    BMP2    DUSP4   PDLIM5  ICOSLG  GFPT2   KLF2    TNC SERPINB8    MXD1
HALLMARK_HYPOXIA    http://www.gsea-msigdb.org/gsea/msigdb/cards/HALLMARK_HYPOXIA   PGK1    PDK1    GBE1    PFKL    ALDOA   ENO2    PGM1    NDRG1   HK2 ALDOC   GPI MXI1    SLC2A1  P4HA1   ADM P4HA2   ENO1    PFKP    AK4 FAM162A PFKFB3  VEGFA   BNIP3L  TPI1    ERO1A   KDM3A   CCNG2   LDHA    GYS1    GAPDH   BHLHE40 ANGPTL4 JUN SERPINE1    LOX GCK PPFIA4  MAFF    DDIT4   SLC2A3  IGFBP3  NFIL3   FOS RBPJ    HK1 CITED2  ISG20   GALK1   WSB1    PYGM    STC1    ZNF292  BTG1    PLIN2   CSRP2   VLDLR   JMJD6   EXT1    F3  PDK3    ANKZF1  UGP2    ALDOB   STC2    ERRFI1  ENO3    PNRC1   HMOX1   PGF GAPDHS  CHST2   TMEM45A BCAN    ATF3    CAV1    AMPD3   GPC3    NDST1   IRS2    SAP30   GAA SDC4    STBD1   IER3    PKLR    IGFBP1  PLAUR   CAVIN3  CCN5    LARGE1  NOCT    S100A4  RRAGD   ZFP36   EGFR    EDN2    IDS CDKN1A  RORA    DUSP1   MIF PPP1R3C DPYSL4  KDELR3  DTNA    ADORA2B HS3ST1  CAVIN1  NR3C1   KLF6    GPC4    CCN1    TNFAIP3 CA12    HEXA    BGN PPP1R15A    PGM2    PIM1    PRDX5   NAGK    CDKN1B  BRS3    TKTL1   MT1E    ATP7A   MT2A    SDC3    TIPARP  PKP1    ANXA2   PGAM2   DDIT3   PRKCA   SLC37A4 CXCR4   EFNA3   CP  KLF7    CCN2    CHST3   TPD52   LXN B4GALNT2    PPARGC1A    BCL2    GCNT2   HAS1    KLHL24  SCARB1  SLC25A1 SDC2    CASP6   VHL FOXO3   PDGFB   B3GALT6 SLC2A5  SRPX    EFNA1   GLRX    ACKR3   PAM TGFBI   DCN SIAH2   PLAC8   FBP1    TPST2   PHKG1   MYH9    CDKN1C  GRHPR   PCK1    INHA    HSPA5   NDST2   NEDD4L  TPBG    XPNPEP1 IL6 SLC6A6  MAP3K1  LDHC    AKAP12  TES KIF5A   LALBA   COL5A1  GPC1    HDLBP   ILVBL   NCAN    TGM2    ETS1    HOXB9   SELENBP1    FOSL2   SULT2B1 TGFB3
HALLMARK_CHOLESTEROL_HOMEOSTASIS    http://www.gsea-msigdb.org/gsea/msigdb/cards/HALLMARK_CHOLESTEROL_HOMEOSTASIS   FDPS    CYP51A1 IDI1    FDFT1   DHCR7   SQLE    HMGCS1  NSDHL   LSS MVD LDLR    TM7SF2  ALDOC   EBP SCD PMVK    MVK LPL SC5D    FADS2   HMGCR   HSD17B7 ANXA13  SREBF2  PCYT2   ACSS2   ATF3    ADH4    ETHE1   ECH1    CBS GUSB    FASN    LGALS3  ATF5    ANXA5   TP53INP1    CHKA    GSTM2   ACAT2   AVPR1A  PLSCR1  CLU ERRFI1  TRIB3   CXCL16  TNFRSF12A   ACTG1   JAG1    LGMN    FBXO6   GPX8    PNRC1   ANTXR2  MAL2    CD9 PPARG   GLDC    STX5    STARD4  CTNNB1  TMEM97  FAM129A PDK3    PLAUR   SEMA3B  GNAI1   ABCA2   ATXN2   NFIL3   ALCAM   FABP5   S100A11 CPEB2

这个时候就有一个问题:gmt数据库只有人的,如果要改为其他物种(如小鼠)(则需要自己去构造一个这样的底层文件)

下面以构建小鼠的gmt文件为例:

思路:
将小鼠与人的同源基因用于替换掉文件每一行中的小鼠基因名(Symbol ID)

从Ensembl下载小鼠与人同源信息的对应关系

目的构建一个这样这样格式的gmt文件格式的文件

不过由于我们需要的是小鼠的基因名,作为后面基因,所以需要进行如下处理:

1. 到 http://asia.ensembl.org/biomart/中下载小鼠和人的同源基因对应关系
image
  • 选择小鼠的entrezID 与版本号的对应关系


    image
  • 选择人的entrezID与Symbol ID的对应关系


    image

    image

    image

    image
  • 得到的是:mart_export.txt
    映射关系:小鼠Entrez ID --> 人的Symbol ID
    $head mart_export.txt
    Gene stable ID  Gene stable ID version  Human gene stable ID    Human gene name Human orthology confidence [0 low, 1 high]
    ENSMUSG00000064372  ENSMUSG00000064372.1            
    ENSMUSG00000064371  ENSMUSG00000064371.1            
    ENSMUSG00000064370  ENSMUSG00000064370.1    ENSG00000198727 MT-CYB  1       
    

寻找小鼠EntrezID映射到小鼠Symbol ID的映射关系文件

方法有很多,这里用一个最简单粗暴的方法:直接使用Cellranger跑出来用于矩阵文件的目录里的features.tsv文件找到这种映射关系

head features.tsv
ENSMUSG00000102693  4933401J01Rik   Gene Expression
ENSMUSG00000064842  Gm26206 Gene Expression
ENSMUSG00000051951  Xkr4    Gene Expression
ENSMUSG00000102851  Gm18956 Gene Expression

编写Python脚本

写脚本将小鼠的gene symbol ID替换掉原本gmt文件中的人对应的同源基因 symbol ID(感觉用perl或awk写会比较简洁,当然Python也可以)

  • 脚本思路:
    • mart_export.txt中寻找映射关系:用哈希1存放映射关系(人gene symbol ID-->小鼠Entrez ID)
    • features.tsv中寻找映射关系:用哈希2存放映射关系(小鼠Entrez ID-->小鼠symbol ID)
    • 循环遍历gmt文件h.all.v7.0.symbols.gmt,将每一行里面的人类gene symbol ID作为键去查找相应的值(小鼠的Entrez ID):
      • 若查找不到返回空值;
      • 若查找到,则继续用哈希1的获取到值(小鼠的Entrez ID),作为哈希2的键获取相应的小鼠Gene symbol ID,替换原本的人类基因symbol ID
#!/usr/bin/python
#Author:Robin 20200220
#For transfer human to mouse

human_set = set()
h2m = dict()
m2m = dict()

# human gene ID to mouse Entrez ID
with open(./mart_export.txt') as f1:     
    for line in f1:
        line = line.strip('\n')
        lst = line.split('\t')    

        if lst[3]:
            if lst[3] in human_set:
                h2m[lst[3]].append(lst[0])
            else:
                h2m[lst[3]] = [ lst[0] ]
                human_set.add(lst[3])        


# mouse Entrez ID to mouse symbol ID
with open('./features.tsv') as f2:
    for line in f2:
        line = line.strip('\n')
        lst = line.split('\t')
        m2m[lst[0]] = lst[1]

# 遍历gmt文件每行,进行替换
with open('./h.all.v7.0.symbols.gmt') as f3:
    with open('./mm.all.v7.0.symbols.gmt', 'w') as f4:
        for line in f3:
            line = line.strip('\n')
            lst = line.split('\t')
            
            name = lst[0]
            url = lst[1]
            gene_list = lst[2:]
            
            # replace the human gene symbol into mouse gene symbol
            tmp_list = []
            for hg in gene_list:    #each gene in row
                mouse_entrez = h2m.get(hg, "")

                tmp = []          
                for entrez in mouse_entrez:   #each element in entrez IDs
                    mouse_gene = m2m[entrez]
                    tmp.append(mouse_gene)
                
                tmp_list.extend(tmp)
            
            row = [name, url]
            row.extend(tmp_list)
            row = '\t'.join(row)
            row = row + '\n'
            print(row)
            f4.write(row)

用GSVA将表达矩阵转换成通路矩阵,并生成热图

.libPaths(c('/share/nas2/public/R/library/3.6','/home/honghh/R/x86_64-redhat-linux-gnu-library/3.6'))
library(GSVA)
library(GSEABase)
library(limma)
library("org.Hs.eg.db")   
library(parallel)
library(Seurat)
options(future.globals.maxSize=9891289600)

wd='/share/nas1/Data/Users/luohb/.../20200220'
dir.create(wd)
setwd(wd)

RSA<-readRDS('/share/nas1/Data/Users/yinl/Project/personality/…/merge_seurat_rna.rds')
Idents(RSA)<[email protected]$seurat_clusters

c11_seu<-subset(RSA, idents = 11)

sam_list<-c(c11_seu)
name_list<-c('c11_seu')

i<-1
for(sam in sam_list){
  
  name<-name_list[i]
  print(name)
  #remove Mitochondrial gene
  tmp<-as.matrix(sam@assays$RNA@counts)
  bool<-!grepl('^mt-',rownames(sam),perl = T)
  
  tmp<-tmp[bool,]
  print(class(tmp))
  rownames(tmp)<-toupper(rownames(tmp))
  
  count.marix<-tmp
  keggSet <- getGmt("/share/nas1/Data/Users/luohb/20191223/mm.all.v7.0.symbols.gmt")
  keggEs <- gsva(expr=as.matrix(count.marix), gset.idx.list=keggSet, kcdf="Poisson", parallel.sz=30)
  
  saveRDS(keggEs, file=paste0(name, '_keggEs_filter.rds'))
  keggEs_filter<-readRDS(paste0(name, '_keggEs_filter.rds'))
  
#add annotation
group<-gsub('[1-9]_.*', '',colnames(keggEs_filter))

col_annotation=data.frame(Type=group)
rownames(col_annotation) = colnames(keggEs_filter)

#plot heatmap
p<-pheatmap::pheatmap(keggEs_filter, show_colnames=F, annotation_col=col_annotation)

filename<-paste0(name, '_keggEs_filter.heatmap.pdf')
ggsave(p, filename=filename,
       width = 40, height = 20)
  i<-i+1
}

你可能感兴趣的:(【单细胞转录组1】GSVA分析小鼠的单细胞转录组数据)