复现这篇文献里的一幅热图
Tinagl1 Suppresses Triple-Negative Breast Cancer Progression and Metastasis by Simultaneously Inhibiting Integrin/FAK and EGFR Signaling .
文中对于图A、B、C的描述
We next tested whether Tinagl1 affects downstream signaling in the EGFR and integrin pathways. Microarray gene expression profiling was performed on lung metastatic lesions produced by Tinagl1-expressing or control LM2 cells (Figure S5A).
Gene set enrichment analysis indicated that genes induced by EGF or suppressed by either EGFR or FAK inhibition were significantly enriched in control cells compared with Tinagl1-express-ing cells
(Figure5A
; Table S2). To further confirm the result, a set of genes that regulated by EGFR and integrin/FAK signaling were generated.Genes downregulated by EGFR and integrin/FAK signaling were significantly increased in the Tinagl1-expressing group, while genes upregulated by either signaling programs were enriched in the control group
(Figure5B
). Collectively, the results indicate that Tinagl1 was negatively correlated with EGFR and FAK activation and may inhibit both pathways.
Previous studies indicated that the EGFR and integrin/FAK pathways are intricately connected (Bill et al., 2004). Based on previous EGFR-related signatures and the microarray data from fibronectin (FN)- or FAK inhibitor-treated cells, we identified two sets of EGFR and integrin/FAK crosstalk genes: (1) genes induced by EGF that are resistant to EGFR inhibitor treatment and upregulated by integrin/FAK signaling (termed as genes compensated by integrin/FAK signaling); and (2)
genes induced by FN that are resistant to FAK inhibitor treatment and upregulated by EGFR signaling (termed as genes compensated by EGFR signaling
) (Table S3; Figure5C
). Interestingly, both sets of compensatory gene networks were suppressed in Tinagl1-ex-pressing cells (Figure 5C), further suggesting that Tinagl1 may inhibit EGFR and integrin/FAK signal pathways simultaneously.
<菜鸟一枚,所以现在只是复现这篇文章里的B图,C图里的通路GO/KEGG注释及A图中的GSEA下回分解>
从文章中,我们找到GSE号
All microarray data generated in this study have been deposited as a superseries at the NCBI Gene Expression Omnibus with the accession code GSE99507, and can be accessed at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE99507.
点开,看到如下分组,这里面的分组信息先留个悬念
<从下面开始,跟着老大的哔哩哔哩GEO挖掘视频一步一步走>
1.现在我们需要找到这个GSE号对应的GPL平台,然后找到这个GPL平台所对应的注释R包
老大GEO视频中一闪而过的文章是:从GEO数据库下载得到表达矩阵 一文就够
现在我们需要通这个GPL平台找到对应的注释R包然后找到下面这个包装函数的板块,截图如下:图片中是老大已经包装好的函数,我们这次要下载的GSE号是99507,因此在R里代码框输入下面即可,如图:
得到当前GSE号所对应的芯片平台,如下图:即GPL17077.
当然,我们也可以在GEO网站上看到平台信息是这样的,截图如下
接下来,因此我们可以去老大的搜集贴找对应关系,参考老大的这篇文章:(16)芯片探针与基因的对应关系-生信菜鸟团博客2周年精选文章集,截图如下:
从以上列表中,用GPL17077去找对应的注释包,发现没有。(老大已经总结地很全了,但是依然有些GPL平台是没有注释包或者是有我们没有找到。)
而老大GEO视频里的GPL平台其实是有注释R包的,就是下图中的这个hgu95av2.db
,视频中老大从R语言20练习题里有关于ID转换的代码,如下图:
总之,现在的GPL17077平台现在并没有对应的注释R包,当然更不是视频中的这个hgu95av2.db
了。为了能继续跟着按照老大视频中讲的继续做,我去Google搜索,用的是前面在GEO网站里GPL平台对应的平台信息,前面也有截图,如下
Google找的说是这个hgug4110b.db的注释R包,其实在老大总结里这个包对应的是GPL887平台,但是我想还是试试,于是我尝试了下载下,代码如下啦
options(BioC_mirror="https://mirrors.ustc.edu.cn/bioc/")
options("repos" = c(CRAN="https://mirrors.tuna.tsinghua.edu.cn/CRAN/"))
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
library(GEOquery)
gset <- getGEO('GSE99507',destdir=".",AnnotGPL = F,getGPL = F)
class(gset)
length(gset)
class(gset[[1]])
a=gset[[1]]
dat=exprs(a)
dim(dat)
#老大给包装了一个函数
downGSE <- function(studyID = "GSE1009", destdir = ".") {
library(GEOquery)
eSet <- getGEO(studyID, destdir = destdir, getGPL = F)
exprSet = exprs(eSet[[1]])
pdata = pData(eSet[[1]])
write.csv(exprSet, paste0(studyID, "_exprSet.csv"))
write.csv(pdata, paste0(studyID, "_metadata.csv"))
return(eSet)
}
downGSE('GSE99507')
BiocManager::install('hgug4110b.db')
library('hgug4110b.db')
?hgug4110b
ls("package:hgug4110b.db")
#下面这段很眼熟,就是我上面在R语言20题里的截图,我把视频中的hgu95av2.db包换成我在谷歌搜索到的hgug4110b.db包
library('hgug4110b.db')
ids=toTable(hgug4110bSYMBOL)
length(unique(ids$symbol))
tail(sort(table(ids$symbol)))
table(sort(table(ids$symbol)))
plot(table(sort(table(ids$symbol))))
exprSet<-dat
table(rownames(exprSet) %in% ids$probe_id)
dim(exprSet)
exprSet=exprSet[rownames(exprSet) %in% ids$probe_id,]
dim(exprSet)
ids=ids[match(rownames(exprSet),ids$probe_id),]
head(ids)
exprSet[1:5,1:5]
tmp = by(exprSet,ids$symbol,
function(x) rownames(x)[which.max(rowMeans(x))] )
probes = as.character(tmp)
exprSet=exprSet[rownames(exprSet) %in% probes ,]
dim(exprSet)
#写一个函数,可以之间过滤掉其他探针并且进行ID转换,
#此时的id转换是用bioconductor包来转换的,如果下载的matrix,则不用转换
jimmy <- function(exprSet,ids){
tmp = by(exprSet,
ids$symbol,
function(x) rownames(x)[which.max(rowMeans(x))] )
probes = as.character(tmp)
print(dim(exprSet))
exprSet=exprSet[rownames(exprSet) %in% probes ,]
print(dim(exprSet))
return(exprSet)
}
new.exprSet <- jimmy(exprSet,ids)
#此时会报错,因为现在的exprSet和ids的长度不一样
由于expeSet里的ID能在ids里对应的很少,大部分都是对应不上的,如下图
对应上的基因数TRUE只有8190,因此我认为我谷歌到的这个hgug4110b.db
并不对。不过==没关系==
因为老大视频里又给了找不到对应的注释R包时,通过下面的代码来实现探针名和基因名的对接,代码网址:https://github.com/jmzeng1314/my-R/blob/master/9-microarray-examples/E-MTAB-3017.R
#代码如下
gpl <- getGEO('GPL17077', destdir=".")
colnames(Table(gpl))
head(Table(gpl)[,c(1,6,7)]) ## you need to check this , which column do you need
write.csv(Table(gpl)[,c(1,6,7)],"GPL17077.csv")
ids=read.csv('GPL17077.csv')
#这样我们就同样获得了探针对应基因的表达矩阵啦
前面我所认为的第一步刚刚结束了,获得了探针对应基因的表达矩阵
2.下载数据,获得分组信息
下载数据:老大在视频里讲了三种找到数据集后的数据下载方式,我简单罗列如下(目的得到表达矩阵):
1.直接下载raw data,但不推荐大家用,原始数据
2.下载表达矩阵 series matrix file(s),下载后可读到R里面,考验网速
a=read.table('GSE42872_series_matrix.txt.gz')
> class(a)
[1] "data.frame"
> str(gset)
3.在R里面读取GSE号.
gset <- getGEO("GSE42589")
加载GEO包下载
library(GEOquery)
gset <- getGEO('GSE42872',destdir=".",AnnotGPL = F,getGPL = F)
老大推荐第三种方式,下面是代码
#在下载前,必要的设置镜像、安装包和加载包如下:
###step0-install
rm(list = ls())
options()$repos
options()$BioC_mirror
options(BioC_mirror="https://mirrors.ustc.edu.cn/bioc/")
options("repos" = c(CRAN="https://mirrors.tuna.tsinghua.edu.cn/CRAN/"))
options()$repos
options()$BioC_mirror
# https://bioconductor.org/packages/release/bioc/html/GEOquery.html
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("KEGG.db",ask = F,update = F)
BiocManager::install(c("GSEABase","GSVA","clusterProfiler" ),ask = F,update = F)
BiocManager::install(c("GEOquery","limma","impute" ),ask = F,update = F)
BiocManager::install(c("genefu","org.Hs.eg.db","hgu133plus2.db" ),ask = F,update = F)
# source("https://bioconductor.org/biocLite.R")
# library('BiocInstaller')
# options(BioC_mirror="https://mirrors.ustc.edu.cn/bioc/")
# BiocInstaller::biocLite("GEOquery")
# BiocInstaller::biocLite(c("limma"))
# BiocInstaller::biocLite(c("impute"))
options()$repos
install.packages('WGCNA')
install.packages(c("FactoMineR", "factoextra"))
install.packages(c("ggplot2", "pheatmap","ggpubr"))
library("FactoMineR")
library("factoextra")
library(GSEABase)
library(GSVA)
library(clusterProfiler)
library(genefu)
library(ggplot2)
library(ggpubr)
library(hgu133plus2.db)
library(limma)
library(org.Hs.eg.db)
library(pheatmap)
#就这么来就对了,没毛病
###step1-downloadR
rm(list = ls()) ## 魔幻操作,一键清空~
options(stringsAsFactors = F)
# 注意查看下载文件的大小,检查数据
f='GSE99507_eSet.Rdata'
library(GEOquery)
# 这个包需要注意两个配置,一般来说自动化的配置是足够的。
#Setting options('download.file.method.GEOquery'='auto')
#Setting options('GEOquery.inmemory.gpl'=FALSE)
if(!file.exists(f)){
gset <- getGEO('GSE99507', destdir=".",
AnnotGPL = F, ## 注释文件
getGPL = F) ## 平台文件
save(gset,file=f) ## 保存到本地
}
load('GSE99507_eSet.Rdata') ## 载入数据
class(gset)
length(gset)
class(gset[[1]])
# 因为这个GEO数据集只有一个GPL平台,所以下载到的是一个含有一个元素的list
a=gset[[1]]
dat=exprs
dim(dat)
# GPL3921 [HT_HG-U133A] Affymetrix HT Human Genome U133A Array
# Bioconductor - hgu133a.db
dat[1:4,1:4]
pd=pData(a) #获得分组信息,就得到了前面一张截图,即前面悬念的解答处-GSE号对应的Samples分组信息,主要是根据文章里的那个截图的分组信息来对字符串进行拆分(str_split)和替换(str_replace)
library(stringr)
group_list = trimws(str_split(pd$title,' ',simplify = T)[,5])#拆分
group_list = str_replace(group_list,'LM2','Vector')
table(group_list)
save(dat,group_list,file = 'step1-output.Rdata')
#此时,要注意保存的这个文件step1-output.Rdata,我们要知道此时的dat表达矩阵是什么样的
3.ID转换
###step2-ids-filter.R
rm(list = ls()) ## 魔幻操作,一键清空~
options(stringsAsFactors = F)
load('step1-output.Rdata')
gpl <- getGEO('GPL17077', destdir=".")
colnames(Table(gpl)) ## [1] 41108 17
head(Table(gpl)[,c(1,7)]) ## you need to check this , which column do you need
write.csv(Table(gpl)[,c(1,7)],"GPL17077.csv")
ids=read.csv('GPL17077.csv')
ids=ids[,-1]
dat[1:4,1:4]
rownames(dat)
#如前所述,现在没有这个平台对应的注视R包,所以下面是根据下载的dat和ids来进行的ID转换步骤
ids[ids$GENE_SYMBOL!='',]
ids=ids[ids$GENE_SYMBOL!='',]
match(rownames(dat),ids$ID)
ids[match(rownames(dat),ids$ID),]
ids=ids[match(rownames(dat),ids$ID),]
ids=ids[complete.cases(ids),]
dat=dat[ids$ID,]
#下面是针对那些有多个基因去重的步骤,注意理解为什么要取median(中位数),巧妙
ids$median=apply(dat,1,median)
ids=ids[order(ids$GENE_SYMBOL,ids$median,decreasing = T),]
ids=ids[!duplicated(ids$GENE_SYMBOL),]
dat=dat[ids$ID,]
rownames(dat)=ids$GENE_SYMBOL
dat[1:4,1:4]
dim(dat)
save(dat,group_list,file = 'step2-output.Rdata')
#此时此时,要注意保存的这个文件step2-output.Rdata,我们要知道此时的dat表达矩阵是什么样的
4.PCA和Heatmap
###step3-pca-heatmap
##PCA主成分分析
(list = ls()) ## 魔幻操作,一键清空~
options(stringsAsFactors = F)
load(file = 'step1-output.Rdata')
# 每次都要检测数据
dat[1:4,1:4]
## 下面是画PCA的必须操作,需要看说明书。
dat=t(dat)
dat=as.data.frame(dat)
dat=cbind(dat,group_list)
library("FactoMineR")
library("factoextra")
# The variable group_list (index = ) is removed
# before PCA analysis
dat.pca <- PCA(dat[,-ncol(dat)], graph = FALSE)
fviz_pca_ind(dat.pca,repel =T,
#geom.ind = "point", # show points only (nbut not "text")
col.ind = dat$group_list, # color by groups
palette = c("#00AFBB", "#E7B800"),
addEllipses = TRUE, # Concentration ellipses
legend.title = "Groups"
)
##热图:heatmap-sd
rm(list = ls()) ## 魔幻操作,一键清空~
options(stringsAsFactors = F)
load(file = 'step1-output.Rdata')
# 每次都要检测数据
dat[1:4,1:4]
cg=names(tail(sort(apply(dat, 1, sd)),1000))
library(pheatmap)
pheatmap(dat[cg,],show_colnames =F,show_rownames = F)
rowMeans(dat[,1:3]) - rowMeans(dat[,4:6])
cg1=names(tail(sort(rowMeans(dat[,1:3]) - rowMeans(dat[,4:6]) ),500))
cg2=names(head(sort(rowMeans(dat[,1:3]) - rowMeans(dat[,4:6]) ),500))
cg=c(cg1,cg2)
n=t(scale(t(dat[cg,])))
n[n>2]=2
n[n< -2]= -2
n[1:4,1:4]
pheatmap(n,show_colnames =F,show_rownames = F)
ac=data.frame(g=group_list)
rownames(ac)=colnames(n)
pheatmap(n,show_colnames =F,
show_rownames = F,
cluster_cols = F,
annotation_col=ac)
5.差异分析
##step4-DEG
rm(list = ls()) ## 魔幻操作,一键清空~
options(stringsAsFactors = F)
load(file = 'step2-output.Rdata')######要知道此时是step2中的dat哦
# 每次都要检测数据
dat[1:4,1:4]
table(group_list)
##boxplot图
boxplot(dat[1,]~group_list)
t.test(dat[1,]~group_list)
bp=function(g){
library(ggpubr)
df=data.frame(gene=g,stage=group_list)
p <- ggboxplot(df, x = "stage", y = "gene",
color = "stage", palette = "jco",
add = "jitter")
# Add p-value
p + stat_compare_means()
}
bp(dat[1,])
bp(dat[2,])
bp(dat[3,])
bp(dat[4,])
##DEG-limma包
library(limma)
design=model.matrix(~factor( group_list ))
fit=lmFit(dat,design)
fit=eBayes(fit)
options(digits = 4)
#topTable(fit,coef=2,adjust='BH')
topTable(fit,coef=2,adjust='BH')
bp(dat['KLK10',])
bp(dat['FXYD3',])
deg=topTable(fit,coef=2,adjust='BH',number = Inf)
head(deg)
save(deg,file = 'deg.Rdata')
## for volcano
if(T){
nrDEG=deg
head(nrDEG)
attach(nrDEG)
plot(logFC,-log10(P.Value))
library(ggpubr)
df=nrDEG
df$v= -log10(P.Value)
ggscatter(df, x = "logFC", y = "v",size=0.5)
df$g=ifelse(df$P.Value>0.01,'stable',
ifelse( df$logFC >1.5,'up',
ifelse( df$logFC < -1.5,'down','stable') )
)
table(df$g)#此时获得的上调基因为20,下调基因为145
df$symbol=rownames(df)
ggscatter(df, x = "logFC", y = "v",size=0.5,color = 'g')
ggscatter(df, x = "logFC", y = "v", color = "g",size = 0.5,
label = "symbol", repel = T,
#label.select = rownames(df)[df$g != 'stable'] ,
label.select = rownames(head(head(deg))),
palette = c("#00AFBB", "#E7B800", "#FC4E07") )
ggscatter(df, x = "AveExpr", y = "logFC",size = 0.2)
df$p_c = ifelse(df$P.Value<0.001,'p<0.001',
ifelse(df$P.Value<0.01,'0.0010.01'))
table(df$p_c )
ggscatter(df,x = "AveExpr", y = "logFC", color = "p_c",size=0.2,
palette = c("green", "red", "black") )
}
## 热图:for heatmap
rm(list = ls()) ## 魔幻操作,一键清空~
options(stringsAsFactors = F)
if(T){
load(file = 'step2-output.Rdata')
# 每次都要检测数据
dat[1:4,1:4]
table(group_list)
load(file = 'deg.Rdata')
x=deg$logFC
names(x)=rownames(deg)
cg=c(names(head(sort(x),20)),#老大的代码中是100、100,由于我获得上下调基因少,就改了
names(tail(sort(x),145)))
library(pheatmap)
pheatmap(dat[cg,],show_colnames =F,show_rownames = F)
n=t(scale(t(dat[cg,])))
n[n>2]=2
n[n< -2]= -2
n[1:4,1:4]
pheatmap(n,show_colnames =F,show_rownames = F)
ac=data.frame(g=group_list)
rownames(ac)=colnames(n)
pheatmap(n,show_colnames =F,
show_rownames = F,
cluster_cols = F,
annotation_col=ac)
}
到这里就差不多结束了,基本得到和文章中差不多的热图。还有很多做图技巧需要慢慢学习。其中的代码均从老大那copy过来的,需要边做边理解,注意每一步变量的变化,以不变应万变,这点很重要哇。
十分<感谢老大>
的指导和帮助,第一次做到这里,很开心哇~