首先,我参照GEO2R的在线工具,构建了一个函数,传递进GSE编号、平台号和分组信息,它就会返回给你差异分析的结果。
与GEO2R类似,但是有很多参数可以更改,也可以知道更多细节。代码如下:
# Version info: R 3.4.3, Biobase 2.30.0, GEOquery 2.40.0, limma 3.26.8
# R scripts generated Fri May 4 00:38:36 EDT 2018
source("https://bioconductor.org/biocLite.R")
# biocLite()
#########################1.函数GEOAnaly#######################################
# Differential expression analysis with limma
library(Biobase)
library(GEOquery)
library(limma)
GEOAnaly <- function(geo, gpl, groups){
#这三个参数分别传递geo芯片编号、平台名、样本选择和分组信息
# load series and platform data from GEO
geodir = paste0('GEOdataDownloads/', geo)
dir.create(geodir, recursive = T)
gset = getGEO(GEO = geo, GSEMatrix =TRUE, AnnotGPL=TRUE, destdir = geodir)
if (length(gset) > 1) idx = grep(gpl, attr(gset, "names")) else idx = 1
gset = gset[[idx]]
# make proper column names to match toptable
fvarLabels(gset) = make.names(fvarLabels(gset))
# group names for all samples
gsms = groups
sml = c()
for (i in 1:nchar(gsms)) { sml[i] = substr(gsms,i,i) }
# eliminate samples marked as "X"
sel = which(sml != "X")
sml = sml[sel]
gset = gset[ ,sel]
# log2 transform
ex = exprs(gset)
qx = as.numeric(quantile(ex, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T))
LogC = (qx[5] > 100) ||
(qx[6]-qx[1] > 50 && qx[2] > 0) ||
(qx[2] > 0 && qx[2] < 1 && qx[4] > 1 && qx[4] < 2)
if (LogC) { ex[which(ex <= 0)] = NaN
exprs(gset) = log2(ex) }
# set up the data and proceed with analysis
sml = paste("G", sml, sep="") # set group names
fl = as.factor(sml)
gset$description = fl
design = model.matrix(~ description + 0, gset)
colnames(design) = levels(fl)
fit = lmFit(gset, design)
cont.matrix = makeContrasts(G1-G0, levels=design)
fit2 = contrasts.fit(fit, cont.matrix)
fit2 = eBayes(fit2, 0.01)
tT = topTable(fit2, adjust="fdr", sort.by="B", number=10000)
genesymbol = colnames(tT)[grep('symbol', colnames(tT), ignore.case = T)]#想想为什么加这一句
tT = subset(tT, select=c("ID","adj.P.Val","P.Value","t","B","logFC",genesymbol))
write.table(tT, file=stdout(), row.names=F, sep="\t")
return(tT)
}
随便找了一个例子来进行分析。
这个数据集有96个样本,解剖部位有舌、口底、加补、牙槽骨等,我们只想分析舌鳞癌的肿瘤和正常样本,分别标记为“1”和“0”,其它标记为x,于是这些就形成第三个参数。
gse31056 <- GEOAnaly('GSE31056', 'GPL10526', paste0("XXXXXXXXXXXXXXXXXXXX01XX01XXXXXXXX",
"01XXXXXXX01XXX01X1X0XXX0XXXXXXXXXXX10X10X01XXX01XX01X0XXX1XXXX"))
head(gse31056)
结果是出来了。但是其实还是有问题的:
如果不是正常比病变的对照设计怎么办?
如果有lncRNA与mRNA需要区分怎么办?
拿到结果后可以进行下游分析了。
截图实在困难。就凑合着看吧!结果展示在坐下区域。