花式分析GEO数据集

首先,我参照GEO2R的在线工具,构建了一个函数,传递进GSE编号、平台号和分组信息,它就会返回给你差异分析的结果。
与GEO2R类似,但是有很多参数可以更改,也可以知道更多细节。代码如下:

# Version info: R 3.4.3, Biobase 2.30.0, GEOquery 2.40.0, limma 3.26.8
# R scripts generated  Fri May 4 00:38:36 EDT 2018
source("https://bioconductor.org/biocLite.R")
# biocLite()


#########################1.函数GEOAnaly#######################################
#   Differential expression analysis with limma
library(Biobase)
library(GEOquery)
library(limma)

GEOAnaly <- function(geo, gpl, groups){
  #这三个参数分别传递geo芯片编号、平台名、样本选择和分组信息
  # load series and platform data from GEO
  geodir = paste0('GEOdataDownloads/', geo)
  dir.create(geodir, recursive = T)
  gset = getGEO(GEO = geo, GSEMatrix =TRUE, AnnotGPL=TRUE, destdir = geodir)
  if (length(gset) > 1) idx = grep(gpl, attr(gset, "names")) else idx = 1
  gset = gset[[idx]]
  
  # make proper column names to match toptable 
  fvarLabels(gset) = make.names(fvarLabels(gset))
  
  # group names for all samples
  gsms = groups
  sml = c()
  for (i in 1:nchar(gsms)) { sml[i] = substr(gsms,i,i) }
  
  # eliminate samples marked as "X"
  sel = which(sml != "X")
  sml = sml[sel]
  gset = gset[ ,sel]
  
  
  # log2 transform
  ex = exprs(gset)
  qx = as.numeric(quantile(ex, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T))
  LogC = (qx[5] > 100) ||
    (qx[6]-qx[1] > 50 && qx[2] > 0) ||
    (qx[2] > 0 && qx[2] < 1 && qx[4] > 1 && qx[4] < 2)
  if (LogC) { ex[which(ex <= 0)] = NaN
  exprs(gset) = log2(ex) }
  # set up the data and proceed with analysis
  sml = paste("G", sml, sep="")    # set group names
  fl = as.factor(sml)
  gset$description = fl
  design = model.matrix(~ description + 0, gset)
  colnames(design) = levels(fl)
  fit = lmFit(gset, design)
  cont.matrix = makeContrasts(G1-G0, levels=design)
  fit2 = contrasts.fit(fit, cont.matrix)
  fit2 = eBayes(fit2, 0.01)
  tT = topTable(fit2, adjust="fdr", sort.by="B", number=10000)

  genesymbol = colnames(tT)[grep('symbol', colnames(tT), ignore.case = T)]#想想为什么加这一句
  tT = subset(tT, select=c("ID","adj.P.Val","P.Value","t","B","logFC",genesymbol))
  write.table(tT, file=stdout(), row.names=F, sep="\t")
  return(tT)
}

随便找了一个例子来进行分析。


花式分析GEO数据集_第1张图片
image.png

这个数据集有96个样本,解剖部位有舌、口底、加补、牙槽骨等,我们只想分析舌鳞癌的肿瘤和正常样本,分别标记为“1”和“0”,其它标记为x,于是这些就形成第三个参数。

gse31056 <- GEOAnaly('GSE31056', 'GPL10526', paste0("XXXXXXXXXXXXXXXXXXXX01XX01XXXXXXXX",
"01XXXXXXX01XXX01X1X0XXX0XXXXXXXXXXX10X10X01XXX01XX01X0XXX1XXXX"))
head(gse31056)

结果是出来了。但是其实还是有问题的:
如果不是正常比病变的对照设计怎么办?
如果有lncRNA与mRNA需要区分怎么办?

拿到结果后可以进行下游分析了。

花式分析GEO数据集_第2张图片
image.png

截图实在困难。就凑合着看吧!结果展示在坐下区域。

你可能感兴趣的:(花式分析GEO数据集)