GEO 批次效应就靠一个函数搞定

导读

当用到多个数据集合并分析时不可避免要处理批次效应,一篇2011年的批次效应处理工具测评文章中说,sva的ComBat函数是表现最好的~所以赶紧学习下它。

1.准备R包

if(!require(BiocManager)) install.packages("BiocManager")
if(!require(sva)) BiocManager::install("sva")
if(!require(bladderbatch)) BiocManager::install("bladderbatch")
library(sva)
library(bladderbatch)

2. 了解数据

示例数据取自bladderbatch包,用data加载,和GEO下载的数据一样,可直接用函数提取表达矩阵和临床信息。

data(bladderdata)
edata <- exprs(bladderEset) 
pheno <- pData(bladderEset) 
dim(edata);head(pheno)
## [1] 22283    57
##              sample outcome batch cancer
## GSM71019.CEL      1  Normal     3 Normal
## GSM71020.CEL      2  Normal     2 Normal
## GSM71021.CEL      3  Normal     2 Normal
## GSM71022.CEL      4  Normal     3 Normal
## GSM71023.CEL      5  Normal     3 Normal
## GSM71024.CEL      6  Normal     3 Normal

table(pheno$cancer)
## Biopsy Cancer Normal 
##   9     40      8 
edata[1:4,1:4]
##           GSM71019.CEL GSM71020.CEL GSM71021.CEL GSM71022.CEL
## 1007_s_at    10.115170     8.628044     8.779235     9.248569
## 1053_at       5.345168     5.063598     5.113116     5.179410
## 117_at        6.348024     6.663625     6.465892     6.116422
## 121_at        8.901739     9.439977     9.540738     9.254368

3.设置model(可选)

mod = model.matrix(~as.factor(cancer), data=pheno)

4.校正其实就一步

combat_edata <- ComBat(dat = edata, batch = pheno$batch, mod = mod)
## Found5batches
## Adjusting for2covariate(s) or covariate level(s)
## Standardizing Data across genes
## Fitting L/S model and finding priors
## Finding parametric adjustments
## Adjusting the Data

5.聚类看下批次效应处理前后对比

## before
dist_mat <- dist(t(edata))
clustering <- hclust(dist_mat, method = "complete")
## after1
dist_mat_combat <- dist(t(combat_edata))
clustering_combat <- hclust(dist_mat_combat, method = "complete")

可视化

par(mfrow = c(2,2))
plot(clustering, labels = pheno$batch)
plot(clustering, labels = pheno$cancer)
plot(clustering_combat, labels = pheno$batch)
plot(clustering_combat, labels = pheno$cancer)
GEO 批次效应就靠一个函数搞定_第1张图片

可以看到批次处理前cancer夹杂在normal中,聚类是不对的,处理后则聚类正确~

示例代码来自Combat的帮助文档,可直接在Rstudio中用??sva::ComBat查看。

你可能感兴趣的:(GEO 批次效应就靠一个函数搞定)