Nov 21, 2019更新 => 修正部分function代码,实现了导出gene2module
目的
最近有个项目又准备跑WGCNA
,之前跑通过一个,然后很久没碰,都快忘了,这一年多对R又多了一丢丢的理解,干脆给他切分了几个步骤,分别写成了function。这里保存记录下,不过好像只适用于我自己的项目,如果大家有需求的话有些东西还是要改一下的,WGCNA的流程可不是复制一套别人的代码,然后点点点就出来结果的,还是需要可以读懂大部分的代码,脑子清楚自己要干啥。
参考内容
如有侵权,联系我删除!!
这里参考了多为大佬的WGCNA教程,之前只是懵懵懂懂ctrl c+v然后套数据,然后跑出个结果,这次折腾了这么久,也看懂了一部分,当然,总体还是ctrl c+v,但是多了一些理解在里面,这里做个记录,以后再有直接拿来用。
- 输入数据清洗,样品PCA,热图; 参考:「WGCNA-FAQ」「 生信技能树-这个WGCNA作业终于有学徒完成了!」
- WGCNA常规流程包括
sft
,module
,module-trait
,hubgene
: 「WGCNA分析,简单全面的最新教程」「 STEP6:WGCNA相关性分析」
代码记录
###########################################
# Assignment: WGCNA
# Date: Nov 17,2019
# Author: Shawn Wang
###########################################
##=====01.准备========
setwd("/Volumes/FileManage/06.360Drive/02.heterosis_project/08.4133BtGroup/03.WGCNA/")
options(stringsAsFactors = F)
library(edgeR)
library(WGCNA)
library(dplyr)
library(reshape2)
library(stringr)
library(tidyverse)
library(FactoMineR)
library(factoextra)
library(pheatmap)
source("/Volumes/FileManage/06.360Drive/02.heterosis_project/09.code/11.01.PCAandHeatmap.R")
source("/Volumes/FileManage/06.360Drive/02.heterosis_project/09.code/11.02.WGCNA.SFT.R")
source("/Volumes/FileManage/06.360Drive/02.heterosis_project/09.code/11.03.WGCNA.module.R")
source("/Volumes/FileManage/06.360Drive/02.heterosis_project/09.code/11.04.WGCNA.moduleTrait.R")
source("/Volumes/FileManage/06.360Drive/02.heterosis_project/09.code/11.05.WGCNA.HubGene.R")
# 原始count数据
rawCount <- read.table("/Volumes/FileManage/06.360Drive/02.heterosis_project/08.4133BtGroup/03.WGCNA/02.Rawdata/mRNA_readcount.xls",
header = T,
sep = "\t")
# head(rawCount)
rownames(rawCount) <- rawCount$transcript_id
rawCount <- rawCount[,-1]
bt.raw <- data.frame(row.names = rownames(rawCount),
select(rawCount, contains("Bt_")),
select(rawCount, contains("J_")),
select(rawCount, contains("SY_")))
bt.raw <- select(bt.raw, -contains("Z12"))
# 按照组织分成3份
leaf <- select(bt.raw, contains("_L"))
ovule <- select(bt.raw, contains("_O"))
fiber <- select(bt.raw, contains("_F"))
##======02.方法======================
# 参数设置
type = "unsigned"
corType = "pearson"
corFnc = ifelse(corType=="pearson", cor, bicor)
maxPOutliers = ifelse(corType=="pearson",1,0.05)
robustY = ifelse(corType=="pearson",T,F)
# 组织
tissue = leaf
# 名字
Title = "leaf"
##======02.1 样品PCA和Heatmap============
WGCNA.PCAandHeatmap(tissue = tissue,
Title = Title)
##======02.2 软阈值SFT计算===============
# 表达
datExpr = y
WGCNA.SFT(Title = Title,
datExpr = datExpr)
##======02.3 Module计算===============
# 无向网络在power小于15或有向网络power小于30内,没有一个power值可以使
# 无标度网络图谱结构R^2达到0.8,平均连接度较高如在100以上,可能是由于
# 部分样品与其他样品差别太大。这可能由批次效应、样品异质性或实验条件对
# 表达影响太大等造成。可以通过绘制样品聚类查看分组信息和有无异常样品。
# 如果这确实是由有意义的生物变化引起的,也可以使用下面的经验power值。
if (is.na(power)){
power = ifelse(nSamples<20, ifelse(type == "unsigned", 9, 18),
ifelse(nSamples<30, ifelse(type == "unsigned", 8, 16),
ifelse(nSamples<40, ifelse(type == "unsigned", 7, 14),
ifelse(type == "unsigned", 6, 12))
)
)
}
WGCNA.oneStepNetWork(Title = Title)
##=======表型鉴定========
rownames(datExpr)
phenotype = data.frame(row.names = rownames(datExpr),
yield = rep(c(1,0),times = c(3,12)),
quality = rep(c(0,1,0,1),times = c(6,3,3,3)),
Yield_Quality = rep(c(0,1,0,1,0), times =c(3,3,3,3,3)))
WGCNA.ModuleTrait(Title = Title,
phenotype = phenotype)
##=======选择感兴趣的模块分析=========
## design设计
datTraits <- data.frame(row.names = rownames(phenotype),
subtype = factor(rep(c("Y","Y_Q","Y","Y_Q","Q"),each = 3),
levels = c("Y","Q","Y_Q")))
design = model.matrix(~0 + datTraits$subtype)
colnames(design) = levels(datTraits$subtype)
## 参数设置
moduleName = "blue"
phenoName = "Y_Q"
cor = 0.6
con = 0.90
##
WGCNA.HubGene(Title = Title,
cor = cor,
con = con,
moduleName = moduleName,
phenoName = phenoName)
Function
这里一个写了5个:
- PCA和heatmap:
WGCNA.PCAandHeatmap
- SFT :
WGCNA.SFT
- One step network:
WGCNA.oneStepNetWork
- 模块-表型:
WGCNA.ModuleTrait
, - hubgene:
WGCNA.HubGene
PCA和heatmap
WGCNA.PCAandHeatmap <- function(tissue,Title){
x <- tissue[apply(tissue,1,function(x) sum(x > 10) > (0.9*ncol(tissue))),]
# 01.1.将readcount转换为logcpm
y <- log10(edgeR::cpm(x)+1)
y[1:4,1:4]
# 01.2.将样品名称去掉生物学重复标记起到分组作用
z <- gsub("_.*","",colnames(y))
test <- as.data.frame(t(y))
# 01.3.将分组标记放到表达矩阵最后一列
dat <- cbind(test,z)
# 02.1.准备PCA数据
dat.pca <- PCA(dat[,-ncol(dat)], graph = F)
# 02.2.绘制PCA图并保存
fviz_pca_ind(dat.pca,
geom.ind = "point",
col.ind = dat$z,
palette = c("#9370DB", "#FF82AB", "#87CEFF", "#2E8B57", "#0000FF"),
addEllipses = T,
legend.title = "Groups")
ggsave(paste(Title,"SamplsPCAplot.pdf",sep = "_"), width = 8, height = 8)
# 将每行表达量最大的前5000个基因拿出来做热图
cg = names(tail(sort(apply(y, 1, function(x){sum(x)})),5000))
# pheatmap(pheatmap(y[cg,],show_rownames = F,show_colnames = F),scale = "row")
n=t(scale(t(y[cg,]))) # 'scale'可以对log-ratio数值进行归一化
n[n>3]=3
n[n< -3]= -3
n[1:4,1:4]
ac=data.frame(g=z)
rownames(ac)=colnames(n) #把ac的行名给到n的列名,即对每一个探针标记上分组信息(是'noTNBC'还是'TNBC')
pheatmap(n,show_colnames =T,show_rownames = F,
annotation_names_col = F,
annotation_col=ac,
filename = paste(Title,'heatmap_top5000.png',sep = "_"),
clustering_distance_rows = "euclidean")
assign("y",value = y, envir = globalenv())
}
SFT
WGCNA.SFT <- function(datExpr, Title){
datExpr <- datExpr
type = "unsigned"
corType = "pearson"
corFnc = ifelse(corType=="pearson", cor, bicor)
maxPOutliers = ifelse(corType=="pearson",1,0.05)
robustY = ifelse(corType=="pearson",T,F)
m.mad <- apply(datExpr,1,mad)
datExprVar <- datExpr[which(m.mad >
max(quantile(m.mad, probs=seq(0, 1, 0.8))[2],0.01)),]
dim(datExprVar)
datExpr <- as.data.frame(t(datExprVar))
## 检测缺失值
gsg = goodSamplesGenes(datExpr, verbose = 3)
if (!gsg$allOK){
# Optionally, print the gene and sample names that were removed:
if (sum(!gsg$goodGenes)>0)
printFlush(paste("Removing genes:",
paste(names(dataExpr)[!gsg$goodGenes], collapse = ",")));
if (sum(!gsg$goodSamples)>0)
printFlush(paste("Removing samples:",
paste(rownames(dataExpr)[!gsg$goodSamples], collapse = ",")));
# Remove the offending genes and samples from the data:
dataExpr = dataExpr[gsg$goodSamples, gsg$goodGenes]
}
nGenes = ncol(datExpr)
nSamples = nrow(datExpr)
assign("nGenes",value = nGenes, envir = globalenv())
assign("nSamples",value = nSamples, envir = globalenv())
dim(datExpr)
head(datExpr)[,1:8]
sampleTree = hclust(dist(datExpr), method = "average")
pdf(file = paste(Title,"Sample_clustering.pdf",sep = "."),width = 8,height = 5)
plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="")
dev.off();
powers = c(c(1:10), seq(from = 12, to=30, by=2))
sft = pickSoftThreshold(datExpr, powerVector=powers,
networkType=type, verbose=5)
pdf(file = paste(Title,"SFTPlot.pdf",sep = "."),width = 10,height = 7)
par(mfrow = c(1,2))
cex1 = 0.9
# 横轴是Soft threshold (power),纵轴是无标度网络的评估参数,数值越高,
# 网络越符合无标度特征 (non-scale)
{plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
xlab="Soft Threshold (power)",
ylab="Scale Free Topology Model Fit,signed R^2",type="n",
main = paste("Scale independence"))
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
labels=powers,cex=cex1,col="red")
# 筛选标准。R-square=0.85
abline(h=0.90,col="red")
abline(h=0.85,col="green")
# Soft threshold与平均连通性
plot(sft$fitIndices[,1], sft$fitIndices[,5],
xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
main = paste("Mean connectivity"))
text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers,
cex=cex1, col="red")
}
dev.off();
power = sft$powerEstimate
power
assign("datExpr",value = datExpr, envir = globalenv())
assign("power",value = power, envir = globalenv())
}
One step network
WGCNA.oneStepNetWork <- function(Title){
#######============一步法网络构建===========
# power: 上一步计算的软阈值
# maxBlockSize: 计算机能处理的最大模块的基因数量 (默认5000);
# 4G内存电脑可处理8000-10000个,16G内存电脑可以处理2万个,32G内存电脑可
# 以处理3万个
# 计算资源允许的情况下最好放在一个block里面。
# corType: pearson or bicor
# numericLabels: 返回数字而不是颜色作为模块的名字,后面可以再转换为颜色
# saveTOMs:最耗费时间的计算,存储起来,供后续使用
# mergeCutHeight: 合并模块的阈值,越大模块越少
net = blockwiseModules(datExpr, power = power, maxBlockSize = nGenes,
TOMType = type, minModuleSize = 30,
reassignThreshold = 0, mergeCutHeight = 0.25,
numericLabels = TRUE, pamRespectsDendro = FALSE,
saveTOMs=TRUE, corType = corType,
maxPOutliers=maxPOutliers, loadTOMs=TRUE,
#saveTOMFileBase = paste(Title,".tom",sep = ""),
verbose = 3)
table(net$colors)
assign("net",value = net, envir = globalenv())
####============modular construction=================
## 灰色的为**未分类**到模块的基因。
# Convert labels to colors for plotting
moduleLabels = net$colors
assign("moduleLabels",value = moduleLabels, envir = globalenv())
moduleColors = labels2colors(moduleLabels)
assign("moduleColors",value = moduleColors, envir = globalenv())
# Plot the dendrogram and the module colors underneath
# 如果对结果不满意,还可以recutBlockwiseTrees,节省计算时间
pdf(file = paste(Title,"Module.pdf",sep = "."),width = 6,height = 6)
plotDendroAndColors(net$dendrograms[[1]], moduleColors[net$blockGenes[[1]]],
"Module colors",
dendroLabels = FALSE, hang = 0.03,
addGuide = TRUE, guideHang = 0.05)
dev.off()
# module eigengene, 可以绘制线图,作为每个模块的基因表达趋势的展示
MEs = net$MEs
### 不需要重新计算,改下列名字就好
### 官方教程是重新计算的,起始可以不用这么麻烦
MEs_col = MEs
colnames(MEs_col) = paste0("ME", labels2colors(
as.numeric(str_replace_all(colnames(MEs),"ME",""))))
MEs_col = orderMEs(MEs_col)
assign("MEs",value = MEs, envir = globalenv())
assign("MEs_col",value = MEs_col, envir = globalenv())
pdf(file = paste(Title,"Eigeng_adja_heatmap.pdf",sep = "."),width = 7,height = 10)
# 根据基因间表达量进行聚类所得到的各模块间的相关性图
# marDendro/marHeatmap 设置下、左、上、右的边距
plotEigengeneNetworks(MEs_col, "Eigengene adjacency heatmap",
marDendro = c(3,3,2,4),
marHeatmap = c(3,4,2,2), plotDendrograms = T,
xLabelsAngle = 90)
dev.off()
Gene2module <- data.frame(GID = colnames(datExpr),
Module = moduleColors)
write.table(Gene2module,file = paste(Title,"Gene2module.xls",sep = "_"),
row.names = F,
quote = F,
sep = "\t")
}
模块-表型
WGCNA.ModuleTrait <- function(Title,phenotype){
traitData <- phenotype
dim(traitData)
### 模块与表型数据关联
if (corType=="pearson") {
modTraitCor = cor(MEs_col, traitData, use = "p")
modTraitP = corPvalueStudent(modTraitCor, nSamples)
} else {
modTraitCorP = bicorAndPvalue(MEs_col, traitData, robustY=robustY)
modTraitCor = modTraitCorP$bicor
modTraitP = modTraitCorP$p
}
## Warning in bicor(x, y, use = use, ...): bicor: zero MAD in variable 'y'.
## Pearson correlation was used for individual columns with zero (or missing)
## MAD.
# signif表示保留几位小数
textMatrix = paste(signif(modTraitCor, 2), "\n(", signif(modTraitP, 1), ")", sep = "")
dim(textMatrix) = dim(modTraitCor)
pdf(file = paste(Title,"Module_trait.pdf",sep = "."),width = 8,height = 10)
labeledHeatmap(Matrix = modTraitCor, xLabels = colnames(traitData),
yLabels = colnames(MEs_col),
cex.lab = 0.7, xLabelsAngle = 0, xLabelsAdj = 0.5,
ySymbols = substr(colnames(MEs_col),3,1000), colorLabels = FALSE,
colors = blueWhiteRed(50),
textMatrix = textMatrix, setStdMargins = FALSE,
cex.text = 0.6, zlim = c(-1,1),
main = paste("Module-trait relationships"))
dev.off()
}
Hubgene 提取
WGCNA.HubGene <- function(Title,cor,con,moduleName,phenoName){
## 联通性计算
# (1) Intramodular connectivity
connet=abs(cor(datExpr,use="p"))^6
Alldegrees1=intramodularConnectivity(connet, moduleColors)
###(3) Generalizing intramodular connectivity for all genes on the array
datKME=signedKME(datExpr, MEs_col, outputColumnName="MM.")
write.table(datKME, paste(Title,"Conectivity_of_each_modular.xls",sep = "."),
sep = "\t",
row.names = T,
quote = F)
# Display the first few rows of the data frame
##Finding genes with high gene significance and high intramodular connectivity in interesting modules
PheName <- as.data.frame(design[,grep(phenoName,colnames(design))])
names(PheName) = phenoName
GS1 = as.numeric(cor(PheName,datExpr, use = "p"))
# abs(GS1)模块和基因的关联性
# abs(datKME$MM.green) 基因的连通性
num <- grep(moduleName,colnames(datKME))
FilterGenes= abs(GS1)> cor & abs(datKME[,num])>con
hubGenes_raw = data.frame(ID = rownames(datKME),
TORF = FilterGenes)
hubGenes = filter(hubGenes_raw, TORF == "TRUE")
table(hubGenes)
write.table(hubGenes,file = paste(moduleName,phenoName,"hubGene.xls",
sep = "_"),
sep = "\t",
row.names = F,
quote = F)
}
结果
-
做完后在working dir下面应该会有以下几个文件,关于以下图片及文件的意义,这个看我参考的那几篇文章理解就好;
还会发现我基本没有做基因和模块的相关,这部分如果有需求,也可以看参考的文章实现。
一些问题总结
-
正确的输入数据格式
基本教程,FAQ中提到的我都尝试了一下,不外乎以下几种:- FPKM或者RPKM。
- log(fpkm+1)
- z-scored fpkm
- log(tpm +1)
使用不同类型的输入数据会发现分出的module数量也不一样,甚至sft都不尽相同,但总体来看,差异不大,用MAD筛选过后留下的基因中,除了直接用fpkm之外,其他几种方法差异不是很大,我最终选择了用log(tpm+1)
, 而且在转换前先用90%样本中readcount > 10
这个阈值去一下背景。个人习惯罢了。
无法找到合适的Power值:
这些在「WGCNA-FAQ」中作者已经解释,这里着重说下我遇到的问题。
首先,有一次我尝试用大量不同品种的不同组织的转录组(17个品种,3个组织,3个重复共153个sample)做WGCNA,发现一个特别有趣的结果:在sft
阶段发现R2从1开始为明显的负数值,随着power的增大R2逐渐变为正数。刚开始百思不得其解,后来找到了答案:
Data heterogeneity may affect any statistical analysis, and even more so an unsupervised one such as WGCNA. What, if any, modifications should be made to the analysis depends crucially on whether the heterogeneity (or its underlying driver) is considered "interesting" for the question the analyst is trying to answer, or not. If one is lucky, the main driver of sample differences is the treatment/condition one studies, in which case WGCNA can be applied to the data as is. Unfortunately, often the heterogeneity drivers are uninteresting and should be adjusted for. Such factors can be technical (batch effects, technical variables such as post-mortem interval etc.) or biological (e.g., sex, tissue, or species differences).
原因就是样品异质性(heterogeneity)! 由于样本中总体分成了3个组织,由于基因表达的时空特异性,组织间的差异特别大,所以会导致组织内连通性超级高,而组织间连通性特别低。具体解释见:Question about WGCNA soft thresholding value
The negative "signed R^2" is negative when your network has more genes with high connectivity than ones with low connectivity (i.e., the regression line for the fit log(n(k))~log(k) has a positive slope). It means your network shows a topology in some ways opposite (more high connectivity than low connectivity genes) to what is normally expected (a lot of low-connetivity gens and fewer high connectivity genes).Usually, a lot of high-connectivity genes means there is a strong global driver (e.g., you have samples from different tissues or a strong batch effect). Make sure your sample tree doesn't show very strong branches. Also, see WGCNA FAQ (https://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/Rpackages/WGCNA/faq.html) for some comments about heterogeneous data sets and lack of SFT.
这会导致一个结果,就是无法构成一个无尺度网络;
这也不代表WGCNA往下走没有意义了! 不要轻易的怀疑自己的结果,不是为了做WGCNA而做WGCNA,这样就失去了分析的意义,如果有一定的生物学问题,那么他就是有意义的!
上面说了虽然无法构成一个无尺度的网络,我们后期可能无法取到合适的Hubgene,但是这样想,如果我想从这个结果找到与组织特异性强的模块,这个后续分析就是有意义的,我们用经验的power值继续构建module,然后用tissue对sample进行分类,最后我们可以得到一个module-tissue的相关性结果,后续通过对相应module的基因进行富集分析,证实了我的想法,确实叶片的结果大多富集到生物节律,光合作用,叶绿体,细胞循环等,胚珠中富集到花发育,细胞壁,细胞膜,激素等通路...grey模块中富集的较少,但不难看出有许多看家基因。虽然说生物学意义不是很大,但是说明了这条路应该没有错,下一步可以通过对ncRNA的WGCNA找到组织特异性表达的lncRNA,circRNA等等。这些module就会很有意义。
应该是未完待续~~