WGCNA（3）：基因模块与性状关联识别重要基因

前面的文章中完成了数据导入和网络构建，下面是将基因模块与性状进行关联，从而识别与表型相关的重要基因。
WGCNA（1）：R包安装及数据导入清洗 - (jianshu.com)
WGCNA（2a）：一步法完成网络构建和模块检测 - (jianshu.com)
WGCNA（2b）：分步法完成网络构建和模块检测 - (jianshu.com)

参考材料还是官方说明书：
https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/FemaleLiver-03-relateModsToExt.pdf

1. 准备工作

导入前期数据，这里我选择了分步法构建网络的结果，大家可以根据自己的数据选择使用哪种方法构建网络。

# 设置工作目录
> setwd("D:/RNA-seq/WGCNA/mad0.3")
# 载入WGCNA包
> library('WGCNA')
# 允许R语言以最大线程运行
> options(stringsAsFactors = FALSE)
> allowWGCNAThreads()
Allowing multi-threading with up to 4 threads.
# 载入第一步中的表达量和表型值
> lnames = load(file = "WGCNA0.3-dataInput.RData")
> lnames
# 载入第二步的网络数据
> lnames = load(file = "networkConstruction-stepByStep.RData")
> lnames

2. 将基因模块与表型数据关联

2.1 量化基因模块与表型数据的相关关系

在此分析中，我们希望识别与表型显著相关的基因模块，因为前期已经有了每个模块的概要文件(特征基因)，我们只需要简单地将特征基因与表型联系起来，并寻找最显著的关联即可。

# 明确基因和样本数
> nGenes = ncol(datExpr)
> nSamples = nrow(datExpr)
# 用颜色标签重新计算MEs
> MEs0 = moduleEigengenes(datExpr, moduleColors)$eigengenes
> MEs = orderMEs(MEs0)
> moduleTraitCor = cor(MEs, datTraits, use = "p")
> moduleTraitPvalue = corPvalueStudent(moduleTraitCor, nSamples)
# 查看每个模块的颜色及包含的基因数目
> table(moduleColors)

绘制基因模块和性状的相关性热图（Fig.1）。

> pdf("Module-trait associations.pdf",width = 8, height=10)
# 通过p值显示相关性
> textMatrix = paste(signif(moduleTraitCor, 2), "\n(", signif(moduleTraitPvalue, 1), ")", sep = "")
> dim(textMatrix) = dim(moduleTraitCor)
> par(mar = c(6, 8.5, 3, 3))
# 通过热图显示相关性
> labeledHeatmap(Matrix = moduleTraitCor, xLabels = names(datTraits), yLabels = names(MEs), ySymbols = names(MEs), colorLabels = FALSE, colors = greenWhiteRed(50), textMatrix = textMatrix, setStdMargins = FALSE, cex.text = 0.5, zlim = c(-1,1), main = paste("Module-trait relationships"))
# 想改变热图配色可以更改 colors = greenWhiteRed(50)
> dev.off()

Figure 1: Module-trait associations. Each row corresponds to a module eigengene, column to a trait. Each cell contains the corresponding correlation and p-value. The table is color-coded by correlation according to the color legend.

2.2 样本与模块间的关系

明确模块与性状之间的关系后，想继续看一下各个样本与模块之间的关系，也可以通过绘制热图来解决。首先创建一个矩阵，列名为sample名称，行名与FPKM文件中的行名一致，横纵坐标一致为1，否则为0。

> sample=read.csv("sample.csv")
> head(sample)
           X weight_g length_cm ab_fat
1 weight_g  1   0   0  
2 length_cm  0  1   0 
3 ab_fat  0   0   1 
> rownames(sample) <- sample[,1]
> sample<- sample[,-1]

进行相关性计算，并通过热图展示：

> moduleSampleCor = cor(MEs, sample, use = "p")
> moduleSamplePvalue = corPvalueStudent(moduleSampleCor, nSamples)
> pdf("Module-Sample associations.pdf",width = 8, height=10)
# 通过p值显示相关性
> textMatrix = paste(signif(moduleSampleCor, 2), "\n(", 
  signif(moduleSamplePvalue, 1), ")", sep = "")
> dim(textMatrix) = dim(moduleSampleCor)
> par(mar = c(6, 8.5, 3, 3))
# 通过热图显示相关性
> labeledHeatmap(Matrix = moduleSampleCor, xLabels = 
  names(sample), yLabels = names(MEs), ySymbols = 
  names(MEs), colorLabels = FALSE, colors = blueWhiteRed(50), 
  textMatrix = textMatrix, setStdMargins = FALSE, cex.text = 0.5, 
  zlim = c(-1,1), main = paste("Module-Sample relationships"))
> dev.off()

这里引用一张其他文章里的图说明，原文链接：
https://www.mdpi.com/1422-0067/22/17/9622/htm#

2.3 基因显著性（GS）和模块成员（MM）

在上一步计算中，得到了各个基因模块和性状之间的相关性，这一步计算在一个模块中，每个基因（MM）在这个模块中的权重，即每个基因和这个模块的相关性。这里需要是连续性状。

# 指定datTrait中感兴趣的一个性状，这里选择weight_g
> weight = as.data.frame(datTraits$weight_g)
> names(weight) = "weight"

#  各基因模块的名字（颜色）
> modNames = substring(names(MEs), 3)

# 计算MM的P值
> geneModuleMembership = as.data.frame(cor(datExpr, MEs, use = "p"))
> MMPvalue = as.data.frame(corPvalueStudent(as.matrix(geneModuleMembership 
 ), nSamples))
> names(geneModuleMembership) = paste("MM", modNames, sep="")
> names(MMPvalue) = paste("p.MM", modNames, sep="")

# 计算性状和基因表达量之间的相关性（GS）
> geneTraitSignificance = as.data.frame(cor(datExpr, weight, use = "p"))
> GSPvalue = as.data.frame(corPvalueStudent(as.matrix(geneTraitSignificance), 
  nSamples))
> names(geneTraitSignificance) = paste("GS.", names(weight), sep="")
> names(GSPvalue) = paste("p.GS.", names(weight), sep="")

2.4 模块内分析：鉴定高GS和MM基因

通过上一步计算，可以在感兴趣的基因模块中，筛选出高GS和MM的基因。在Fig.1中可以看到，与性状weight_g相关性最高的基因模块是MEbrown，因此选择MEbrown进行后续分析，绘制brown模块中GS和MM的散点图。

> module = "brown"
> column = match(module, modNames)
> moduleGenes = moduleColors==module
> pdf("Module membership vs gene significance.pdf",width = 7, height=7)
> par(mfrow = c(1,1))
> verboseScatterplot(abs(geneModuleMembership[moduleGenes, 
  column]),abs(geneTraitSignificance[moduleGenes, 1]), xlab = 
  paste("Module Membership in", module, "module"), ylab = "Gene 
  significance for body weight", main = paste("Module membership 
  vs gene significance"), cex.main = 1.2, cex.lab = 1.2, cex.axis = 
  1.2, col = 'black')
> dev.off()

Fig.2中可以清楚的看出，GS和MM有极强的相关性，表明与性状高夫相关的基因，也是模块中的重要基因。

Figure 2: A scatterplot of Gene Significance (GS) for weight vs. Module Membership (MM) in the brown module. There is a highly significant correlation between GS and MM in this module.

2.5 总结网络分析结果并输出（非必要）

在前面的分析计算中，我们已经找到了和性状相关性最强的模块，以及模块中的核心基因，下面与基因注释合并，并输出为Excel。

> names(datExpr)
# 返回brown模块中所有ID
> names(datExpr)[moduleColors=="brown"]
# 输入注释文件
> annot = read.csv(file = "GeneAnnotation.csv");
> dim(annot)
> names(annot)
> probes = names(datExpr)
> probes2annot = match(probes, annot$substanceBXH)
# 总结没有注释到的基因
> sum(is.na(probes2annot))
# 创建数据集，包含探测ID ，
> geneInfo0 = data.frame(substanceBXH = probes, geneSymbol = annot$gene_symbol[probes2annot], LocusLinkID = annot$LocusLinkID[probes2annot], moduleColor = moduleColors, geneTraitSignificance, GSPvalue)
# 通过显著性对模块进行排序
> modOrder = order(-abs(cor(MEs, weight, use = "p")))
# 添加模块成员
> for (mod in 1:ncol(geneModuleMembership))
{
 oldNames = names(geneInfo0)
 geneInfo0 = data.frame(geneInfo0, geneModuleMembership[, 
 modOrder[mod]],
 MMPvalue[, modOrder[mod]]);
 names(geneInfo0) = c(oldNames, paste("MM.", 
 modNames[modOrder[mod]], sep=""),
 paste("p.MM.", modNames[modOrder[mod]], sep=""))
}
# 对基因进行排序
> geneOrder = order(geneInfo0$moduleColor, abs(geneInfo0$GS.weight))
> geneInfo = geneInfo0[geneOrder, ]
#写出
> write.csv(geneInfo, file = "geneInfo.csv")