WGCNA（2a）：一步法完成网络构建和模块检测

在上一篇帖子上完成了表达量数据和表型数据的整理，这篇文章将进行网络构建和模块检测。

WGCNA（1）：R包安装及数据导入清洗 - (jianshu.com)

参考材料还是官方说明书：
https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/FemaleLiver-02-networkConstr-auto.pdf

1. 准备工作

> setwd("D:/RNA-seq/WGCNA/fpkm")
> library(WGCNA)
> options(stringsAsFactors = FALSE)
> enableWGCNAThreads()
Allowing parallel execution with up to 3 working processes.
#载入第一步中的表达量和表型值
> lnames = load(file = "WGCNA-dataInput.RData")
> lnames
[1] "datExpr"  "datTraits"

2. 构建基因网络和模块识别

这一步是使用WGCNA进行所有网络分析的基础，共有三种方法：
a. 便捷的一步法网络构建和模块检测功能，
b. 分步法；
c. 自动按区块划分的网络构建和模块检测方法，适用于大数据集
通常我们使用的是一步法，所以这里只展示一步法的构建过程，需要其他方法的话可以去官方查看说明书。

2.1 确定合适的软阈值：网络拓扑分析

通过函数pickSoftThreshold执行网络拓扑分析，并帮助用户选择合适的软阈值。

# 设置网络构建参数选择范围
> powers = c(c(1:10), seq(from = 12, to=20, by=2)) 
> powers
 [1]  1  2  3  4  5  6  7  8  9 10 12 14 16 18 20
# 计算无尺度分布拓扑矩阵
> sft = pickSoftThreshold(datExpr, powerVector = powers, verbose = 5)

软阈值计算结果可视化

> sizeGrWindow(9, 5)
#一页多图，一行2列
> par(mfrow = c(1,2))
#字号
> cex1 = 0.9
# Scale-free topology fit index as a function of the soft-thresholding power（左图）
> plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
       xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n",
       main = paste("Scale independence"))
> text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
       labels=powers,cex=cex1,col="red") 
#查看最佳软阈值
> sft$powerEstimate
[1] 6 
# this line corresponds to using an R^2 cut-off of h
> abline(h=0.90,col="red")
# Mean connectivity as a function of the soft-thresholding power（右图）
> plot(sft$fitIndices[,1], sft$fitIndices[,5], xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n", main = paste("Mean connectivity"))
> text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")

Figure 1: Analysis of network topology for various soft-thresholding powers. The left panel shows the scale-free fit index (y-axis) as a function of the soft-thresholding power (x-axis). The right panel displays the mean connectivity (degree, y-axis) as a function of the soft-thresholding power (x-axis).

2.2 一步法网络构建和模块识别

所有的核心就在这一步，把输入的表达矩阵的几千个基因组归类成了几十个模块。大体思路：计算基因间的邻接性，根据邻接性计算基因间的相似性，然后推出基因间的相异性系数，并据此得到基因间的系统聚类树。然后按照混合动态剪切树的标准。

>  net = blockwiseModules(
                 datExpr,
                 power = sft$powerEstimate,
                 maxBlockSize = 6000,
                 TOMType = "unsigned", minModuleSize = 100,
                 reassignThreshold = 0, mergeCutHeight = 0.25,
                 numericLabels = TRUE, pamRespectsDendro = FALSE,
                 saveTOMs = F, 
                 verbose = 3
)
 Calculating module eigengenes block-wise from all genes
   Flagging genes and samples with too many missing values...
    ..step 1
 ....pre-clustering genes to determine blocks..
   Projective K-means:
   ..k-means clustering..
错误: 无法分配大小为638.4 Mb的矢量

报错，无法分配大小为638.4 Mb的矢量。调整了给R分配的内存，还是不行，大概Windows的内存不够用了，所以又转战Linux下的R。

$ R
> library(WGCNA)
#开始设置按最大线程数39计算，报错后改为15，就没有问题了。这个要看服务器的具体情况决定
> enableWGCNAThreads(15) 
Allowing parallel execution with up to 15 working processes.
> net = blockwiseModules(
                 datExpr,
                 power = sft$powerEstimate,
                 maxBlockSize = 6000,
                 TOMType = "unsigned", minModuleSize = 100,
                 reassignThreshold = 0, mergeCutHeight = 0.25,
                 numericLabels = TRUE, pamRespectsDendro = FALSE,
                 saveTOMs = F, 
                 verbose = 3
)
> table(net$colors)

这时查看结果有70多个模块，实在是太多了，查了一些资料module数目太多如何处理，有三个参数可以调整：
● deepSplit 参数调整划分模块的敏感度，值越大，越敏感，得到的模块就越多，默认是2；
● minModuleSize 参数设置最小模块的基因数，值越小，小的模块就会被保留下来；
● mergeCutHeight 设置合并相似性模块的距离，值越小，就越不容易被合并，保留下来的模块就越多。
参考资料： https://zhuanlan.zhihu.com/p/34697561?from_voters_page=true

调整参数如下：

> net = blockwiseModules(
                 datExpr,
                 power = sft$powerEstimate,
                 maxBlockSize = 6000,
                 TOMType = "unsigned", minModuleSize = 200, 
                 deepSplit = 1,  reassignThreshold = 0, 
                 mergeCutHeight = 0.3,
                 numericLabels = TRUE, pamRespectsDendro = FALSE,
                 saveTOMs = F, 
                 verbose = 3
)
> table(net$colors)
   0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15
8965 7832 5337 5138 3699 3576 3424 3038 2532 2315 2143 2124 2118 1987 1716 1537
  16   17   18   19   20   21   22   23   24   25   26   27   28   29
1479 1372 1338 1231  940  936  904  768  746  548  518  428  403  283

共有29个模块，模块大小从7832个基因到283个基因，其中模块0是不属于任何一个模块的基因。

2.3 模块可视化

# open a graphics window
> sizeGrWindow(12, 9)
# Convert labels to colors for plotting
> mergedColors = labels2colors(net$colors)
# Plot the dendrogram and the module colors underneath
> plotDendroAndColors(net$dendrograms[[1]], mergedColors[net$blockGenes[[1]]], "Module colors", dendroLabels = FALSE, hang = 0.03, addGuide = TRUE, guideHang = 0.05)

Figure 2: Clustering dendrogram of genes, with dissimilarity based on topological overlap, together with assigned module colors.

2.4 保存模块信息

> moduleLabels = net$colors
> moduleColors = labels2colors(net$colors)
> MEs = net$MEs;
> geneTree = net$dendrograms[[1]];
> save(MEs, moduleLabels, moduleColors, geneTree, file = "networkConstruction-auto.RData")