本篇代码参考文章:
1.生信菜鸟团:一文学会WGCNA分析
2.WGCNA(加权基因共表达网络分析)
3.WGCNA分析,简单全面的最新教程
4.WGCNA实战练习
5.STEP6:WGCNA相关性分析
6.加权基因共表达网络分析(WGCNA)
7.WGCNA的输入矩阵到底是什么格式
在网上无意看到这个分析,感觉可以用来折腾一下,综合上面的几个教程的流程走了一遍,也算是有了个初步的了解。
加权基因共表达网络分析(weighted gene co-expression network analysis,WGCNA)
(1)WGCNA能做什么?
A: 用于探索基因网络的潜在模式,为调控通路的识别以及基因间的靶向关系提供了更多的见解。
(2)什么是加权?加权原理是什么?
A: 我也不知道什么是加权,可能给我讲了我也听不明白,原理就更不用说了,但是我这里有一张图,应该很好理解:
(一)导入数据
我用的是别人的数据,如果你没有数据只是想练习一下,可以使用上面第5篇或者第6篇里,都有不同的示例数据。
# 这里我用的是FPKM矩阵,对于应该用什么矩阵,见上面第7篇参考文章
> total_FPKM = read.csv("total_FPKM.csv",header = TRUE,sep = ",")
> rownames(total_FPKM) = total_FPKM[,1]
> total_FPKM = total_FPKM[,-1]
# 导入clinical traits矩阵,如果是实验RNA-seq,那么导入你的样本分组信息
> datTraits = read.csv("RNA_seq_datTrait.csv",header = TRUE,sep = ",")
> rownames(datTraits) = datTraits[,1]
> datTraits = datTraits[,-1]
> dim(total_FPKM)
[1] 23361 30
30个样本,有23361个基因,但是上面的参考文章都没有用所有的基因来做WGCNA,只需选择一些具有代表性的基因就够了,所以要先筛选一下.
(二)过滤基因
这里注意,你不能使用差异基因来进行WGCNA分析,具体原因:
# 筛选方法:mad>1且top5000
> datExpr = total_FPKM
> WGCNA_matrix = t(datExpr[order(apply(datExpr,1,mad),decreasing = T)[1:5000],])
> datExpr_filted <- WGCNA_matrix
看一下筛选后、转置后的表达矩阵:
(三)检查离群样品
如果你的是TCGA上下载的数据,几百个临床样品,这一步是必须的,可能会有离群值,需要把它们去掉。
> library(WGCNA)
> gsg = goodSamplesGenes(datExpr_filted, verbose = 3)
> gsg$allOK
> sampleTree = hclust(dist(datExpr_filted), method = "average")
> plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="")
> save(datExpr_filted, file = "datExpr_filted_hclust.RData")
这里我不需要去掉离群值,因为聚的挺好的,但是如果你的cluster长的像这样,就需要去掉了:(图片来自:WGCNA(加权基因共表达网络分析)):
去除离群值代码:
#离群值height大于80000
> clust = cutreeStatic(sampleTree, cutHeight = 80000, minSize = 10)
> table(clust)
> keepSamples = (clust==1)
> datExpr_remove = datExpr_filted[keepSamples, ]
(四)确定软阈值
# Constructing a weighted gene network entails the choice of the soft thresholding power to which coexpression similarity is raised to calculate adjacency.
# Set up a bunch of power gradients(设定一些列power梯度)
> powers = c(c(1:10,by = 1), seq(from = 12, to=20, by=2))
> sft = pickSoftThreshold(datExpr_filted, powerVector = powers, verbose = 5) #this step will take some time
# The "sft" object contains the network characteristics that calculated for each power value(在sft这个对象中保存了每个power值计算出来的网络的特???)
> str(sft)
# make figures to show the soft thresholding power
> par(mfrow = c(1,2))
> cex1 = 0.9
# x axis is the Soft threshold (power),y axis is evaluation parameters for scale-free networks(纵轴是无标度网络的评估参???)
# The higher R^2 ,the more scale-free the network is.(数值越高,网络越符合无标度特征 (non-scale))
# "FitIndices" stores the characteristics of each network corresponding to its power. (fitIndices保存了每个power对应的网络的特征)
> plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n",
main = paste("Scale independence"))
> text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],labels=powers,cex=cex1,col="red")
# R^2 value(h) usually around 0.9. But if there's some big changes between your samples, R^2 value will be lower.
# In my case, I use 0.7 to set the criteria.
> abline(h=0.9,col="red")
> plot(sft$fitIndices[,1], sft$fitIndices[,5],
xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
main = paste("Mean connectivity"))
> text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=cex1,col="red")
# System will recommend the best power value to you.
# But you can not totally trust it, depend on your data.
> sft$powerEstimate #I use power = 14
下面的左图中纵轴展示了无标度拓扑拟合指数R^2(scale free topology fitting index R2,即统计结果中的SFT.R.sq列数值),之所以有负值是因为又乘以了slope列数值的负方向,仅关注正值即可;右图纵轴展示了网络的平均连通度。
一般来说,取R^2值大于0.8或到达平台期时最小的值用于构建网络。你也可以用sft$powerEstimate让系统推荐一个值给你。
而在我的数据里,系统推荐的值是1,这显然不太对头。我看网上有个人跟我有同样的问题,参考:WGCNA power 选择 。有的人也会得到NA这个结果,对于这样的情况,可参考:WGCNA分析,简单全面的最新教程。这里我选的14。总之,这里的值的选择是比较重要的,特殊情况需特殊对待。sft$powerEstimate给出的值,你可以参考,但是如果出现1或者NA时,就要自己判断了。
(五)网络构建
# power: 上面得到的软阈值
# maxBlockSize: 计算机能处理的最大模块的基因数量 (默认5000)(4G内存电脑可处理8000-10000个,16G内存电脑可以处理2万个,32G内存电脑可以处理3万个)
# numericLabels: 返回数字而不是颜色作为模块的名字,后面可以再转换为颜色
# saveTOMs:最耗费时间的计算,存储起来,供后续使用
# mergeCutHeight: 合并模块的阈值,越大模块越少
> net = blockwiseModules(datExpr_filted, power = 14, maxBlockSize = 5000,
TOMType = "unsigned", minModuleSize = 30,
reassignThreshold = 0, mergeCutHeight = 0.25,
numericLabels = TRUE, pamRespectsDendro = FALSE,
saveTOMs = TRUE,
saveTOMFileBase = "RNA_seq_datExpr_filted_TOM",
verbose = 3)
# You can check how many modules and how many genes in each module
# "0" represent "grey" module
> table(net$colors)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
40 1014 959 858 430 265 263 248 230 195 159 147 77 69 46
# Convert labels to colors for plotting
> mergedColors = labels2colors(net$colors)
# Plot the dendrogram and the module colors underneath
> plotDendroAndColors(net$dendrograms[[1]], mergedColors[net$blockGenes[[1]]],"Module colors",
dendroLabels = FALSE, hang = 0.03,
addGuide = TRUE, guideHang = 0.05)
绘画结果里灰色模块里的基因是一类无法划分到任何一个模块的基因集,如果你的灰色模块基因数太多,那你需要调整你的基因过滤的方法。或是表示样本中的基因共表达趋势不明显,不同特征的样本之间差异性不大,或者组内基因表达一致性比较差。这里我的结果看起来灰色模块好像非常少。
同一模块内的基因具有较高的共表达相似性,即存在较高的协同,暗示它们参与了相似的调节途径,或者在相似的细胞区域中发挥功能。参考:加权基因共表达网络分析(WGCNA)
(六)可视化基因网络
上面我们构建了网络,还可以将基因共表达相似度矩阵绘制成热图,和模块的聚类一起展示:
# shown in a heat map according to 1-TOM value (distance between genes)
# This step you could use 5000 genes to make heatmap figure, or you could use randomly some genes from each module
> geneTree = net$dendrograms[[1]]
> moduleColors = labels2colors(net$colors)
> dissTOM = 1 - TOMsimilarityFromExpr(
datExpr_filted,
power = 14)
plotTOM = dissTOM ^ 7
diag(plotTOM) = NA
# The more genes you choose, the longer time this step will take:
# Actually, this step take a long time. This is why we need to filt genes before WGCNA analysis.
#可视化这一步,你可以选择用5000个基因来画,但是很费时间,我这个破电脑大概10多分钟吧
#当然你也可以选择只画一部分基因,具体如何操作,请见:https://www.jianshu.com/p/3618f5ff3eb0
> TOMplot(
plotTOM,
geneTree,
moduleColors,
main = "Network heatmap plot, all genes"
)
如果有的同学跑出来像下面这样,红色底,黄色块:
不要急,可以用下面的代码把颜色转过来:
> library(gplots)
> myheatcol = colorpanel(250,'red',"orange",'lemonchiffon')
> TOMplot(plotTOM, geneTree, moduleColors, main = "Network heatmap plot, all genes", col=myheatcol)
然后就正常了。热图中,若基因间表达相似度越高,则颜色越深。可以看到,同一模块内的基因具有较高的共表达模式,而不同模块间则相差明显。图里的红色方块对应的模块就是共表达聚类比较好的模块,比如蓝色和天青色模块:
同样的:grey模块中的相关性应该很小,如果你的数据里相关性最显著的模块是grey模块,那肯定是有问题的。
保存你的结果:
> moduleLabels = net$colors
> moduleColors = labels2colors(net$colors)
> table(moduleColors) #查看所有模块里基因的数量
moduleColors
black blue brown cyan green greenyellow grey
248 959 858 46 265 147 40
magenta pink purple red salmon tan turquoise
195 230 159 263 69 77 1014
yellow
430
> MEs = net$MEs
> geneTree = net$dendrograms[[1]]
> save(MEs, moduleLabels, moduleColors, geneTree,
file = "RNA_seq_FPKM_filted_networkConstruction.RData")
想查看每一个模块里的基因吗?代码:
> unique(moduleColors)
[1] "brown" "turquoise" "pink" "green" "red"
[6] "yellow" "black" "purple" "blue" "grey"
[11] "salmon" "magenta" "cyan" "tan" "greenyellow"
> head(colnames(datExpr_filted)[moduleColors=="black"])
[1] "Ifitm3" "Aldoa" "Ecm1" "Nupr1" "Sat1" "Aplp2"
# 如果不想看前几个,想调出来全部的,可以这样:
> a=colnames(datExpr_filted)[moduleColors=="black"]
> View(a)
(七)选一个你感兴趣的模块,看看基因的表达情况
# You could extract the FPKM of the genes in the intested module, make heatmap of gene expression
> which.module="black"
> plotMat(t(scale(datExpr_filted[,moduleColors==which.module ]) ),
nrgcols=30,
rlabels=F,
rcols=which.module,
main=which.module,
cex.main=2
)
# You can also displaying module heatmap and the eigengene
> sizeGrWindow(8,7)
> which.module="black"
> ME=MEs[,paste("ME",which.module, sep="")]
> par(mfrow=c(2,1), mar=c(0.3, 5.5, 3, 2))
> plotMat(t(scale(datExpr_filted[,moduleColors==which.module ]) ),
nrgcols=30,rlabels=F,rcols=which.module,
main=which.module, cex.main=2)
> par(mar=c(5, 4.2, 0, 0.7))
> barplot(ME, col=which.module, main="", cex.main=2,
ylab="eigengene expression",xlab="samples")
图里的热图就是你得到的黑色模块里所有的基因,在你所有的样品里的表达情况:
(八)模块之间的相关性
> MEs = net$MEs
> MEs_col = MEs
> library(stringr)
> colnames(MEs_col) = paste0("ME", labels2colors(as.numeric(str_replace_all(colnames(MEs),"ME",""))))
> MEs_col = orderMEs(MEs_col)
# The correlation map of each module was obtained by clustering according to the gene expression
> plotEigengeneNetworks(MEs_col, "Eigengene adjacency heatmap",
marDendro = c(3,3,2,4),
marHeatmap = c(3,4,2,2), plotDendrograms = T,
xLabelsAngle = 90)
(九)模块与性状/实验条件/临床特征的相关性
对于我来说,这一步是最关键的,我想知道我的实验处理条件和哪一个或哪几个模块最相关:
#先处理一下样品属性的matrix,或者你有临床属性也可以
> datTraits <- as.data.frame(datTraits)
> colnames(datTraits) = "condition"
> design = model.matrix(~0+ datTraits$condition)
> c = as.factor(datTraits$condition)
> levels(c)
> colnames(design) = levels(c)
> moduleColors <- labels2colors(net$colors)
> MEs0 = moduleEigengenes(datExpr_filted, moduleColors)$eigengenes
> MEs = orderMEs(MEs0)
> moduleTraitCor = cor(MEs, design , use = "p")
> nSamples = nrow(datExpr_filted)
> moduleTraitPvalue = corPvalueStudent(moduleTraitCor, nSamples)
> textMatrix = paste(signif(moduleTraitCor, 2), "\n(",
signif(moduleTraitPvalue, 1), ")", sep = "")
> dim(textMatrix) = dim(moduleTraitCor)
# Display the correlation values within a heatmap plot (correlation between module and conditions)
> labeledHeatmap(Matrix = moduleTraitCor,
xLabels = colnames(design),
yLabels = names(MEs),
ySymbols = names(MEs),
colorLabels = FALSE,
colors = blueWhiteRed(50),
textMatrix = textMatrix,
setStdMargins = FALSE,
cex.text = 0.9,
zlim = c(-1,1),
main = paste("Module-trait relationships"))
下图里,每一行代表不同的基因模块,每一列代表不同的实验条件/临床表型,每一个格子里的第一行数值代表了相关系数,通过红色和绿色区分正负相关,第二行括号里值为显著性p值。
有了这个图,就可以挑选自己感兴趣的模块进行下游分析了。
你可以指定一个感兴趣的表型/condition,可以得到与这个形状相关性最高的模块:
# You can get the most relevant module by specifying a condition of interest
> which.trait <- "some condition/trait"
> moduleTraitCor[, which.trait]
MEbrown MEcyan MEgreenyellow MEpink MEred MEblue
-0.3944402 -0.3860880 -0.2461779 -0.4056089 -0.5912242 -0.4908400
MEyellow MEtan MEturquoise MEsalmon MEblack MEmagenta
-0.2084418 0.3295024 0.7474432 0.2530541 0.4889765 0.4670376
MEgreen MEpurple MEgrey
0.2920325 -0.1241892 0.3292240
(十)对某一个性状/条件/临床特征相对应的某个模块进行具体分析
# 对p/brown(条件/模块)具体分析,模块内基因与表型数据关联:
# 性状跟模块虽然求出了相关性,可以挑选最相关的那些模块来分析
# 但是模块本身仍然包含非常多的基因,还需进一步的寻找最重要的基因
# 所有的模块都可以跟基因算出相关系数,所有的连续型性状也可以跟基因的表达值算出相关系数
# 如果跟性状显著相关基因也跟某个模块显著相关,那么这些基因可能就非常重要
# Firstly, calculate the correlation matrix between module and genes.(首先计算模块与基因的相关性矩)
# names (colors) of the modules
> modNames = substring(names(MEs), 3)
> geneModuleMembership = as.data.frame(cor(datExpr_filted, MEs, use = "p"))
# Calculate the Pearson correlation coefficient matrix of each module and its genes(算出每个模块跟基因的皮尔森相关系数矩)
# MEs is the value of each module in each sample.(MEs是每个模块在每个样本里面的)
> MMPvalue = as.data.frame(corPvalueStudent(as.matrix(geneModuleMembership), nSamples))
> names(geneModuleMembership) = paste("MM", modNames, sep="")
> names(MMPvalue) = paste("p.MM", modNames, sep="")
# Secondly, calculate the correlation matrix between conditions and genes. (再计算性状与基因的相关性矩)
# Only continuous properties can be computed. If the variables are discrete, the matrix is converted to 0-1 when the sample table is constructed.(只有连续型性状才能只有计算,如果是离散变量,在构建样品表时就转为0-1矩阵)
# Here, the variable whether or not it belongs to the P condition is numeralized with 0 and 1.(这里把是否属于 P 实验条件这个变量进行数值化,0和1表示)
> P = as.data.frame(design[,1]) # choose an interested condition!!
> names(P) = "proliferating"
> geneTraitSignificance = as.data.frame(cor(datExpr_filted, P, use = "p"))
> GSPvalue = as.data.frame(corPvalueStudent(as.matrix(geneTraitSignificance), nSamples))
> names(geneTraitSignificance) = paste("GS.", names(P), sep="")
> names(GSPvalue) = paste("p.GS.", names(P), sep="")
# Then, combine aboved two correlation matrixes(最后把两个相关性矩阵联合起来,指定感兴趣模块进行分析)
> module = "brown" # choose interested module
> column = match(module, modNames)
# get the genes in the interested module(获取模块内的基因)
> moduleGenes = moduleColors==module
> sizeGrWindow(7, 7)
> par(mfrow = c(1,1))
# Genes that are highly correlated with conditions are also highly associated with modules (与性状高度相关的基因,也是与性状相关的模型的关键基因)
> verboseScatterplot(abs(geneModuleMembership[moduleGenes, column]),
abs(geneTraitSignificance[moduleGenes, 1]),
xlab = paste("Module Membership in", module, "module"),
ylab = "Gene significance for proliferating",
main = paste("Module membership vs. gene significance\n"),
cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2, col = module)
下图可以看到这些基因不仅仅是跟其对应的模块高度相关,而且是跟其对应的性状高度相关,进一步说明了基因值得深度探究:
在前面的步骤里,你已经可以调出任何一个模块里的基因,如果你想在上面的点图里标记出特定的基因,请参考:如何在散点图里添加特定标签
(十一)导出网络到Cytoscape
可以导出指定module对应的基因共表达网络:
# Recalculate topological overlap
> TOM = TOMsimilarityFromExpr(datExpr_filted, power = 14)
# Select modules
> module = "turquoise"
# Select module probes
> probes = colnames(datExpr_filted) #基因名
> inModule = (moduleColors == module)
> modProbes = probes[inModule]
# modGenes = annot$gene_symbol[match(modProbes, annot$substanceBXH)];
# Select the corresponding Topological Overlap
> modTOM = TOM[inModule, inModule]
> dimnames(modTOM) = list(modProbes, modProbes)
#export to VisANT
> vis = exportNetworkToVisANT(modTOM,
file = paste("VisANTInput-", module, ".txt", sep=""),
weighted = TRUE,
threshold = 0)
# Export the network into edge and node list files Cytoscape can read
> cyt = exportNetworkToCytoscape(modTOM,
edgeFile = paste("AS-green-FPKM-One-step-CytoscapeInput-edges-", paste(module, collapse="-"), ".txt", sep=""),
nodeFile = paste("AS-green-FPKM-One-step-CytoscapeInput-nodes-", paste(module, collapse="-"), ".txt", sep=""),
weighted = TRUE,
threshold = 0.02,
nodeNames = modProbes,
#altNodeNames = modGenes,
nodeAttr = moduleColors[inModule])
这一步你会得到两个文件,包含edges和nodes的信息。其中edges信息很大:
这里对如何可视化这一部分先不做介绍。因为这一部分对我来说并不是重点。有兴趣的童鞋可以自己搜索一下。
(十二)筛选hub基因
什么是hub基因?
在一个模块中,连通性( k值)排名靠前的基因。 K值排名靠前本身已经表明它们处于中心枢纽的位置。如下图中的A基因:
那什么又是连通性呢:
#(1)Calculate Intramodular connectivity
> moduleColors <- labels2colors(net$colors)
> connet=abs(cor(datExpr_filted,use="p"))^6
> Alldegrees1=intramodularConnectivity(connet, moduleColors)
> head(Alldegrees1)
# (2) calculate relationship between gene significance and intramodular connectivity
> which.module="brown"
> P= as.data.frame(design[,1]) # change specific
> names(P) = "Proliferating"
> GS1 = as.numeric(cor(P,datExpr_filted, use="p"))
> GeneSignificance=abs(GS1)
> colorlevels=unique(moduleColors)
> sizeGrWindow(9,6)
> par(mfrow=c(2,as.integer(0.5+length(colorlevels)/2)))
> par(mar = c(4,5,3,1))
> for (i in c(1:length(colorlevels)))
{
whichmodule=colorlevels[[i]]
restrict1 = (moduleColors == whichmodule)
verboseScatterplot(Alldegrees1$kWithin[restrict1],
GeneSignificance[restrict1], col=moduleColors[restrict1],
main=whichmodule,
xlab = "Connectivity", ylab = "Gene Significance", abline = TRUE)
}
下图里是每一个模块在你选择的性状/条件/临床特征下,基因与模块的连通性(x 轴)和模块里的基因与该性状/条件/临床特征的相关性(y轴):
下面的hub基因筛选比较重要:
这里注意筛选标准:abs(GS1)>0.9 可以根据实际情况调整; abs(datKME$MM.black)>0.8 (至少大于 >0.8)
#(3)Calculate the connectivity of all genes in the module.Screen the hub genes.
# abs(GS1)> 0.9 (could be adjusted depend on your data)
# abs(datKME$MM.black)>0.8 (at least more than 0.8)
# Generalizing intramodular connectivity for all genes on the array
> datKME=signedKME(datExpr_filted, MEs, outputColumnName="MM.")
> head(datKME)
> write.csv(datKME,file = "datKME.csv")
# Finding genes with high gene significance and high intramodular connectivity in interesting modules
> FilterGenes= abs(GS1)> 0.8 & abs(datKME$MM.brown) > 0.8
> table(FilterGenes)
> x = datKME[(GS1>0.8 & datKME$MM.brown >0.8),]
> View(x)
> write.csv(x, file="module_brown_hub_gene_GS0.8.csv")