ConsensusClusterPlus包的ConsensesClusterPlus函数,用于通过稳定性证据确定簇数和类成员身份。计算聚类一致性和项目一致性的calcICL函数。
ConsensusClusterPlus( d=NULL, maxK = 3, reps=10, pItem=0.8, pFeature=1, clusterAlg="hc",title="untitled_consensus_cluster", innerLinkage="average", finalLinkage="average", distance="pearson", ml=NULL, tmyPal=NULL,seed=NULL,plot=NULL,writeTable=FALSE,weightsItem=NULL,weightsFeature=NULL,verbose=F,corUse="everything") calcICL(res,title="untitled_consensus_cluster",plot=NULL,writeTable=FALSE)
d |
data to be clustered; either a data matrix where columns=items/samples and rows are features. For example, a gene expression matrix of genes in rows and microarrays in columns, or ExpressionSet object, or a distance object (only for cases of no feature resampling) |
maxK |
integer value. maximum cluster number to evaluate. |
reps |
integer value. number of subsamples. |
pItem |
numerical value. proportion of items to sample. |
pFeature |
numerical value. proportion of features to sample. |
clusterAlg |
character value. cluster algorithm. 'hc' hierarchical (hclust), 'pam' for paritioning around medoids, 'km' for k-means upon data matrix, or a function that returns a clustering. See example and vignette for more details. |
title |
character value for output directory. Directory is created only if plot is not NULL or writeTable is TRUE. This title can be an abosulte or relative path. |
innerLinkage |
hierarchical linkage method for subsampling. |
finalLinkage |
hierarchical linkage method for consensus matrix. |
distance |
character value. 'pearson': (1 - Pearson correlation), 'spearman' (1 - Spearman correlation), 'euclidean', 'binary', 'maximum', 'canberra', 'minkowski" or custom distance function. |
ml |
optional. prior result, if supplied then only do graphics and tables. |
tmyPal |
optional character vector of colors for consensus matrix |
seed |
optional numerical value. sets random seed for reproducible results. |
plot |
character value. NULL - print to screen, 'pdf', 'png', 'pngBMP' for bitmap png, helpful for large datasets. |
writeTable |
logical value. TRUE - write ouput and log to csv. |
weightsItem |
optional numerical vector. weights to be used for sampling items. |
weightsFeature |
optional numerical vector. weights to be used for sampling features. |
res |
result of consensusClusterPlus. |
verbose |
boolean. If TRUE, print messages to the screen to indicate progress. This is useful for large datasets. |
corUse |
optional character value. specifies how to handle missing data in correlation distances 'everything','pairwise.complete.obs', 'complete.obs' see cor() for description. |
# if (!require("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
#
# BiocManager::install("ConsensusClusterPlus")
### 1.准备数据
## 行为特征,列为样本
library(ALL)
data(ALL)
d=exprs(ALL)
d[1:5,1:5]
# 取中位数绝对偏差(Median Absolute Deviation)大的前5000个探针
mads=apply(d,1,mad)
d=d[rev(order(mads))[1:5000],]
# order(mads):从小到大排序,返回索引
# rev(order(mads):从大到小排序
d = sweep(d,1, apply(d,1,median,na.rm=T))
# sweep:Return an array obtained from an input array
# by sweeping out a summary statistic.
# 输入数组行数据减去各行中间值得到的数据。
# 如第一行 d[1,]-median(d[1,])
### 2.运行一致性聚类
library(ConsensusClusterPlus)
output_dir="/Users/zhengxueming/test/test0705"
results = ConsensusClusterPlus(d,maxK=6,reps=50,pItem=0.8,pFeature=1,
title=output_dir,clusterAlg="hc",distance="pearson",
seed=1213,plot="png")
# str(results)
# str(results[[2]])
## output_dir 目录下生成不同K值下的聚类图和聚类评估图
# 根据consensus CDF和Delta area图,选择最佳的k值:从K=2开始,计算K和K-1相比,
# CDF 曲线下面积的相对变化,选取增加不明显的点作为最佳的K值
# trackling plot:行为样本,列为每个K, 用热图展示样本在每个K下的cluster,
# 用于定性评估不稳定的聚类和不稳定的样本
# the top ten rows and columns of results for k=2:
results[[2]][["consensusMatrix"]][1:10,1:10]
# 查看各类别颜色
results[[6]][["clrs"]]
#consensusTree - hclust object
results[[2]][["consensusTree"]]
###3.计算组间一致性和组类一致性
# calculating cluster-consensus and item-consensus.
icl = calcICL(results,title=output_dir,plot="png")
# output_dir生成icl开头的png文件
# icl 为list,含有"clusterConsensus" "itemConsensus"
icl[["clusterConsensus"]]
icl[["itemConsensus"]][1:5,]
### 4.选择合适的K值,得到各样本聚类结果的数据框
sample_cluster <- results[[5]]$consensusClass
sample_cluster_df <- data.frame(sample = names(sample_cluster),
cluster = sample_cluster)
head(sample_cluster_df)
https://www.bioconductor.org/packages/release/bioc/vignettes/ConsensusClusterPlus/inst/doc/ConsensusClusterPlus.pdf