ConsensusClusterPlus包进行聚类分析

ConsensusClusterPlus包的ConsensesClusterPlus函数,用于通过稳定性证据确定簇数和类成员身份。计算聚类一致性和项目一致性的calcICL函数。

Usage

ConsensusClusterPlus(
d=NULL, maxK = 3, reps=10, pItem=0.8, pFeature=1, clusterAlg="hc",title="untitled_consensus_cluster",
innerLinkage="average", finalLinkage="average", distance="pearson", ml=NULL,
tmyPal=NULL,seed=NULL,plot=NULL,writeTable=FALSE,weightsItem=NULL,weightsFeature=NULL,verbose=F,corUse="everything")

calcICL(res,title="untitled_consensus_cluster",plot=NULL,writeTable=FALSE)

Arguments

d

data to be clustered; either a data matrix where columns=items/samples and rows are features. For example, a gene expression matrix of genes in rows and microarrays in columns, or ExpressionSet object, or a distance object (only for cases of no feature resampling)

maxK

integer value. maximum cluster number to evaluate.

reps

integer value. number of subsamples.

pItem

numerical value. proportion of items to sample.

pFeature

numerical value. proportion of features to sample.

clusterAlg

character value. cluster algorithm. 'hc' hierarchical (hclust), 'pam' for paritioning around medoids, 'km' for k-means upon data matrix, or a function that returns a clustering. See example and vignette for more details.

title

character value for output directory. Directory is created only if plot is not NULL or writeTable is TRUE. This title can be an abosulte or relative path.

innerLinkage

hierarchical linkage method for subsampling.

finalLinkage

hierarchical linkage method for consensus matrix.

distance

character value. 'pearson': (1 - Pearson correlation), 'spearman' (1 - Spearman correlation), 'euclidean', 'binary', 'maximum', 'canberra', 'minkowski" or custom distance function.

ml

optional. prior result, if supplied then only do graphics and tables.

tmyPal

optional character vector of colors for consensus matrix

seed

optional numerical value. sets random seed for reproducible results.

plot

character value. NULL - print to screen, 'pdf', 'png', 'pngBMP' for bitmap png, helpful for large datasets.

writeTable

logical value. TRUE - write ouput and log to csv.

weightsItem

optional numerical vector. weights to be used for sampling items.

weightsFeature

optional numerical vector. weights to be used for sampling features.

res

result of consensusClusterPlus.

verbose

boolean. If TRUE, print messages to the screen to indicate progress. This is useful for large datasets.

corUse

optional character value. specifies how to handle missing data in correlation distances 'everything','pairwise.complete.obs', 'complete.obs' see cor() for description.

# if (!require("BiocManager", quietly = TRUE))
#   install.packages("BiocManager")
# 
# BiocManager::install("ConsensusClusterPlus")

### 1.准备数据
## 行为特征,列为样本
library(ALL)
data(ALL)
d=exprs(ALL)
d[1:5,1:5]

# 取中位数绝对偏差(Median Absolute Deviation)大的前5000个探针
mads=apply(d,1,mad)
d=d[rev(order(mads))[1:5000],]
# order(mads):从小到大排序,返回索引
# rev(order(mads):从大到小排序

d = sweep(d,1, apply(d,1,median,na.rm=T))
# sweep:Return an array obtained from an input array 
# by sweeping out a summary statistic.
# 输入数组行数据减去各行中间值得到的数据。
# 如第一行 d[1,]-median(d[1,])

### 2.运行一致性聚类
library(ConsensusClusterPlus)
output_dir="/Users/zhengxueming/test/test0705"
results = ConsensusClusterPlus(d,maxK=6,reps=50,pItem=0.8,pFeature=1,
                               title=output_dir,clusterAlg="hc",distance="pearson",
                               seed=1213,plot="png")
# str(results)
# str(results[[2]])

## output_dir 目录下生成不同K值下的聚类图和聚类评估图 
# 根据consensus CDF和Delta area图,选择最佳的k值:从K=2开始,计算K和K-1相比,
# CDF 曲线下面积的相对变化,选取增加不明显的点作为最佳的K值
# trackling plot:行为样本,列为每个K, 用热图展示样本在每个K下的cluster, 
# 用于定性评估不稳定的聚类和不稳定的样本

# the top ten rows and columns of results for k=2:
results[[2]][["consensusMatrix"]][1:10,1:10]

# 查看各类别颜色
results[[6]][["clrs"]]

#consensusTree - hclust object 
results[[2]][["consensusTree"]]


###3.计算组间一致性和组类一致性
# calculating cluster-consensus and item-consensus.
icl = calcICL(results,title=output_dir,plot="png")
# output_dir生成icl开头的png文件
# icl 为list,含有"clusterConsensus" "itemConsensus" 
icl[["clusterConsensus"]]
icl[["itemConsensus"]][1:5,]


### 4.选择合适的K值,得到各样本聚类结果的数据框
sample_cluster <- results[[5]]$consensusClass

sample_cluster_df <- data.frame(sample = names(sample_cluster),
                                cluster = sample_cluster)
head(sample_cluster_df)

参考

https://www.bioconductor.org/packages/release/bioc/vignettes/ConsensusClusterPlus/inst/doc/ConsensusClusterPlus.pdf
 

你可能感兴趣的:(大数据)