什么情况下会用到一致性聚类？

顺手策个文章套路：

比较常用到 聚类分析 的就是肿瘤亚型识别的文章吧。这类文章一般会对表达量或甲基化等数据进行聚类分析，选出最优聚类数；对聚出的类组进行差异化表达分析得到DEGs，差异表达基因做GO、pathway一系列分析，在分析一下与生存的关系、免疫细胞丰度的区别，等等。。。
要是一不小心找出了这几组免疫细胞有区别、生存有区别，是不是一篇揭示XX癌免疫应答异质性的文章就来了？
这类文章里有很多鉴别subtype是用到的聚类方法是Consensus Clustering～

聚类方法起始运算时，是随机的。比如Hierarchical Clustering，有时给不同的随机种子，得到的聚类结果会不一样！
这时，一致性聚类由于基于重采样的方法，特性就是结果很稳定。可以克服这个问题。

无监督分析下鉴定簇集数及成员

聚类分析

传统方法的不足

不能提供“客观的”分类数目的标准和分类边界,例如Hierarchical Clustering。

需要预先给定一个分类的数目，且没有统一的标准去比较不同分类数目下分类的结果，例如K-means Clustering。

聚类结果的合理性和可靠性无法验证。

一致聚类（Consensus Clustering）

一致聚类通过基于重采样的方法来验证聚类合理性
一致聚类方法的主要目的是评估聚类的稳定性

基本原理假设

从原数据集不同的子类中提取出的样本构成一个新的数据集，并且从同一个子类中有不同的样本被提取出来，那么在新数据集上聚类分析之后的结果，无论是聚类的数目还是类内样本都应该和原数据集相差不大。因此所得到的聚类相对于抽样变异越稳定，我们越可以相信这一样的聚类代表了一个真实的子类结构。重采样的方法可以打乱原始数据集，这样对每一次重采样的样本进行聚类分析然后再综合评估多次聚类分析的结果给出一致性(Consensus)的评估。

1. 关于ConsensusClusterPlus

Consensus Clustering是一种可用于鉴定数据集（比如 microarray 基因表达）中的簇集 (clusters) 成员及其数量的算法。ConsensusClusterPlus则将Consensus Clustering在 R 中实现了。

#载入R包
library(ConsensusClusterPlus)
ls("package:ConsensusClusterPlus")
# [1] "calcICL" "ConsensusClusterPlus"

ConsensusClusterPlus function for determing cluster number and class membership by stability evidence.
calcICL function for calculating cluster-consensus and item-consensus.

2. 操作

使用 ConsensusClusterPlus 的主要三个步骤：

准备输入数据
跑程序
计算聚类一致性 (cluster-consensus) 和样品一致性 (item-consensus)

3. 准备输入数据

首先收集用于聚类分析的数据，比如 mRNA 表达微阵列或免疫组织化学染色强度的实验结果数据。输入数据的格式应为矩阵。下面以 ALL 基因表达数据为例进行操作。

library(ALL)
data(ALL)
#这是个表达为列阵数据示例
dataset <- exprs(ALL)

#取前五行、前五列看看长什么样子
dataset[1:5,1:5]
#              01005    01010    03002    04006    04007
# 1000_at   7.597323 7.479445 7.567593 7.384684 7.905312
# 1001_at   5.046194 4.932537 4.799294 4.922627 4.844565
# 1002_f_at 3.900466 4.208155 3.886169 4.206798 3.416923
# 1003_s_at 5.903856 6.169024 5.860459 6.116890 5.687997
# 1004_at   5.925260 5.912780 5.893209 6.170245 5.615210

取矩阵中 MAD 值（绝对中位差） top 5000 的数据：
在统计学中，绝对中位数MAD是对单变量数值型数据的样本偏差的一种鲁棒性测量。

#取绝对中位差
mads <- apply(dataset, 1, mad) 
#按绝对中位差排序，取前5000数据
dataset <- dataset[rev(order(mads))[1:5000],]
dim(dataset)
# [1] 5000  128

4. 运行 ConsensusClusterPlus

先设定几个参数：

pItem (item resampling, proportion of items to sample) : 80%
pFeature (gene resampling, proportion of features to sample) : 80%
maxK (a maximum evalulated k, maximum cluster number to evaluate) : 6 设置最多想尝试的分组数
reps (resamplings, number of subsamples) : 50
clusterAlg (agglomerative heirarchical clustering algorithm) : 'hc' (hclust)
distance : 'pearson' (1 - Pearson correlation)

title <- “YOUR PATH”  #所有的图片以及数据都会输出到这里的
results <- ConsensusClusterPlus(dataset, maxK = 6,
                                reps = 50, pItem = 0.8,
                                pFeature = 0.8,  
                                clusterAlg = "hc", 
                                distance = "pearson",
                                title = title,
                                plot = "png")  
## 作者这里是pFeature = 1，和前文不符，于是我依然是按0.8输入计算的

这时工作路径的文件夹会出现9张图。

查看一下结果：

#         [,1]      [,2]      [,3]    [,4]      [,5]
# [1,] 1.00000 0.9375000 1.0000000 0.90625 1.0000000
# [2,] 0.93750 1.0000000 0.9677419 1.00000 0.9393939
# [3,] 1.00000 0.9677419 1.0000000 0.93750 1.0000000
# [4,] 0.90625 1.0000000 0.9375000 1.00000 0.9062500
# [5,] 1.00000 0.9393939 1.0000000 0.90625 1.0000000
results[[2]][["consensusTree"]] 
# Call:
# hclust(d = as.dist(1 - fm), method = finalLinkage)
# 
# Cluster method   : average 
# Number of objects: 128 
results[[2]][["consensusClass"]][1:5] 
# 01005 01010 03002 04006 04007 
#     1     1     1     1     1

让我们看一下可视化结果怎么理解？

4.1 一致性矩阵

分别为图例、k = 2, 3, 4, 5 时的矩阵热图。
这个图叫做CM plots，其目的是展示分类情况，找到最“干净”的一张图（也就是白的方块中尽量不掺杂蓝色），就是分类效果最好的一类。

CM plots

4.2 一致性累积分布函数图

cdf plot

不同聚类数k时的cdf分布。
Empirical cumulative distribution function (CDF) plots display consensus distributions for each k . The purpose of the CDF plot is to find the k at which the distribution reaches an approximate maximum, which indicates a maximum stability and after which divisions are equivalent to random picks rather than true cluster structure.

4.3 Delta Area Plot

image

一般用elbow method，取拐点处的k值，为最佳分类数。
The delta area score (y-axis) indicates the relative increase in cluster stability.

4.4 Tracking Plot

image

这个图从行（k）开始看，展示了不同聚类数(k)下，每个sample(列)都被分为了哪一类。比如，k=2时，大部分sample都被分为了淡蓝色那一类，只有中间一小撮被分为深蓝色那一类。
The item tracking plot shows the consensus cluster of items (in columns) at each k (in rows). This allows a user to track an item's cluster assignments across different k, to identify promiscuous items that are suggestive of weak class membership, and to visualize the distribution of cluster sizes across k.

5. 计算聚类一致性 (cluster-consensus) 和样品一致性 (item-consensus)

icl <- calcICL(results, title = title,
               plot = "png")
## 返回了具有两个元素的list，然后分别查看一下
dim(icl[["clusterConsensus"]])
# [1] 20  3
icl[["clusterConsensus"]] 
#       k cluster clusterConsensus
#  [1,] 2       1        0.9402982
#  [2,] 2       2        0.9062500
#  [3,] 3       1        0.8504193
#  [4,] 3       2        0.9062500
#  [5,] 3       3        0.9869781
#  [6,] 4       1        0.9652282
#  [7,] 4       2        0.9045058
#  [8,] 4       3        0.9062500
#  [9,] 4       4        0.9728043
# [10,] 5       1        0.9216686
# [11,] 5       2        0.9145987
# [12,] 5       3        0.9062500
# [13,] 5       4        0.9874950
# [14,] 5       5              NaN
# [15,] 6       1        0.9307379
# [16,] 6       2        0.8897721
# [17,] 6       3        0.7474747
# [18,] 6       4        0.8750000
# [19,] 6       5        0.9885269
# [20,] 6       6        0.6333333
dim(icl[["itemConsensus"]])
# [1] 2560    4
icl[["itemConsensus"]][1:5,] 
#   k cluster  item itemConsensus
# 1 2       1 28032     0.9523526
# 2 2       1 28024     0.9366226
# 3 2       1 03002     0.9686272
# 4 2       1 01005     0.9573623
# 5 2       1 04007     0.9549235

5.1 item-Consensus Plot

IC plot

Item-consensus (IC) is the average consensus value between an item and members of a consensus cluster, so that there are multiple IC values for an item at a k corresponding to the k clusters. IC plots display items as vertical bars of coloured rectangles whose height corresponds to IC values.

5.2 Cluster-Consensus Plot

cluster-consensus plot

References

ConsensusClusterPlus Tutorial https://bioconductor.org/packages/release/bioc/vignettes/ConsensusClusterPlus/inst/doc/ConsensusClusterPlus.pdf
Nowicka, Malgorzata, et al. "CyTOF workflow: differential discovery in high-throughput high-dimensional cytometry datasets." F1000Research 6 (2017).

想看更多请关注公众号：bioinfo-c

想看更多请关注

R语言：一致性聚类 ConsensusClusterPlus