相关文章:
机器学习 | 目录
机器学习 | 距离计算
无监督学习 | KMeans与KMeans++原理
无监督学习 | KMeans之Skleaen实现:电影评分聚类
Clustering performance evaluation
聚类性能度量亦称聚类“有效性指标
”(validity index)。与监督学习中的性能度量相似,对聚类结果,我们需通过某种性能度量来评估其好坏;另一方面,若明确了最终将要使用的性能度量,则可直接将其作为聚类过程的优化目标,从而更好地得到符合要求的聚类结果。
聚类是将样本集D划分为若干互不相关的子集,即样本簇
(类),而我们又希望聚类结果的“簇内相似度
”(intra-cluster similarity)高且“簇间相似度
”(intra-cluster similarity)低。
聚类性能度量大致有两类,一类是将聚类结果与某个“参考模型
”(reference model,样本含标签的)进行比较,称为“外部指标
”(external index);另一类是直接考察聚类结果而不利用任何参考模型,称为“内部指标
”(internal index)。
对数据集 D = { x 1 , x 2 , ⋯ , x n } D=\{x_1,x_2,\cdots,x_n\} D={x1,x2,⋯,xn},假定通过聚类给出的 k k k 个簇,划分为 C = { C 1 , C 2 , ⋯ , C k } C=\{C_1,C_2,\cdots,C_k\} C={C1,C2,⋯,Ck},参考模型给出的 s s s 个簇划分为 C ∗ = { C 1 ∗ , C 2 ∗ , ⋯ , C s ∗ } C^*=\{C_1^*,C_2^*,\cdots,C_s^*\} C∗={C1∗,C2∗,⋯,Cs∗}。相应地,令 λ \lambda λ 与 λ ∗ \lambda^* λ∗ 分别表示 C C C 与 C ∗ C^* C∗ 对应的簇标记向量。我们将样本两两配对考虑,定义:
a = ∣ S S ∣ , S S = { ( x i , x j ) ∣ λ i = λ j , λ i ∗ = λ j ∗ , i < j } (1) a=|SS|,\quad SS=\{(x_i,x_j)| \lambda_i=\lambda_j,\lambda_i^*=\lambda_j^*,i
b = ∣ S D ∣ , S D = { ( x i , x j ) ∣ λ i = λ j , λ i ∗ ≠ λ j ∗ , i < j } (2) b=|SD|,\quad SD=\{(x_i,x_j)| \lambda_i=\lambda_j,\lambda_i^*\neq\lambda_j^*,i
c = ∣ D S ∣ , D S = { ( x i , x j ) ∣ λ i ≠ λ j , λ i ∗ = λ j ∗ , i < j } (3) c=|DS|,\quad DS=\{(x_i,x_j)| \lambda_i\neq\lambda_j,\lambda_i^*=\lambda_j^*,i
d = ∣ D D ∣ , D D = { ( x i , x j ) ∣ λ i ≠ λ j , λ i ∗ ≠ λ j ∗ , i < j } (4) d=|DD|,\quad DD=\{(x_i,x_j)| \lambda_i\neq\lambda_j,\lambda_i^*\neq\lambda_j^*,i
其中集合 S S SS SS 表示点 i i i 和点 j j j 在聚类结果中处于同一个簇,而实际上这两个点也是处于同一个簇的所有点的集合,相当于混淆矩阵中的 TP;
集合 S D SD SD 表示点 i i i 和点 j j j 在聚类结果中处于同一个簇,而实际上这两个点不处于同一个簇的所有点的集合,相当于混淆矩阵中的 FP,…。
由于每个样本对 ( x i , x j ) ( i < j ) (x_i,x_j)(i
基于公式(1-4)可以导出下面常用的聚类性能度量外部指标
:
Rand Index 兰德指数
R I = ( a + b ) C n 2 ( ) RI=\frac{(a+b)}{C_n^2}\tag{} RI=Cn2(a+b)()
Adjuested Rand Index 调整兰德指数
sklearn.metrics.adjusted_rand_score
A R I = R I − E ( R I ) m a x ( R I ) − E ( R I ) (5) ARI=\frac{RI-E(RI)}{max(RI)-E(RI)}\tag{5} ARI=max(RI)−E(RI)RI−E(RI)(5)
Advantages
Random (uniform) label assignments have a ARI score close to 0.0 for any value of n_clusters and n_samples (which is not the case for raw Rand index or the V-measure for instance).
Bounded range [-1, 1]: negative values are bad (independent labelings), similar clusterings have a positive ARI, 1.0 is the perfect match score.
No assumption is made on the cluster structure: can be used to compare clustering algorithms such as k-means which assumes isotropic blob shapes with results of spectral clustering algorithms which can find cluster with “folded” shapes.
Drawbacks
However ARI can also be useful in a purely unsupervised setting as a building block for a Consensus Index that can be used for clustering model selection (TODO).
Jaccard Coefficient
J C = a a + b + c (6) JC=\frac{a}{a+b+c}\tag{6} JC=a+b+ca(6)
Fowlkes and Mallows Index
sklearn.metrics.fowlkes_mallows_score
F M I = a a + b ⋅ a a + c (7) FMI=\sqrt{\frac{a}{a+b}\cdot\frac{a}{a+c}}\tag{7} FMI=a+ba⋅a+ca(7)
Advantages
Random (uniform) label assignments have a FMI score close to 0.0 for any value of n_clusters and n_samples (which is not the case for raw Mutual Information or the V-measure for instance).
Upper-bounded at 1: Values close to zero indicate two label assignments that are largely independent, while values close to one indicate significant agreement. Further, values of exactly 0 indicate purely independent label assignments and a FMI of exactly 1 indicates that the two label assignments are equal (with or without permutation).
No assumption is made on the cluster structure: can be used to compare clustering algorithms such as k-means which assumes isotropic blob shapes with results of spectral clustering algorithms which can find cluster with “folded” shapes.
Drawbacks
Mutual Information 互信息
sklearn.metrics.mutual_info_score
对数据集 D = { x 1 , x 2 , ⋯ , x n } D=\{x_1,x_2,\cdots,x_n\} D={x1,x2,⋯,xn},假定通过聚类给出的 k k k 个簇,划分为 C = { C 1 , C 2 , ⋯ , C k } C=\{C_1,C_2,\cdots,C_k\} C={C1,C2,⋯,Ck},参考模型给出的 s s s 个簇划分为 C ∗ = { C 1 ∗ , C 2 ∗ , ⋯ , C s ∗ } C^*=\{C_1^*,C_2^*,\cdots,C_s^*\} C∗={C1∗,C2∗,⋯,Cs∗}。
M I ( C , C ∗ ) = ∑ i = 1 k ∑ j = 1 s P ( C i , C j ∗ ) l o g P ( C i ∩ C j ∗ ) P ( C i ) P ( C j ∗ ) = ∑ i = 1 k ∑ j = 1 s ∣ C i ∩ C j ∗ ∣ n l o g n ⋅ ∣ C i ∩ C j ∗ ∣ ∣ C i ∣ ∣ C j ∗ ∣ (8) \begin{aligned} {MI(C,C^*)} &= {\sum_{i=1}^k \sum_{j=1}^s P(C_i,C_j^*)log \frac{P(C_i\cap C_j^*)}{P(C_i)P(C_j^*)}} \\ &= {\sum_{i=1}^k \sum_{j=1}^s \frac{|C_i\cap C_j^*|}{n}log\frac{n\cdot|C_i\cap C_j^*|}{|C_i||C_j^*|}} \\ \end{aligned}\tag{8} MI(C,C∗)=i=1∑kj=1∑sP(Ci,Cj∗)logP(Ci)P(Cj∗)P(Ci∩Cj∗)=i=1∑kj=1∑sn∣Ci∩Cj∗∣log∣Ci∣∣Cj∗∣n⋅∣Ci∩Cj∗∣(8)
P ( C i ) , P ( C j ∗ ) , P ( C i ∩ C j ∗ ) P(C_i),P(C_j^*),P(C_i\cap C_j^*) P(Ci),P(Cj∗),P(Ci∩Cj∗) 可以分别看作样本属于聚类簇 C i C_i Ci ,属于类 C j ∗ C_j^* Cj∗ 以及同时属于两者的概率。
定义熵 H:
H ( C ) = − ∑ i = 1 k P ( C i ) l o g P ( C i ) = − ∑ i = 1 k ∣ C i ∣ n l o g ( ∣ C i ∣ n ) (9) \begin{aligned} H(C)& =-\sum_{i=1}^kP(C_i)log P(C_i)\\ & = -\sum_{i=1}^k \frac{|C_i|}{n}log(\frac{|C_i|}{n})\\ \end{aligned}\tag{9} H(C)=−i=1∑kP(Ci)logP(Ci)=−i=1∑kn∣Ci∣log(n∣Ci∣)(9)
给定簇信息 C ∗ C^* C∗ 的前提条件下,类别信息 C C C 的增加量,或者说其不确定度的减少量,直观的,可以写出如下形式:
M I ( C , C ∗ ) = H ( C ) − H ( C ∣ C ∗ ) (10) MI(C,C^*)=H(C)-H(C|C^*)\tag{10} MI(C,C∗)=H(C)−H(C∣C∗)(10)
互信息的最小值为 0, 当类簇相对于类别只是随机的, 也就是说两者独立的情况下, C C C 对于 C ∗ C^* C∗ 未带来任何有用的信息;
如果得到的 C C C 与 C ∗ C^* C∗ 关系越密切, 那么 M I ( C , C ∗ ) MI(C,C^*) MI(C,C∗) 值越大. 如果 C C C 完整重现了 C ∗ C^* C∗ , 此时互信息最大。
当 k = n k=n k=n 时,即类簇数和样本个数相等,MI 也能达到最大值。所以 MI 也存在和纯度类似的问题,即它并不对簇数目较大的聚类结果进行惩罚,因此也不能在其他条件一样的情况下,对簇数目越小越好的这种期望进行形式化。
Normalized Mutual Information 归一化互信息
sklearn.metrics.normalized_mutual_info_score
NMI 则可以解决上述问题,因为熵会随着簇的数目的增长而增大。当 k = n k=n k=n 时, H ( C ) H(C) H(C) 会达到其最大值 l o g ( n ) log(n) log(n) , 此时就能保证 NMI 的值较低。之所以采用 1 2 H ( C ) + H ( C ∗ ) \frac{1}{2} H(C)+H(C^*) 21H(C)+H(C∗) 作为分母,是因为它是 M I ( C , C ∗ ) MI(C,C^*) MI(C,C∗) 的紧上界, 因此可以保证 N M I ∈ [ 0 , 1 ] NMI\in[0,1] NMI∈[0,1] 。
N M I ( C , C ∗ ) = 2 × M I ( C , C ∗ ) H ( C ) + H ( C ∗ ) (11) NMI(C,C^*)=\frac{2\times MI(C,C^*)}{H(C)+H(C^*)}\tag{11} NMI(C,C∗)=H(C)+H(C∗)2×MI(C,C∗)(11)
Adjusted Mutual Information 调整互信息
sklearn.metrics.adjusted_mutual_info_score
A M I ( C , C ∗ ) = M I ( C , C ∗ ) − E ( M I ( C , C ∗ ) ) a v g ( H ( C ) , H ( C ∗ ) ) − E [ M I ( C , C ∗ ) ] (12) AMI(C,C^*)=\frac{MI(C,C^*)-E(MI(C,C^*))}{avg(H(C),H(C^*))-E[MI(C,C^*)]}\tag{12} AMI(C,C∗)=avg(H(C),H(C∗))−E[MI(C,C∗)]MI(C,C∗)−E(MI(C,C∗))(12)
令 a i = ∣ C i ∣ , b j = ∣ C J ∗ ∣ a_i=|C_i|,b_j=|C_J^*| ai=∣Ci∣,bj=∣CJ∗∣ ,则 E [ M I ( C , C ∗ ) ] E[MI(C,C^*)] E[MI(C,C∗)] 为:
E [ M I ( C , C ∗ ) ] = ∑ i = 1 ∣ C ∣ ∑ j = 1 ∣ C ∗ ∣ ∑ n i j = ( a i + b j − n ) + m i n ( a i , b j ) n i j n l o g ( n ⋅ n i j a i b j ) a i ! b j ! ( n − a i ) ! ( n = b j ) ! n ! n i j ! ( a i − n i j ) ! ( b j − n i j ) ! ( n − a i − b j + n i j ) ! (13) E[MI(C,C^*)] = \sum_{i=1}^{|C|}\sum_{j=1}^{|C^*|} \sum_{n_{ij}=(a_i+b_j-n)^+}^{min(a_i,b_j)} \frac{n_{ij}}{n}log(\frac{n\cdot n_{ij}}{a_i b_j}) \frac{a_i!b_j!(n-a_i)!(n=b_j)!}{n!n_{ij}!(a_i-n_{ij})!(b_j-n_{ij})!(n-a_i-b_j+n_{ij})!} \tag{13} E[MI(C,C∗)]=i=1∑∣C∣j=1∑∣C∗∣nij=(ai+bj−n)+∑min(ai,bj)nnijlog(aibjn⋅nij)n!nij!(ai−nij)!(bj−nij)!(n−ai−bj+nij)!ai!bj!(n−ai)!(n=bj)!(13)
当 log 取 2 为底时,单位为 bit,取 e 为底时单位为 nat。[2]
考虑聚类结果的 k k k 个簇划分 C = { C 1 , C 2 , ⋯ , C k } C=\{C_1,C_2,\cdots,C_k\} C={C1,C2,⋯,Ck},其中 d i s t ( ⋅ , ⋅ ) dist(\cdot,\cdot) dist(⋅,⋅) 用于计算两个样本之间的距离(2. 距离计算), μ \mu μ 代表簇 C C C 的中心点 μ = 1 k ∑ 1 ≤ i ≤ k x i \mu=\frac{1}{k}\sum_{1\leq i\leq k} x_i μ=k1∑1≤i≤kxi。
定义:
簇 C C C 内样本间的平均距离
a v g ( C ) avg(C) avg(C):
a v g ( C ) = C k 2 ∑ 1 ≤ i ≤ j ≤ k d i s t ( x i , x j ) (14) avg(C)=C_k^2\sum_{1\leq i\leq j \leq k} dist(x_i,x_j) \tag{14} avg(C)=Ck21≤i≤j≤k∑dist(xi,xj)(14)
簇 C C C 内样本间的最远距离
d i a m ( C ) diam(C) diam(C):
d i a m ( C ) = m a x 1 ≤ i ≤ j ≤ k d i s t ( x i , x j ) (15) diam(C)=max_{1\leq i\leq j \leq k}dist(x_i,x_j) \tag{15} diam(C)=max1≤i≤j≤kdist(xi,xj)(15)
簇 C i C_i Ci 与簇 C j C_j Cj 最近的样本间距离
d m i n ( C i , C j ) d_{min}(C_i,C_j) dmin(Ci,Cj):
d m i n ( C i , C j ) = m i n x i ∈ C i , x j ∈ C j d i s t ( x i , x j ) (16) d_{min}(C_i,C_j)=min_{x_i \in C_i,x_j \in C_j} dist(x_i,x_j) \tag{16} dmin(Ci,Cj)=minxi∈Ci,xj∈Cjdist(xi,xj)(16)
簇 C i C_i Ci 与簇 C j C_j Cj 中心点间的距离
d c e n ( C i , C j ) d_{cen}(C_i,C_j) dcen(Ci,Cj):
d c e n ( C i , C j ) = d i s t ( x i , x j ) (17) d_{cen}(C_i,C_j)=dist(x_i,x_j) \tag{17} dcen(Ci,Cj)=dist(xi,xj)(17)
Davies-Bouldin Index 戴维森堡丁指数,越小越好。
sklearn.metrics.davies_bouldin_score
D B I = 1 k ∑ i = 1 k max j ≠ i ( a v g ( C i ) + a v g ( C j ) d c e n ( μ i , μ j ) ) (18) DBI=\frac{1}{k}\sum_{i=1}^k \max \limits_{j\neq i}(\frac{avg(C_i)+avg(C_j)}{d_{cen}(\mu_i,\mu_j)}) \tag{18} DBI=k1i=1∑kj=imax(dcen(μi,μj)avg(Ci)+avg(Cj))(18)
Dunn Index,越大越好。
D I = min 1 ≤ i ≤ k { min j ≠ i ( d m i n ( C i , C j ) max 1 ≤ l ≤ k d i a m ( C l ) ) } (19) DI= \min \limits_{1\leq i\leq k} \bigg\{ \min \limits_{j \neq i}\bigg(\frac{d_{min}(C_i,C_j)}{\max \limits_{1 \leq l \leq k} diam(C_l)}\bigg) \bigg\} \tag{19} DI=1≤i≤kmin{j=imin(1≤l≤kmaxdiam(Cl)dmin(Ci,Cj))}(19)
Silhouette Coefficient 轮廓系数
sklearn.metrics.silhouette_score
S = b − a m a x ( a , b ) (20) S=\frac{b-a}{max(a,b)} \tag{20} S=max(a,b)b−a(20)
Advantages
The score is bounded between -1 for incorrect clustering and +1 for highly dense clustering. Scores around zero indicate overlapping clusters.
The score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster.
Drawbacks
[1] 周志华. 机器学习[M]. 北京: 清华大学出版社, 2016: 197-199.
[2] Gan Pan.[ML] 聚类评价指标[EB/OL].https://zhuanlan.zhihu.com/p/53840697, 2019-06-28.