在无监督学习中,常见的两种任务为聚类与降维。这里给出三个聚类效果评价指标:互信息,标准化互信息,调整互信息(MI, NMI, AMI),分别给出它们的计算方法与代码。需要指出的是,这三个指标均需要已知数据点的真实标签。
已知 N N N 个 D D D 维的数据,构成数据矩阵 X = [ x 1 , x 2 , ⋯ , x N ] ∈ R D × N \mathbf{X} = [x_1, x_2, \cdots, x_N] \in \mathbb{R}^{D\times N} X=[x1,x2,⋯,xN]∈RD×N ;
对每个数据 x i ∈ R D × 1 x_i \in \mathbb{R}^{D\times 1} xi∈RD×1 ,其对应标签(label)为 u i ∈ R u_i \in \mathbb{R} ui∈R ;
标签中共有 R R R 个不同的取值,或者说数据共有 R R R 类,其对应的标签组成向量 $U = [u_1, u_2, \cdots, u_N] \in \mathbb{R}^N $ ;
我们取自然数 1 1 1 至 N N N 来为这 N N N 个数据点依次编号,并取 S = { 1 , 2 , ⋯ , N } S = \left\{1, 2, \cdots, N\right\} S={1,2,⋯,N} ,则有
S = ∪ i = 1 R U i = { U 1 , U 2 , ⋯ , U R } S = \mathop{\cup}\limits_{i=1}^R U_i = \left\{U_1, U_2, \cdots, U_R\right\} S=i=1∪RUi={U1,U2,⋯,UR}
其中 U i U_i Ui 代表归属于第 i i i 类的数据集合;如对向量 U = [ 1 , 1 , 2 , 2 , 3 , 3 , 3 ] U = [1, 1, 2, 2, 3, 3, 3] U=[1,1,2,2,3,3,3] ,则有
N = 7 , R = 3 , U 1 = { 1 , 2 } , U 2 = { 3 , 4 } , U 3 = { 5 , 6 , 7 } N = 7, R = 3, U_1 = \left\{1, 2\right\}, U_2 = \left\{3, 4\right\}, U_3 = \left\{5, 6, 7 \right\} N=7,R=3,U1={1,2},U2={3,4},U3={5,6,7}
这里需要指出的是,如果整个数据集共有5类,则我们默认用自然数1至5来代表这5类;
对于 U U U 而言,其包含 R R R 类,则用自然数 1 1 1 至 R R R 来代表这 R R R 个类别。
所谓聚类问题,即为对原始数据集 X \mathbf{X} X 进行一种划分;通过某种聚类算法(如DBSCAN、谱聚类等),我们可以得到一个对 X \mathbf{X} X 的划分,即 V = [ v 1 , v 2 , ⋯ , v N ] ∈ R N V = [v_1, v_2, \cdots, v_N] \in \mathbb{R}^N V=[v1,v2,⋯,vN]∈RN ;
类似地,我们有
S = ∪ i = 1 C V i = { V 1 , V 2 , ⋯ , V C } S = \mathop{\cup}\limits_{i=1}^C V_i = \left\{V_1, V_2, \cdots, V_C \right\} S=i=1∪CVi={V1,V2,⋯,VC}
其中 V i V_i Vi 代表聚类结果中归属于第 i i i 类的数据集合。
在介绍互信息之前,首先介绍一下香农熵(信息熵)和列联表(contingency table)。
对上述标签向量 U ∈ R N U \in \mathbb{R}^N U∈RN ,其香农熵(信息熵)可以被计算为:
H ( U ) = − ∑ i = 1 R p i log p i (1) \text{H}(U) = -\sum_{i=1}^R p_i \log p_i \tag{1} H(U)=−i=1∑Rpilogpi(1)
其中对数函数的底常取 2 2 2 或自然对数 e e e;
p i p_i pi 为归属于第 i i i 类的数据个数占数据总量的比例,即
p i = ∣ U i ∣ N (2) p_i = \frac{\left| U_i \right|}{N} \tag{2} pi=N∣Ui∣(2)
如对向量 U = [ 1 , 1 , 1 ] U = [1, 1, 1] U=[1,1,1] ,则有 p 1 = 1 p_1 = 1 p1=1 , H ( U ) = − 1 ⋅ log ( 1 ) = 0 \text{H} (U) = - 1 \cdot \log(1) = 0 H(U)=−1⋅log(1)=0 ;
对向量 U = [ 1 , 2 , 3 ] U = [1, 2, 3] U=[1,2,3] ,则有 p 1 = p 2 = p 3 = 1 3 p_1 = p_2 = p_3 = \frac{1}{3} p1=p2=p3=31 , H ( U ) = − 3 ⋅ 1 3 ⋅ log ( 1 3 ) = log 3 \text{H} (U) = - 3 \cdot \frac{1}{3} \cdot \log(\frac{1}{3}) = \log 3 H(U)=−3⋅31⋅log(31)=log3 ;
对向量 U = [ 1 , 2 , 2 ] U = [1, 2, 2] U=[1,2,2] ,则有 p 1 = 1 3 , p 2 = 2 3 p_1 = \frac{1}{3}, p_2 = \frac{2}{3} p1=31,p2=32 , H ( U ) = − 1 3 ⋅ log ( 1 3 ) − 2 3 ⋅ log ( 2 3 ) = log 3 − 2 3 ⋅ log 2 \text{H} (U) = - \frac{1}{3} \cdot \log(\frac{1}{3}) - \frac{2}{3} \cdot \log(\frac{2}{3}) = \log 3 - \frac{2}{3} \cdot \log2 H(U)=−31⋅log(31)−32⋅log(32)=log3−32⋅log2 ;
不难发现,信息熵是非负的,因为 0 < p i ≤ 1 0 < p_i \leq 1 0<pi≤1 。
我们取矩阵 M ∈ R R × C M \in \mathbb{R}^{R\times C} M∈RR×C 为真实标签向量 U U U 与预测标签向量 V V V 的列联表(contingency table),满足
m i j = ∣ U i ∩ V j ∣ (3) m_{ij} = \left|U_i \cap V_j\right| \tag{3} mij=∣Ui∩Vj∣(3)
其中 i = 1 , 2 , ⋯ , R i = 1, 2, \cdots, R i=1,2,⋯,R , j = 1 , 2 , ⋯ , C j = 1, 2, \cdots, C j=1,2,⋯,C 。
如对向量 U = [ 1 , 1 , 2 , 2 ] , V = [ 1 , 1 , 1 , 2 ] U = [1, 1, 2, 2], V = [1, 1, 1, 2] U=[1,1,2,2],V=[1,1,1,2] ,有
U 1 = { 1 , 2 } U 2 = { 3 , 4 } V 1 = { 1 , 2 , 3 } V 2 = { 4 } \begin{aligned}U_1 &= \left\{1, 2\right\} & U_2 &= \left\{3, 4\right\} \\V_1 &= \left\{1, 2, 3\right\} & V_2 &= \left\{4\right\}\end{aligned} U1V1={1,2}={1,2,3}U2V2={3,4}={4}
因此,
m 11 = ∣ U 1 ∩ V 1 ∣ = ∣ { 1 , 2 } ∣ = 2 m 12 = ∣ U 1 ∩ V 2 ∣ = ∣ ∅ ∣ = 0 m 21 = ∣ U 2 ∩ V 1 ∣ = ∣ { 3 } ∣ = 1 m 22 = ∣ U 2 ∩ V 2 ∣ = ∣ { 4 } ∣ = 1 \begin{aligned} m_{1 1} & = \left|U_1 \cap V_1\right| = \left|\left\{1, 2\right\}\right| = 2 \\ m_{1 2} & = \left|U_1 \cap V_2\right| = \left|\varnothing\right| = 0 \\ m_{2 1} & = \left|U_2 \cap V_1\right| = \left|\left\{3 \right\}\right| = 1 \\ m_{2 2} & = \left|U_2 \cap V_2\right| = \left|\left\{4 \right\}\right| = 1 \\ \end{aligned} m11m12m21m22=∣U1∩V1∣=∣{1,2}∣=2=∣U1∩V2∣=∣∅∣=0=∣U2∩V1∣=∣{3}∣=1=∣U2∩V2∣=∣{4}∣=1
列联表为
M = [ 2 0 1 1 ] M = \left[ \begin{array}{l} 2 & 0 \\ 1 & 1 \end{array} \right] M=[2101]
互信息的计算公式如下:
MI ( U , V ) = ∑ i = 1 R ∑ i = 1 C p i , j log ( p i , j p i × p j ) (4) \text{MI}(U, V) = \sum_{i = 1}^R\sum_{i=1}^C p_{i,j} \log\left(\frac{p_{i, j}}{p_i \times p_j}\right) \tag{4} MI(U,V)=i=1∑Ri=1∑Cpi,jlog(pi×pjpi,j)(4)
其中
p i , j = ∣ U i ∩ V j ∣ N = m i j N (5) p_{i,j} = \frac{\left|U_i \cap V_j \right|}{N} = \frac{m_{ij}}{N} \tag{5} pi,j=N∣Ui∩Vj∣=Nmij(5)
p i = ∣ U i ∣ N , p j = ∣ V j ∣ N p_i = \frac{\left| U_i \right|}{N}, p_j = \frac{\left| V_j \right|}{N} pi=N∣Ui∣,pj=N∣Vj∣
如对向量 U = [ 1 , 1 , 2 , 2 ] , V = [ 1 , 1 , 1 , 2 ] U = [1, 1, 2, 2], V = [1, 1, 1, 2] U=[1,1,2,2],V=[1,1,1,2] ,有
p 1 , 1 = m 11 / N = 0.5 p 1 , 2 = m 12 / N = 0 p 2 , 1 = m 21 / N = 0.25 p 2 , 2 = m 22 / N = 0.25 \begin{aligned} p_{1, 1} & = m_{11} / N = 0.5 \\ p_{1, 2} & = m_{12} / N = 0 \\ p_{2, 1} & = m_{21} / N = 0.25 \\ p_{2, 2} & = m_{22} / N = 0.25 \\ \end{aligned} p1,1p1,2p2,1p2,2=m11/N=0.5=m12/N=0=m21/N=0.25=m22/N=0.25
通常采用NMI和AMI来作为衡量聚类效果的指标。
标准化互信息的计算方法如下:
NMI ( U , V ) = MI ( U , V ) F ( H ( U ) , H ( V ) ) (6) \text{NMI}(U, V) = \frac{\text{MI}(U, V)}{F\left(\text{H}\left(U\right), \text{H}\left(V\right)\right)} \tag{6} NMI(U,V)=F(H(U),H(V))MI(U,V)(6)
其中 F ( x 1 , x 2 ) F(x_1, x_2) F(x1,x2) 可以为 min \min min/ max \max max 函数;可以为几何平均,即 F ( x 1 , x 2 ) = x 1 x 2 F(x_1, x_2) = \sqrt{x_1x_2} F(x1,x2)=x1x2 ;可以为算术平均,即 F ( x 1 , x 2 ) = x 1 + x 2 2 F(x_1, x_2) = \frac{x_1 + x_2}{2} F(x1,x2)=2x1+x2 。
通常我们选取算术平均,则标准化互信息即可被计算为
NMI ( U , V ) = 2 MI ( U , V ) H ( U ) + H ( V ) (7) \text{NMI}(U, V) = 2 \frac{\text{MI}(U, V)}{\text{H}\left(U\right) + \text{H}\left(V\right)} \tag{7} NMI(U,V)=2H(U)+H(V)MI(U,V)(7)
调整互信息的计算要复杂一些,其计算方法如下:
AMI ( U , V ) = MI ( U , V ) − E { MI ( U , V ) } F ( H ( U ) , H ( V ) ) − E { MI ( U , V ) } (8) \text{AMI}(U, V) = \frac{\text{MI}(U, V) - E\left\{ \text{MI}(U, V) \right\}}{F\left(\text{H}\left(U\right), \text{H}\left(V\right)\right) - E\left\{ \text{MI}(U, V) \right\}} \tag{8} AMI(U,V)=F(H(U),H(V))−E{MI(U,V)}MI(U,V)−E{MI(U,V)}(8)
其中, E { MI ( U , V ) } E\left\{ \text{MI}(U, V) \right\} E{MI(U,V)} 为互信息 MI ( U , V ) \text{MI}(U, V) MI(U,V) 的期望,计算方法为
E { MI ( U , V ) } = ∑ i = 1 R ∑ j = 1 C ∑ k = ( a i + b j − N ) + min ( a i , b j ) k N log ( N × k a i × b j ) a i ! b j ! ( N − a i ) ! ( N − b j ) ! N ! k ! ( a i − k ) ! ( b j − k ) ! ( N − a i − b j + k ) ! (9) \begin{aligned} E\left\{ \text{MI}(U, V) \right\} &= \\ \sum_{i=1}^R \sum_{j=1}^C &\sum_{k = \left(a_i + b_j - N \right)^+}^{\min \left(a_i, b_j\right)} \frac{k}{N} \log\left(\frac{N \times k}{a_i \times b_j}\right)\frac{a_i!b_j!\left(N - a_i\right)!\left(N - b_j\right)!}{N!k!\left(a_i - k\right)!\left(b_j - k\right)!\left(N - a_i - b_j + k\right)!} \end{aligned} \tag{9} E{MI(U,V)}i=1∑Rj=1∑C=k=(ai+bj−N)+∑min(ai,bj)Nklog(ai×bjN×k)N!k!(ai−k)!(bj−k)!(N−ai−bj+k)!ai!bj!(N−ai)!(N−bj)!(9)
其中 ( a i + b j − N ) + \left(a_i + b_j - N \right)^+ (ai+bj−N)+ 为 max ( 1 , a i + b j − N ) \max \left(1, a_i + b_j - N \right) max(1,ai+bj−N) ;
a i , b j a_i, b_j ai,bj 分别为列联表 M M M 的第 i i i 行和与第 j j j 列和,具体为
a i = ∑ j = 1 C m i j b j = ∑ i = 1 R m i j (10) \begin{aligned} a_i = \sum_{j=1}^C m_{ij} \\ b_j = \sum_{i=1}^R m_{ij} \end{aligned} \tag{10} ai=j=1∑Cmijbj=i=1∑Rmij(10)
如果我们选取函数 F ( x 1 , x 2 ) F(x_1, x_2) F(x1,x2) 为 max \max max 函数,则调整互信息可被计算为
AMI ( U , V ) = MI ( U , V ) − E { MI ( U , V ) } max ( H ( U ) , H ( V ) ) − E { MI ( U , V ) } (11) \text{AMI}(U, V) = \frac{\text{MI}(U, V) - E\left\{ \text{MI}(U, V) \right\}}{\max\left(\text{H}\left(U\right), \text{H}\left(V\right)\right) - E\left\{ \text{MI}(U, V) \right\}} \tag{11} AMI(U,V)=max(H(U),H(V))−E{MI(U,V)}MI(U,V)−E{MI(U,V)}(11)
如果我们选取函数 F ( x 1 , x 2 ) F(x_1, x_2) F(x1,x2) 为几何平均,则调整互信息可被计算为
AMI ( U , V ) = MI ( U , V ) − E { MI ( U , V ) } H ( U ) ⋅ H ( V ) − E { MI ( U , V ) } (12) \text{AMI}(U, V) = \frac{\text{MI}(U, V) - E\left\{ \text{MI}(U, V) \right\}}{\sqrt{\text{H}\left(U\right) \cdot \text{H}\left(V\right)} - E\left\{ \text{MI}(U, V) \right\}} \tag{12} AMI(U,V)=H(U)⋅H(V)−E{MI(U,V)}MI(U,V)−E{MI(U,V)}(12)
如果我们选取函数 F ( x 1 , x 2 ) F(x_1, x_2) F(x1,x2) 为算术平均,则调整互信息可被计算为
AMI ( U , V ) = MI ( U , V ) − E { MI ( U , V ) } 1 2 ( H ( U ) + H ( V ) ) − E { MI ( U , V ) } (13) \text{AMI}(U, V) = \frac{\text{MI}(U, V) - E\left\{ \text{MI}(U, V) \right\}}{\frac{1}{2}\left(\text{H}\left(U\right) + \text{H}\left(V\right)\right) - E\left\{ \text{MI}(U, V) \right\}} \tag{13} AMI(U,V)=21(H(U)+H(V))−E{MI(U,V)}MI(U,V)−E{MI(U,V)}(13)
Python中的 sklearn
库里有这三个指标的类,可以直接调用;
matlab中似乎没有找到现成的包,因此自己编写。
这里NMI和AMI的计算均采用算术平均; log \log log 函数的底为自然对数 e e e 。
from sklearn.metrics.cluster import entropy, mutual_info_score, normalized_mutual_info_score, adjusted_mutual_info_score
MI = lambda x, y: mutual_info_score(x, y)
NMI = lambda x, y: normalized_mutual_info_score(x, y, average_method='arithmetic')
AMI = lambda x, y: adjusted_mutual_info_score(x, y, average_method='arithmetic')
A = [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3]
B = [1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 1, 1, 3, 3, 3]
#print(entropy(A))
#print(MI(A, B))
print(NMI(A, B))
print(AMI(A, B))
C = [1, 1, 2, 2, 3, 3, 3]
D = [1, 1, 1, 2, 1, 1, 1]
print(NMI(C, D))
print(AMI(C, D))
这里的NMI与AMI的代码实现均采用的算术平均, log \log log 函数的底采用自然对数 e e e 。
函数文件:
NMI_AMI.m
function [NMI, AMI] = NMI_AMI(X, Y)
%NMI_AMI return NMI, AMI
% MI: mutual information
% H; entropy
% NMI: normalized mutual infomation
% AMI: adjusted mutual infomation
% NMI(X, Y) = MI(X, Y) / F(H(X), H(Y))
% AMI(X, Y) = (MI(X, Y) - EMI(X, Y)) / (F(H(X) + H(Y)) - EMI(X, Y))
% F(x, y) is a function, can be "mean", "max", "geometric", "arithmetic"
% here we both use arithmetric
NMI = 2 * MI(X, Y) / (H(X) + H(Y));
AMI = (MI(X, Y) - EMI(X, Y)) / (1/2 * (H(X) + H(Y)) - EMI(X, Y));
end
function [res] = MI(X, Y)
%MI mutual infomation
n = length(X);
X_list = unique(X);
Y_list = unique(Y);
res = 0;
for x = X_list
for y = Y_list
loc_x = find(X == x);
loc_y = find(Y == y);
loc_xy = intersect(loc_x, loc_y);
res = res + length(loc_xy) / n * log(length(loc_xy) / n / ((length(loc_x) / n) * (length(loc_y) / n)) + eps);
end
end
end
function [res] = H(X)
%H information entropy
n = length(X);
X_list = unique(X);
res = 0;
for x = X_list
loc = find(X == x);
px = length(loc) / n;
res = res - px * log(px);
end
end
function [res] = f(a, b)
% F calculate a! / b!
% sometimes a and b can be very large, hence, directly calculate a! or b! is not
% suitable; but maybe a-b is small;
% a,b should both be positive integers
res = 1;
if a > b
for i = b+1:a
res = res * i;
end
elseif a < b
for i = a+1:b
res = res / i;
end
else
res = 1;
end
end
function [res] = EMI(U, V)
% EMI expected mutual information, E[MI(X, Y)]
N = length(U);
U_list = unique(U);
V_list = unique(V);
R = length(U_list);
C = length(V_list);
M = zeros(R, C);
for i = 1:R
for j = 1:C
U_loc = find(U == U_list(i));
V_loc = find(V == V_list(j));
M(i, j) = length(intersect(U_loc, V_loc));
end
end
a = sum(M, 2);
b = sum(M, 1);
res = 0;
for i = 1:R
for j = 1:C
for nij = max(a(i) + b(j) - N, 1):min(a(i), b(j))
res = res + nij / N * log(N * nij / (a(i) * b(j)) + eps) * f(a(i), a(i) - nij) * f(b(j), b(j) - nij) * f(N - a(i), N) * f(N - b(j), N - a(i) - b(j) + nij) / factorial(nij);
end
end
end
end
主文件(或脚本):
clc;
format long
A = [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3];
B = [1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 1, 1, 3, 3, 3];
[NMI, AMI] = NMI_AMI(A, B)
C = [1, 1, 2, 2, 3, 3, 3];
D = [1, 1, 1, 2, 1, 1, 1];
[NMI, AMI] = NMI_AMI(C, D)
python代码的运行结果为:
0.36456177185718985
0.26018122538925054
0.28483386264113447
0.056748831755324296
matlab代码的运行结果为:
NMI =
0.364561771857190
AMI =
0.260181225389251
NMI =
0.284833862641135
AMI =
0.056748831755324
这里仅给出了三个指标的计算方法,方便直接使用;对于各个指标的优缺点、构造思想等,并没有研究。
https://en.wikipedia.org/wiki/Adjusted_mutual_information (科学上网)
Vinh, Nguyen Xuan; Epps, Julien; Bailey, James (2010), “Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance” (PDF), The Journal of Machine Learning Research, 11 (oct): 2837–54
论文链接:http://jmlr.csail.mit.edu/papers/volume11/vinh10a/vinh10a.pdf