关于Rand指数的定义我发现维基百科上总结得到位,我也就不再进行赘述,为了本文的完整性和以防国内打不开维基百科,我这里就当一次搬运工,当然有条件的还是建议去维基百科上去看原文~~
The Rand index or Rand measure (named after William M. Rand) in statistics, and in particular in data clustering, is a measure of the similarity between two data clusterings. A form of the Rand index may be defined that is adjusted for the chance grouping of elements, this is the adjusted Rand index. From a mathematical standpoint, Rand index is related to the accuracy, but is applicable even when class labels are not used.
Given a set of n n n elements S = { o 1 , . . . , o n } S = \{o_1, ..., o_n\} S={o1,...,on} and two partitions of S S S to compare, X = { X 1 , . . . , X r } X = \{ X_1, ..., X_r \} X={X1,...,Xr}, a partition of S S S into r r r subsets, and Y = { Y 1 , . . . , Y s } Y = \{ Y_1, ..., Y_s \} Y={Y1,...,Ys}, a partition of S S S into s s s subsets, define the following:
The Rand index, R, is:
R = a + b a + b + c + d = a + b C n 2 = a + b n ( n − 1 ) / 2 R = \frac{a+b}{a+b+c+d} = \frac{a+b}{C_n^2} = \frac{a+b}{n(n-1)/2} R=a+b+c+da+b=Cn2a+b=n(n−1)/2a+b
Intuitively, a + b a+b a+b can be considered as the number of agreements between X X X and Y Y Y, and c + d c+d c+d as the number of disagreements between X X X and Y Y Y.
Since the denominator is the total number of pairs, the Rand index represents the frequency of occurrence of agreements over the total pairs, or the probability that X X X and Y Y Y will agree on a randomly chosen pair, e.g., C n 2 = n ( n − 1 ) / 2 C_n^2=n(n-1)/2 Cn2=n(n−1)/2.
Similarly, one can also view the Rand index as a measure of the percentage of correct decisions made by the algorithm. It can be computed using the following formula:
R I = T P + T N T P + F P + F N + T N RI = \frac{TP + TN}{TP + FP + FN + TN} RI=TP+FP+FN+TNTP+TN
where T P TP TP is the number of true positives, T N TN TN is the number of true negatives, F P FP FP is the number of false positives, and F N FN FN is the number of false negatives.
The Rand index has a value between 0 and 1, with 0 indicating that the two data clusterings do not agree on any pair of points and 1 indicating that the data clusterings are exactly the same.
In mathematical terms, a, b, c, d are defined as follows:
The Rand index can also be viewed through the prism of binary classification accuracy over the pairs of elements in S S S. The two class labels are " o i o_i oi and o j o_j oj are in the same subset in X X X and Y Y Y" and " o i o_i oi and o j o_j oj are in different subsets in X X X and Y Y Y".
In that setting, a a a is the number of pairs correctly labeled as belonging to the same subset (true positives), and b b b is the number of pairs correctly labeled as belonging to different subsets (true negativess).
Given a set S S S of n n n elements, and two groupings or partitions (e.g. clusterings) of these elements, namely X = X 1 , X 2 , . . . , X r X={X_1, X_2, ..., X_r} X=X1,X2,...,Xr and Y = Y 1 , Y 2 , . . . , Y s Y={Y_1, Y_2, ..., Y_s} Y=Y1,Y2,...,Ys, the overlap between X X X and Y Y Y can be summarized in a contingency table [ n i j ] [n_{ij}] [nij] where each entry n i j n_{ij} nij denotes the number of objects in common between X i X_i Xi and Y j Y_j Yj: n i j = ∣ X i ⋂ Y j ∣ n_{ij} = |X_i \bigcap Y_j| nij=∣Xi⋂Yj∣.
X X X \ Y Y Y | Y 1 Y_1 Y1 | Y 2 Y_2 Y2 | … | Y s Y_s Ys | Sums |
---|---|---|---|---|---|
X 1 X_1 X1 | n 11 n_{11} n11 | n 12 n_{12} n12 | … | n 1 s n_{1s} n1s | a 1 a_1 a1 |
X 2 X_2 X2 | n 21 n_{21} n21 | n 22 n_{22} n22 | … | n 2 s n_{2s} n2s | a 2 a_2 a2 |
… | … | … | … | … | … |
X r X_r Xr | n r 1 n_{r1} nr1 | n r 2 n_{r2} nr2 | … | n r s n_{rs} nrs | a r a_r ar |
Sums | b 1 b_1 b1 | b 2 b_2 b2 | … | b s b_s bs | N N N |
The adjusted Rand index is the corrected-for-chance version of the Rand index. Such a correction for chance establishes a baseline by using the expected similarity of all pair-wise comparisons between clusterings specified by a random model. Traditionally, the Rand Index was corrected using the Permutation Model for clusterings (the number and size of clusters within a clustering are fixed, and all random clusterings are generated by shuffling the elements between the fixed clusters). However, the premises of the permutation model are frequently violated; in many clustering scenarios, either the number of clusters or the size distribution of those clusters vary drastically. For example, consider that in K-means the number of clusters is fixed by the practitioner, but the sizes of those clusters are inferred from the data. Variations of the adjusted Rand Index account for different models of random clusterings.
Though the Rand Index may only yield a value between 0 and +1, the adjusted Rand index can yield negative values if the index is less than the expected index.
上面全是维基百科上的内容,当了一个搬运工,还是那句话,有条件的去维基百科去查看原文~~
接下来就是自己对Rand index的“白话文”解释了,希望能对大家有一点点的帮助,如有错误,也希望大家能及时指出,谢谢
如上图,用一个最简单的例子来解释Rand index的代码实现过程。
我们假设 X X X就是预测的结果: X = { X 1 , X 2 } X=\{X_1, X_2\} X={X1,X2}, X 1 + X 2 = U X_1 + X_2 = U X1+X2=U, U U U为全集
Y Y Y是groundtruth (GT)结果: Y = { Y 1 , Y 2 } Y=\{Y_1, Y_2\} Y={Y1,Y2}, Y 1 + Y 2 = U Y_1 + Y_2 = U Y1+Y2=U.
X 1 X_1 X1是左边这个圆, Y 1 Y_1 Y1是右边这个圆,我们称 X 1 X_1 X1和 Y 1 Y_1 Y1为前景, X 2 X_2 X2和 Y 2 Y_2 Y2为背景。
现在将两者重叠放在一起,假设出现上面的情况,即前景只有部分重叠。整个全集 U U U被分成 A , B , C , D A, B, C, D A,B,C,D四个部分, A + B + C + D = U A+B+C+D=U A+B+C+D=U.
所以可得The contingency table (T)为:
X X X \ Y Y Y | Y 1 Y_1 Y1 | Y 2 Y_2 Y2 | Sums |
---|---|---|---|
X 1 X_1 X1 | A = n 11 A=n_{11} A=n11 | B = n 12 B = n_{12} B=n12 | a 1 a_1 a1 |
X 2 X_2 X2 | C = n 21 C = n_{21} C=n21 | D = n 22 D = n_{22} D=n22 | a 2 a_2 a2 |
Sums | b 1 b_1 b1 | b 2 b_2 b2 | N N N |
箭头表示pair对:
对于 A , B , C , D A, B, C, D A,B,C,D四个子集,共有10种pair连接方式: 4 + C 4 2 = 10 4+C_4^2 = 10 4+C42=10
我们大致将这10种pair对分成两类,即 a + b a+b a+b和 c + d c+d c+d两类,分别用蓝色和红色表示
以红色的(1)为例: 在 X X X中,两个端点属于同一类 (都属于 X 1 X_1 X1),而在 Y Y Y中却不是,左端点属于 Y 2 Y_2 Y2,右端点属于 Y 1 Y_1 Y1,不是同一类。所以对于红色的(1)pair对应该属于 c + d c+d c+d中的情况。其他的情况不再一一列举,是一样的意思
从图中也可以看出计算 c + d c+d c+d比 a + b a+b a+b要容易一些,所以我们一般将Rand index的计算改为:
R = a + b n ( n − 1 ) / 2 = 1 − c + d n ( n − 1 ) / 2 R = \frac{a+b}{n(n-1)/2} = 1 - \frac{c+d}{n(n-1)/2} R=n(n−1)/2a+b=1−n(n−1)/2c+d
而:
c + d = ( 1 ) + ( 2 ) + ( 3 ) + ( 4 ) = n 11 ∗ n 12 + n 11 ∗ n 21 + n 12 ∗ n 22 + n 21 ∗ n 22 = [ C n 11 + n 12 2 − C n 11 2 − C n 12 2 ] + [ C n 11 + n 21 2 − C n 11 2 − C n 21 2 ] + [ C n 12 + n 22 2 − C n 12 2 − C n 22 2 ] + [ C n 21 + n 22 2 − C n 21 2 − C n 22 2 ] = [ C a 1 2 − C n 11 2 − C n 12 2 ] + [ C b 1 2 − C n 11 2 − C n 21 2 ] + [ C b 2 2 − C n 12 2 − C n 22 2 ] + [ C a 2 2 − C n 21 2 − C n 22 2 ] = [ C a 1 2 + C a 2 2 + C b 1 2 + C b 2 2 ] − 2 ∗ [ C n 11 2 + C n 12 2 + C n 21 2 + C n 22 2 ] = [ ( a 1 ) 2 / 2 + ( a 2 ) 2 / 2 + ( b 1 ) 2 / 2 + ( b 2 ) 2 / 2 ] − 2 ∗ [ ( n 11 ) 2 / 2 + ( n 12 ) 2 / 2 + ( n 21 ) 2 / 2 + ( n 22 ) 2 / 2 ] = [ ( a 1 ) 2 + ( a 2 ) 2 + ( b 1 ) 2 + ( b 2 ) 2 ] / 2 − [ ( n 11 ) 2 + ( n 12 ) 2 + ( n 21 ) 2 + ( n 22 ) 2 ] = [ T . s u m ( 1 ) . p o w ( 2 ) . s u m ( ) + T . s u m ( 0 ) . p o w ( 2 ) . s u m ( ) ] / 2 − T . p o w ( 2 ) . s u m ( ) \begin{aligned} c + d & = (1) + (2) + (3) + (4) \\ & = n_{11} * n_{12} + n_{11} * n_{21} + n_{12} * n_{22} + n_{21} * n_{22} \\ & = [C_{n_{11}+n_{12}}^2 - C_{n_{11}}^2 - C_{n_{12}}^2] + [C_{n_{11}+n_{21}}^2 - C_{n_{11}}^2 - C_{n_{21}}^2] + [C_{n_{12}+n_{22}}^2 - C_{n_{12}}^2 - C_{n_{22}}^2] + [C_{n_{21}+n_{22}}^2 - C_{n_{21}}^2 - C_{n_{22}}^2] \\ & = [C_{a_1}^2 - C_{n_{11}}^2 - C_{n_{12}}^2] + [C_{b_1}^2 - C_{n_{11}}^2 - C_{n_{21}}^2] + [C_{b_2}^2 - C_{n_{12}}^2 - C_{n_{22}}^2] + [C_{a_2}^2 - C_{n_{21}}^2 - C_{n_{22}}^2] \\ & = [C_{a_1}^2 + C_{a_2}^2 + C_{b_1}^2 + C_{b_2}^2] - 2 * [C_{n_{11}}^2 + C_{n_{12}}^2 + C_{n_{21}}^2 + C_{n_{22}}^2] \\ & = [(a_1)^2/2 + (a_2)^2/2 + (b_1)^2/2 + (b_2)^2/2] - 2 * [(n_{11})^2/2 + (n_{12})^2/2 + (n_{21})^2/2 + (n_{22})^2/2] \\ & = [(a_1)^2 + (a_2)^2 + (b_1)^2 + (b_2)^2] / 2 - [(n_{11})^2 + (n_{12})^2 + (n_{21})^2 + (n_{22})^2] \\ & = [T.sum(1).pow(2).sum() + T.sum(0).pow(2).sum()] / 2 - T.pow(2).sum() \end{aligned} c+d=(1)+(2)+(3)+(4)=n11∗n12+n11∗n21+n12∗n22+n21∗n22=[Cn11+n122−Cn112−Cn122]+[Cn11+n212−Cn112−Cn212]+[Cn12+n222−Cn122−Cn222]+[Cn21+n222−Cn212−Cn222]=[Ca12−Cn112−Cn122]+[Cb12−Cn112−Cn212]+[Cb22−Cn122−Cn222]+[Ca22−Cn212−Cn222]=[Ca12+Ca22+Cb12+Cb22]−2∗[Cn112+Cn122+Cn212+Cn222]=[(a1)2/2+(a2)2/2+(b1)2/2+(b2)2/2]−2∗[(n11)2/2+(n12)2/2+(n21)2/2+(n22)2/2]=[(a1)2+(a2)2+(b1)2+(b2)2]/2−[(n11)2+(n12)2+(n21)2+(n22)2]=[T.sum(1).pow(2).sum()+T.sum(0).pow(2).sum()]/2−T.pow(2).sum()
这里, T T T代表的是The contingency table,上面的化简是为了得到最后一步的矩阵运算,我们本可以直接使用第二个等号后面的方法计算 c + d c+d c+d的,但当不是二分类的时候,该等式的计算方式是非常低效的(避免不了要使用for循环),但如果我们化简为最后一步的方式时,不再需要循环运算,全部依赖矩阵运算(从代码的角度上来说就是一行的事),是非常简洁且高效的
import numpy as np
def Rand_index_numpy(predMasks, gtMasks):
'''
predMasks: Numpy-array, Predcition result; shape: [r, H, W], (r>=1)
gtMasks: Numpy-array, Groundtruth; shape: [s, H, W], (s>=1)
'''
gtMasks = np.concatenate([gtMasks, np.clip(1 - np.sum(gtMasks, axis=0, keepdims=True), a_min=0, a_max=1)], axis=0)
# 在GT上扩充一个类别,即除去所有前景(s类),剩下的背景归为一类
predMasks = np.concatenate([predMasks, np.clip(1 - np.sum(predMasks, axis=0, keepdims=True), a_min=0, a_max=1)], axis=0)
# 在prediction上扩充一个类别,即除去所有前景(r类),剩下的背景归为一类
T = (np.expand_dims(gtMasks, axis=1) * predMasks).sum(-1).sum(-1).astype(np.float32)
# The contingency table
N = T.sum()
# 所有的像素数量
RI = 1 - ((np.power(T.sum(0), 2).sum() + np.power(T.sum(1), 2).sum()) / 2 - np.power(T, 2).sum()) / (N * (N - 1) / 2)
return RI
import torch
def Rand_index_torch(predMasks, gtMasks):
'''
predMasks: Tensor, Predcition result; shape: [r, H, W], (r>=1)
gtMasks: Tensor, Groundtruth; shape: [s, H, W], (s>=1)
'''
gtMasks = torch.cat([gtMasks, torch.clamp(1 - gtMasks.sum(0, keepdim=True), min=0)], dim=0)
# 在GT上扩充一个类别,即除去所有前景(s类),剩下的背景归为一类
predMasks = torch.cat([predMasks, torch.clamp(1 - predMasks.sum(0, keepdim=True), min=0)], dim=0)
# 在prediction上扩充一个类别,即除去所有前景(r类),剩下的背景归为一类
T = (gtMasks.unsqueeze(1) * predMasks).sum(-1).sum(-1).float()
# The contingency table
N = T.sum()
# 所有的像素数量
RI = 1 - ((T.sum(0).pow(2).sum() + T.sum(1).pow(2).sum()) / 2 - T.pow(2).sum()) / (N * (N - 1) / 2)
return RI