Def.1 给定数据集 D = { x 1 , x 2 , ⋯ , x n } , ( x i ∈ R d ) \mathbf{D}=\{ \mathbf{x}_1,\mathbf{x}_2,\cdots,\mathbf{x}_n\},(\mathbf{x}_i\in \mathbb{R}^d) D={x1,x2,⋯,xn},(xi∈Rd), D \mathbf{D} D 的一个聚类是指 D \mathbf{D} D 的划分 C = { C 1 , C 2 , ⋯ , C k } \mathcal{C}=\{C_1,C_2,\cdots,C_k \} C={C1,C2,⋯,Ck} s.t. C i ⊆ D , C i ∩ C j = ∅ , ∪ i = 1 k C i = D C_i\subseteq \mathbf{D},C_i \cap C_j=\emptyset, \cup_{i=1}^k C_i=\mathbf{D} Ci⊆D,Ci∩Cj=∅,∪i=1kCi=D;
称聚类 A = { A 1 , ⋯ , A r } \mathcal{A}=\{A_1,\cdots,A_r\} A={A1,⋯,Ar} 是聚类 B = { B 1 , ⋯ , B s } \mathcal{B}=\{B_1,\cdots,B_s\} B={B1,⋯,Bs} 的嵌套,如果 r > s r>s r>s,且对于 ∀ A i ∈ A \forall A_i \in \mathcal{A} ∀Ai∈A ,存在 B j ∈ B B_j \in \mathcal{B} Bj∈B 使得 A i ⊆ B j A_i \subseteq B_j Ai⊆Bj
D \mathbf{D} D 的分层聚类是指一个嵌套聚类序列 C 1 , ⋯ , C n \mathcal{C}_1,\cdots,\mathcal{C}_n C1,⋯,Cn,其中 C 1 = { { x 1 } , { x 2 } , ⋯ , { x n } } , ⋯ , C n = { { x 1 , x 2 , ⋯ , x n } } \mathcal{C}_1=\{ \{\mathbf{x}_1\},\{\mathbf{x}_2\},\cdots,\{\mathbf{x}_n\}\},\cdots,\mathcal{C}_n=\{\{ \mathbf{x}_1,\mathbf{x}_2,\cdots,\mathbf{x}_n\} \} C1={{x1},{x2},⋯,{xn}},⋯,Cn={{x1,x2,⋯,xn}},且 C t \mathcal{C}_t Ct 是 C t + 1 \mathcal{C}_{t+1} Ct+1 的嵌套。
Def.2 分层聚类示图的顶点集是指所有在 C 1 , ⋯ , C n \mathcal{C}_1,\cdots,\mathcal{C}_n C1,⋯,Cn 中出现的元,如果 C i ∈ C t C_i \in \mathcal{C}_t Ci∈Ct 且 C j ∈ C t + 1 C_j \in \mathcal{C}_{t+1} Cj∈Ct+1 满足,则 C i C_i Ci 与 C j C_j Cj 之间有一条边。
事实:
算法14.1 :
输入: D , k \mathbf{D}, k D,k
输出: C \mathcal{C} C
问题:如何定义/计算 C i , C j C_i,C_j Ci,Cj 的距离,即 δ ( C i , C j ) \delta(C_i,C_j) δ(Ci,Cj) ?
δ ( C i , C j ) \delta(C_i,C_j) δ(Ci,Cj) 有以下五种不同方式:
简单连接: δ ( C i , C j ) : = min { δ ( x , y ) ∣ x ∈ C i , y ∈ C j } \delta(C_i,C_j):= \min \{\delta(\mathbf{x},\mathbf{y}) | \mathbf{x} \in C_i, \mathbf{y} \in C_j\} δ(Ci,Cj):=min{δ(x,y)∣x∈Ci,y∈Cj}
完全连接: δ ( C i , C j ) : = max { δ ( x , y ) ∣ x ∈ C i , y ∈ C j } \delta(C_i,C_j):= \max \{\delta(\mathbf{x},\mathbf{y}) | \mathbf{x} \in C_i, \mathbf{y} \in C_j\} δ(Ci,Cj):=max{δ(x,y)∣x∈Ci,y∈Cj}
组群平均: δ ( C i , C j ) : = ∑ x ∈ C i ∑ y ∈ C j δ ( x , y ) n i ⋅ n j , n i = ∣ C i ∣ , n j = ∣ C j ∣ \delta(C_i,C_j):= \frac{\sum\limits_{\mathbf{x} \in C_i}\sum\limits_{\mathbf{y} \in C_j}\delta(\mathbf{x},\mathbf{y})}{n_i \cdot n_j}, n_i=|C_i|,n_j=|C_j| δ(Ci,Cj):=ni⋅njx∈Ci∑y∈Cj∑δ(x,y),ni=∣Ci∣,nj=∣Cj∣
均值距离: δ ( C i , C j ) : = ∣ ∣ μ i − μ j ∣ ∣ 2 , μ i = 1 n ∑ x ∈ C i x , μ j = 1 n ∑ y ∈ C j y \delta(C_i,C_j):= ||\boldsymbol{\mu}_i-\boldsymbol{\mu}_j|| ^2,\boldsymbol{\mu}_i=\frac{1}{n}\sum\limits_{\mathbf{x} \in C_i}\mathbf{x},\boldsymbol{\mu}_j=\frac{1}{n}\sum\limits_{\mathbf{y} \in C_j}\mathbf{y} δ(Ci,Cj):=∣∣μi−μj∣∣2,μi=n1x∈Ci∑x,μj=n1y∈Cj∑y
极小方差:对任意 C i C_i Ci,定义平方误差和 S S E i = ∑ x ∈ C i ∣ ∣ x − μ i ∣ ∣ 2 SSE_i= \sum\limits_{\mathbf{x} \in C_i} ||\mathbf{x}-\boldsymbol{\mu}_i|| ^2 SSEi=x∈Ci∑∣∣x−μi∣∣2
对 C i , C j , S S E i j : = ∑ x ∈ C i ∪ C j ∣ ∣ x − μ i j ∣ ∣ 2 C_i,C_j,SSE_{ij}:=\sum\limits_{\mathbf{x} \in C_i\cup C_j} ||\mathbf{x}-\boldsymbol{\mu}_{ij}|| ^2 Ci,Cj,SSEij:=x∈Ci∪Cj∑∣∣x−μij∣∣2,其中 μ i j : = 1 n i + n j ∑ x ∈ C i ∪ C j x \boldsymbol{\mu}_{ij}:=\frac{1}{n_i+n_j}\sum\limits_{\mathbf{x} \in C_i\cup C_j}\mathbf{x} μij:=ni+nj1x∈Ci∪Cj∑x
δ ( C i , C j ) : = S S E i j − S S E i − S S E j \delta(C_i,C_j):=SSE_{ij}-SSE_i-SSE_j δ(Ci,Cj):=SSEij−SSEi−SSEj
证明: δ ( C i , C j ) = n i n j n i + n j ∣ ∣ μ i − μ j ∣ ∣ 2 \delta(C_i,C_j)=\frac{n_in_j}{n_i+n_j}||\boldsymbol{\mu}_i-\boldsymbol{\mu}_j|| ^2 δ(Ci,Cj)=ni+njninj∣∣μi−μj∣∣2
简记: C i j : = C i ∪ C j , n i j : = n i + n j C_{ij}:=C_i\cup C_j,n_{ij}:=n_i+n_j Cij:=Ci∪Cj,nij:=ni+nj
注意: C i ∩ C j = ∅ C_i \cap C_j=\emptyset Ci∩Cj=∅,故 ∣ C i j ∣ = n i + n j |C_{ij}|=n_i+n_j ∣Cij∣=ni+nj
δ ( C i , C j ) = ∑ z ∈ C i j ∥ z − μ i j ∥ 2 − ∑ x ∈ C i ∥ x − μ i ∥ 2 − ∑ y ∈ C j ∥ y − μ j ∥ 2 = ∑ z ∈ C i j z T z − n i j μ i j T μ i j − ∑ x ∈ C i x T x + n i μ i T μ i − ∑ y ∈ C j y T y + n j μ j T μ j = n i μ i T μ i + n j μ j T μ j − ( n i + n j ) μ i j T μ i j \begin{aligned} \delta\left(C_{i}, C_{j}\right) &=\sum_{\mathbf{z} \in C_{i j}}\left\|\mathbf{z}-\boldsymbol{\mu}_{i j}\right\|^{2}-\sum_{\mathbf{x} \in C_{i}}\left\|\mathbf{x}-\boldsymbol{\mu}_{i}\right\|^{2}-\sum_{\mathbf{y} \in C_{j}}\left\|\mathbf{y}-\boldsymbol{\mu}_{j}\right\|^{2} \\ &=\sum_{\mathbf{z} \in C_{i j}} \mathbf{z}^{T} \mathbf{z}-n_{i j} \boldsymbol{\mu}_{i j}^{T} \boldsymbol{\mu}_{i j}-\sum_{\mathbf{x} \in C_{i}} \mathbf{x}^{T} \mathbf{x}+n_{i} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}-\sum_{\mathbf{y} \in C_{j}} \mathbf{y}^{T} \mathbf{y}+n_{j} \boldsymbol{\mu}_{j}^{T} \boldsymbol{\mu}_{j} \\ &=n_{i} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}+n_{j} \boldsymbol{\mu}_{j}^{T} \boldsymbol{\mu}_{j}-\left(n_{i}+n_{j}\right) \boldsymbol{\mu}_{i j}^{T} \boldsymbol{\mu}_{i j} \end{aligned} δ(Ci,Cj)=z∈Cij∑∥ ∥z−μij∥ ∥2−x∈Ci∑∥x−μi∥2−y∈Cj∑∥ ∥y−μj∥ ∥2=z∈Cij∑zTz−nijμijTμij−x∈Ci∑xTx+niμiTμi−y∈Cj∑yTy+njμjTμj=niμiTμi+njμjTμj−(ni+nj)μijTμij
注意到: μ i j = 1 n i j ∑ z ∈ C i j z = 1 n i + n j ( ∑ x ∈ C i x + ∑ y ∈ C j y ) = 1 n i + n j ( n i μ i + n j μ j ) \boldsymbol{\mu}_{i j}=\frac{1}{n_{ij}}\sum\limits_{\mathbf{z} \in C_{ij}} \mathbf{z}=\frac{1}{n_i+n_j}(\sum\limits_{\mathbf{x} \in C_{i}} \mathbf{x}+\sum\limits_{\mathbf{y} \in C_{j}} \mathbf{y})=\frac{1}{n_i+n_j}(n_i\boldsymbol{\mu}_{i}+n_j\boldsymbol{\mu}_{j}) μij=nij1z∈Cij∑z=ni+nj1(x∈Ci∑x+y∈Cj∑y)=ni+nj1(niμi+njμj)
故: μ i j T μ i j = 1 ( n i + n j ) 2 ( n i 2 μ i T μ i + 2 n i n j μ i T μ j + n j 2 μ j T μ j ) \boldsymbol{\mu}_{i j}^{T} \boldsymbol{\mu}_{i j}=\frac{1}{\left(n_{i}+n_{j}\right)^{2}}\left(n_{i}^{2} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}+2 n_{i} n_{j} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{j}+n_{j}^{2} \boldsymbol{\mu}_{j}^{T} \boldsymbol{\mu}_{j}\right) μijTμij=(ni+nj)21(ni2μiTμi+2ninjμiTμj+nj2μjTμj)
δ ( C i , C j ) = n i μ i T μ i + n j μ j T μ j − 1 ( n i + n j ) ( n i 2 μ i T μ i + 2 n i n j μ i T μ j + n j 2 μ j T μ j ) = n i ( n i + n j ) μ i T μ i + n j ( n i + n j ) μ j T μ j − n i 2 μ i T μ i − 2 n i n j μ i T μ j − n j 2 μ j T μ j n i + n j = n i n j ( μ i T μ i − 2 μ i T μ j + μ j T μ j ) n i + n j = ( n i n j n i + n j ) ∥ μ i − μ j ∥ 2 \begin{aligned} \delta\left(C_{i}, C_{j}\right) &=n_{i} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}+n_{j} \mu_{j}^{T} \boldsymbol{\mu}_{j}-\frac{1}{\left(n_{i}+n_{j}\right)}\left(n_{i}^{2} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}+2 n_{i} n_{j} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{j}+n_{j}^{2} \boldsymbol{\mu}_{j}^{T} \boldsymbol{\mu}_{j}\right) \\ &=\frac{n_{i}\left(n_{i}+n_{j}\right) \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}+n_{j}\left(n_{i}+n_{j}\right) \boldsymbol{\mu}_{j}^{T} \boldsymbol{\mu}_{j}-n_{i}^{2} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}-2 n_{i} n_{j} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{j}-n_{j}^{2} \boldsymbol{\mu}_{j}^{T} \boldsymbol{\mu}_{j}}{n_{i}+n_{j}} \\ &=\frac{n_{i} n_{j}\left(\boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}-2 \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{j}+\boldsymbol{\mu}_{j}^{T} \boldsymbol{\mu}_{j}\right)}{n_{i}+n_{j}} \\ &=\left(\frac{n_{i} n_{j}}{n_{i}+n_{j}}\right)\left\|\boldsymbol{\mu}_{i}-\boldsymbol{\mu}_{j}\right\|^{2} \end{aligned} δ(Ci,Cj)=niμiTμi+njμjTμj−(ni+nj)1(ni2μiTμi+2ninjμiTμj+nj2μjTμj)=ni+njni(ni+nj)μiTμi+nj(ni+nj)μjTμj−ni2μiTμi−2ninjμiTμj−nj2μjTμj=ni+njninj(μiTμi−2μiTμj+μjTμj)=(ni+njninj)∥ ∥μi−μj∥ ∥2
问题:如何快速计算算法14.1 第7步:更新矩阵?
☆ Lance–Williams formula
δ ( C i j , C r ) = α i ⋅ δ ( C i , C r ) + α j ⋅ δ ( C j , C r ) + β ⋅ δ ( C i , C j ) + γ ⋅ ∣ δ ( C i , C r ) − δ ( C j , C r ) ∣ \begin{array}{r} \delta\left(C_{i j}, C_{r}\right)=\alpha_{i} \cdot \delta\left(C_{i}, C_{r}\right)+\alpha_{j} \cdot \delta\left(C_{j}, C_{r}\right)+ \\ \beta \cdot \delta\left(C_{i}, C_{j}\right)+\gamma \cdot\left|\delta\left(C_{i}, C_{r}\right)-\delta\left(C_{j}, C_{r}\right)\right| \end{array} δ(Cij,Cr)=αi⋅δ(Ci,Cr)+αj⋅δ(Cj,Cr)+β⋅δ(Ci,Cj)+γ⋅∣δ(Ci,Cr)−δ(Cj,Cr)∣
Measure | α i \alpha_i αi | α j \alpha_j αj | β \beta β | γ \gamma γ |
---|---|---|---|---|
简单连接 | 1 2 1\over2 21 | 1 2 1\over2 21 | 0 0 0 | − 1 2 -{1\over2} −21 |
完全连接 | 1 2 1\over2 21 | 1 2 1\over2 21 | 0 0 0 | 1 2 1\over2 21 |
组群平均 | n i n i + n j \frac{n_i}{n_i+n_j} ni+njni | n j n i + n j \frac{n_j}{n_i+n_j} ni+njnj | 0 0 0 | 0 0 0 |
均值距离 | n i n i + n j \frac{n_i}{n_i+n_j} ni+njni | n j n i + n j \frac{n_j}{n_i+n_j} ni+njnj | − n i n j ( n i + n j ) 2 \frac{-n_in_j}{(n_i+n_j)^2} (ni+nj)2−ninj | 0 0 0 |
极小方差 | n i + n r n i + n j + n r \frac{n_i+n_r}{n_i+n_j+n_r} ni+nj+nrni+nr | n j + n r n i + n j + n r \frac{n_j+n_r}{n_i+n_j+n_r} ni+nj+nrnj+nr | − n r n i + n j + n r \frac{-n_r}{n_i+n_j+n_r} ni+nj+nr−nr | 0 0 0 |
Proof:
简单连接
δ ( C i j , C r ) = min { δ ( x , y ) ∣ x ∈ C i j , y ∈ C r } = min { δ ( C i , C r ) , δ ( C j , C r ) } a = a + b − ∣ a − b ∣ 2 , b = a + b + ∣ a − b ∣ 2 \begin{aligned} \delta\left(C_{i j}, C_{r}\right) &= \min \{\delta({\mathbf{x}, \mathbf{y}} )|\mathbf{x}\in C_{ij}, \mathbf{y} \in C_r\} \\ &= \min \{\delta({C_{i}, C_{r}), \delta(C_{j}, C_{r}} )\} \end{aligned}\\ a=\frac{a+b-|a-b|}{2},b=\frac{a+b+|a-b|}{2} δ(Cij,Cr)=min{δ(x,y)∣x∈Cij,y∈Cr}=min{δ(Ci,Cr),δ(Cj,Cr)}a=2a+b−∣a−b∣,b=2a+b+∣a−b∣
完全连接
见上图
组群平均
δ ( C i j , C r ) = ∑ x ∈ C i ∪ C j ∑ y ∈ C r δ ( x , y ) ( n i + n j ) ⋅ n r = ∑ x ∈ C i ∑ y ∈ C r δ ( x , y ) + ∑ x ∈ C j ∑ y ∈ C r δ ( x , y ) ( n i + n j ) ⋅ n r = n i n r δ ( C i , C r ) + n j n r δ ( C j , C r ) ( n i + n j ) ⋅ n r = n i δ ( C i , C r ) + n j δ ( C j , C r ) ( n i + n j ) \begin{aligned} \delta\left(C_{i j}, C_{r}\right) &= \frac{\sum\limits_{\mathbf{x} \in C_i\cup C_j}\sum\limits_{\mathbf{y} \in C_r}\delta(\mathbf{x},\mathbf{y})}{(n_i+n_j )\cdot n_r} \\ &= \frac{\sum\limits_{\mathbf{x} \in C_i}\sum\limits_{\mathbf{y} \in C_r}\delta(\mathbf{x},\mathbf{y})+\sum\limits_{\mathbf{x} \in C_j}\sum\limits_{\mathbf{y} \in C_r}\delta(\mathbf{x},\mathbf{y})}{(n_i+n_j )\cdot n_r} \\ &=\frac{n_in_r\delta(C_i,C_r)+n_jn_r\delta(C_j,C_r)}{(n_i+n_j )\cdot n_r}\\ &=\frac{n_i\delta(C_i,C_r)+n_j\delta(C_j,C_r)}{(n_i+n_j )} \end{aligned} δ(Cij,Cr)=(ni+nj)⋅nrx∈Ci∪Cj∑y∈Cr∑δ(x,y)=(ni+nj)⋅nrx∈Ci∑y∈Cr∑δ(x,y)+x∈Cj∑y∈Cr∑δ(x,y)=(ni+nj)⋅nrninrδ(Ci,Cr)+njnrδ(Cj,Cr)=(ni+nj)niδ(Ci,Cr)+njδ(Cj,Cr)
均值距离:作业
极小方差
基于均值距离的结论再代入 δ ( C i , C j ) = n i n j n i + n j ∣ ∣ μ i − μ j ∣ ∣ 2 \delta(C_i,C_j)=\frac{n_in_j}{n_i+n_j}||\boldsymbol{\mu}_i-\boldsymbol{\mu}_j|| ^2 δ(Ci,Cj)=ni+njninj∣∣μi−μj∣∣2