数据挖掘与分析课程笔记(Chapter 14)

数据挖掘与分析课程笔记

  • 参考教材:Data Mining and Analysis : MOHAMMED J.ZAKI, WAGNER MEIRA JR.

文章目录

  1. 数据挖掘与分析课程笔记(目录)
  2. 数据挖掘与分析课程笔记(Chapter 1)
  3. 数据挖掘与分析课程笔记(Chapter 2)
  4. 数据挖掘与分析课程笔记(Chapter 5)
  5. 数据挖掘与分析课程笔记(Chapter 7)
  6. 数据挖掘与分析课程笔记(Chapter 14)
  7. 数据挖掘与分析课程笔记(Chapter 15)
  8. 数据挖掘与分析课程笔记(Chapter 20)
  9. 数据挖掘与分析课程笔记(Chapter 21)

笔记目录

  • 数据挖掘与分析课程笔记
  • 文章目录
  • Chapter 14:Hierarchical Clustering 分层聚类
    • 14.1 预备
    • 14.2 团聚分层聚类
    • 事实:算法14.1 的复杂度为 O ( n 2 log ⁡ n ) O(n^2\log n) O(n2logn)


Chapter 14:Hierarchical Clustering 分层聚类

14.1 预备

Def.1 给定数据集 D = { x 1 , x 2 , ⋯   , x n } , ( x i ∈ R d ) \mathbf{D}=\{ \mathbf{x}_1,\mathbf{x}_2,\cdots,\mathbf{x}_n\},(\mathbf{x}_i\in \mathbb{R}^d) D={x1,x2,,xn},(xiRd) D \mathbf{D} D 的一个聚类是指 D \mathbf{D} D 的划分 C = { C 1 , C 2 , ⋯   , C k } \mathcal{C}=\{C_1,C_2,\cdots,C_k \} C={C1,C2,,Ck} s.t. C i ⊆ D , C i ∩ C j = ∅ , ∪ i = 1 k C i = D C_i\subseteq \mathbf{D},C_i \cap C_j=\emptyset, \cup_{i=1}^k C_i=\mathbf{D} CiD,CiCj=,i=1kCi=D

称聚类 A = { A 1 , ⋯   , A r } \mathcal{A}=\{A_1,\cdots,A_r\} A={A1,,Ar} 是聚类 B = { B 1 , ⋯   , B s } \mathcal{B}=\{B_1,\cdots,B_s\} B={B1,,Bs} 的嵌套,如果 r > s r>s r>s,且对于 ∀ A i ∈ A \forall A_i \in \mathcal{A} AiA ,存在 B j ∈ B B_j \in \mathcal{B} BjB 使得 A i ⊆ B j A_i \subseteq B_j AiBj

D \mathbf{D} D 的分层聚类是指一个嵌套聚类序列 C 1 , ⋯   , C n \mathcal{C}_1,\cdots,\mathcal{C}_n C1,,Cn,其中 C 1 = { { x 1 } , { x 2 } , ⋯   , { x n } } , ⋯   , C n = { { x 1 , x 2 , ⋯   , x n } } \mathcal{C}_1=\{ \{\mathbf{x}_1\},\{\mathbf{x}_2\},\cdots,\{\mathbf{x}_n\}\},\cdots,\mathcal{C}_n=\{\{ \mathbf{x}_1,\mathbf{x}_2,\cdots,\mathbf{x}_n\} \} C1={{x1},{x2},,{xn}},,Cn={{x1,x2,,xn}},且 C t \mathcal{C}_t Ct C t + 1 \mathcal{C}_{t+1} Ct+1 的嵌套。

Def.2 分层聚类示图的顶点集是指所有在 C 1 , ⋯   , C n \mathcal{C}_1,\cdots,\mathcal{C}_n C1,,Cn 中出现的元,如果 C i ∈ C t C_i \in \mathcal{C}_t CiCt C j ∈ C t + 1 C_j \in \mathcal{C}_{t+1} CjCt+1 满足,则 C i C_i Ci C j C_j Cj 之间有一条边。

数据挖掘与分析课程笔记(Chapter 14)_第1张图片

事实:

  1. 分层聚类示图是一棵二叉树(不一定,作为假设,假设每层只聚两类),分层聚类与其示图一一对应。
  2. 设(即数据点数为 n n n),则所有可能的分层聚类示图数目为 ( 2 n − 3 ) ! ! (2n-3)!! (2n3)!! (跃乘 1 × 3 × 5 × ⋯ 1\times 3 \times 5 \times \cdots 1×3×5×

14.2 团聚分层聚类

算法14.1 :

输入: D , k \mathbf{D}, k D,k

输出: C \mathcal{C} C

  1. C ← { C i = { x i } ∣ x i ∈ D } \mathcal{C} \leftarrow \{C_i=\{\mathbf{x}_i\}|\mathbf{x}_i \in \mathbf{D} \} C{Ci={xi}xiD}
  2. Δ ← { δ ( x i , x j ) : x i , x j ∈ D } \Delta \leftarrow \{\delta(\mathbf{x}_i,\mathbf{x}_j):\mathbf{x}_i,\mathbf{x}_j \in \mathbf{D} \} Δ{δ(xi,xj):xi,xjD}
  3. repeat
  4. ​ 寻找最近的对 C i , C j ∈ C C_i,C_j \in \mathcal{C} Ci,CjC
  5. C i j ← C i ∪ C j C_{ij}\leftarrow C_i \cup C_j CijCiCj
  6. C ← ( C ∣ { C i , C j } ) ∪ C i j \mathcal{C}\leftarrow (\mathcal{C} | \{C_i,C_j \}) \cup {C_{ij}} C(C{Ci,Cj})Cij
  7. ​ 根据 C \mathcal{C} C 更新距离矩阵 Δ \Delta Δ
  8. Until ∣ C ∣ = k |\mathcal{C}|=k C=k

问题:如何定义/计算 C i , C j C_i,C_j Ci,Cj 的距离,即 δ ( C i , C j ) \delta(C_i,C_j) δ(Ci,Cj) ?

δ ( C i , C j ) \delta(C_i,C_j) δ(Ci,Cj) 有以下五种不同方式:

  1. 简单连接: δ ( C i , C j ) : = min ⁡ { δ ( x , y ) ∣ x ∈ C i , y ∈ C j } \delta(C_i,C_j):= \min \{\delta(\mathbf{x},\mathbf{y}) | \mathbf{x} \in C_i, \mathbf{y} \in C_j\} δ(Ci,Cj):=min{δ(x,y)xCi,yCj}

  2. 完全连接: δ ( C i , C j ) : = max ⁡ { δ ( x , y ) ∣ x ∈ C i , y ∈ C j } \delta(C_i,C_j):= \max \{\delta(\mathbf{x},\mathbf{y}) | \mathbf{x} \in C_i, \mathbf{y} \in C_j\} δ(Ci,Cj):=max{δ(x,y)xCi,yCj}

  3. 组群平均: δ ( C i , C j ) : = ∑ x ∈ C i ∑ y ∈ C j δ ( x , y ) n i ⋅ n j , n i = ∣ C i ∣ , n j = ∣ C j ∣ \delta(C_i,C_j):= \frac{\sum\limits_{\mathbf{x} \in C_i}\sum\limits_{\mathbf{y} \in C_j}\delta(\mathbf{x},\mathbf{y})}{n_i \cdot n_j}, n_i=|C_i|,n_j=|C_j| δ(Ci,Cj):=ninjxCiyCjδ(x,y),ni=Ci,nj=Cj

  4. 均值距离: δ ( C i , C j ) : = ∣ ∣ μ i − μ j ∣ ∣ 2 , μ i = 1 n ∑ x ∈ C i x , μ j = 1 n ∑ y ∈ C j y \delta(C_i,C_j):= ||\boldsymbol{\mu}_i-\boldsymbol{\mu}_j|| ^2,\boldsymbol{\mu}_i=\frac{1}{n}\sum\limits_{\mathbf{x} \in C_i}\mathbf{x},\boldsymbol{\mu}_j=\frac{1}{n}\sum\limits_{\mathbf{y} \in C_j}\mathbf{y} δ(Ci,Cj):=∣∣μiμj2,μi=n1xCix,μj=n1yCjy

  5. 极小方差:对任意 C i C_i Ci,定义平方误差和 S S E i = ∑ x ∈ C i ∣ ∣ x − μ i ∣ ∣ 2 SSE_i= \sum\limits_{\mathbf{x} \in C_i} ||\mathbf{x}-\boldsymbol{\mu}_i|| ^2 SSEi=xCi∣∣xμi2

    C i , C j , S S E i j : = ∑ x ∈ C i ∪ C j ∣ ∣ x − μ i j ∣ ∣ 2 C_i,C_j,SSE_{ij}:=\sum\limits_{\mathbf{x} \in C_i\cup C_j} ||\mathbf{x}-\boldsymbol{\mu}_{ij}|| ^2 Ci,Cj,SSEij:=xCiCj∣∣xμij2,其中 μ i j : = 1 n i + n j ∑ x ∈ C i ∪ C j x \boldsymbol{\mu}_{ij}:=\frac{1}{n_i+n_j}\sum\limits_{\mathbf{x} \in C_i\cup C_j}\mathbf{x} μij:=ni+nj1xCiCjx

    δ ( C i , C j ) : = S S E i j − S S E i − S S E j \delta(C_i,C_j):=SSE_{ij}-SSE_i-SSE_j δ(Ci,Cj):=SSEijSSEiSSEj

证明: δ ( C i , C j ) = n i n j n i + n j ∣ ∣ μ i − μ j ∣ ∣ 2 \delta(C_i,C_j)=\frac{n_in_j}{n_i+n_j}||\boldsymbol{\mu}_i-\boldsymbol{\mu}_j|| ^2 δ(Ci,Cj)=ni+njninj∣∣μiμj2

简记: C i j : = C i ∪ C j , n i j : = n i + n j C_{ij}:=C_i\cup C_j,n_{ij}:=n_i+n_j Cij:=CiCj,nij:=ni+nj

注意: C i ∩ C j = ∅ C_i \cap C_j=\emptyset CiCj=,故 ∣ C i j ∣ = n i + n j |C_{ij}|=n_i+n_j Cij=ni+nj
δ ( C i , C j ) = ∑ z ∈ C i j ∥ z − μ i j ∥ 2 − ∑ x ∈ C i ∥ x − μ i ∥ 2 − ∑ y ∈ C j ∥ y − μ j ∥ 2 = ∑ z ∈ C i j z T z − n i j μ i j T μ i j − ∑ x ∈ C i x T x + n i μ i T μ i − ∑ y ∈ C j y T y + n j μ j T μ j = n i μ i T μ i + n j μ j T μ j − ( n i + n j ) μ i j T μ i j \begin{aligned} \delta\left(C_{i}, C_{j}\right) &=\sum_{\mathbf{z} \in C_{i j}}\left\|\mathbf{z}-\boldsymbol{\mu}_{i j}\right\|^{2}-\sum_{\mathbf{x} \in C_{i}}\left\|\mathbf{x}-\boldsymbol{\mu}_{i}\right\|^{2}-\sum_{\mathbf{y} \in C_{j}}\left\|\mathbf{y}-\boldsymbol{\mu}_{j}\right\|^{2} \\ &=\sum_{\mathbf{z} \in C_{i j}} \mathbf{z}^{T} \mathbf{z}-n_{i j} \boldsymbol{\mu}_{i j}^{T} \boldsymbol{\mu}_{i j}-\sum_{\mathbf{x} \in C_{i}} \mathbf{x}^{T} \mathbf{x}+n_{i} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}-\sum_{\mathbf{y} \in C_{j}} \mathbf{y}^{T} \mathbf{y}+n_{j} \boldsymbol{\mu}_{j}^{T} \boldsymbol{\mu}_{j} \\ &=n_{i} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}+n_{j} \boldsymbol{\mu}_{j}^{T} \boldsymbol{\mu}_{j}-\left(n_{i}+n_{j}\right) \boldsymbol{\mu}_{i j}^{T} \boldsymbol{\mu}_{i j} \end{aligned} δ(Ci,Cj)=zCij zμij 2xCixμi2yCj yμj 2=zCijzTznijμijTμijxCixTx+niμiTμiyCjyTy+njμjTμj=niμiTμi+njμjTμj(ni+nj)μijTμij
注意到: μ i j = 1 n i j ∑ z ∈ C i j z = 1 n i + n j ( ∑ x ∈ C i x + ∑ y ∈ C j y ) = 1 n i + n j ( n i μ i + n j μ j ) \boldsymbol{\mu}_{i j}=\frac{1}{n_{ij}}\sum\limits_{\mathbf{z} \in C_{ij}} \mathbf{z}=\frac{1}{n_i+n_j}(\sum\limits_{\mathbf{x} \in C_{i}} \mathbf{x}+\sum\limits_{\mathbf{y} \in C_{j}} \mathbf{y})=\frac{1}{n_i+n_j}(n_i\boldsymbol{\mu}_{i}+n_j\boldsymbol{\mu}_{j}) μij=nij1zCijz=ni+nj1(xCix+yCjy)=ni+nj1(niμi+njμj)

故: μ i j T μ i j = 1 ( n i + n j ) 2 ( n i 2 μ i T μ i + 2 n i n j μ i T μ j + n j 2 μ j T μ j ) \boldsymbol{\mu}_{i j}^{T} \boldsymbol{\mu}_{i j}=\frac{1}{\left(n_{i}+n_{j}\right)^{2}}\left(n_{i}^{2} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}+2 n_{i} n_{j} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{j}+n_{j}^{2} \boldsymbol{\mu}_{j}^{T} \boldsymbol{\mu}_{j}\right) μijTμij=(ni+nj)21(ni2μiTμi+2ninjμiTμj+nj2μjTμj)
δ ( C i , C j ) = n i μ i T μ i + n j μ j T μ j − 1 ( n i + n j ) ( n i 2 μ i T μ i + 2 n i n j μ i T μ j + n j 2 μ j T μ j ) = n i ( n i + n j ) μ i T μ i + n j ( n i + n j ) μ j T μ j − n i 2 μ i T μ i − 2 n i n j μ i T μ j − n j 2 μ j T μ j n i + n j = n i n j ( μ i T μ i − 2 μ i T μ j + μ j T μ j ) n i + n j = ( n i n j n i + n j ) ∥ μ i − μ j ∥ 2 \begin{aligned} \delta\left(C_{i}, C_{j}\right) &=n_{i} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}+n_{j} \mu_{j}^{T} \boldsymbol{\mu}_{j}-\frac{1}{\left(n_{i}+n_{j}\right)}\left(n_{i}^{2} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}+2 n_{i} n_{j} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{j}+n_{j}^{2} \boldsymbol{\mu}_{j}^{T} \boldsymbol{\mu}_{j}\right) \\ &=\frac{n_{i}\left(n_{i}+n_{j}\right) \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}+n_{j}\left(n_{i}+n_{j}\right) \boldsymbol{\mu}_{j}^{T} \boldsymbol{\mu}_{j}-n_{i}^{2} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}-2 n_{i} n_{j} \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{j}-n_{j}^{2} \boldsymbol{\mu}_{j}^{T} \boldsymbol{\mu}_{j}}{n_{i}+n_{j}} \\ &=\frac{n_{i} n_{j}\left(\boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{i}-2 \boldsymbol{\mu}_{i}^{T} \boldsymbol{\mu}_{j}+\boldsymbol{\mu}_{j}^{T} \boldsymbol{\mu}_{j}\right)}{n_{i}+n_{j}} \\ &=\left(\frac{n_{i} n_{j}}{n_{i}+n_{j}}\right)\left\|\boldsymbol{\mu}_{i}-\boldsymbol{\mu}_{j}\right\|^{2} \end{aligned} δ(Ci,Cj)=niμiTμi+njμjTμj(ni+nj)1(ni2μiTμi+2ninjμiTμj+nj2μjTμj)=ni+njni(ni+nj)μiTμi+nj(ni+nj)μjTμjni2μiTμi2ninjμiTμjnj2μjTμj=ni+njninj(μiTμi2μiTμj+μjTμj)=(ni+njninj) μiμj 2
问题:如何快速计算算法14.1 第7步:更新矩阵?

Lance–Williams formula
δ ( C i j , C r ) = α i ⋅ δ ( C i , C r ) + α j ⋅ δ ( C j , C r ) + β ⋅ δ ( C i , C j ) + γ ⋅ ∣ δ ( C i , C r ) − δ ( C j , C r ) ∣ \begin{array}{r} \delta\left(C_{i j}, C_{r}\right)=\alpha_{i} \cdot \delta\left(C_{i}, C_{r}\right)+\alpha_{j} \cdot \delta\left(C_{j}, C_{r}\right)+ \\ \beta \cdot \delta\left(C_{i}, C_{j}\right)+\gamma \cdot\left|\delta\left(C_{i}, C_{r}\right)-\delta\left(C_{j}, C_{r}\right)\right| \end{array} δ(Cij,Cr)=αiδ(Ci,Cr)+αjδ(Cj,Cr)+βδ(Ci,Cj)+γδ(Ci,Cr)δ(Cj,Cr)

Measure α i \alpha_i αi α j \alpha_j αj β \beta β γ \gamma γ
简单连接 1 2 1\over2 21 1 2 1\over2 21 0 0 0 − 1 2 -{1\over2} 21
完全连接 1 2 1\over2 21 1 2 1\over2 21 0 0 0 1 2 1\over2 21
组群平均 n i n i + n j \frac{n_i}{n_i+n_j} ni+njni n j n i + n j \frac{n_j}{n_i+n_j} ni+njnj 0 0 0 0 0 0
均值距离 n i n i + n j \frac{n_i}{n_i+n_j} ni+njni n j n i + n j \frac{n_j}{n_i+n_j} ni+njnj − n i n j ( n i + n j ) 2 \frac{-n_in_j}{(n_i+n_j)^2} (ni+nj)2ninj 0 0 0
极小方差 n i + n r n i + n j + n r \frac{n_i+n_r}{n_i+n_j+n_r} ni+nj+nrni+nr n j + n r n i + n j + n r \frac{n_j+n_r}{n_i+n_j+n_r} ni+nj+nrnj+nr − n r n i + n j + n r \frac{-n_r}{n_i+n_j+n_r} ni+nj+nrnr 0 0 0

Proof:

  1. 简单连接
    δ ( C i j , C r ) = min ⁡ { δ ( x , y ) ∣ x ∈ C i j , y ∈ C r } = min ⁡ { δ ( C i , C r ) , δ ( C j , C r ) } a = a + b − ∣ a − b ∣ 2 , b = a + b + ∣ a − b ∣ 2 \begin{aligned} \delta\left(C_{i j}, C_{r}\right) &= \min \{\delta({\mathbf{x}, \mathbf{y}} )|\mathbf{x}\in C_{ij}, \mathbf{y} \in C_r\} \\ &= \min \{\delta({C_{i}, C_{r}), \delta(C_{j}, C_{r}} )\} \end{aligned}\\ a=\frac{a+b-|a-b|}{2},b=\frac{a+b+|a-b|}{2} δ(Cij,Cr)=min{δ(x,y)xCij,yCr}=min{δ(Ci,Cr),δ(Cj,Cr)}a=2a+bab,b=2a+b+ab
    在这里插入图片描述

  2. 完全连接

    见上图

  3. 组群平均
    δ ( C i j , C r ) = ∑ x ∈ C i ∪ C j ∑ y ∈ C r δ ( x , y ) ( n i + n j ) ⋅ n r = ∑ x ∈ C i ∑ y ∈ C r δ ( x , y ) + ∑ x ∈ C j ∑ y ∈ C r δ ( x , y ) ( n i + n j ) ⋅ n r = n i n r δ ( C i , C r ) + n j n r δ ( C j , C r ) ( n i + n j ) ⋅ n r = n i δ ( C i , C r ) + n j δ ( C j , C r ) ( n i + n j ) \begin{aligned} \delta\left(C_{i j}, C_{r}\right) &= \frac{\sum\limits_{\mathbf{x} \in C_i\cup C_j}\sum\limits_{\mathbf{y} \in C_r}\delta(\mathbf{x},\mathbf{y})}{(n_i+n_j )\cdot n_r} \\ &= \frac{\sum\limits_{\mathbf{x} \in C_i}\sum\limits_{\mathbf{y} \in C_r}\delta(\mathbf{x},\mathbf{y})+\sum\limits_{\mathbf{x} \in C_j}\sum\limits_{\mathbf{y} \in C_r}\delta(\mathbf{x},\mathbf{y})}{(n_i+n_j )\cdot n_r} \\ &=\frac{n_in_r\delta(C_i,C_r)+n_jn_r\delta(C_j,C_r)}{(n_i+n_j )\cdot n_r}\\ &=\frac{n_i\delta(C_i,C_r)+n_j\delta(C_j,C_r)}{(n_i+n_j )} \end{aligned} δ(Cij,Cr)=(ni+nj)nrxCiCjyCrδ(x,y)=(ni+nj)nrxCiyCrδ(x,y)+xCjyCrδ(x,y)=(ni+nj)nrninrδ(Ci,Cr)+njnrδ(Cj,Cr)=(ni+nj)niδ(Ci,Cr)+njδ(Cj,Cr)

  4. 均值距离:作业

  5. 极小方差

    基于均值距离的结论再代入 δ ( C i , C j ) = n i n j n i + n j ∣ ∣ μ i − μ j ∣ ∣ 2 \delta(C_i,C_j)=\frac{n_in_j}{n_i+n_j}||\boldsymbol{\mu}_i-\boldsymbol{\mu}_j|| ^2 δ(Ci,Cj)=ni+njninj∣∣μiμj2

事实:算法14.1 的复杂度为 O ( n 2 log ⁡ n ) O(n^2\log n) O(n2logn)

你可能感兴趣的:(数学,数据挖掘,聚类)