数据挖掘与分析课程笔记(Chapter 7)

数据挖掘与分析课程笔记

  • 参考教材:Data Mining and Analysis : MOHAMMED J.ZAKI, WAGNER MEIRA JR.

文章目录

  1. 数据挖掘与分析课程笔记(目录)
  2. 数据挖掘与分析课程笔记(Chapter 1)
  3. 数据挖掘与分析课程笔记(Chapter 2)
  4. 数据挖掘与分析课程笔记(Chapter 5)
  5. 数据挖掘与分析课程笔记(Chapter 7)
  6. 数据挖掘与分析课程笔记(Chapter 14)
  7. 数据挖掘与分析课程笔记(Chapter 15)
  8. 数据挖掘与分析课程笔记(Chapter 20)
  9. 数据挖掘与分析课程笔记(Chapter 21)

笔记目录

  • 数据挖掘与分析课程笔记
  • 文章目录
  • Chapter 7:降维
    • 7.1 背景
    • 7.2 主元分析:
      • 7.2.1 最佳直线近似
      • 7.2.2 最佳2-维近似
      • 7.2.3 推广
    • 7.3 Kernel PCA:核主元分析


Chapter 7:降维

PCA:主元分析

7.1 背景

D = ( X 1 X 2 ⋯ X d x 1 x 11 x 12 ⋯ x 1 d x 2 x 21 x 22 ⋯ x 2 d ⋮ ⋮ ⋮ ⋱ ⋮ x n x n 1 x n 2 ⋯ x n d ) \mathbf{D}=\left(\begin{array}{c|cccc} & X_{1} & X_{2} & \cdots & X_{d} \\ \hline \mathbf{x}_{1} & x_{11} & x_{12} & \cdots & x_{1 d} \\ \mathbf{x}_{2} & x_{21} & x_{22} & \cdots & x_{2 d} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \mathbf{x}_{n} & x_{n 1} & x_{n 2} & \cdots & x_{n d} \end{array}\right) D= x1x2xnX1x11x21xn1X2x12x22xn2Xdx1dx2dxnd

对象: x 1 T , ⋯   , x n T ∈ R d \mathbf{x}_{1}^T,\cdots,\mathbf{x}_n^T \in \mathbb{R}^d x1T,,xnTRd ∀ x ∈ R d , \forall \mathbf{x} \in \mathbb{R}^d, xRd, x = ( x 1 , ⋯   , x d ) T = ∑ i = 1 d x i e i \mathbf{x}=(x_1,\cdots,x_d)^T= \sum\limits_{i=1}^{d}x_i \mathbf{e}_i x=(x1,,xd)T=i=1dxiei

其中, e i = ( 0 , ⋯   , 1 , ⋯   , 0 ) T ∈ R d \mathbf{e}_i=(0,\cdots,1,\cdots,0)^T\in\mathbb{R}^d ei=(0,,1,,0)TRd,i-坐标

设另有单位正交基 { u } i = 1 n \{\mathbf{u}\}_{i=1}^n {u}i=1n x = ∑ i = 1 d a i u i , a i ∈ R \mathbf{x}=\sum\limits_{i=1}^{d}a_i \mathbf{u}_i,a_i \in \mathbb{R} x=i=1daiui,aiR u i T u j = { 1 , i = j 0 , i ≠ j \mathbf{u}_i^T \mathbf{u}_j =\left\{\begin{matrix} 1,i=j\\ 0,i\ne j \end{matrix}\right. uiTuj={1,i=j0,i=j

∀ r : 1 ≤ r ≤ d , x = a 1 u 1 + ⋯ + a r u r ⏟ 投影 + a r + 1 u r + 1 + ⋯ + a d u d ⏟ 误差 \forall r:1\le r\le d, \mathbf{x}=\underbrace{a_1 \mathbf{u}_1+\cdots+a_r \mathbf{u}_r}_{\text{投影}}+ \underbrace{a_{r+1} \mathbf{u}_{r+1}+\cdots+a_d \mathbf{u}_d}_{\text{误差}} r:1rd,x=投影 a1u1++arur+误差 ar+1ur+1++adud

r r r 项是投影,后面是投影误差。

目标:对于给定 D D D,寻找最优 { u } i = 1 n \{\mathbf{u}\}_{i=1}^n {u}i=1n,使得 D D D 在其前 r r r 维子空间的投影是对 D D D 的“最佳近似”,即投影之后“误差最小”。

7.2 主元分析:

7.2.1 最佳直线近似

(一阶主元分析)(r=1)

目标:寻找 u 1 \mathbf{u}_1 u1,不妨记为 u = ( u 1 , ⋯   , u d ) T \mathbf{u}=(u_1,\cdots,u_d)^T u=(u1,,ud)T

假设: ∣ ∣ u ∣ ∣ = u T u = 1 ||\mathbf{u}||=\mathbf{u}^T\mathbf{u}=1 ∣∣u∣∣=uTu=1 μ ^ = 1 n ∑ i = 1 n x i = 0 , ∈ R d \hat{\boldsymbol{\mu}}=\frac{1}{n} \sum\limits_{i=1}^n\mathbf{x}_i=\mathbf{0},\in \mathbb{R}^{d} μ^=n1i=1nxi=0,Rd

∀ x i ( i = 1 , ⋯   , n ) \forall \mathbf{x}_i(i=1,\cdots,n) xi(i=1,,n) x i \mathbf{x}_i xi 沿 u \mathbf{u} u 方向投影是:
x i ′ = ( u T x i u T u ) u = ( u T x i ) u = a i u , a i = u T x i \mathbf{x}_{i}^{\prime}=\left(\frac{\mathbf{u}^{T} \mathbf{x}_{i}}{\mathbf{u}^{T} \mathbf{u}}\right) \mathbf{u}=\left(\mathbf{u}^{T} \mathbf{x}_{i}\right) \mathbf{u}=a_{i} \mathbf{u},a_{i}=\mathbf{u}^{T} \mathbf{x}_{i} xi=(uTuuTxi)u=(uTxi)u=aiu,ai=uTxi
μ ^ = 0 ⇒ \hat{\boldsymbol{\mu}}=\mathbf{0}\Rightarrow μ^=0 μ ^ \hat{\boldsymbol{\mu}} μ^ u \mathbf{u} u 上投影是0; x 1 ′ , ⋯   , x n ′ \mathbf{x}_{1}^{\prime},\cdots,\mathbf{x}_{n}^{\prime} x1,,xn 的平均值为0 。

P r o j ( m e a n ( D ) ) = m e a n P r o j ( D ) Proj(mean(D))=mean{Proj(D)} Proj(mean(D))=meanProj(D)

考察 x 1 ′ , ⋯   , x n ′ \mathbf{x}_{1}^{\prime},\cdots,\mathbf{x}_{n}^{\prime} x1,,xn 沿 u \mathbf{u} u 方向的样本方差:
σ u 2 = 1 n ∑ i = 1 n ( a i − μ u ) 2 = 1 n ∑ i = 1 n ( u T x i ) 2 = 1 n ∑ i = 1 n u T ( x i x i T ) u = u T ( 1 n ∑ i = 1 n x i x i T ) u = u T Σ u \begin{aligned} \sigma_{\mathbf{u}}^{2} &=\frac{1}{n} \sum_{i=1}^{n}\left(a_{i}-\mu_{\mathbf{u}}\right)^{2} \\ &=\frac{1}{n} \sum_{i=1}^{n}\left(\mathbf{u}^{T} \mathbf{x}_{i}\right)^{2} \\ &=\frac{1}{n} \sum_{i=1}^{n} \mathbf{u}^{T}\left(\mathbf{x}_{i} \mathbf{x}_{i}^{T}\right) \mathbf{u} \\ &=\mathbf{u}^{T}\left(\frac{1}{n} \sum_{i=1}^{n} \mathbf{x}_{i} \mathbf{x}_{i}^{T}\right) \mathbf{u} \\ &=\mathbf{u}^{T} \mathbf{\Sigma} \mathbf{u} \end{aligned} σu2=n1i=1n(aiμu)2=n1i=1n(uTxi)2=n1i=1nuT(xixiT)u=uT(n1i=1nxixiT)u=uTΣu
Σ \mathbf{\Sigma} Σ 是样本协方差矩阵。

目标
max ⁡ u u T Σ u s.t u T u − 1 = 0 \begin{array}{ll} \max\limits_{\mathbf{u}} & \mathbf{u}^{T} \mathbf{\Sigma} \mathbf{u} \\ \text{s.t} & \mathbf{u}^T\mathbf{u}-1=0 \end{array} umaxs.tuTΣuuTu1=0
应用 Lagrangian 乘数法:
max ⁡ u J ( u ) = u T Σ u − λ ( u T u − 1 ) \max \limits_{\mathbf{u}} J(\mathbf{u})=\mathbf{u}^{T} \Sigma \mathbf{u}-\lambda\left(\mathbf{u}^{T} \mathbf{u}-1\right) umaxJ(u)=uTΣuλ(uTu1)
求偏导:
∂ ∂ u J ( u ) = 0 ∂ ∂ u ( u T Σ u − λ ( u T u − 1 ) ) = 0 2 Σ u − 2 λ u = 0 Σ u = λ u \begin{aligned} \frac{\partial}{\partial \mathbf{u}} J(\mathbf{u}) &=\mathbf{0} \\ \frac{\partial}{\partial \mathbf{u}}\left(\mathbf{u}^{T} \mathbf{\Sigma} \mathbf{u}-\lambda\left(\mathbf{u}^{T} \mathbf{u}-1\right)\right) &=\mathbf{0} \\ 2 \mathbf{\Sigma} \mathbf{u}-2 \lambda \mathbf{u} &=\mathbf{0} \\ \mathbf{\Sigma} \mathbf{u} &=\lambda \mathbf{u} \end{aligned} uJ(u)u(uTΣuλ(uTu1))2Σu2λuΣu=0=0=0=λu
注意到: u T Σ u = u T λ u = λ \mathbf{u}^{T} \mathbf{\Sigma} \mathbf{u}=\mathbf{u}^{T} \lambda \mathbf{u}=\lambda uTΣu=uTλu=λ

故优化问题的解 λ \lambda λ 选取 Σ \mathbf{\Sigma} Σ 最大特征值, u \mathbf{u} u 选与 λ \lambda λ 相应的单位特征向量。

问题:上述问题使得 σ u 2 \sigma_{\mathbf{u}}^{2} σu2 最大的 u \mathbf{u} u 能否使投影误差最小?

定义平均平方误差(Minimum Squared Error,MSE):
M S E ( u ) = 1 n ∑ i = 1 n ∥ x i − x i ′ ∥ 2 = 1 n ∑ i = 1 n ( x i − x i ′ ) T ( x i − x i ′ ) = 1 n ∑ i = 1 n ( ∥ x i ∥ 2 − 2 x i T x i ′ + ( x i ′ ) T x i ′ ) = 1 n ∑ i = 1 n ( ∥ x i ∥ 2 − 2 x i T ( u T x i ) u + [ ( u T x i ) u ] T [ ( u T x i ) u ] ) = 1 n ∑ i = 1 n ( ∥ x i ∥ 2 − 2 ( u T x i ) x i T u + ( u T x i ) ( x i T u ) u T u ) = 1 n ∑ i = 1 n ( ∥ x i ∥ 2 − u T x i x i T u ) = 1 n ∑ i = 1 n ∥ x i ∥ 2 − u T Σ u = v a r ( D ) − σ u 2 \begin{aligned} M S E(\mathbf{u}) &=\frac{1}{n} \sum_{i=1}^{n}\left\|\mathbf{x}_{i}-\mathbf{x}_{i}^{\prime}\right\|^{2} \\ &=\frac{1}{n} \sum_{i=1}^{n}\left(\mathbf{x}_{i}-\mathbf{x}_{i}^{\prime}\right)^{T}\left(\mathbf{x}_{i}-\mathbf{x}_{i}^{\prime}\right) \\ &=\frac{1}{n} \sum_{i=1}^{n}\left(\left\|\mathbf{x}_{i}\right\|^{2}-2 \mathbf{x}_{i}^{T} \mathbf{x}_{i}^{\prime}+\left(\mathbf{x}_{i}^{\prime}\right)^{T} \mathbf{x}_{i}^{\prime}\right)\\ &=\frac{1}{n} \sum_{i=1}^{n}\left(\left\|\mathbf{x}_{i}\right\|^{2}-2 \mathbf{x}_{i}^{T} (\mathbf{u}^{T} \mathbf{x}_{i})\mathbf{u}+\left[(\mathbf{u}^{T} \mathbf{x}_{i})\mathbf{u}\right]^{T} \left[ (\mathbf{u}^{T} \mathbf{x}_{i})\mathbf{u}\right] \right)\\ &=\frac{1}{n} \sum_{i=1}^{n}\left(\left\|\mathbf{x}_{i}\right\|^{2}-2 (\mathbf{u}^{T} \mathbf{x}_{i})\mathbf{x}_{i}^{T} \mathbf{u}+(\mathbf{u}^{T} \mathbf{x}_{i})(\mathbf{x}_{i}^{T} \mathbf{u})\mathbf{u}^{T}\mathbf{u} \right) \\ &=\frac{1}{n} \sum_{i=1}^{n}\left(\left\|\mathbf{x}_{i}\right\|^{2}-\mathbf{u}^{T} \mathbf{x}_{i}\mathbf{x}_{i}^{T} \mathbf{u} \right) \\ &=\frac{1}{n} \sum_{i=1}^{n}\left\|\mathbf{x}_{i}\right\|^{2}-\mathbf{u}^{T} \mathbf{\Sigma} \mathbf{u}\\ &= var(D)-\sigma_{\mathbf{u}}^{2} \end{aligned} MSE(u)=n1i=1nxixi2=n1i=1n(xixi)T(xixi)=n1i=1n(xi22xiTxi+(xi)Txi)=n1i=1n(xi22xiT(uTxi)u+[(uTxi)u]T[(uTxi)u])=n1i=1n(xi22(uTxi)xiTu+(uTxi)(xiTu)uTu)=n1i=1n(xi2uTxixiTu)=n1i=1nxi2uTΣu=var(D)σu2
上式表明: v a r ( D ) = σ u 2 + M S E var(D)=\sigma_{\mathbf{u}}^{2}+MSE var(D)=σu2+MSE

u \mathbf{u} u 的几何意义: R d \mathbb{R}^d Rd 中使得数据沿其方向投影后方差最大的同时,MSE 最小的直线方向。

u \mathbf{u} u 被称为一阶主元(first principal component)

7.2.2 最佳2-维近似

(二阶主元分析:r=2)

假设 u 1 \mathbf{u}_1 u1 已经找到,即 Σ \mathbf{\Sigma} Σ 的最大特征值对应的特征向量。

目标:寻找 u 2 \mathbf{u}_2 u2 ,简记为 v \mathbf{v} v,使得: v T u 1 = 0 , v T v = 1 \mathbf{v}^{T} \mathbf{u}_{1}=0,\mathbf{v}^{T} \mathbf{v} =1 vTu1=0,vTv=1

考虑 x i \mathbf{x}_{i} xi 沿 v \mathbf{v} v 方向投影的方差:
max ⁡ u σ v 2 = v T Σ v s.t v T v − 1 = 0 v T u 1 = 0 \begin{array}{ll} \max\limits_{\mathbf{u}} & \sigma_{\mathbf{v}}^{2} = \mathbf{v}^{T} \mathbf{\Sigma} \mathbf{v} \\ \text{s.t} & \mathbf{v}^T\mathbf{v}-1=0\\ & \mathbf{v}^{T} \mathbf{u}_{1}=0 \end{array} umaxs.tσv2=vTΣvvTv1=0vTu1=0
定义: J ( v ) = v T Σ v − α ( v T v − 1 ) − β ( v T u 1 − 0 ) J(\mathbf{v})=\mathbf{v}^{T} \mathbf{\Sigma} \mathbf{v}-\alpha\left(\mathbf{v}^{T} \mathbf{v}-1\right)-\beta\left(\mathbf{v}^{T} \mathbf{u}_{1}-0\right) J(v)=vTΣvα(vTv1)β(vTu10)

v \mathbf{v} v 求偏导得:
2 Σ v − 2 α v − β u 1 = 0 2 \Sigma \mathbf{v}-2 \alpha \mathbf{v}-\beta \mathbf{u}_{1}=\mathbf{0} v2αvβu1=0
两边同乘 u 1 T \mathbf{u}_{1}^{T} u1T
2 u 1 T Σ v − 2 α u 1 T v − β u 1 T u 1 = 0 2 u 1 T Σ v − β = 0 2 v T Σ u 1 − β = 0 2 v T λ 1 u 1 − β = 0 β = 0 \begin{aligned} 2 \mathbf{u}_{1}^{T}\Sigma \mathbf{v}-2 \alpha \mathbf{u}_{1}^{T}\mathbf{v}-\beta \mathbf{u}_{1}^{T}\mathbf{u}_{1} &=0 \\ 2 \mathbf{u}_{1}^{T}\Sigma \mathbf{v}-\beta &= 0\\ 2 \mathbf{v}^{T}\Sigma \mathbf{u}_{1}-\beta &= 0\\ 2 \mathbf{v}^{T}\lambda_1 \mathbf{u}_{1}-\beta &= 0\\ \beta &= 0 \end{aligned} 2u1TΣv2αu1Tvβu1Tu12u1TΣvβ2vTΣu1β2vTλ1u1ββ=0=0=0=0=0
再代入到原式:
2 Σ v − 2 α v = 0 Σ v = α v 2 \Sigma \mathbf{v}-2 \alpha \mathbf{v}=\mathbf{0}\\ \Sigma \mathbf{v}=\alpha \mathbf{v} v2αv=0Σv=αv
v \mathbf{v} v 也是 Σ \mathbf{\Sigma} Σ 的特征向量。

σ v 2 = v T Σ v = α \sigma_{\mathbf{v}}^{2} = \mathbf{v}^{T} \mathbf{\Sigma} \mathbf{v} =\alpha σv2=vTΣv=α,故 α \alpha α 应取 Σ \mathbf{\Sigma} Σ (第二大)的特征向量。

问题1:上述求得的 v \mathbf{v} v (即 u 2 \mathbf{u}_2 u2 ),与 u 1 \mathbf{u}_1 u1 一起考虑,能否使 D D D s p a n { u 1 , u 2 } span\{\mathbf{u}_1, \mathbf{u}_2 \} span{u1,u2} 上投影总方差最大?

x i = a i 1 u 1 + a i 2 u 2 ⏟ 投影 + ⋯ \mathbf{x}_i=\underbrace{a_{i1} \mathbf{u}_1+a_{i2}\mathbf{u}_2}_{投影}+\cdots xi=投影 ai1u1+ai2u2+

x i \mathbf{x}_i xi s p a n { u 1 , u 2 } span\{\mathbf{u}_1, \mathbf{u}_2 \} span{u1,u2} 上投影坐标: a i = ( a i 1 , a i 2 ) T = ( u 1 T x i , u 2 T x i ) T \mathbf{a}_{i}=(a_{i1},a_{i2})^T=(\mathbf{u}_1^{T}\mathbf{x}_i,\mathbf{u}_2^{T}\mathbf{x}_i)^{T} ai=(ai1,ai2)T=(u1Txi,u2Txi)T

U 2 = ( ∣ ∣ u 1 u 2 ∣ ∣ ) \mathbf{U}_{2}=\left(\begin{array}{cc} \mid & \mid \\ \mathbf{u}_{1} & \mathbf{u}_{2} \\ \mid & \mid \end{array}\right) U2= u1u2 ,则 a i = U 2 T x i \mathbf{a}_{i}=\mathbf{U}_{2}^{T} \mathbf{x}_{i} ai=U2Txi

投影总方差为:
var ⁡ ( A ) = 1 n ∑ i = 1 n ∥ a i − 0 ∥ 2 = 1 n ∑ i = 1 n ( U 2 T x i ) T ( U 2 T x i ) = 1 n ∑ i = 1 n x i T ( U 2 U 2 T ) x i = 1 n ∑ i = 1 n x i T ( u 1 u 1 T + u 2 u 2 T ) x i = u 1 T Σ u 1 + u 2 T Σ u 2 = λ 1 + λ 2 \begin{aligned} \operatorname{var}(\mathbf{A}) &=\frac{1}{n} \sum_{i=1}^{n}\left\|\mathbf{a}_{i}-\mathbf{0}\right\|^{2} \\ &=\frac{1}{n} \sum_{i=1}^{n}\left(\mathbf{U}_{2}^{T} \mathbf{x}_{i}\right)^{T}\left(\mathbf{U}_{2}^{T} \mathbf{x}_{i}\right) \\ &=\frac{1}{n} \sum_{i=1}^{n} \mathbf{x}_{i}^{T}\left(\mathbf{U}_{2} \mathbf{U}_{2}^{T}\right) \mathbf{x}_{i}\\ &=\frac{1}{n} \sum_{i=1}^{n} \mathbf{x}_{i}^{T}\left( \mathbf{u}_{1}\mathbf{u}_{1}^T + \mathbf{u}_{2}\mathbf{u}_{2}^T \right) \mathbf{x}_{i}\\ &=\mathbf{u}_{1}^T\mathbf{\Sigma} \mathbf{u}_{1} + \mathbf{u}_{2}^T\mathbf{\Sigma} \mathbf{u}_{2}\\ &= \lambda_1 +\lambda_2 \end{aligned} var(A)=n1i=1nai02=n1i=1n(U2Txi)T(U2Txi)=n1i=1nxiT(U2U2T)xi=n1i=1nxiT(u1u1T+u2u2T)xi=u1TΣu1+u2TΣu2=λ1+λ2
问题2:平均平方误差是否最小?

其中, x i ′ = U 2 U 2 T x i \mathbf{x}_{i}^{\prime}=\mathbf{U}_{2}\mathbf{U}_{2}^{T} \mathbf{x}_{i} xi=U2U2Txi
M S E = 1 n ∑ i = 1 n ∥ x i − x i ′ ∥ 2 = 1 n ∑ i = 1 n ∥ x i ∥ 2 − 1 n ∑ i = 1 n x i T ( U 2 U 2 T ) x i = v a r ( D ) − λ 1 − λ 2 \begin{aligned} M S E &= \frac{1}{n} \sum_{i=1}^{n}\left\|\mathbf{x}_{i}-\mathbf{x}_{i}^{\prime}\right\|^{2} \\ &= \frac{1}{n} \sum_{i=1}^{n}\left\|\mathbf{x}_{i}\right\|^{2} - \frac{1}{n} \sum_{i=1}^{n} \mathbf{x}_{i}^{T}\left(\mathbf{U}_{2} \mathbf{U}_{2}^{T}\right) \mathbf{x}_{i}\\ &= var(D) - \lambda_1 - \lambda_2 \end{aligned} MSE=n1i=1nxixi2=n1i=1nxi2n1i=1nxiT(U2U2T)xi=var(D)λ1λ2
结论:

  1. Σ \mathbf{\Sigma} Σ 的前 r r r 个特征值的和 λ 1 + ⋯ + λ r ( λ 1 ≥ ⋯ ≥ λ r ) \lambda_1+\cdots+\lambda_r(\lambda_1\ge\cdots\ge\lambda_r) λ1++λr(λ1λr) 给出最大投影总方差;
  2. v a r ( D ) − ∑ i = 1 r λ i var(D)-\sum\limits_{i=1}^r \lambda_i var(D)i=1rλi 给出最小MSE;
  3. λ 1 , ⋯   , λ r \lambda_1,\cdots,\lambda_r λ1,,λr 相应的特征向量 u 1 , ⋯ u r \mathbf{u}_{1},\cdots\mathbf{u}_{r} u1,ur 张成 r r r - 阶主元。

7.2.3 推广

Σ d × d \Sigma_{d\times d} Σd×d λ 1 ≥ λ 2 ≥ ⋯ λ d \lambda_1 \ge \lambda_2 \ge \cdots \lambda_d λ1λ2λd,中心化

∑ i = 1 r λ i \sum\limits_{i=1}^r\lambda_i i=1rλi:最大投影总方差;

v a r ( D ) − ∑ i = 1 r λ i var(D)-\sum\limits_{i=1}^r\lambda_i var(D)i=1rλi:最小MSE

实践: 如何选取适当的 r r r,考虑比值 ∑ i = 1 r λ i v a r ( D ) \frac{\sum\limits_{i=1}^r\lambda_i}{var(D)} var(D)i=1rλi 与给定阈值 α \alpha α 比较

算法 7.1 PCA:

输入: D D D α \alpha α

输出: A A A (降维后)

  1. μ = 1 n ∑ i = 1 r x i \boldsymbol{\mu} = \frac{1}{n}\sum\limits_{i=1}^r\mathbf{x}_i μ=n1i=1rxi;
  2. Z = D − 1 ⋅ μ T \mathbf{Z}=\mathbf{D}-\mathbf{1}\cdot \boldsymbol{\mu} ^T Z=D1μT;
  3. Σ = 1 n ( Z T Z ) \mathbf{\Sigma}=\frac{1}{n}(\mathbf{Z}^T\mathbf{Z}) Σ=n1(ZTZ);
  4. λ 1 ≥ λ 2 ≥ ⋯ λ d \lambda_1 \ge \lambda_2 \ge \cdots \lambda_d λ1λ2λd ⟵ Σ \longleftarrow \mathbf{\Sigma} Σ 的特征值(降序排列);
  5. u 1 , u 2 , ⋯   , u d \mathbf{u}_1,\mathbf{u}_2,\cdots,\mathbf{u}_d u1,u2,,ud ⟵ Σ \longleftarrow \mathbf{\Sigma} Σ 的特征向量(单位正交);
  6. 计算 ∑ i = 1 r λ i v a r ( D ) \frac{\sum\limits_{i=1}^r\lambda_i}{var(D)} var(D)i=1rλi,选取其比值超过 α \alpha α 最小的 r r r
  7. U r = ( u 1 , u 2 , ⋯   , u r ) \mathbf{U}_r=(\mathbf{u}_1,\mathbf{u}_2,\cdots,\mathbf{u}_r) Ur=(u1,u2,,ur);
  8. A = { a i ∣ a i = U r T x i , i = 1 , ⋯   , n } A=\{\mathbf{a}_i|\mathbf{a}_i=\mathbf{U}_r^T\mathbf{x}_i, i=1,\cdots,n\} A={aiai=UrTxi,i=1,,n}

7.3 Kernel PCA:核主元分析

ϕ : I → F ⊆ R d \phi:\mathcal{I}\to \mathcal{F}\subseteq \mathbb{R}^d ϕ:IFRd

K : I × I → R K:\mathcal{I}\times\mathcal{I}\to \mathbb{R} K:I×IR

K ( x i , x j ) = ϕ T ( x i ) ϕ ( x j ) K(\mathbf{x}_i,\mathbf{x}_j)=\phi^T(\mathbf{x}_i)\phi(\mathbf{x}_j) K(xi,xj)=ϕT(xi)ϕ(xj)

已知 K = [ K ( x i , x j ) ] n × n \mathbf{K}=[K(\mathbf{x}_i,\mathbf{x}_j)]_{n\times n} K=[K(xi,xj)]n×n Σ ϕ = 1 n ∑ i = 1 n ϕ ( x i ) ϕ ( x i ) T \mathbf{\Sigma}_{\phi}=\frac{1}{n}\sum\limits_{i=1}^n\phi(\mathbf{x}_i)\phi(\mathbf{x}_i)^T Σϕ=n1i=1nϕ(xi)ϕ(xi)T

对象 ϕ ( x 1 ) , ϕ ( x 2 ) , ⋯   , ϕ ( x n ) ∈ R d \phi(\mathbf{x}_1),\phi(\mathbf{x}_2),\cdots,\phi(\mathbf{x}_n)\in \mathbb{R}^d ϕ(x1),ϕ(x2),,ϕ(xn)Rd,假设 1 n ∑ i n ϕ ( x i ) = 0 \frac{1}{n}\sum\limits_{i}^{n}\phi(\mathbf{x}_i)=\mathbf{0} n1inϕ(xi)=0 K → K ^ \mathbf{K} \to \hat{\mathbf{K}} KK^,已经中心化;

目标 u , λ , s . t . Σ ϕ u = λ u \mathbf{u},\lambda,s.t. \mathbf{\Sigma}_{\phi}\mathbf{u}=\lambda\mathbf{u} u,λ,s.t.Σϕu=λu
1 n ∑ i = 1 n ϕ ( x i ) [ ϕ ( x i ) T u ] = λ u ∑ i = 1 n [ ϕ ( x i ) T u n λ ] ϕ ( x i ) = u \begin{aligned} \frac{1}{n}\sum\limits_{i=1}^n\phi(\mathbf{x}_i)[\phi(\mathbf{x}_i)^T\mathbf{u}] &=\lambda\mathbf{u}\\ \sum\limits_{i=1}^n[\frac{\phi(\mathbf{x}_i)^T\mathbf{u}}{n\lambda}] \phi(\mathbf{x}_i)&=\mathbf{u}\\ \end{aligned} n1i=1nϕ(xi)[ϕ(xi)Tu]i=1n[ϕ(xi)Tu]ϕ(xi)=λu=u
相同于所有数据线性组合。

令: c i = ϕ ( x i ) T u n λ c_i=\frac{\phi(\mathbf{x}_i)^T\mathbf{u}}{n\lambda} ci=ϕ(xi)Tu,则 u = ∑ i = 1 n c i ϕ ( x i ) \mathbf{u}=\sum\limits_{i=1}^nc_i \phi(\mathbf{x}_i) u=i=1nciϕ(xi)。代入原式:
( 1 n ∑ i = 1 n ϕ ( x i ) ϕ ( x i ) T ) ( ∑ j = 1 n c j ϕ ( x j ) ) = λ ∑ i = 1 n c i ϕ ( x i ) 1 n ∑ i = 1 n ∑ j = 1 n c j ϕ ( x i ) ϕ ( x i ) T ϕ ( x j ) = λ ∑ i = 1 n c i ϕ ( x i ) ∑ i = 1 n ( ϕ ( x i ) ∑ j = 1 n c j K ( x i , x j ) ) = n λ ∑ i = 1 n c i ϕ ( x i ) \begin{aligned} \left(\frac{1}{n} \sum_{i=1}^{n} \phi\left(\mathbf{x}_{i}\right) \phi\left(\mathbf{x}_{i}\right)^{T}\right)\left(\sum_{j=1}^{n} c_{j} \phi\left(\mathbf{x}_{j}\right)\right) &=\lambda \sum_{i=1}^{n} c_{i} \phi\left(\mathbf{x}_{i}\right) \\ \frac{1}{n} \sum_{i=1}^{n} \sum_{j=1}^{n} c_{j} \phi\left(\mathbf{x}_{i}\right) \phi\left(\mathbf{x}_{i}\right)^{T} \phi\left(\mathbf{x}_{j}\right) &=\lambda \sum_{i=1}^{n} c_{i} \phi\left(\mathbf{x}_{i}\right) \\ \sum_{i=1}^{n}\left(\phi\left(\mathbf{x}_{i}\right) \sum_{j=1}^{n} c_{j} K(\mathbf{x}_i, \mathbf{x}_j) \right) &=n \lambda \sum_{i=1}^{n} c_{i} \phi\left(\mathbf{x}_{i}\right) \end{aligned} (n1i=1nϕ(xi)ϕ(xi)T)(j=1ncjϕ(xj))n1i=1nj=1ncjϕ(xi)ϕ(xi)Tϕ(xj)i=1n(ϕ(xi)j=1ncjK(xi,xj))=λi=1nciϕ(xi)=λi=1nciϕ(xi)=i=1nciϕ(xi)
注意,此处 K = K ^ \mathbf{K}=\hat{\mathbf{K}} K=K^ 已经中心化

对于 ∀ k ( 1 ≤ k ≤ n ) \forall k (1\le k\le n) k(1kn),两边同时左乘 ϕ ( x k ) \phi(\mathbf{x}_{k}) ϕ(xk)
∑ i = 1 n ( ϕ T ( x k ) ϕ ( x i ) ∑ j = 1 n c j K ( x i , x j ) ) = n λ ∑ i = 1 n c i ϕ T ( x k ) ϕ ( x i ) ∑ i = 1 n ( K ( x k , x i ) ∑ j = 1 n c j K ( x i , x j ) ) = n λ ∑ i = 1 n c i K ( x k , x i ) \begin{aligned} \sum_{i=1}^{n}\left(\phi^T(\mathbf{x}_{k}) \phi\left(\mathbf{x}_{i}\right) \sum_{j=1}^{n} c_{j} K(\mathbf{x}_i, \mathbf{x}_j) \right) &=n \lambda \sum_{i=1}^{n} c_{i} \phi^T(\mathbf{x}_{k}) \phi\left(\mathbf{x}_{i}\right) \\ \sum_{i=1}^{n}\left(K(\mathbf{x}_k, \mathbf{x}_i) \sum_{j=1}^{n} c_{j} K(\mathbf{x}_i, \mathbf{x}_j) \right) &=n \lambda \sum_{i=1}^{n} c_{i} K(\mathbf{x}_k, \mathbf{x}_i) \\ \end{aligned} i=1n(ϕT(xk)ϕ(xi)j=1ncjK(xi,xj))i=1n(K(xk,xi)j=1ncjK(xi,xj))=i=1nciϕT(xk)ϕ(xi)=i=1nciK(xk,xi)
K i = ( K ( x i , x 1 ) , K ( x i , x 2 ) , ⋯   , K ( x i , x n ) ) T \mathbf{K}_{i}=\left(K\left(\mathbf{x}_{i}, \mathbf{x}_{1}\right), K\left(\mathbf{x}_{i}, \mathbf{x}_{2}\right), \cdots, K\left(\mathbf{x}_{i}, \mathbf{x}_{n}\right)\right)^{T} Ki=(K(xi,x1),K(xi,x2),,K(xi,xn))T (核矩阵的第 i i i 行, K = ( [ K 1 T ⋮ K n T ] ) \mathbf{K}=(\begin{bmatrix} \mathbf{K}_1^T \\ \vdots \\ \mathbf{K}_n^T \end{bmatrix}) K=( K1TKnT )), c = ( c 1 , c 2 , ⋯   , c n ) T \mathbf{c}=(c_1,c_2,\cdots,c_n)^T c=(c1,c2,,cn)T,则:
∑ i = 1 n K ( x k , x i ) K i T c = n λ K k T c , k = 1 , 2 , ⋯   , n K k T [ K 1 T ⋮ K n T ] c = n λ K k T c K k T K = n λ K k T c \begin{aligned} \sum_{i=1}^{n}K(\mathbf{x}_k, \mathbf{x}_i) \mathbf{K}^T_i\mathbf{c} &=n \lambda \mathbf{K}^T_k\mathbf{c},k=1,2,\cdots,n \\ \mathbf{K}^T_k\begin{bmatrix} \mathbf{K}_1^T \\ \vdots \\ \mathbf{K}_n^T \end{bmatrix}\mathbf{c} &=n \lambda \mathbf{K}^T_k\mathbf{c}\\ \mathbf{K}^T_k\mathbf{K} &=n \lambda \mathbf{K}^T_k\mathbf{c} \end{aligned} i=1nK(xk,xi)KiTcKkT K1TKnT cKkTK=KkTc,k=1,2,,n=KkTc=KkTc
K 2 c = n λ K c \mathbf{K}^2\mathbf{c}=n\lambda \mathbf{K}\mathbf{c} K2c=Kc

假设 K − 1 \mathbf{K}^{-1} K1 存在
K 2 c = n λ K c K c = n λ c K c = η c , η = n λ \begin{aligned} \mathbf{K}^2\mathbf{c}&=n\lambda \mathbf{K}\mathbf{c}\\ \mathbf{K}\mathbf{c}&=n\lambda \mathbf{c}\\ \mathbf{K}\mathbf{c}&= \eta\mathbf{c},\eta=n\lambda \end{aligned} K2cKcKc=Kc=c=ηc,η=
结论: η 1 n ≥ η 2 n ≥ ⋯ ≥ η n n \frac{\eta_1}{n}\ge\frac{\eta_2}{n}\ge\cdots\ge\frac{\eta_n}{n} nη1nη2nηn,给出在特征空间中 ϕ ( x 1 ) , ϕ ( x 2 ) , ⋯   , ϕ ( x n ) \phi(\mathbf{x}_1),\phi(\mathbf{x}_2),\cdots,\phi(\mathbf{x}_n) ϕ(x1),ϕ(x2),,ϕ(xn) 的投影方差: ∑ i = 1 r η r n \sum\limits_{i=1}^{r}\frac{\eta_r}{n} i=1rnηr,其中 η 1 ≥ η 2 ⋯ ≥ η n \eta_1\ge\eta_2\cdots\ge\eta_n η1η2ηn K \mathbf{K} K 的特征值。

问:可否计算出 ϕ ( x 1 ) , ϕ ( x 2 ) , ⋯   , ϕ ( x n ) \phi(\mathbf{x}_1),\phi(\mathbf{x}_2),\cdots,\phi(\mathbf{x}_n) ϕ(x1),ϕ(x2),,ϕ(xn) 在主元方向上的投影(即降维之后的数据)?

u 1 , ⋯   , u d \mathbf{u}_1,\cdots,\mathbf{u}_d u1,,ud Σ ϕ \mathbf{\Sigma}_{\phi} Σϕ 的特征向量,则 ϕ ( x j ) = a 1 u 1 + ⋯ + a d u d \phi(\mathbf{x}_j)=a_1\mathbf{u}_1+\cdots+a_d\mathbf{u}_d ϕ(xj)=a1u1++adud,其中
a k = ϕ ( x j ) T u k , k = 1 , 2 , ⋯   , d = ϕ ( x j ) T ∑ i = 1 n c k i ϕ ( x i ) = ∑ i = 1 n c k i ϕ ( x j ) T ϕ ( x i ) = ∑ i = 1 n c k i K ( x j , x i ) \begin{aligned} a_k &= \phi(\mathbf{x}_j)^T\mathbf{u}_k, k=1,2,\cdots,d\\ &= \phi(\mathbf{x}_j)^T\sum\limits_{i=1}^nc_{ki} \phi(\mathbf{x}_i)\\ &= \sum\limits_{i=1}^nc_{ki} \phi(\mathbf{x}_j)^T\phi(\mathbf{x}_i)\\ &= \sum\limits_{i=1}^nc_{ki} K(\mathbf{x}_j,\mathbf{x}_i) \end{aligned} ak=ϕ(xj)Tuk,k=1,2,,d=ϕ(xj)Ti=1nckiϕ(xi)=i=1nckiϕ(xj)Tϕ(xi)=i=1nckiK(xj,xi)

算法7.2:核主元分析( F ⊆ R d \mathcal{F}\subseteq \mathbb{R}^d FRd

输入: K K K α \alpha α

输出: A A A (降维后数据的投影坐标)

  1. K ^ : = ( I − 1 n 1 n × n ) K ( I − 1 n 1 n × n ) \hat{\mathbf{K}} :=\left(\mathbf{I}-\frac{1}{n} \mathbf{1}_{n \times n}\right) \mathbf{K}\left(\mathbf{I}-\frac{1}{n} \mathbf{1}_{n \times n}\right) K^:=(In11n×n)K(In11n×n)

  2. η 1 , η 2 , ⋯ η d \eta_1,\eta_2,\cdots\eta_d η1,η2,ηd ⟵ K \longleftarrow \mathbf{K} K 的特征值,只取前 d d d

  3. c 1 , c 2 , ⋯   , c d \mathbf{c}_1,\mathbf{c}_2,\cdots,\mathbf{c}_d c1,c2,,cd ⟵ K \longleftarrow \mathbf{K} K 的特征向量(单位化,正交)

  4. c i ← 1 η i ⋅ c i , i = 1 , ⋯   , d \mathbf{c}_i \leftarrow \frac{1}{\sqrt{\eta_i}}\cdot \mathbf{c}_i,i=1,\cdots,d ciηi 1ci,i=1,,d

  5. 选取最小的 r r r 使得: ∑ i = 1 r η i n ∑ i = 1 d η i n ≥ α \frac{\sum\limits_{i=1}^r\frac{\eta_i}{n}}{\sum\limits_{i=1}^d\frac{\eta_i}{n}}\ge \alpha i=1dnηii=1rnηiα

  6. C r = ( c 1 , c 2 , ⋯   , c r ) \mathbf{C}_r=(\mathbf{c}_1,\mathbf{c}_2,\cdots,\mathbf{c}_r) Cr=(c1,c2,,cr)

  7. A = { a i ∣ a i = C r T K i , i = 1 , ⋯   , n } A=\{\mathbf{a}_i|\mathbf{a}_i=\mathbf{C}_r^T\mathbf{K}_i, i=1,\cdots,n\} A={aiai=CrTKi,i=1,,n}


你可能感兴趣的:(数学,数据挖掘,概率论)