PCA:主元分析
D = ( X 1 X 2 ⋯ X d x 1 x 11 x 12 ⋯ x 1 d x 2 x 21 x 22 ⋯ x 2 d ⋮ ⋮ ⋮ ⋱ ⋮ x n x n 1 x n 2 ⋯ x n d ) \mathbf{D}=\left(\begin{array}{c|cccc} & X_{1} & X_{2} & \cdots & X_{d} \\ \hline \mathbf{x}_{1} & x_{11} & x_{12} & \cdots & x_{1 d} \\ \mathbf{x}_{2} & x_{21} & x_{22} & \cdots & x_{2 d} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \mathbf{x}_{n} & x_{n 1} & x_{n 2} & \cdots & x_{n d} \end{array}\right) D=⎝ ⎛x1x2⋮xnX1x11x21⋮xn1X2x12x22⋮xn2⋯⋯⋯⋱⋯Xdx1dx2d⋮xnd⎠ ⎞
对象: x 1 T , ⋯ , x n T ∈ R d \mathbf{x}_{1}^T,\cdots,\mathbf{x}_n^T \in \mathbb{R}^d x1T,⋯,xnT∈Rd, ∀ x ∈ R d , \forall \mathbf{x} \in \mathbb{R}^d, ∀x∈Rd, 设 x = ( x 1 , ⋯ , x d ) T = ∑ i = 1 d x i e i \mathbf{x}=(x_1,\cdots,x_d)^T= \sum\limits_{i=1}^{d}x_i \mathbf{e}_i x=(x1,⋯,xd)T=i=1∑dxiei
其中, e i = ( 0 , ⋯ , 1 , ⋯ , 0 ) T ∈ R d \mathbf{e}_i=(0,\cdots,1,\cdots,0)^T\in\mathbb{R}^d ei=(0,⋯,1,⋯,0)T∈Rd,i-坐标
设另有单位正交基 { u } i = 1 n \{\mathbf{u}\}_{i=1}^n {u}i=1n, x = ∑ i = 1 d a i u i , a i ∈ R \mathbf{x}=\sum\limits_{i=1}^{d}a_i \mathbf{u}_i,a_i \in \mathbb{R} x=i=1∑daiui,ai∈R, u i T u j = { 1 , i = j 0 , i ≠ j \mathbf{u}_i^T \mathbf{u}_j =\left\{\begin{matrix} 1,i=j\\ 0,i\ne j \end{matrix}\right. uiTuj={1,i=j0,i=j
∀ r : 1 ≤ r ≤ d , x = a 1 u 1 + ⋯ + a r u r ⏟ 投影 + a r + 1 u r + 1 + ⋯ + a d u d ⏟ 误差 \forall r:1\le r\le d, \mathbf{x}=\underbrace{a_1 \mathbf{u}_1+\cdots+a_r \mathbf{u}_r}_{\text{投影}}+ \underbrace{a_{r+1} \mathbf{u}_{r+1}+\cdots+a_d \mathbf{u}_d}_{\text{误差}} ∀r:1≤r≤d,x=投影 a1u1+⋯+arur+误差 ar+1ur+1+⋯+adud
前 r r r 项是投影,后面是投影误差。
目标:对于给定 D D D,寻找最优 { u } i = 1 n \{\mathbf{u}\}_{i=1}^n {u}i=1n,使得 D D D 在其前 r r r 维子空间的投影是对 D D D 的“最佳近似”,即投影之后“误差最小”。
(一阶主元分析)(r=1)
目标:寻找 u 1 \mathbf{u}_1 u1,不妨记为 u = ( u 1 , ⋯ , u d ) T \mathbf{u}=(u_1,\cdots,u_d)^T u=(u1,⋯,ud)T。
假设: ∣ ∣ u ∣ ∣ = u T u = 1 ||\mathbf{u}||=\mathbf{u}^T\mathbf{u}=1 ∣∣u∣∣=uTu=1, μ ^ = 1 n ∑ i = 1 n x i = 0 , ∈ R d \hat{\boldsymbol{\mu}}=\frac{1}{n} \sum\limits_{i=1}^n\mathbf{x}_i=\mathbf{0},\in \mathbb{R}^{d} μ^=n1i=1∑nxi=0,∈Rd
∀ x i ( i = 1 , ⋯ , n ) \forall \mathbf{x}_i(i=1,\cdots,n) ∀xi(i=1,⋯,n), x i \mathbf{x}_i xi 沿 u \mathbf{u} u 方向投影是:
x i ′ = ( u T x i u T u ) u = ( u T x i ) u = a i u , a i = u T x i \mathbf{x}_{i}^{\prime}=\left(\frac{\mathbf{u}^{T} \mathbf{x}_{i}}{\mathbf{u}^{T} \mathbf{u}}\right) \mathbf{u}=\left(\mathbf{u}^{T} \mathbf{x}_{i}\right) \mathbf{u}=a_{i} \mathbf{u},a_{i}=\mathbf{u}^{T} \mathbf{x}_{i} xi′=(uTuuTxi)u=(uTxi)u=aiu,ai=uTxi
μ ^ = 0 ⇒ \hat{\boldsymbol{\mu}}=\mathbf{0}\Rightarrow μ^=0⇒ μ ^ \hat{\boldsymbol{\mu}} μ^ 在 u \mathbf{u} u 上投影是0; x 1 ′ , ⋯ , x n ′ \mathbf{x}_{1}^{\prime},\cdots,\mathbf{x}_{n}^{\prime} x1′,⋯,xn′ 的平均值为0 。
P r o j ( m e a n ( D ) ) = m e a n P r o j ( D ) Proj(mean(D))=mean{Proj(D)} Proj(mean(D))=meanProj(D)
考察 x 1 ′ , ⋯ , x n ′ \mathbf{x}_{1}^{\prime},\cdots,\mathbf{x}_{n}^{\prime} x1′,⋯,xn′ 沿 u \mathbf{u} u 方向的样本方差:
σ u 2 = 1 n ∑ i = 1 n ( a i − μ u ) 2 = 1 n ∑ i = 1 n ( u T x i ) 2 = 1 n ∑ i = 1 n u T ( x i x i T ) u = u T ( 1 n ∑ i = 1 n x i x i T ) u = u T Σ u \begin{aligned} \sigma_{\mathbf{u}}^{2} &=\frac{1}{n} \sum_{i=1}^{n}\left(a_{i}-\mu_{\mathbf{u}}\right)^{2} \\ &=\frac{1}{n} \sum_{i=1}^{n}\left(\mathbf{u}^{T} \mathbf{x}_{i}\right)^{2} \\ &=\frac{1}{n} \sum_{i=1}^{n} \mathbf{u}^{T}\left(\mathbf{x}_{i} \mathbf{x}_{i}^{T}\right) \mathbf{u} \\ &=\mathbf{u}^{T}\left(\frac{1}{n} \sum_{i=1}^{n} \mathbf{x}_{i} \mathbf{x}_{i}^{T}\right) \mathbf{u} \\ &=\mathbf{u}^{T} \mathbf{\Sigma} \mathbf{u} \end{aligned} σu2=n1i=1∑n(ai−μu)2=n1i=1∑n(uTxi)2=n1i=1∑nuT(xixiT)u=uT(n1i=1∑nxixiT)u=uTΣu
Σ \mathbf{\Sigma} Σ 是样本协方差矩阵。
目标:
max u u T Σ u s.t u T u − 1 = 0 \begin{array}{ll} \max\limits_{\mathbf{u}} & \mathbf{u}^{T} \mathbf{\Sigma} \mathbf{u} \\ \text{s.t} & \mathbf{u}^T\mathbf{u}-1=0 \end{array} umaxs.tuTΣuuTu−1=0
应用 Lagrangian 乘数法:
max u J ( u ) = u T Σ u − λ ( u T u − 1 ) \max \limits_{\mathbf{u}} J(\mathbf{u})=\mathbf{u}^{T} \Sigma \mathbf{u}-\lambda\left(\mathbf{u}^{T} \mathbf{u}-1\right) umaxJ(u)=uTΣu−λ(uTu−1)
求偏导:
∂ ∂ u J ( u ) = 0 ∂ ∂ u ( u T Σ u − λ ( u T u − 1 ) ) = 0 2 Σ u − 2 λ u = 0 Σ u = λ u \begin{aligned} \frac{\partial}{\partial \mathbf{u}} J(\mathbf{u}) &=\mathbf{0} \\ \frac{\partial}{\partial \mathbf{u}}\left(\mathbf{u}^{T} \mathbf{\Sigma} \mathbf{u}-\lambda\left(\mathbf{u}^{T} \mathbf{u}-1\right)\right) &=\mathbf{0} \\ 2 \mathbf{\Sigma} \mathbf{u}-2 \lambda \mathbf{u} &=\mathbf{0} \\ \mathbf{\Sigma} \mathbf{u} &=\lambda \mathbf{u} \end{aligned} ∂u∂J(u)∂u∂(uTΣu−λ(uTu−1))2Σu−2λuΣu=0=0=0=λu
注意到: u T Σ u = u T λ u = λ \mathbf{u}^{T} \mathbf{\Sigma} \mathbf{u}=\mathbf{u}^{T} \lambda \mathbf{u}=\lambda uTΣu=uTλu=λ
故优化问题的解 λ \lambda λ 选取 Σ \mathbf{\Sigma} Σ 最大特征值, u \mathbf{u} u 选与 λ \lambda λ 相应的单位特征向量。
问题:上述问题使得 σ u 2 \sigma_{\mathbf{u}}^{2} σu2 最大的 u \mathbf{u} u 能否使投影误差最小?
定义平均平方误差(Minimum Squared Error,MSE):
M S E ( u ) = 1 n ∑ i = 1 n ∥ x i − x i ′ ∥ 2 = 1 n ∑ i = 1 n ( x i − x i ′ ) T ( x i − x i ′ ) = 1 n ∑ i = 1 n ( ∥ x i ∥ 2 − 2 x i T x i ′ + ( x i ′ ) T x i ′ ) = 1 n ∑ i = 1 n ( ∥ x i ∥ 2 − 2 x i T ( u T x i ) u + [ ( u T x i ) u ] T [ ( u T x i ) u ] ) = 1 n ∑ i = 1 n ( ∥ x i ∥ 2 − 2 ( u T x i ) x i T u + ( u T x i ) ( x i T u ) u T u ) = 1 n ∑ i = 1 n ( ∥ x i ∥ 2 − u T x i x i T u ) = 1 n ∑ i = 1 n ∥ x i ∥ 2 − u T Σ u = v a r ( D ) − σ u 2 \begin{aligned} M S E(\mathbf{u}) &=\frac{1}{n} \sum_{i=1}^{n}\left\|\mathbf{x}_{i}-\mathbf{x}_{i}^{\prime}\right\|^{2} \\ &=\frac{1}{n} \sum_{i=1}^{n}\left(\mathbf{x}_{i}-\mathbf{x}_{i}^{\prime}\right)^{T}\left(\mathbf{x}_{i}-\mathbf{x}_{i}^{\prime}\right) \\ &=\frac{1}{n} \sum_{i=1}^{n}\left(\left\|\mathbf{x}_{i}\right\|^{2}-2 \mathbf{x}_{i}^{T} \mathbf{x}_{i}^{\prime}+\left(\mathbf{x}_{i}^{\prime}\right)^{T} \mathbf{x}_{i}^{\prime}\right)\\ &=\frac{1}{n} \sum_{i=1}^{n}\left(\left\|\mathbf{x}_{i}\right\|^{2}-2 \mathbf{x}_{i}^{T} (\mathbf{u}^{T} \mathbf{x}_{i})\mathbf{u}+\left[(\mathbf{u}^{T} \mathbf{x}_{i})\mathbf{u}\right]^{T} \left[ (\mathbf{u}^{T} \mathbf{x}_{i})\mathbf{u}\right] \right)\\ &=\frac{1}{n} \sum_{i=1}^{n}\left(\left\|\mathbf{x}_{i}\right\|^{2}-2 (\mathbf{u}^{T} \mathbf{x}_{i})\mathbf{x}_{i}^{T} \mathbf{u}+(\mathbf{u}^{T} \mathbf{x}_{i})(\mathbf{x}_{i}^{T} \mathbf{u})\mathbf{u}^{T}\mathbf{u} \right) \\ &=\frac{1}{n} \sum_{i=1}^{n}\left(\left\|\mathbf{x}_{i}\right\|^{2}-\mathbf{u}^{T} \mathbf{x}_{i}\mathbf{x}_{i}^{T} \mathbf{u} \right) \\ &=\frac{1}{n} \sum_{i=1}^{n}\left\|\mathbf{x}_{i}\right\|^{2}-\mathbf{u}^{T} \mathbf{\Sigma} \mathbf{u}\\ &= var(D)-\sigma_{\mathbf{u}}^{2} \end{aligned} MSE(u)=n1i=1∑n∥xi−xi′∥2=n1i=1∑n(xi−xi′)T(xi−xi′)=n1i=1∑n(∥xi∥2−2xiTxi′+(xi′)Txi′)=n1i=1∑n(∥xi∥2−2xiT(uTxi)u+[(uTxi)u]T[(uTxi)u])=n1i=1∑n(∥xi∥2−2(uTxi)xiTu+(uTxi)(xiTu)uTu)=n1i=1∑n(∥xi∥2−uTxixiTu)=n1i=1∑n∥xi∥2−uTΣu=var(D)−σu2
上式表明: v a r ( D ) = σ u 2 + M S E var(D)=\sigma_{\mathbf{u}}^{2}+MSE var(D)=σu2+MSE
u \mathbf{u} u 的几何意义: R d \mathbb{R}^d Rd 中使得数据沿其方向投影后方差最大的同时,MSE 最小的直线方向。
u \mathbf{u} u 被称为一阶主元(first principal component)
(二阶主元分析:r=2)
假设 u 1 \mathbf{u}_1 u1 已经找到,即 Σ \mathbf{\Sigma} Σ 的最大特征值对应的特征向量。
目标:寻找 u 2 \mathbf{u}_2 u2 ,简记为 v \mathbf{v} v,使得: v T u 1 = 0 , v T v = 1 \mathbf{v}^{T} \mathbf{u}_{1}=0,\mathbf{v}^{T} \mathbf{v} =1 vTu1=0,vTv=1
考虑 x i \mathbf{x}_{i} xi 沿 v \mathbf{v} v 方向投影的方差:
max u σ v 2 = v T Σ v s.t v T v − 1 = 0 v T u 1 = 0 \begin{array}{ll} \max\limits_{\mathbf{u}} & \sigma_{\mathbf{v}}^{2} = \mathbf{v}^{T} \mathbf{\Sigma} \mathbf{v} \\ \text{s.t} & \mathbf{v}^T\mathbf{v}-1=0\\ & \mathbf{v}^{T} \mathbf{u}_{1}=0 \end{array} umaxs.tσv2=vTΣvvTv−1=0vTu1=0
定义: J ( v ) = v T Σ v − α ( v T v − 1 ) − β ( v T u 1 − 0 ) J(\mathbf{v})=\mathbf{v}^{T} \mathbf{\Sigma} \mathbf{v}-\alpha\left(\mathbf{v}^{T} \mathbf{v}-1\right)-\beta\left(\mathbf{v}^{T} \mathbf{u}_{1}-0\right) J(v)=vTΣv−α(vTv−1)−β(vTu1−0)
对 v \mathbf{v} v 求偏导得:
2 Σ v − 2 α v − β u 1 = 0 2 \Sigma \mathbf{v}-2 \alpha \mathbf{v}-\beta \mathbf{u}_{1}=\mathbf{0} 2Σv−2αv−βu1=0
两边同乘 u 1 T \mathbf{u}_{1}^{T} u1T:
2 u 1 T Σ v − 2 α u 1 T v − β u 1 T u 1 = 0 2 u 1 T Σ v − β = 0 2 v T Σ u 1 − β = 0 2 v T λ 1 u 1 − β = 0 β = 0 \begin{aligned} 2 \mathbf{u}_{1}^{T}\Sigma \mathbf{v}-2 \alpha \mathbf{u}_{1}^{T}\mathbf{v}-\beta \mathbf{u}_{1}^{T}\mathbf{u}_{1} &=0 \\ 2 \mathbf{u}_{1}^{T}\Sigma \mathbf{v}-\beta &= 0\\ 2 \mathbf{v}^{T}\Sigma \mathbf{u}_{1}-\beta &= 0\\ 2 \mathbf{v}^{T}\lambda_1 \mathbf{u}_{1}-\beta &= 0\\ \beta &= 0 \end{aligned} 2u1TΣv−2αu1Tv−βu1Tu12u1TΣv−β2vTΣu1−β2vTλ1u1−ββ=0=0=0=0=0
再代入到原式:
2 Σ v − 2 α v = 0 Σ v = α v 2 \Sigma \mathbf{v}-2 \alpha \mathbf{v}=\mathbf{0}\\ \Sigma \mathbf{v}=\alpha \mathbf{v} 2Σv−2αv=0Σv=αv
故 v \mathbf{v} v 也是 Σ \mathbf{\Sigma} Σ 的特征向量。
σ v 2 = v T Σ v = α \sigma_{\mathbf{v}}^{2} = \mathbf{v}^{T} \mathbf{\Sigma} \mathbf{v} =\alpha σv2=vTΣv=α,故 α \alpha α 应取 Σ \mathbf{\Sigma} Σ (第二大)的特征向量。
问题1:上述求得的 v \mathbf{v} v (即 u 2 \mathbf{u}_2 u2 ),与 u 1 \mathbf{u}_1 u1 一起考虑,能否使 D D D 在 s p a n { u 1 , u 2 } span\{\mathbf{u}_1, \mathbf{u}_2 \} span{u1,u2} 上投影总方差最大?
设 x i = a i 1 u 1 + a i 2 u 2 ⏟ 投影 + ⋯ \mathbf{x}_i=\underbrace{a_{i1} \mathbf{u}_1+a_{i2}\mathbf{u}_2}_{投影}+\cdots xi=投影 ai1u1+ai2u2+⋯
则 x i \mathbf{x}_i xi 在 s p a n { u 1 , u 2 } span\{\mathbf{u}_1, \mathbf{u}_2 \} span{u1,u2} 上投影坐标: a i = ( a i 1 , a i 2 ) T = ( u 1 T x i , u 2 T x i ) T \mathbf{a}_{i}=(a_{i1},a_{i2})^T=(\mathbf{u}_1^{T}\mathbf{x}_i,\mathbf{u}_2^{T}\mathbf{x}_i)^{T} ai=(ai1,ai2)T=(u1Txi,u2Txi)T
令 U 2 = ( ∣ ∣ u 1 u 2 ∣ ∣ ) \mathbf{U}_{2}=\left(\begin{array}{cc} \mid & \mid \\ \mathbf{u}_{1} & \mathbf{u}_{2} \\ \mid & \mid \end{array}\right) U2=⎝ ⎛∣u1∣∣u2∣⎠ ⎞,则 a i = U 2 T x i \mathbf{a}_{i}=\mathbf{U}_{2}^{T} \mathbf{x}_{i} ai=U2Txi
投影总方差为:
var ( A ) = 1 n ∑ i = 1 n ∥ a i − 0 ∥ 2 = 1 n ∑ i = 1 n ( U 2 T x i ) T ( U 2 T x i ) = 1 n ∑ i = 1 n x i T ( U 2 U 2 T ) x i = 1 n ∑ i = 1 n x i T ( u 1 u 1 T + u 2 u 2 T ) x i = u 1 T Σ u 1 + u 2 T Σ u 2 = λ 1 + λ 2 \begin{aligned} \operatorname{var}(\mathbf{A}) &=\frac{1}{n} \sum_{i=1}^{n}\left\|\mathbf{a}_{i}-\mathbf{0}\right\|^{2} \\ &=\frac{1}{n} \sum_{i=1}^{n}\left(\mathbf{U}_{2}^{T} \mathbf{x}_{i}\right)^{T}\left(\mathbf{U}_{2}^{T} \mathbf{x}_{i}\right) \\ &=\frac{1}{n} \sum_{i=1}^{n} \mathbf{x}_{i}^{T}\left(\mathbf{U}_{2} \mathbf{U}_{2}^{T}\right) \mathbf{x}_{i}\\ &=\frac{1}{n} \sum_{i=1}^{n} \mathbf{x}_{i}^{T}\left( \mathbf{u}_{1}\mathbf{u}_{1}^T + \mathbf{u}_{2}\mathbf{u}_{2}^T \right) \mathbf{x}_{i}\\ &=\mathbf{u}_{1}^T\mathbf{\Sigma} \mathbf{u}_{1} + \mathbf{u}_{2}^T\mathbf{\Sigma} \mathbf{u}_{2}\\ &= \lambda_1 +\lambda_2 \end{aligned} var(A)=n1i=1∑n∥ai−0∥2=n1i=1∑n(U2Txi)T(U2Txi)=n1i=1∑nxiT(U2U2T)xi=n1i=1∑nxiT(u1u1T+u2u2T)xi=u1TΣu1+u2TΣu2=λ1+λ2
问题2:平均平方误差是否最小?
其中, x i ′ = U 2 U 2 T x i \mathbf{x}_{i}^{\prime}=\mathbf{U}_{2}\mathbf{U}_{2}^{T} \mathbf{x}_{i} xi′=U2U2Txi
M S E = 1 n ∑ i = 1 n ∥ x i − x i ′ ∥ 2 = 1 n ∑ i = 1 n ∥ x i ∥ 2 − 1 n ∑ i = 1 n x i T ( U 2 U 2 T ) x i = v a r ( D ) − λ 1 − λ 2 \begin{aligned} M S E &= \frac{1}{n} \sum_{i=1}^{n}\left\|\mathbf{x}_{i}-\mathbf{x}_{i}^{\prime}\right\|^{2} \\ &= \frac{1}{n} \sum_{i=1}^{n}\left\|\mathbf{x}_{i}\right\|^{2} - \frac{1}{n} \sum_{i=1}^{n} \mathbf{x}_{i}^{T}\left(\mathbf{U}_{2} \mathbf{U}_{2}^{T}\right) \mathbf{x}_{i}\\ &= var(D) - \lambda_1 - \lambda_2 \end{aligned} MSE=n1i=1∑n∥xi−xi′∥2=n1i=1∑n∥xi∥2−n1i=1∑nxiT(U2U2T)xi=var(D)−λ1−λ2
结论:
Σ d × d \Sigma_{d\times d} Σd×d , λ 1 ≥ λ 2 ≥ ⋯ λ d \lambda_1 \ge \lambda_2 \ge \cdots \lambda_d λ1≥λ2≥⋯λd,中心化
∑ i = 1 r λ i \sum\limits_{i=1}^r\lambda_i i=1∑rλi:最大投影总方差;
v a r ( D ) − ∑ i = 1 r λ i var(D)-\sum\limits_{i=1}^r\lambda_i var(D)−i=1∑rλi:最小MSE
实践: 如何选取适当的 r r r,考虑比值 ∑ i = 1 r λ i v a r ( D ) \frac{\sum\limits_{i=1}^r\lambda_i}{var(D)} var(D)i=1∑rλi 与给定阈值 α \alpha α 比较
算法 7.1 PCA:
输入: D D D, α \alpha α
输出: A A A (降维后)
ϕ : I → F ⊆ R d \phi:\mathcal{I}\to \mathcal{F}\subseteq \mathbb{R}^d ϕ:I→F⊆Rd
K : I × I → R K:\mathcal{I}\times\mathcal{I}\to \mathbb{R} K:I×I→R
K ( x i , x j ) = ϕ T ( x i ) ϕ ( x j ) K(\mathbf{x}_i,\mathbf{x}_j)=\phi^T(\mathbf{x}_i)\phi(\mathbf{x}_j) K(xi,xj)=ϕT(xi)ϕ(xj)
已知: K = [ K ( x i , x j ) ] n × n \mathbf{K}=[K(\mathbf{x}_i,\mathbf{x}_j)]_{n\times n} K=[K(xi,xj)]n×n, Σ ϕ = 1 n ∑ i = 1 n ϕ ( x i ) ϕ ( x i ) T \mathbf{\Sigma}_{\phi}=\frac{1}{n}\sum\limits_{i=1}^n\phi(\mathbf{x}_i)\phi(\mathbf{x}_i)^T Σϕ=n1i=1∑nϕ(xi)ϕ(xi)T
对象: ϕ ( x 1 ) , ϕ ( x 2 ) , ⋯ , ϕ ( x n ) ∈ R d \phi(\mathbf{x}_1),\phi(\mathbf{x}_2),\cdots,\phi(\mathbf{x}_n)\in \mathbb{R}^d ϕ(x1),ϕ(x2),⋯,ϕ(xn)∈Rd,假设 1 n ∑ i n ϕ ( x i ) = 0 \frac{1}{n}\sum\limits_{i}^{n}\phi(\mathbf{x}_i)=\mathbf{0} n1i∑nϕ(xi)=0, K → K ^ \mathbf{K} \to \hat{\mathbf{K}} K→K^,已经中心化;
目标: u , λ , s . t . Σ ϕ u = λ u \mathbf{u},\lambda,s.t. \mathbf{\Sigma}_{\phi}\mathbf{u}=\lambda\mathbf{u} u,λ,s.t.Σϕu=λu
1 n ∑ i = 1 n ϕ ( x i ) [ ϕ ( x i ) T u ] = λ u ∑ i = 1 n [ ϕ ( x i ) T u n λ ] ϕ ( x i ) = u \begin{aligned} \frac{1}{n}\sum\limits_{i=1}^n\phi(\mathbf{x}_i)[\phi(\mathbf{x}_i)^T\mathbf{u}] &=\lambda\mathbf{u}\\ \sum\limits_{i=1}^n[\frac{\phi(\mathbf{x}_i)^T\mathbf{u}}{n\lambda}] \phi(\mathbf{x}_i)&=\mathbf{u}\\ \end{aligned} n1i=1∑nϕ(xi)[ϕ(xi)Tu]i=1∑n[nλϕ(xi)Tu]ϕ(xi)=λu=u
相同于所有数据线性组合。
令: c i = ϕ ( x i ) T u n λ c_i=\frac{\phi(\mathbf{x}_i)^T\mathbf{u}}{n\lambda} ci=nλϕ(xi)Tu,则 u = ∑ i = 1 n c i ϕ ( x i ) \mathbf{u}=\sum\limits_{i=1}^nc_i \phi(\mathbf{x}_i) u=i=1∑nciϕ(xi)。代入原式:
( 1 n ∑ i = 1 n ϕ ( x i ) ϕ ( x i ) T ) ( ∑ j = 1 n c j ϕ ( x j ) ) = λ ∑ i = 1 n c i ϕ ( x i ) 1 n ∑ i = 1 n ∑ j = 1 n c j ϕ ( x i ) ϕ ( x i ) T ϕ ( x j ) = λ ∑ i = 1 n c i ϕ ( x i ) ∑ i = 1 n ( ϕ ( x i ) ∑ j = 1 n c j K ( x i , x j ) ) = n λ ∑ i = 1 n c i ϕ ( x i ) \begin{aligned} \left(\frac{1}{n} \sum_{i=1}^{n} \phi\left(\mathbf{x}_{i}\right) \phi\left(\mathbf{x}_{i}\right)^{T}\right)\left(\sum_{j=1}^{n} c_{j} \phi\left(\mathbf{x}_{j}\right)\right) &=\lambda \sum_{i=1}^{n} c_{i} \phi\left(\mathbf{x}_{i}\right) \\ \frac{1}{n} \sum_{i=1}^{n} \sum_{j=1}^{n} c_{j} \phi\left(\mathbf{x}_{i}\right) \phi\left(\mathbf{x}_{i}\right)^{T} \phi\left(\mathbf{x}_{j}\right) &=\lambda \sum_{i=1}^{n} c_{i} \phi\left(\mathbf{x}_{i}\right) \\ \sum_{i=1}^{n}\left(\phi\left(\mathbf{x}_{i}\right) \sum_{j=1}^{n} c_{j} K(\mathbf{x}_i, \mathbf{x}_j) \right) &=n \lambda \sum_{i=1}^{n} c_{i} \phi\left(\mathbf{x}_{i}\right) \end{aligned} (n1i=1∑nϕ(xi)ϕ(xi)T)(j=1∑ncjϕ(xj))n1i=1∑nj=1∑ncjϕ(xi)ϕ(xi)Tϕ(xj)i=1∑n(ϕ(xi)j=1∑ncjK(xi,xj))=λi=1∑nciϕ(xi)=λi=1∑nciϕ(xi)=nλi=1∑nciϕ(xi)
注意,此处 K = K ^ \mathbf{K}=\hat{\mathbf{K}} K=K^ 已经中心化
对于 ∀ k ( 1 ≤ k ≤ n ) \forall k (1\le k\le n) ∀k(1≤k≤n),两边同时左乘 ϕ ( x k ) \phi(\mathbf{x}_{k}) ϕ(xk):
∑ i = 1 n ( ϕ T ( x k ) ϕ ( x i ) ∑ j = 1 n c j K ( x i , x j ) ) = n λ ∑ i = 1 n c i ϕ T ( x k ) ϕ ( x i ) ∑ i = 1 n ( K ( x k , x i ) ∑ j = 1 n c j K ( x i , x j ) ) = n λ ∑ i = 1 n c i K ( x k , x i ) \begin{aligned} \sum_{i=1}^{n}\left(\phi^T(\mathbf{x}_{k}) \phi\left(\mathbf{x}_{i}\right) \sum_{j=1}^{n} c_{j} K(\mathbf{x}_i, \mathbf{x}_j) \right) &=n \lambda \sum_{i=1}^{n} c_{i} \phi^T(\mathbf{x}_{k}) \phi\left(\mathbf{x}_{i}\right) \\ \sum_{i=1}^{n}\left(K(\mathbf{x}_k, \mathbf{x}_i) \sum_{j=1}^{n} c_{j} K(\mathbf{x}_i, \mathbf{x}_j) \right) &=n \lambda \sum_{i=1}^{n} c_{i} K(\mathbf{x}_k, \mathbf{x}_i) \\ \end{aligned} i=1∑n(ϕT(xk)ϕ(xi)j=1∑ncjK(xi,xj))i=1∑n(K(xk,xi)j=1∑ncjK(xi,xj))=nλi=1∑nciϕT(xk)ϕ(xi)=nλi=1∑nciK(xk,xi)
令 K i = ( K ( x i , x 1 ) , K ( x i , x 2 ) , ⋯ , K ( x i , x n ) ) T \mathbf{K}_{i}=\left(K\left(\mathbf{x}_{i}, \mathbf{x}_{1}\right), K\left(\mathbf{x}_{i}, \mathbf{x}_{2}\right), \cdots, K\left(\mathbf{x}_{i}, \mathbf{x}_{n}\right)\right)^{T} Ki=(K(xi,x1),K(xi,x2),⋯,K(xi,xn))T (核矩阵的第 i i i 行, K = ( [ K 1 T ⋮ K n T ] ) \mathbf{K}=(\begin{bmatrix} \mathbf{K}_1^T \\ \vdots \\ \mathbf{K}_n^T \end{bmatrix}) K=(⎣ ⎡K1T⋮KnT⎦ ⎤)), c = ( c 1 , c 2 , ⋯ , c n ) T \mathbf{c}=(c_1,c_2,\cdots,c_n)^T c=(c1,c2,⋯,cn)T,则:
∑ i = 1 n K ( x k , x i ) K i T c = n λ K k T c , k = 1 , 2 , ⋯ , n K k T [ K 1 T ⋮ K n T ] c = n λ K k T c K k T K = n λ K k T c \begin{aligned} \sum_{i=1}^{n}K(\mathbf{x}_k, \mathbf{x}_i) \mathbf{K}^T_i\mathbf{c} &=n \lambda \mathbf{K}^T_k\mathbf{c},k=1,2,\cdots,n \\ \mathbf{K}^T_k\begin{bmatrix} \mathbf{K}_1^T \\ \vdots \\ \mathbf{K}_n^T \end{bmatrix}\mathbf{c} &=n \lambda \mathbf{K}^T_k\mathbf{c}\\ \mathbf{K}^T_k\mathbf{K} &=n \lambda \mathbf{K}^T_k\mathbf{c} \end{aligned} i=1∑nK(xk,xi)KiTcKkT⎣ ⎡K1T⋮KnT⎦ ⎤cKkTK=nλKkTc,k=1,2,⋯,n=nλKkTc=nλKkTc
即 K 2 c = n λ K c \mathbf{K}^2\mathbf{c}=n\lambda \mathbf{K}\mathbf{c} K2c=nλKc
假设 K − 1 \mathbf{K}^{-1} K−1 存在
K 2 c = n λ K c K c = n λ c K c = η c , η = n λ \begin{aligned} \mathbf{K}^2\mathbf{c}&=n\lambda \mathbf{K}\mathbf{c}\\ \mathbf{K}\mathbf{c}&=n\lambda \mathbf{c}\\ \mathbf{K}\mathbf{c}&= \eta\mathbf{c},\eta=n\lambda \end{aligned} K2cKcKc=nλKc=nλc=ηc,η=nλ
结论: η 1 n ≥ η 2 n ≥ ⋯ ≥ η n n \frac{\eta_1}{n}\ge\frac{\eta_2}{n}\ge\cdots\ge\frac{\eta_n}{n} nη1≥nη2≥⋯≥nηn,给出在特征空间中 ϕ ( x 1 ) , ϕ ( x 2 ) , ⋯ , ϕ ( x n ) \phi(\mathbf{x}_1),\phi(\mathbf{x}_2),\cdots,\phi(\mathbf{x}_n) ϕ(x1),ϕ(x2),⋯,ϕ(xn) 的投影方差: ∑ i = 1 r η r n \sum\limits_{i=1}^{r}\frac{\eta_r}{n} i=1∑rnηr,其中 η 1 ≥ η 2 ⋯ ≥ η n \eta_1\ge\eta_2\cdots\ge\eta_n η1≥η2⋯≥ηn 是 K \mathbf{K} K 的特征值。
问:可否计算出 ϕ ( x 1 ) , ϕ ( x 2 ) , ⋯ , ϕ ( x n ) \phi(\mathbf{x}_1),\phi(\mathbf{x}_2),\cdots,\phi(\mathbf{x}_n) ϕ(x1),ϕ(x2),⋯,ϕ(xn) 在主元方向上的投影(即降维之后的数据)?
设 u 1 , ⋯ , u d \mathbf{u}_1,\cdots,\mathbf{u}_d u1,⋯,ud 是 Σ ϕ \mathbf{\Sigma}_{\phi} Σϕ 的特征向量,则 ϕ ( x j ) = a 1 u 1 + ⋯ + a d u d \phi(\mathbf{x}_j)=a_1\mathbf{u}_1+\cdots+a_d\mathbf{u}_d ϕ(xj)=a1u1+⋯+adud,其中
a k = ϕ ( x j ) T u k , k = 1 , 2 , ⋯ , d = ϕ ( x j ) T ∑ i = 1 n c k i ϕ ( x i ) = ∑ i = 1 n c k i ϕ ( x j ) T ϕ ( x i ) = ∑ i = 1 n c k i K ( x j , x i ) \begin{aligned} a_k &= \phi(\mathbf{x}_j)^T\mathbf{u}_k, k=1,2,\cdots,d\\ &= \phi(\mathbf{x}_j)^T\sum\limits_{i=1}^nc_{ki} \phi(\mathbf{x}_i)\\ &= \sum\limits_{i=1}^nc_{ki} \phi(\mathbf{x}_j)^T\phi(\mathbf{x}_i)\\ &= \sum\limits_{i=1}^nc_{ki} K(\mathbf{x}_j,\mathbf{x}_i) \end{aligned} ak=ϕ(xj)Tuk,k=1,2,⋯,d=ϕ(xj)Ti=1∑nckiϕ(xi)=i=1∑nckiϕ(xj)Tϕ(xi)=i=1∑nckiK(xj,xi)
算法7.2:核主元分析( F ⊆ R d \mathcal{F}\subseteq \mathbb{R}^d F⊆Rd)
输入: K K K, α \alpha α
输出: A A A (降维后数据的投影坐标)
K ^ : = ( I − 1 n 1 n × n ) K ( I − 1 n 1 n × n ) \hat{\mathbf{K}} :=\left(\mathbf{I}-\frac{1}{n} \mathbf{1}_{n \times n}\right) \mathbf{K}\left(\mathbf{I}-\frac{1}{n} \mathbf{1}_{n \times n}\right) K^:=(I−n11n×n)K(I−n11n×n)
η 1 , η 2 , ⋯ η d \eta_1,\eta_2,\cdots\eta_d η1,η2,⋯ηd ⟵ K \longleftarrow \mathbf{K} ⟵K 的特征值,只取前 d d d 个
c 1 , c 2 , ⋯ , c d \mathbf{c}_1,\mathbf{c}_2,\cdots,\mathbf{c}_d c1,c2,⋯,cd ⟵ K \longleftarrow \mathbf{K} ⟵K 的特征向量(单位化,正交)
c i ← 1 η i ⋅ c i , i = 1 , ⋯ , d \mathbf{c}_i \leftarrow \frac{1}{\sqrt{\eta_i}}\cdot \mathbf{c}_i,i=1,\cdots,d ci←ηi1⋅ci,i=1,⋯,d
选取最小的 r r r 使得: ∑ i = 1 r η i n ∑ i = 1 d η i n ≥ α \frac{\sum\limits_{i=1}^r\frac{\eta_i}{n}}{\sum\limits_{i=1}^d\frac{\eta_i}{n}}\ge \alpha i=1∑dnηii=1∑rnηi≥α
C r = ( c 1 , c 2 , ⋯ , c r ) \mathbf{C}_r=(\mathbf{c}_1,\mathbf{c}_2,\cdots,\mathbf{c}_r) Cr=(c1,c2,⋯,cr)
A = { a i ∣ a i = C r T K i , i = 1 , ⋯ , n } A=\{\mathbf{a}_i|\mathbf{a}_i=\mathbf{C}_r^T\mathbf{K}_i, i=1,\cdots,n\} A={ai∣ai=CrTKi,i=1,⋯,n}