(真的非常期待能读到第8、9章那里,看看贝叶斯派的作者是如何讲概率图的)
回归有三种方式:
激活函数activation function的逆称之为链接函数link function
推广的线性模型generalized linear models: y ( x ) = f ( w T x + w 0 ) y(\textbf x)=f(\textbf w^T \textbf x+w_0) y(x)=f(wTx+w0),注意决策边界 y = c o n s t a n t y=constant y=constant对应 w T x + w 0 = c o n s t a n t \textbf w^T\textbf x+w_0=constant wTx+w0=constant
y ( x ) = w T x + w 0 = w ~ T x ~ y(\textbf x) = \textbf w^T \textbf x + w_0=\tilde {\textbf w}^T\tilde{\textbf x} y(x)=wTx+w0=w~Tx~
其中 w ~ = ( w 0 , w ) , x ~ = ( 1 , x ) \tilde \textbf w = (w_0, \textbf w), \tilde \textbf x = (1, \textbf x) w~=(w0,w),x~=(1,x)
y ( x ) = W ~ T x ~ y k ( x ) = w k T x + w k 0 \begin{aligned} \textbf y(\textbf x) & = \tilde \textbf W^T \tilde \textbf x \\ y_k(\textbf x) & = \textbf w_k^T \textbf x + w_{k0} \end{aligned} y(x)yk(x)=W~Tx~=wkTx+wk0
如果 y k ( x ) > y j ( x ) for all j ≠ k y_k(\textbf x)>y_j(\textbf x) \text{ for all } j \neq k yk(x)>yj(x) for all j=k,那么 x \textbf x x 属于 C k \mathcal C_k Ck.
以上把模型定义清楚了,接下来考虑如何学习参数,有最小二乘、Fisher线性判别和感知机算法三种
方法是把label给onehot处理,然后直接当回归做。但是这样非常差。因为
究其原因是回归当中我们假定的模型label服从高斯分布,和这里非常不一样
思想:类均值分最开,类内方差最小
对于二分类问题,如果记 C 1 \mathcal C_1 C1的label为 N / N 1 N/N_1 N/N1( N 1 N_1 N1是 C 1 \mathcal C_1 C1样本数, N N N为总样本数), C 2 \mathcal C_2 C2的label为 − N / N 2 -N/N_2 −N/N2. 那么Fisher判别分析的结果和最小二乘一样。
而且最小二乘还能给出 − w 0 -w_0 −w0的分类阈值threshold,不仅仅是把数据映射到一维空间
y = W T x \begin{aligned} \textbf y = \textbf W ^T \textbf x \\ \end{aligned} y=WTx
类内协方差矩阵为
S W = ∑ k = 1 K S k \begin{aligned} \textbf S_W = \sum_{k=1}^K \textbf S_k \end{aligned} SW=k=1∑KSk
其中
S k = ∑ n ∈ C k ( x n − m k ) ( x n − m k ) T m k = 1 N k ∑ n ∈ C k x n \begin{aligned} \textbf S_k &= \sum_{n\in \mathcal C_k}(\textbf x_n-\textbf m_k)(\textbf x_n - \textbf m_k)^T \\ \textbf m_k &=\frac{1}{N_k}\sum_{n\in\mathcal C_k} \textbf x_n \end{aligned} Skmk=n∈Ck∑(xn−mk)(xn−mk)T=Nk1n∈Ck∑xn
整体协方差矩阵为
S T = ∑ n = 1 N ( x n − m ) ( x n − m ) T \textbf S_T = \sum_{n=1}^N (\textbf x_n-\textbf m)(\textbf x_n - \textbf m)^T ST=n=1∑N(xn−m)(xn−m)T
可以证明,有关系
S T = S W + S B \textbf S_T = \textbf S_W + \textbf S_B ST=SW+SB
其中
S B = ∑ k = 1 K N k ( m k − m ) ( m k − m ) T \textbf S_B = \sum_{k=1}^K N_k(\textbf m_k - \textbf m)(\textbf m_k - \textbf m)^T SB=k=1∑KNk(mk−m)(mk−m)T
以上是对于 x \textbf x x所在空间,对于 y \textbf y y所在空间类似,例如 y \textbf y y 所在空间的类内协方差矩阵为 W S W W T \textbf WS_W \textbf W^T WSWWT.
所以准则为
J ( W ) = T r { ( W S w W T ) − 1 ( W S B W T ) } J(\textbf W) = Tr\{ (\textbf W \textbf S_w \textbf W^T)^{-1}(\textbf W \textbf S_B \textbf W^T) \} J(W)=Tr{(WSwWT)−1(WSBWT)}
优化算法书中略去
损失函数为感知机准则perceptron criterion
E P ( w ) = − ∑ n ∈ M w T ϕ n t n E_P(\textbf w) = -\sum_{n\in \mathcal M} \textbf w^T \phi_n t_n EP(w)=−n∈M∑wTϕntn
其中 t n ∈ { + 1 , − 1 } t_n \in \{+1,-1\} tn∈{+1,−1}. M \mathcal M M是分类错误的模式
利用sgd
w ( τ + 1 ) = w ( τ ) − η ∇ E P ( w ) = w ( τ ) + η ϕ n t n \textbf w^{(\tau +1)} = \textbf w^{(\tau)} - \eta \nabla E_P(\textbf w)=\textbf w^{(\tau)} + \eta \phi_n t_n w(τ+1)=w(τ)−η∇EP(w)=w(τ)+ηϕntn
注意 w \textbf w w的尺度不影响决策,所以直接设定 η = 1 \eta=1 η=1
感知机收敛定理perceptron convergence theorem表明如果训练数据线性可分,则一定可以在有限步内收敛
生成式模型。对于二分类
p ( C 1 ∣ x ) = p ( x ∣ C 1 ) p ( C 1 ) p ( x ∣ C 1 ) p ( C 1 ) + p ( x ∣ C 2 ) p ( C 2 ) = 1 1 + exp ( − a ) = σ ( a ) \begin{aligned} p(\mathcal C_1|\textbf x) = \frac{p(\textbf x|\mathcal C_1)p(\mathcal C_1)}{p(\textbf x|\mathcal C_1)p(\mathcal C_1)+p(\textbf x|\mathcal C_2)p(\mathcal C_2)} = \frac{1}{1+\exp(-a)}=\sigma(a) \end{aligned} p(C1∣x)=p(x∣C1)p(C1)+p(x∣C2)p(C2)p(x∣C1)p(C1)=1+exp(−a)1=σ(a)
其中
a = ln p ( x ∣ C 1 ) p ( C 1 ) p ( x ∣ C 2 ) p ( C 2 ) a = \ln \frac{p(\textbf x|\mathcal C_1)p(\mathcal C_1)}{p(\textbf x|\mathcal C_2)p(\mathcal C_2)} a=lnp(x∣C2)p(C2)p(x∣C1)p(C1)
对于多分类
p ( C k ∣ x ) = p ( x ∣ C k ) p ( C k ) ∑ j p ( x ∣ C j ) p ( C j ) = exp ( a k ) ∑ j exp ( a j ) p(\mathcal C_k|\textbf x)=\frac{p(\textbf x|\mathcal C_k)p(\mathcal C_k)}{\sum_j p(\textbf x|\mathcal C_j)p(\mathcal C_j)}=\frac{\exp(a_k)}{\sum_j \exp(a_j)} p(Ck∣x)=∑jp(x∣Cj)p(Cj)p(x∣Ck)p(Ck)=∑jexp(aj)exp(ak)
其中
a k = ln p ( x ∣ C k ) p ( C k ) a_k=\ln p(\textbf x|\mathcal C_k)p(\mathcal C_k) ak=lnp(x∣Ck)p(Ck)
对于连续情况,如果所有类的类内协方差相等,并把每类建模成高斯分布,即
p ( x ∣ C k ) = 1 ( 2 π ) D / 2 ∣ Σ ∣ 1 / 2 exp { − 1 2 ( x − μ k ) T Σ − 1 ( x − μ k ) } p(\textbf x|\mathcal C_k)=\frac{1}{(2\pi)^{D/2}|\Sigma|^{1/2}}\exp\left \{ -\frac{1}{2} (\textbf x- \mu _k)^T\Sigma^{-1}(\textbf x - \mu_k) \right\} p(x∣Ck)=(2π)D/2∣Σ∣1/21exp{−21(x−μk)TΣ−1(x−μk)}
则二分类可以写出
p ( C 1 ∣ x ) = σ ( w T x + w 0 ) p(\mathcal C_1|\textbf x)=\sigma (\textbf w^T \textbf x+w_0) p(C1∣x)=σ(wTx+w0)
其中
w = Σ − 1 ( μ 1 − μ 2 ) w 0 = − 1 2 μ 1 T Σ − 1 μ 1 + 1 2 μ 2 T Σ − 1 μ 2 + ln p ( C 1 ) p ( C 2 ) \begin{aligned} \textbf w & =\Sigma^{-1}(\mu_1 - \mu_2) \\ w_0 &= -\frac{1}{2}\mu_1^T \Sigma^{-1}\mu_1 + \frac{1}{2}\mu_2^T \Sigma^{-1} \mu_2 + \ln \frac{p(\mathcal C_1)}{p(\mathcal C_2)} \end{aligned} ww0=Σ−1(μ1−μ2)=−21μ1TΣ−1μ1+21μ2TΣ−1μ2+lnp(C2)p(C1)
对于多类情况
a k ( x ) = w k T x + w k 0 a_k(\textbf x) = \textbf w_k^T \textbf x + w _{k0} ak(x)=wkTx+wk0
其中
w k = Σ − 1 μ k w k 0 = − 1 2 μ k T Σ − 1 μ k + ln p ( C k ) \begin{aligned} \textbf w_k &= \Sigma^{-1} \mu_k \\ w_{k0} &= -\frac{1}{2}\mu_k^T \Sigma^{-1}\mu_k + \ln p(\mathcal C_k) \end{aligned} wkwk0=Σ−1μk=−21μkTΣ−1μk+lnp(Ck)
注意,如果类内协方差不相等,那么结果会是二次判别quadratic discriminant. 决策边界也不是线性的
这种生成式模型可以用最大似然MLE求解。对于二分类问题
p ( t ∣ π , μ 1 , μ 2 , Σ ) = ∏ n = 1 N [ π N ( x n ∣ μ 1 , Σ ) ] t n [ ( 1 − π ) N ( x n ∣ μ 2 , Σ ) ] 1 − t n p(\textbf t|\pi, \mu_1, \mu_2, \Sigma) = \prod_{n=1}^N [\pi \mathcal N(\textbf x_n|\mu_1, \Sigma)]^{t_n}[(1-\pi)\mathcal N(\textbf x_n|\mu_2, \Sigma)]^{1-t_n} p(t∣π,μ1,μ2,Σ)=n=1∏N[πN(xn∣μ1,Σ)]tn[(1−π)N(xn∣μ2,Σ)]1−tn
结果为
π = N 1 N 1 + N 2 μ 1 = 1 N 1 ∑ n = 1 N t n x n μ 2 = 1 N 2 ∑ n = 1 N ( 1 − t n ) x n Σ = N 1 N [ 1 N 1 ∑ n ∈ C 1 ( x n − μ 1 ) ( x n − μ 1 ) T ] + N 2 N [ 1 N 2 ∑ n ∈ C 2 ( x n − μ 2 ) ( x n − μ 2 ) T ] \begin{aligned} \pi & = \frac{N_1}{N_1+N_2} \\ \mu_1 &= \frac{1}{N_1}\sum_{n=1}^N t_n \textbf x_n \\ \mu_2 &=\frac{1}{N_2}\sum_{n=1}^N (1-t_n)\textbf x_n \\ \Sigma &= \frac{N_1}{N} \left[ \frac{1}{N_1} \sum_{n\in \mathcal C_1} (\textbf x_n - \mu_1)(\textbf x_n - \mu_1)^T \right] + \frac{N_2}{N} \left[ \frac{1}{N_2} \sum_{n\in \mathcal C_2} (\textbf x_n - \mu_2)(\textbf x_n - \mu_2)^T \right] \end{aligned} πμ1μ2Σ=N1+N2N1=N11n=1∑Ntnxn=N21n=1∑N(1−tn)xn=NN1[N11n∈C1∑(xn−μ1)(xn−μ1)T]+NN2[N21n∈C2∑(xn−μ2)(xn−μ2)T]
其中 N 1 , N 2 N_1, N_2 N1,N2是两类的样本数。多类的结果类似
直接分析多类情况。假定有 D D D维度输入,每维输入有0、1两种取值,这里做出naive Bayes假设,即任意两个输入之间是独立的,从而
p ( x ∣ C k ) = ∏ i = 1 D μ k i x i ( 1 − μ k i ) 1 − x i p(\textbf x|\mathcal C_k)=\prod_{i=1}^D \mu_{ki}^{x_i} (1 - \mu_{ki})^{1 -x_i} p(x∣Ck)=i=1∏Dμkixi(1−μki)1−xi
得到
a k ( x ) = ∑ i = 1 D { x i ln μ k i + ( 1 − x i ) ln ( 1 − μ k i ) } + ln p ( C k ) a_k(\textbf x) = \sum_{i=1}^D \left \{ x_i \ln \mu_{ki} + (1-x_i)\ln(1 -\mu_{ki}) \right \} + \ln p(\mathcal C_k) ak(x)=i=1∑D{xilnμki+(1−xi)ln(1−μki)}+lnp(Ck)
注意仍然是关于 x \textbf x x的线性函数
p ( x ∣ λ k , s ) = 1 s h ( 1 s x ) g ( λ k ) exp { 1 s λ k T x } p(\textbf x|\lambda _k,s)=\frac{1}{s}h \left(\frac{1}{s}\textbf x\right) g(\lambda_k)\exp \left\{ \frac{1}{s} \lambda_k^T \textbf x \right \} p(x∣λk,s)=s1h(s1x)g(λk)exp{s1λkTx}
参考PRML读书随笔——第2章:尺度参数的无信息先验分布
此时,对于多分类 a k ( x ) = λ k T x + ln g ( λ k ) + ln p ( C k ) a_k(\textbf x) = \lambda_k^T \textbf x + \ln g(\lambda_k) + \ln p(\mathcal C_k) ak(x)=λkTx+lng(λk)+lnp(Ck)
所以这么看的话,高斯分布其实是一种特例,可以把高斯分布中分子分母关于 x \textbf x x的平方项抵消掉,把高斯分布中剩下的部分,看作是一个新的分布,只要归一化就行。
直接建模 p ( C k ∣ x ) p(\mathcal C_k|\textbf x) p(Ck∣x)
对于二分类
p ( C 1 ∣ ϕ ) = y ( ϕ ) = σ ( w T ϕ ) p(\mathcal C_1|\phi)=y(\phi)=\sigma({\textbf w^T \phi)} p(C1∣ϕ)=y(ϕ)=σ(wTϕ)
利用MLE,损失函数为交叉熵
− ln p ( t ∣ w ) = − ∑ n = 1 N { t n ln y n + ( 1 − t n ) ln ( 1 − y n ) } -\ln p(\textbf t|\textbf w)=-\sum_{n=1}^N \{t_n \ln y_n + (1-t_n)\ln (1 - y_n)\} −lnp(t∣w)=−n=1∑N{tnlnyn+(1−tn)ln(1−yn)}
求导后得到
∇ E ( w ) = ∑ n = 1 N ( y n − t n ) ϕ n \nabla E(\textbf w) = \sum_{n=1}^N (y_n-t_n)\phi _n ∇E(w)=n=1∑N(yn−tn)ϕn
设计用牛顿法优化。先计算Hessian矩阵
H = ∇ 2 E ( w ) = ∑ n = 1 N y n ( 1 − y n ) ϕ n ϕ n T = Φ T R Φ \textbf H = \nabla^2 E(\textbf w) = \sum_{n=1}^N y_n(1-y_n)\phi_n\phi_n^T = \Phi^T \textbf R \Phi H=∇2E(w)=n=1∑Nyn(1−yn)ϕnϕnT=ΦTRΦ
其中 R \textbf R R是对角矩阵, R n n = y n ( 1 − y n ) \textbf R_{nn} = y_n (1-y_n) Rnn=yn(1−yn). 注意 H \textbf H H正定.
w ( n e w ) = w ( o l d ) − H − 1 ∇ E ( w ) = ( Φ T R Φ ) − 1 ) Φ T Rz \textbf w^{(new)}=\textbf w^{(old)} - \textbf H^{-1}\nabla E(\textbf w) = (\Phi^T \textbf R \Phi)^{-1})\Phi^T\textbf R \textbf z w(new)=w(old)−H−1∇E(w)=(ΦTRΦ)−1)ΦTRz
其中
z = Φ w ( o l d ) − R − 1 ( y − t ) \textbf z = \Phi \textbf w^{(old)}-\textbf R^{-1}(\textbf y - \textbf t) z=Φw(old)−R−1(y−t)
注意,每次迭代, R \textbf R R要重新计算,实际上 R R R可以被解释为预测值的协方差
E [ t ] = σ ( w T ϕ ) = y v a r [ t ] = E [ t 2 ] − E [ t ] 2 = σ ( w T ϕ ) − σ ( w T ϕ ) 2 = y ( 1 − y ) \begin{aligned} \mathbb E[t] &=\sigma(\textbf w^T \phi) = y \\ var[t] &= \mathbb E[t^2] - \mathbb E[t]^2 = \sigma(\textbf w^T \phi) - \sigma(\textbf w^T \phi)^2 = y(1-y) \end{aligned} E[t]var[t]=σ(wTϕ)=y=E[t2]−E[t]2=σ(wTϕ)−σ(wTϕ)2=y(1−y)
p ( C k ∣ ϕ ) = y k ( ϕ ) = exp ( a k ) ∑ j exp ( a j ) p(\mathcal C_k|\phi) = y_k(\phi) = \frac{\exp(a_k)}{\sum_j \exp(a_j)} p(Ck∣ϕ)=yk(ϕ)=∑jexp(aj)exp(ak)
其中
a k = w k T ϕ a_k = \textbf w_k^T \phi ak=wkTϕ
求导得
∂ y k ∂ a j = y k ( I k j − y j ) \frac{\partial y_k}{\partial a_j} = y_k(I_{kj}-y_j) ∂aj∂yk=yk(Ikj−yj)
利用MLE,损失函数为多类交叉熵
E ( w 1 , . . . , w K ) = − ln p ( T ∣ w 1 , . . . , w K ) = − ∑ n = 1 N ∑ k = 1 K t n k ln y n k E(\textbf w_1, ...,\textbf w_K) = -\ln p(\textbf T| \textbf w_1, ..., \textbf w_K)=-\sum_{n=1}^N \sum_{k=1}^K t_{nk} \ln y_{nk} E(w1,...,wK)=−lnp(T∣w1,...,wK)=−n=1∑Nk=1∑Ktnklnynk
求导得到
∇ w j E ( w 1 , . . . , w K ) = ∑ n = 1 N ( y n j − t n j ) ϕ n \nabla _{\textbf w_j} E(\textbf w_1, ...,\textbf w_K) = \sum_{n=1}^N (y_{nj}-t_{nj})\phi_n ∇wjE(w1,...,wK)=n=1∑N(ynj−tnj)ϕn
注意这个形式和线性回归与逻辑回归gd的形式一样的
对应IRLS算法中,Hessian矩阵满足
∇ w k ∇ w j E ( w 1 , . . . , w K ) = − ∑ n = 1 N y n k ( I k j − y n j ) ϕ n ϕ n T \nabla_{\textbf w_k}\nabla_{\textbf w_j} E(\textbf w_1, ..., \textbf w_K) =-\sum_{n=1}^N y_{nk}(I_{kj}-y_{nj})\phi_n \phi_n^T ∇wk∇wjE(w1,...,wK)=−n=1∑Nynk(Ikj−ynj)ϕnϕnT
注意Hessian矩阵非常大,是 M K × M K MK\times MK MK×MK的,其中 M M M是一个类的参数个数,注意这里的Hessian矩阵仍然是半正定
激活函数不用logistic函数,改用probit函数
Φ ( w T ϕ ) = ∫ − ∞ w T ϕ N ( θ ∣ 0 , 1 ) d θ \Phi(\textbf w^T\phi)=\int_{-\infty} ^ {\textbf w^T\phi} \mathcal N(\theta|0, 1)d\theta Φ(wTϕ)=∫−∞wTϕN(θ∣0,1)dθ
称之为probit regression.
如果二分类数据集以 ϵ \epsilon ϵ的概率标错,则预测概率可以建模为
p ( t ∣ x ) = ( 1 − ϵ ) σ ( x ) + ϵ ( 1 − σ ( x ) ) = ϵ + ( 1 − 2 ϵ ) σ ( x ) p(t|\textbf x) = (1-\epsilon)\sigma(\textbf x)+\epsilon(1-\sigma(\textbf x))=\epsilon + (1-2\epsilon)\sigma(\textbf x) p(t∣x)=(1−ϵ)σ(x)+ϵ(1−σ(x))=ϵ+(1−2ϵ)σ(x)
这里 ϵ \epsilon ϵ可以是参数或超参数
这很像Label-Noise Robust Generative Adversarial Networks论文的基础版
线性回归、逻辑回归都是指数族分布的特例,注意这里的指数族分布是关于 t t t的,和4.2节不同
p ( t ∣ η , s ) = 1 s h ( t s ) g ( η ) exp { η t s } p(t|\eta, s)=\frac{1}{s}h(\frac{t}{s})g(\eta)\exp\left\{\frac{\eta t}{s}\right\} p(t∣η,s)=s1h(st)g(η)exp{sηt}
根据第2章,有结论
y ≡ E [ t ∣ η ] = − s d d η ln g ( η ) y\equiv \mathbb E[t|\eta]=-s\frac{d}{d\eta}\ln g(\eta) y≡E[t∣η]=−sdηdlng(η)
从而 y y y和 η \eta η有关系,记为 η ( y ) \eta(y) η(y),而 y = f ( w T ϕ ) y=f(\textbf w^T \phi) y=f(wTϕ)
ln p ( t ∣ η , s ) = ∑ n = 1 N { ln g ( η n ) + η n t n s } + const \ln p(\textbf t|\eta, s)=\sum_{n=1}^N \left \{ \ln g(\eta_n)+\frac{\eta_n t_n }{s} \right \} + \text{const} lnp(t∣η,s)=n=1∑N{lng(ηn)+sηntn}+const
从而
∇ w ln p ( t ∣ η , s ) = ∑ n = 1 N { d d η n ln g ( η n ) + t n s } d η n d y n d y n d ( w T ϕ n ) ∇ ( w T ϕ n ) = ∑ n = 1 N 1 s { t n − y n } η ′ ( y n ) f ′ ( w T ϕ n ) ϕ n \nabla_{\textbf w} \ln p(\textbf t|\eta, s)=\sum_{n=1}^{N}\left \{ \frac{d}{d\eta_n}\ln g(\eta_n) + \frac{t_n}{s} \right \}\frac{d\eta_n}{dy_n}\frac{dy_n}{d (\textbf w^T \phi_n)}\nabla (\textbf w^T \phi_n)=\sum_{n=1}^N \frac{1}{s}\{t_n-y_n\}\eta'(y_n)f'(\textbf w^T \phi_n)\phi_n ∇wlnp(t∣η,s)=n=1∑N{dηndlng(ηn)+stn}dyndηnd(wTϕn)dyn∇(wTϕn)=n=1∑Ns1{tn−yn}η′(yn)f′(wTϕn)ϕn
如果 f − 1 = η f^{-1}=\eta f−1=η,那么 η ′ ( y n ) f ′ ( w T ϕ n ) = 1 \eta'(y_n) f'(\textbf w^T \phi_n)=1 η′(yn)f′(wTϕn)=1,注意此时, η = w T ϕ \eta=\textbf w ^T \phi η=wTϕ. 得到
∇ ln E ( w ) = 1 s ∑ n = 1 N ( y n − t n ) ϕ n \nabla \ln E(\textbf w) = \frac{1}{s}\sum_{n=1}^N (y_n-t_n)\phi_n ∇lnE(w)=s1n=1∑N(yn−tn)ϕn
对于高斯函数 s = β − 1 s=\beta^{-1} s=β−1,对于逻辑回归, s = 1 s=1 s=1
指数族函数还真是神奇啊!
拉普拉斯近似,可以用来近似分布。对于分布
p ( z ) = 1 Z f ( z ) p(\textbf z)=\frac{1}{Z}f(\textbf z) p(z)=Z1f(z)
可以在不知道 Z Z Z的情况下,做出近似。方法很简单。找到极大值点
∇ f ( z 0 ) = 0 \nabla f(\textbf z_0)=0 ∇f(z0)=0
其对数概率的泰勒展开为
ln f ( z ) ≃ ln f ( z 0 ) − 1 2 ( z − z 0 ) T A ( z − z 0 ) \ln f(\textbf z)\simeq \ln f(\textbf z_0)-\frac{1}{2} (\textbf z - \textbf z_0)^T\textbf A(\textbf z-\textbf z_0) lnf(z)≃lnf(z0)−21(z−z0)TA(z−z0)
其中Hessian矩阵
A = − ∇ 2 ln f ( z ) ∣ z = z 0 \textbf A=-\nabla^2 \ln f(\textbf z)|_{\textbf z=\textbf z_0} A=−∇2lnf(z)∣z=z0
从而
f ( z ) ≃ f ( z 0 ) exp { − 1 2 ( z − z 0 ) T A ( z − z 0 ) } f(\textbf z)\simeq f(\textbf z_0)\exp\left\{ -\frac{1}{2}(\textbf z-\textbf z_0)^T\textbf A(\textbf z-\textbf z_0) \right\} f(z)≃f(z0)exp{−21(z−z0)TA(z−z0)}
近似分布为
q ( z ) = N ( z ∣ z 0 , A − 1 ) q(\textbf z)=\mathcal N(\textbf z|\textbf z_0, \textbf A^{-1}) q(z)=N(z∣z0,A−1)
注意 A \textbf A A一定要正定,也即 z 0 \textbf z_0 z0一定要是极大值点
从而,估计 Z Z Z的方法为
Z = ∫ f ( z ) d z ≃ f ( z 0 ) ∫ exp { − 1 2 ( z − z 0 ) T A ( z − z 0 ) } d z = f ( z 0 ) ( 2 π ) M / 2 ∣ A ∣ 1 / 2 \begin{aligned} Z &=\int f(\textbf z)d\textbf z \\ &\simeq f(\textbf z_0)\int \exp \left \{ -\frac{1}{2}(\textbf z-\textbf z_0)^T\textbf A (\textbf z- \textbf z_0) \right \}d \textbf z \\ & = f(\textbf z_0)\frac{(2\pi)^{M/2}}{|\textbf A|^{1/2}} \end{aligned} Z=∫f(z)dz≃f(z0)∫exp{−21(z−z0)TA(z−z0)}dz=f(z0)∣A∣1/2(2π)M/2
对于模型证据
p ( D ) = ∫ p ( D ∣ θ ) p ( θ ) d θ p(\mathcal D) = \int p(\mathcal D|\theta)p(\theta)d\theta p(D)=∫p(D∣θ)p(θ)dθ
如果取 f ( θ ) = p ( D ∣ θ ) p ( θ ) f(\theta)=p(\mathcal D|\theta)p(\theta) f(θ)=p(D∣θ)p(θ), Z = p ( D ) Z=p(\mathcal D) Z=p(D),要拟合的分布为 p ( θ ∣ D ) p(\theta|\mathcal D) p(θ∣D),则
ln p ( D ) ≃ ln p ( D ∣ θ M A P ) + ln p ( θ M A P ) + M 2 ln ( 2 π ) − 1 2 ln ∣ A ∣ (1) \ln p(\mathcal D) \simeq \ln p(\mathcal D|\theta_{MAP})+\ln p(\theta_{MAP}) + \frac{M}{2}\ln(2\pi) -\frac{1}{2}\ln |\textbf A| \tag{1} lnp(D)≃lnp(D∣θMAP)+lnp(θMAP)+2Mln(2π)−21ln∣A∣(1)
右侧其中后三项叫做Occam因子Occam factor,惩罚模型复杂度
如果认为参数先验非常宽,并且 A \textbf A A满秩,那么可以非常粗略认为
ln p ( D ) ≃ ln p ( D ∣ θ M A P ) − 1 2 M ln N \ln p(\mathcal D) \simeq \ln p(\mathcal D|\theta _{MAP}) - \frac{1}{2}M\ln N lnp(D)≃lnp(D∣θMAP)−21MlnN
其中 N N N是样本数, M M M是参数量,这就是很多人都知道的贝叶斯信息准则Bayesian Information Citerion(BIC)
精确的贝叶斯推断很难处理,采用拉普拉斯近似. 假定
p ( w ) = N ( w ∣ m 0 , S 0 ) p(\textbf w) = \mathcal N(\textbf w| \textbf m_0, \textbf S_0) p(w)=N(w∣m0,S0)
则
p ( w ∣ t ) ∝ p ( w ) p ( t ∣ w ) p(\textbf w|\textbf t) \propto p(\textbf w)p(\textbf t|\textbf w) p(w∣t)∝p(w)p(t∣w)
其中 t = ( t 1 , . . . , t N ) T \textbf t = (t_1,...,t_N)^T t=(t1,...,tN)T. 两边取对数得到
ln p ( w ∣ t ) = − 1 2 ( w − m 0 ) T S 0 − 1 ( w − m 0 ) + ∑ n = 1 N [ t n ln y n + ( 1 − t n ) ln ( 1 − y n ) ] + const \ln p(\textbf w|\textbf t) = -\frac{1}{2}(\textbf w- \textbf m_0)^T \textbf S_0^{-1} (\textbf w- \textbf m_0) + \sum_{n=1}^N [t_n \ln y_n + (1-t_n)\ln (1-y_n)] + \text{const} lnp(w∣t)=−21(w−m0)TS0−1(w−m0)+n=1∑N[tnlnyn+(1−tn)ln(1−yn)]+const
二阶导数为
S N = − ∇ 2 ln p ( w ∣ t ) = S 0 − 1 + ∑ n = 1 N y n ( 1 − y n ) ϕ n ϕ n T \textbf S_N = -\nabla^2 \ln p(\textbf w|\textbf t)=\textbf S_0^{-1} + \sum_{n=1}^N y_n(1-y_n)\phi_n \phi _n^T SN=−∇2lnp(w∣t)=S0−1+n=1∑Nyn(1−yn)ϕnϕnT
所以对应后验为
q ( w ) = ( w ∣ w M A P , S N ) q(\textbf w) = \mathcal (\textbf w | \textbf w_{MAP}, \textbf S_N) q(w)=(w∣wMAP,SN)
进行预测时
p ( C 1 ∣ ϕ , t ) = ∫ p ( C 1 ∣ ϕ , w ) p ( w ∣ t ) d w ≃ ∫ σ ( w T ϕ ) q ( w ) d w = ∫ [ ∫ δ ( a − w T ϕ ) σ ( a ) d a ] q ( w ) d w = ∫ σ ( a ) p ( a ) d a \begin{aligned} p(\mathcal C_1|\phi, \textbf t) &= \int p(\mathcal C_1|\phi, \textbf w)p(\textbf w| \textbf t) d\textbf w \simeq \int \sigma(\textbf w^T \phi)q(\textbf w)d \textbf w \\ &=\int \left[\int \delta(a-\textbf w^T\phi)\sigma(a)da \right]q(\textbf w)d \textbf w =\int \sigma(a)p(a)da \end{aligned} p(C1∣ϕ,t)=∫p(C1∣ϕ,w)p(w∣t)dw≃∫σ(wTϕ)q(w)dw=∫[∫δ(a−wTϕ)σ(a)da]q(w)dw=∫σ(a)p(a)da
其中 p ( a ) = ∫ δ ( a − w T ϕ ) q ( w ) d w p(a)=\int \delta(a-\textbf w^T\phi)q(\textbf w)d\textbf w p(a)=∫δ(a−wTϕ)q(w)dw
注意 a a a是 w \textbf w w的线性映射, q q q是高斯分布,所以 a a a也是高斯分布
而且
μ a = E [ a ] = ∫ p ( a ) a d a = ∫ q ( w ) w T ϕ d w = w M A P T ϕ σ a 2 = v a r [ a ] = ∫ p ( a ) { a 2 − E [ a ] 2 } d a = ∫ q ( w ) { ( w T ϕ ) 2 − ( w M A P T ϕ ) 2 } d w = ϕ T S N ϕ \begin{aligned} \mu_a &= \mathbb E[a]=\int p(a)a d a=\int q(\textbf w) \textbf w^T\phi d\textbf w = \textbf w^T_{MAP} \phi \\ \sigma^2_a & = var[a]=\int p(a)\{ a^2 - \mathbb E [a]^2 \} da=\int q(\textbf w)\{(\textbf w^T\phi)^2- (\textbf w^T_{MAP} \phi)^2 \}d\textbf w = \phi^T \textbf S_N \phi \end{aligned} μaσa2=E[a]=∫p(a)ada=∫q(w)wTϕdw=wMAPTϕ=var[a]=∫p(a){a2−E[a]2}da=∫q(w){(wTϕ)2−(wMAPTϕ)2}dw=ϕTSNϕ
从而
p ( C 1 ∣ ϕ , t ) = ∫ σ ( a ) N ( μ a , σ a 2 ) d a p(\mathcal C_1|\phi, \textbf t) = \int \sigma(a) \mathcal N(\mu_a, \sigma^2_a)da p(C1∣ϕ,t)=∫σ(a)N(μa,σa2)da
该式无法解析解,不过可以用probit函数 Φ ( λ a ) \Phi(\lambda a) Φ(λa)近似代替 σ ( a ) \sigma(a) σ(a),其中 λ 2 = π / 8 \lambda^2 = \pi/8 λ2=π/8,这样可以求出解析解为
∫ Φ ( λ a ) N ( a ∣ μ a , σ a 2 ) d a = Φ ( μ a ( λ − 2 + σ a 2 ) 1 / 2 ) \int \Phi(\lambda a)\mathcal N(a|\mu_a, \sigma^2_a)da = \Phi \left (\frac{\mu_a}{(\lambda^{-2}+\sigma^2_a)^{1/2}}\right) ∫Φ(λa)N(a∣μa,σa2)da=Φ((λ−2+σa2)1/2μa)
再次利用近似式子,带回sigmoid函数,得到 Φ ( μ a ( λ − 2 + σ a 2 ) 1 / 2 ) ≈ σ ( κ ( σ 2 ) μ ) \Phi \left (\frac{\mu_a}{(\lambda^{-2}+\sigma^2_a)^{1/2}}\right) \approx \sigma(\kappa(\sigma^2)\mu) Φ((λ−2+σa2)1/2μa)≈σ(κ(σ2)μ),其中 κ ( σ a 2 ) = ( 1 + π σ a 2 / 8 ) − 1 / 2 \kappa(\sigma_a^2)=(1+\pi \sigma^2_a /8)^{-1/2} κ(σa2)=(1+πσa2/8)−1/2
即
p ( C 1 ∣ ϕ , t ) ≈ σ ( κ ( σ a 2 ) μ a ) p(\mathcal C_1|\phi, \mathbb t) \approx \sigma(\kappa(\sigma^2_a)\mu_a) p(C1∣ϕ,t)≈σ(κ(σa2)μa)
这和CVMLI一书的180页结论一致!
参考文献:
[1] Christopher M. Bishop. Pattern Recognition and Machine Learning. 2006