《统计学习方法》(第六章)—— 逻辑斯谛回归与最大熵模型

逻辑斯谛回归模型

逻辑斯谛分布

  • 定义:设X是连续随机变量, X X X服从罗辑斯谛分布是指 X X X具有下列分布函数和密度函数:
           F ( x ) = P ( X ≤ x ) = 1 1 + e − − ( x − μ ) γ F(x)=P(X\le x)=\frac{1}{1+e^{-\frac{-(x-\mu)}{\gamma}}} F(x)=P(Xx)=1+eγ(xμ)1
           f ( x ) = F ′ ( x ) = P ( X ≤ x ) = e − ( x − μ ) γ γ ( 1 + e − − ( x − μ ) γ ) 2 f(x)=F^\prime(x)=P(X\le x)=\frac{e^{\frac{-(x-\mu)}{\gamma}}}{\gamma(1+e^{-\frac{-(x-\mu)}{\gamma}})^2} f(x)=F(x)=P(Xx)=γ(1+eγ(xμ))2eγ(xμ)式中 μ \mu μ为位置参数, γ > 0 \gamma>0 γ>0为形状参数,其以 ( μ , 1 2 ) (\mu,\frac{1}{2}) (μ,21)为中心对称点,即满足 F ( − x + μ ) − 1 2 = − F ( x + μ ) + 1 2 F(-x+\mu)-\frac{1}{2}=-F(x+\mu)+\frac{1}{2} F(x+μ)21=F(x+μ)+21在中心增长速度快,两端速度慢, γ \gamma γ越小在中心增长越快

二项式逻辑斯谛回归模型

  • 定义:二项式逻辑斯谛回归模型是如下条件概率分布:
           P ( y = 1 ∣ x ) = exp ⁡ ( w ⋅ x + b ) 1 + exp ⁡ ( w ⋅ x + b ) P(y=1|x)=\frac{\exp(w\cdot x+b)}{1+\exp(w\cdot x+b)} P(y=1x)=1+exp(wx+b)exp(wx+b)

           P ( y = 0 ∣ x ) = 1 1 + exp ⁡ ( w ⋅ x + b ) P(y=0|x)=\frac{1}{1+\exp(w\cdot x+b)} P(y=0x)=1+exp(wx+b)1
    这里, x ∈ R n x \in R^n xRn是输入, y ∈ { 0 , 1 } 是 输 出 , y \in \{0,1\}是输出, y{0,1}, w ∈ R n w \in R^n wRn b ∈ R b \in R bR是参数, w w w称为权值向量, b b b为偏值, w ⋅ x w \cdot x wx w w w x x x的内积,扩充 w = ( w 1 , w 2 , . . . w n , b ) T , x = ( x 1 , x 2 , . . . , x m , 1 ) T w=(w^1,w^2,...w^n,b)^T,x=(x^1,x^2,...,x^m,1)^T w=(w1,w2,...wn,b)T,x=(x1,x2,...,xm,1)T,则$w $
  • 几率定义:发生概率与不发生概率的比值, p 1 − p \frac{p}{1-p} 1pp对数几率, l o g i t ( p ) = log ⁡ p 1 − p logit(p)=\log\frac{p}{1-p} logit(p)=log1pp
      对于逻辑斯谛回归模型, log ⁡ P ( y = 1 ∣ x ) 1 − P ( y = 1 ∣ x ) = w ⋅ x \log \frac{P(y=1|x)}{1-P(y=1|x)}=w \cdot x log1P(y=1x)P(y=1x)=wx

模型参数估计

设:
       P ( y = 1 ∣ x ) = π ( x ) , P ( y = 0 ∣ x ) = 1 − π ( x ) P(y=1|x)=\pi(x),P(y=0|x)=1-\pi(x) P(y=1x)=π(x),P(y=0x)=1π(x)
似然函数为
       ∏ i = 1 N [ π ( x i ) ] y i [ 1 − π ( x i ) ] 1 − y i \prod\limits_{i=1}^N[\pi(x_i)]^{y_i}[1-\pi(x_i)]^{1-y_i} i=1N[π(xi)]yi[1π(xi)]1yi
对数似然函数为 L ( w ) = ∑ i = 1 N [ y i log ⁡   π ( x i ) + ( 1 − y i ) log ⁡   ( 1 − π ( x i ) ) ] L(w)=\sum\limits_{i=1}^N[y_i\log\ \pi(x_i)+(1-y_i)\log\ (1-\pi(x_i))] L(w)=i=1N[yilog π(xi)+(1yi)log (1π(xi))]
= ∑ i = 1 N [ y i log ⁡   π ( x i ) 1 − π ( x i ) + log ⁡   ( 1 − π ( x i ) ) ] =\sum\limits_{i=1}^N[y_i\log\ \frac{\pi (x_i)}{1-\pi(x_i)}+\log\ (1-\pi(x_i))] =i=1N[yilog 1π(xi)π(xi)+log (1π(xi))]
= ∑ i = 1 N [ y i ( w ⋅ x i ) − log ⁡   ( 1 + exp ⁡ ( w ⋅ x i ) ] =\sum\limits_{i=1}^N[y_i(w \cdot x_i)-\log\ (1+\exp(w \cdot x_i)] =i=1N[yi(wxi)log (1+exp(wxi)]
L ( w ) L(w) L(w)求极大值,得到 w w w的估计

多项式逻辑斯谛回归

       P ( y = k ∣ x ) = exp ⁡ ( w k ⋅ x ) 1 + ∑ k = 1 K − 1 exp ⁡ ( w k ⋅ x ) P(y=k|x)=\frac{\exp(w_k \cdot x)}{1+\sum\limits_{k=1}^{K-1}\exp(w_k \cdot x)} P(y=kx)=1+k=1K1exp(wkx)exp(wkx)其中 k = 1 , 2 , . . . , K − 1 k=1,2,...,K-1 k=1,2,...,K1

       P ( y = K ∣ x ) = 1 1 + ∑ k = 1 K − 1 exp ⁡ ( w k ⋅ x ) P(y=K|x)=\frac{1}{1+\sum\limits_{k=1}^{K-1}\exp(w_k \cdot x)} P(y=Kx)=1+k=1K1exp(wkx)1

最大熵模型

最大熵模型是由最大熵原理推导实现。

最大熵原理

最大熵原理认为学习概率模型时,在所有可能的概率模型里,熵最大的模型是最好的模型。
H ( P ) = − ∑ x P ( x ) log ⁡   P ( x ) H(P)=-\sum\limits_xP(x)\log \ P(x) H(P)=xP(x)log P(x),满足 0 ≤ H ( P ) ≤ log ⁡   ∣ X ∣ 0 \le H(P)\le \log\ |X| 0H(P)log X,其中 ∣ X ∣ |X| X是X的取值个数.在约束条件下,那些不确定的事件是等可能的是最好的.

最大熵模型的定义

假设分类模型是一个条件概率分布 P ( Y ∣ X ) P(Y|X) P(YX),给定一个训练数据集
T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x N , y N ) } T=\{ (x_1,y_1),(x_2,y_2),...,(x_N,y_N)\} T={(x1,y1),(x2,y2),...,(xN,yN)},学习目标是利用最大熵选择最好的模型
首先确定 P ( X , Y ) P(X,Y) P(X,Y) P ( X ) P(X) P(X)的经验分布
       P ^ ( X = x , Y = y ) = v ( X = x , Y = y ) N \hat{P}(X=x,Y=y)=\frac{v(X=x,Y=y)}{N} P^(X=x,Y=y)=Nv(X=x,Y=y)
       P ^ ( X = x ) = v ( X = x ) N \hat{P}(X=x)=\frac{v(X=x)}{N} P^(X=x)=Nv(X=x)
其中 v ( X = x ) v(X=x) v(X=x)为样本中 X = x X=x X=x的个数, v ( X = x , Y = y ) v(X=x,Y=y) v(X=x,Y=y)同理.
用特征函数 f ( x , y ) f(x,y) f(x,y)描述输入 x x x和输出 y y y之间某一事实
f ( x , y ) = { 1 x , y 满 足 某 一 事 实 0 其 他 f(x,y)=\begin{cases} 1 & x,y满足某一事实\\ 0 & 其他\\ \end{cases} f(x,y)={10x,y
则特征函数关于经验分布 P ^ ( X , Y ) \hat{P}(X,Y) P^(X,Y)的期望值,用 E P ^ ( f ) E_{\hat{P}}(f) EP^(f)表示
       E P ^ ( f ) = ∑ x , y P ^ ( x , y ) f ( x , y ) E_{\hat{P}}(f)=\sum\limits_{x,y}\hat{P}(x,y)f(x,y) EP^(f)=x,yP^(x,y)f(x,y)
特征函数关于模型 P ( Y ∣ X ) P(Y|X) P(YX)与经验分布 P ^ \hat{P} P^的期望值,用 E P ( f ) E_{P}(f) EP(f)表示
       E P ( f ) = ∑ x , y P ^ ( x ) P ( y ∣ x ) f ( x , y ) E_{P}(f)=\sum\limits_{x,y}\hat{P}(x)P(y|x)f(x,y) EP(f)=x,yP^(x)P(yx)f(x,y)
如果模型能够获取训练数据的信息,则
       E P ^ ( f ) = E P ( f ) E_{\hat{P}}(f)=E_{P}(f) EP^(f)=EP(f)

  • 定义
    假设满足所有约束条件的模型集合为
    C = { P ∈ ρ ∣ E P ^ ( f i ) = E P ( f i ) , i = 1 , 2 , . . . , n } C=\{P \in \rho |E_{\hat{P}}(f_i)=E_{P}(f_i),i=1,2,...,n\} C={PρEP^(fi)=EP(fi),i=1,2,...,n}
    定义在条件概率分布 P ( Y ∣ X ) P(Y|X) P(YX)上的条件熵为
    H ( P ) = − ∑ x , y P ^ ( x ) P ( y ∣ x ) log ⁡   P ( y ∣ x ) H(P)=-\sum\limits_{x,y}\hat{P}(x)P(y|x)\log\ P(y|x) H(P)=x,yP^(x)P(yx)log P(yx)
    则模型集合 C C C中条件熵最大的模型称为最大熵模型

最大熵模型的学习

min ⁡ P ∈ C            − H ( P ) = ∑ x , y P ^ ( x ) P ( y ∣ x ) log ⁡   P ( y ∣ x ) \min\limits_{P \in C}\ \ \ \ \ \ \ \ \ \ -H(P)=\sum\limits_{x,y}\hat{P}(x)P(y|x)\log\ P(y|x) PCmin          H(P)=x,yP^(x)P(yx)log P(yx)
s . t . s.t. s.t.
       E P ^ ( f i ) = E P ( f i ) E_{\hat{P}}(f_i)=E_{P}(f_i) EP^(fi)=EP(fi)
       ∑ y P ( y ∣ x ) = 1 \sum\limits_yP(y|x)=1 yP(yx)=1
引入拉格朗日乘子 w 0 , w 1 , . . . , w n w_0,w_1,...,w_n w0,w1,...,wn定义拉格朗日函数 L ( P , w ) L(P,w) L(P,w)
L ( P , w ) = − H ( P ) + w 0 ( 1 − ∑ y P ( y ∣ x ) ) + ∑ i = 1 n w i ( E P ^ ( f i ) − E P ( f i ) ) L(P,w)=-H(P)+w_0(1-\sum\limits_yP(y|x))+\sum\limits_{i=1}^nw_i(E_{\hat{P}}(f_i)-E_{P}(f_i)) L(P,w)=H(P)+w0(1yP(yx))+i=1nwi(EP^(fi)EP(fi))
= ∑ x , y P ^ ( x ) P ( y ∣ x log ⁡   P ( y ∣ x ) + w 0 ( 1 − ∑ y P ( y ∣ x ) ) + =\sum\limits_{x,y}\hat{P}(x)P(y|x\log \ P(y|x)+w_0(1-\sum\limits_yP(y|x))+ =x,yP^(x)P(yxlog P(yx)+w0(1yP(yx))+
∑ i = 1 n w i ( ∑ x , y P ^ ( x , y ) f ( x , y ) − ∑ x , y P ^ ( x ) P ( y ∣ x ) f ( x , y ) ) \sum\limits_{i=1}^nw_i(\sum\limits_{x,y}\hat{P}(x,y)f(x,y)-\sum\limits_{x,y}\hat{P}(x)P(y|x)f(x,y)) i=1nwi(x,yP^(x,y)f(x,y)x,yP^(x)P(yx)f(x,y))
最优的原始问题是 min ⁡ P ∈ C   max ⁡ w   L ( P , w ) \min\limits_{P \in C} \ \max\limits_w \ L(P,w) PCmin wmax L(P,w)
则对偶问题为 max ⁡ w   min ⁡ P ∈ C   L ( P , w ) \max\limits_{w} \ \min\limits_{P \in C} \ L(P,w) wmax PCmin L(P,w)
∂ L ( P , w ) ∂ P ( y ∣ x ) = ∑ x , y P ^ ( x ) ( log ⁡   P ( y ∣ x ) + 1 ) − ∑ y w 0 − ∑ x , y ( P ^ ( x ) ∑ i = 1 n w i f i ( x , y ) ) \frac{\partial L(P,w)}{\partial P(y|x)}=\sum\limits_{x,y} \hat{P}(x)(\log \ P(y|x)+1)-\sum\limits_yw_0-\sum\limits_{x,y}(\hat{P}(x)\sum\limits_{i=1}^nw_if_i(x,y)) P(yx)L(P,w)=x,yP^(x)(log P(yx)+1)yw0x,y(P^(x)i=1nwifi(x,y))
= ∑ x , y P ^ ( x ) ( log ⁡   P ( y ∣ x ) + 1 − w 0 − ∑ i = 1 n w i f i ( x , y ) ) = 0 =\sum\limits_{x,y}\hat{P}(x)(\log \ P(y|x)+1-w_0-\sum\limits_{i=1}^nw_if_i(x,y))=0 =x,yP^(x)(log P(yx)+1w0i=1nwifi(x,y))=0

P ( y ∣ x ) = exp ⁡ ( ∑ i = 1 n w i f i ( x , y ) + w 0 − 1 ) = exp ⁡ ( ∑ i = 1 n w i f i ( x , y ) ) exp ⁡ ( 1 − w 0 ) P(y|x)=\exp(\sum\limits_{i=1}^nw_if_i(x,y)+w_0-1)=\frac{\exp(\sum\limits_{i=1}^nw_if_i(x,y))}{\exp(1-w_0)} P(yx)=exp(i=1nwifi(x,y)+w01)=exp(1w0)exp(i=1nwifi(x,y))
又由
∑ y P ( y ∣ x ) = 1 \sum\limits_yP(y|x)=1 yP(yx)=1

       P w ( y ∣ x ) = 1 Z w ( x ) exp ⁡ ( ∑ i = 1 n w i f i ( x , y ) ) P_w(y|x)=\frac{1}{Z_w(x)}\exp(\sum\limits_{i=1}^nw_if_i(x,y)) Pw(yx)=Zw(x)1exp(i=1nwifi(x,y))
其中
Z w ( x ) = ∑ y exp ⁡ ( ∑ i = 1 n w i f i ( x , y ) ) Z_w(x)=\sum\limits_y\exp(\sum\limits_{i=1}^nw_if_i(x,y)) Zw(x)=yexp(i=1nwifi(x,y))
把以上代入 L ( P , w ) L(P,w) L(P,w)后,我们只需要最大化求解 w w w即可

极大似然估计

设极大似然估计
L P ^ ( P w ) = log ⁡   ∏ x , y P ( y ∣ x ) P ^ ( x , y ) = ∑ x , y P ^ ( x , y ) log ⁡ P ( y ∣ x ) L_{\hat{P}}(P_w)=\log \ \prod\limits_{x,y}P(y|x)^{\hat{P}(x,y)}=\sum\limits_{x,y}\hat{P}(x,y)\log P(y|x) LP^(Pw)=log x,yP(yx)P^(x,y)=x,yP^(x,y)logP(yx)
= ∑ x , y P ^ ( x , y ) ∑ i = 1 n w i f i ( x , y ) − ∑ x , y P ^ ( x , y ) log ⁡ Z w ( x ) =\sum\limits_{x,y}\hat{P}(x,y)\sum\limits_{i=1}^nw_if_i(x,y)-\sum\limits_{x,y}\hat{P}(x,y)\log Z_w(x) =x,yP^(x,y)i=1nwifi(x,y)x,yP^(x,y)logZw(x)
= ∑ x , y P ^ ( x , y ) ∑ i = 1 n w i f i ( x , y ) − ∑ x P ^ ( x ) log ⁡ Z w ( x ) =\sum\limits_{x,y}\hat{P}(x,y)\sum\limits_{i=1}^nw_if_i(x,y)-\sum\limits_{x}\hat{P}(x)\log Z_w(x) =x,yP^(x,y)i=1nwifi(x,y)xP^(x)logZw(x)
又对偶函数
ψ ( w ) = ∑ x , y P ^ ( x ) P w ( y ∣ x ) log ⁡ P w ( y ∣ x ) + \psi(w)=\sum\limits_{x,y}\hat{P}(x)P_w(y|x)\log P_w(y|x)+ ψ(w)=x,yP^(x)Pw(yx)logPw(yx)+
∑ i = 1 n w i ( ∑ x , y P ^ ( x , y ) f i ( x , y ) − ∑ x , y P ^ ( x ) P w ( y ∣ x ) f i ( x , y ) ) \sum\limits_{i=1}^nw_i(\sum\limits_{x,y}\hat{P}(x,y)f_i(x,y)-\sum\limits_{x,y}\hat{P}(x)P_w(y|x)f_i(x,y)) i=1nwi(x,yP^(x,y)fi(x,y)x,yP^(x)Pw(yx)fi(x,y))
= ∑ x , y P ^ ( x , y ) ∑ i = 1 n w i f i ( x , y ) + ∑ x , y P ^ ( x ) P w ( y ∣ x ) ( log ⁡ P w ( y ∣ x ) − ∑ i = 1 n w i f i ( x , y ) =\sum\limits_{x,y}\hat{P}(x,y)\sum\limits_{i=1}^nw_if_i(x,y)+\sum\limits_{x,y}\hat{P}(x)P_w(y|x)(\log P_w(y|x)-\sum\limits_{i=1}^nw_if_i(x,y) =x,yP^(x,y)i=1nwifi(x,y)+x,yP^(x)Pw(yx)(logPw(yx)i=1nwifi(x,y)
= ∑ x , y P ^ ( x , y ) ∑ i = 1 n w i f i ( x , y ) − ∑ x , y P ^ ( x ) P w ( y ∣ x ) log ⁡ Z w ( x ) =\sum\limits_{x,y}\hat{P}(x,y)\sum\limits_{i=1}^nw_if_i(x,y)-\sum\limits_{x,y}\hat{P}(x)P_w(y|x)\log Z_w(x) =x,yP^(x,y)i=1nwifi(x,y)x,yP^(x)Pw(yx)logZw(x)
= ∑ x , y P ^ ( x , y ) ∑ i = 1 n w i f i ( x , y ) − ∑ x P ^ ( x ) log ⁡ Z w ( x ) =\sum\limits_{x,y}\hat{P}(x,y)\sum\limits_{i=1}^nw_if_i(x,y)-\sum\limits_{x}\hat{P}(x)\log Z_w(x) =x,yP^(x,y)i=1nwifi(x,y)xP^(x)logZw(x)

在最大熵模型下对偶最大化等价于极大似然估计.

模型学习的最优化算法

  1. 改进迭代尺度法
  2. 梯度下降法
  3. 牛顿法
  4. 拟牛顿法

改进的迭代尺度法

希望求的 ϱ = ( ϱ 1 , ϱ 2 , . . . , ϱ n ) \varrho=(\varrho_1,\varrho_2,...,\varrho_n) ϱ=(ϱ1,ϱ2,...,ϱn),使 w + ϱ w+\varrho w+ϱ优于 w w w
L ( w + ϱ ) − L ( w ) = ∑ x , y P ^ ( x , y ) log ⁡ P w + ϱ ( y ∣ x ) − ∑ x , y P ^ ( x , y ) log ⁡ P w ( y ∣ x ) L(w+\varrho)-L(w)=\sum\limits_{x,y}\hat{P}(x,y)\log P_{w+\varrho}(y|x)-\sum\limits_{x,y}\hat{P}(x,y)\log P_w(y|x) L(w+ϱ)L(w)=x,yP^(x,y)logPw+ϱ(yx)x,yP^(x,y)logPw(yx)
= ∑ x , y P ^ ( x , y ) ∑ i = 1 n ϱ i f i ( x , y ) − ∑ x P ^ ( x ) log ⁡ Z w + ϱ ( x ) Z w ( x ) =\sum\limits_{x,y}\hat{P}(x,y)\sum\limits_{i=1}^n\varrho_if_i(x,y)-\sum\limits_{x}\hat{P}(x)\log \frac{Z_{w+\varrho}(x)}{Z_w(x)} =x,yP^(x,y)i=1nϱifi(x,y)xP^(x)logZw(x)Zw+ϱ(x)

− log ⁡ a ≥ 1 − a , a > 0 -\log a \ge 1-a,a>0 loga1a,a>0

L ( w + ϱ ) − L ( w ) ≥ ∑ x , y P ^ ( x , y ) ∑ i = 1 n ϱ i f i ( x , y ) + 1 − ∑ x P ^ ( x ) Z w + ϱ ( x ) Z w ( x ) L(w+\varrho)-L(w)\ge \sum\limits_{x,y}\hat{P}(x,y)\sum\limits_{i=1}^n\varrho_if_i(x,y)+1-\sum\limits_{x}\hat{P}(x)\frac{Z_{w+\varrho}(x)}{Z_w(x)} L(w+ϱ)L(w)x,yP^(x,y)i=1nϱifi(x,y)+1xP^(x)Zw(x)Zw+ϱ(x)
= ∑ x , y P ^ ( x , y ) ∑ i = 1 n ϱ i f i ( x , y ) + 1 − ∑ x P ^ ( x ) ∑ y P w ( y ∣ x ) exp ⁡ ∑ i = 1 n ϱ i f i ( x , y ) =\sum\limits_{x,y}\hat{P}(x,y)\sum\limits_{i=1}^n\varrho_if_i(x,y)+1-\sum\limits_{x}\hat{P}(x)\sum\limits_{y}P_w(y|x)\exp \sum\limits_{i=1}^n\varrho_if_i(x,y) =x,yP^(x,y)i=1nϱifi(x,y)+1xP^(x)yPw(yx)expi=1nϱifi(x,y)
记右端为
A ( ϱ ∣ w ) = ∑ x , y P ^ ( x , y ) ∑ i = 1 n ϱ i f i ( x , y ) + 1 − ∑ x P ^ ( x ) ∑ y P w ( y ∣ x ) exp ⁡ ∑ i = 1 n ϱ i f i ( x , y ) A(\varrho|w)=\sum\limits_{x,y}\hat{P}(x,y)\sum\limits_{i=1}^n\varrho_if_i(x,y)+1-\sum\limits_{x}\hat{P}(x)\sum\limits_{y}P_w(y|x)\exp \sum\limits_{i=1}^n\varrho_if_i(x,y) A(ϱw)=x,yP^(x,y)i=1nϱifi(x,y)+1xP^(x)yPw(yx)expi=1nϱifi(x,y)
于是
L ( w + ϱ ) − L ( w ) ≥ A ( ϱ ∣ w ) L(w+\varrho)-L(w)\ge A(\varrho|w) L(w+ϱ)L(w)A(ϱw)
ϱ \varrho ϱ为向量不易优化故继续化简
定义
f # ( x , y ) = ∑ i f i ( x , y ) f^\#(x,y)=\sum\limits_{i}f_i(x,y) f#(x,y)=ifi(x,y)
A ( ϱ ∣ w ) = ∑ x , y P ^ ( x , y ) ∑ i = 1 n ϱ i f i ( x , y ) + 1 − ∑ x P ^ ( x ) ∑ y P w ( y ∣ x ) exp ⁡ ( f # ( x , y ) ∑ i = 1 n ϱ i f i ( x , y ) f # ( x , y ) ) A(\varrho|w)=\sum\limits_{x,y}\hat{P}(x,y)\sum\limits_{i=1}^n\varrho_if_i(x,y)+1-\sum\limits_x\hat{P}(x)\sum\limits_{y}P_w(y|x)\exp(f^\#(x,y)\sum\limits_{i=1}^n\frac{\varrho_if_i(x,y)}{f^\#(x,y)}) A(ϱw)=x,yP^(x,y)i=1nϱifi(x,y)+1xP^(x)yPw(yx)exp(f#(x,y)i=1nf#(x,y)ϱifi(x,y))
f i ( x , y ) f # ( x , y ) ≥ 0 \frac{f_i(x,y)}{f^\#(x,y)}\ge0 f#(x,y)fi(x,y)0 ∑ i = 1 n f i ( x , y ) f # ( x , y ) = 1 \sum\limits_{i=1}^n\frac{f_i(x,y)}{f^\#(x,y)}=1 i=1nf#(x,y)fi(x,y)=1
根据Jensen不等式
exp ⁡ ( ∑ i = 1 n f i ( x , y ) f # ( x , y ) ϱ i f # ( x , y ) ) ≤ ∑ i = 1 n f i ( x , y ) f # ( x , y ) exp ⁡ ( ϱ i f # ( x , y ) ) \exp(\sum\limits_{i=1}^n\frac{f_i(x,y)}{f^\#(x,y)}\varrho_if^\#(x,y))\le\sum\limits_{i=1}^n\frac{f_i(x,y)}{f^\#(x,y)}\exp(\varrho_if^\#(x,y)) exp(i=1nf#(x,y)fi(x,y)ϱif#(x,y))i=1nf#(x,y)fi(x,y)exp(ϱif#(x,y))
于是 A ( ϱ ∣ w ) ≥ B ( ϱ ∣ w ) = ∑ x , y P ^ ( x , y ) ∑ i = 1 n ϱ i f i ( x , y ) + 1 − ∑ x P ^ ( x ) ∑ y P w ( y ∣ x ) ∑ i = 1 n ( f i ( x , y ) f # ( x , y ) ) exp ⁡ ( ϱ i f # ( x , y ) ) A(\varrho|w)\ge B(\varrho|w)=\sum\limits_{x,y}\hat{P}(x,y)\sum\limits_{i=1}^n\varrho_if_i(x,y)+1-\sum\limits_x\hat{P}(x)\sum\limits_{y}P_w(y|x)\sum\limits_{i=1}^n(\frac{f_i(x,y)}{f^\#(x,y)})\exp(\varrho_if^\#(x,y)) A(ϱw)B(ϱw)=x,yP^(x,y)i=1nϱifi(x,y)+1xP^(x)yPw(yx)i=1n(f#(x,y)fi(x,y))exp(ϱif#(x,y))
得到
L ( w + ϱ ) − L ( w ) ≥ B ( ϱ ∣ w ) L(w+\varrho)-L(w)\ge B(\varrho|w) L(w+ϱ)L(w)B(ϱw)

∂ B ( ϱ ∣ w ) ∂ ϱ = ∑ x , y P ^ ( x , y ) f i ( x , y ) − ∑ x P ^ ( x ) ∑ y P w ( y ∣ x ) f i ( x , y ) exp ⁡ ( ϱ i f # ( x , y ) ) = 0 \frac{\partial B(\varrho|w)}{\partial \varrho}=\sum\limits_{x,y}\hat{P}(x,y)f_i(x,y)-\sum\limits_x\hat{P}(x)\sum\limits_yP_w(y|x)f_i(x,y)\exp(\varrho_if^\#(x,y))=0 ϱB(ϱw)=x,yP^(x,y)fi(x,y)xP^(x)yPw(yx)fi(x,y)exp(ϱif#(x,y))=0

∑ x P ^ ( x ) ∑ y P w ( y ∣ x ) f i ( x , y ) exp ⁡ ( ϱ i f # ( x , y ) ) = E P ^ ( f i ) \sum\limits_x\hat{P}(x)\sum\limits_yP_w(y|x)f_i(x,y)\exp(\varrho_if^\#(x,y))=E_{\hat{P}}(f_i) xP^(x)yPw(yx)fi(x,y)exp(ϱif#(x,y))=EP^(fi)

  • 算法
    输入:特征函数 f 1 , f 2 , . . . , f n f_1,f_2,...,f_n f1,f2,...,fn,经验分布 P ^ ( X , Y ) \hat{P}(X,Y) P^(X,Y),模型 P w ( y ∣ x ) P_w(y|x) Pw(yx)
    输出:最优参数 w ∗ w^* w
    ( 1 ) (1) (1) 对所有 i ∈ { 1 , 2 , . . . , n } i \in \{1,2,...,n\} i{1,2,...,n},取 w i = 0 w_i=0 wi=0
    ( 2 ) (2) (2)对每个 i i i求方程
    ∑ x P ^ ( x ) ∑ y P w ( y ∣ x ) f i ( x , y ) exp ⁡ ( ϱ i f # ( x , y ) ) = E P ^ ( f i ) \sum\limits_x\hat{P}(x)\sum\limits_yP_w(y|x)f_i(x,y)\exp(\varrho_if^\#(x,y))=E_{\hat{P}}(f_i) xP^(x)yPw(yx)fi(x,y)exp(ϱif#(x,y))=EP^(fi)
    w i ⟵ w i + ϱ i w_i \longleftarrow w_i+\varrho_i wiwi+ϱi
    ( 3 ) (3) (3)如果不是所有 w i w_i wi收敛则重复(2)

拟牛顿法

对于最大熵模型
目标函数:
min ⁡ w ∈ R n            f ( w ) = ∑ x P ^ ( x ) log ⁡ exp ⁡ ( ∑ i = 1 n w i f i ( x , y ) ) − ∑ x , y P ^ ( x , y ) ∑ i = 1 n w i f i ( x , y ) \min\limits_{w \in R^n}\ \ \ \ \ \ \ \ \ \ f(w)=\sum\limits_x\hat{P}(x)\log \exp(\sum\limits_{i=1}^nw_if_i(x,y))-\sum\limits_{x,y}\hat{P}(x,y)\sum\limits_{i=1}^nw_if_i(x,y) wRnmin          f(w)=xP^(x)logexp(i=1nwifi(x,y))x,yP^(x,y)i=1nwifi(x,y)
则梯度为
                          g ( w ) = ( ∂ f ( w ) ∂ w 1 , ∂ f ( w ) ∂ w 2 , . . . , ∂ f ( w ) ∂ w n ) T \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ g(w)=(\frac{\partial f(w)}{\partial w_1},\frac{\partial f(w)}{\partial w_2},...,\frac{\partial f(w)}{\partial w_n})^T                          g(w)=(w1f(w),w2f(w),...,wnf(w))T

                          ∂ f ( w ) ∂ w i = ∑ x , y P ^ ( x ) P w ( y ∣ x ) f i ( x , y ) − E P ^ ( f i ) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \frac{\partial f(w)}{\partial w_i}=\sum\limits_{x,y}\hat{P}(x)P_w(y|x)f_i(x,y)-E_{\hat{P}}(f_i)                          wif(w)=x,yP^(x)Pw(yx)fi(x,y)EP^(fi)

  • 算法
    输入:特征函数 f 1 , f 2 , . . . , f n f_1,f_2,...,f_n f1,f2,...,fn,经验分布 P ^ ( x , y ) \hat{P}(x,y) P^(x,y),目标函数 f ( w ) f(w) f(w),梯度 g ( w ) = ▽ f ( w ) g(w)=\bigtriangledown f(w) g(w)=f(w),精度要求    ε \ \ \varepsilon   ε
    输出:最优参数 w ∗ w^* w
    ( 1 ) (1) (1)选定初始点 w ( 0 ) , w^{(0)}, w(0), B 0 B_0 B0为正定对称矩阵, k = 0 k=0 k=0
    ( 2 ) (2) (2)计算 g k = g ( w ( k ) ) g_k=g(w^{(k)}) gk=g(w(k)),如果 ∣ ∣ g k ∣ ∣ < ε ||g_k||<\varepsilon gk<ε,则停止计算,取 w ∗ = w ( k ) w^*=w^{(k)} w=w(k),否则转(3)
    ( 3 ) (3) (3) B k p k = − g k B_kp_k=-g_k Bkpk=gk求出 p k p_k pk
    ( 4 ) (4) (4)一维搜索求 λ k \lambda_k λk
    f ( w ( k ) + λ k p k ) = min ⁡ λ ≥ 0 f ( w ( k ) + λ p k ) f(w^{(k)}+\lambda_kp_k)=\min\limits_{\lambda\ge0}f(w^{(k)}+\lambda p_k) f(w(k)+λkpk)=λ0minf(w(k)+λpk)
    ( 5 ) (5) (5) w ( k + 1 ) = w ( k ) + λ k p k w^{(k+1)}=w^{(k)}+\lambda_kp_k w(k+1)=w(k)+λkpk
    ( 6 ) (6) (6)计算 g k + 1 = g ( w ( k + 1 ) ) g_{k+1}=g(w^{(k+1)}) gk+1=g(w(k+1))如果 ∣ ∣ g k + 1 ∣ ∣ < ε ||g_{k+1}||<\varepsilon gk+1<ε则停止,得到 w = w ( k + 1 ) w=w^{(k+1)} w=w(k+1)否则计算 B k + 1 B_{k+1} Bk+1
    B k + 1 = B k + y k y k T y k T ϱ k − B k ϱ k ϱ k T B k ϱ k T B k ϱ k B_{k+1}=B_k+\frac{y_ky_k^T}{y_k^T\varrho_k}-\frac{B_k\varrho_k\varrho_k^TB_k}{\varrho_k^TB_k\varrho_k} Bk+1=Bk+ykTϱkykykTϱkTBkϱkBkϱkϱkTBk其中 y k = g k + 1 − g k , ϱ k = w ( k + 1 ) − w ( k ) y_k=g_{k+1}-g_k,\varrho_k=w^{(k+1)}-w{(k)} yk=gk+1gk,ϱk=w(k+1)w(k)
    ( 7 ) (7) (7) k = k + 1 k=k+1 k=k+1,转 ( 3 ) (3) (3)

你可能感兴趣的:(机器学习)