【上】CS229 吴恩达机器学习 习题作业答案 problem sets 03 PS03(全部问题解答,欢迎各位前辈指教)

1. A Simple Neural Network

(a)

首先写出forward过程:
z [ 1 ] = W [ 1 ] x + W 0 [ 1 ] h = σ ( z [ 1 ] ) z [ 2 ] = W [ 2 ] h + W 0 [ 2 ] o = σ ( z [ 2 ] ) \begin{aligned} z^{[1]} &=W^{[1]} x+W_{0}^{[1]} \\ h &=\sigma\left(z^{[1]}\right) \\ z^{[2]} &=W^{[2]} h+W_{0}^{[2]} \\ o &=\sigma\left(z^{[2]}\right) \end{aligned} z[1]hz[2]o=W[1]x+W0[1]=σ(z[1])=W[2]h+W0[2]=σ(z[2])
损失函数为:
ℓ = 1 m ∑ i = 1 m ( o ( i ) − y ( i ) ) 2 = 1 m ∑ i = 1 m J ( i ) \begin{aligned} \ell &=\frac{1}{m} \sum_{i=1}^{m}\left(o^{(i)}-y^{(i)}\right)^{2}=\frac{1}{m} \sum_{i=1}^{m} J^{(i)} \end{aligned} =m1i=1m(o(i)y(i))2=m1i=1mJ(i)
对于一个样本,通过链式法则,先求对于 W [ 2 ] W^{[2]} W[2]的导数,利用其中结果,再求对于 W [ 1 ] W^{[1]} W[1]的导数:
∂ J ∂ W [ 1 ] = ∂ J ∂ o ∂ o ∂ z [ 2 ] ∂ z [ 2 ] ∂ h ∂ h ∂ z [ 1 ] ∂ z [ 1 ] ∂ W [ 1 ] = 2 ( o − y ) o ( 1 − o ) x T h ⋅ ( 1 − h ) ⋅ W [ 2 ] \frac{\partial J}{\partial W^{[1]}}=\frac{\partial J}{\partial o} \frac{\partial o}{\partial z^{[2]}} \frac{\partial z^{[2]}}{\partial h} \frac{\partial h}{\partial z^{[1]}} \frac{\partial z^{[1]}}{\partial W^{[1]}} \\ = 2(o-y)o(1-o) x^{T} h \cdot (1 - h) \cdot W^{[2]} W[1]J=oJz[2]ohz[2]z[1]hW[1]z[1]=2(oy)o(1o)xTh(1h)W[2]
其中" ⋅ \cdot "表示element-wise乘法。

则,对于 w 1 , 2 [ 1 ] w^{[1]}_{1,2} w1,2[1],就是RHS所表示的(2x3)矩阵中的第(1,2)个元素,同时将上标(i)加入,表示第i个样本:
∂ J ∂ w 1 , 2 [ 1 ] = 2 ( o ( i ) − y ( i ) ) ⋅ o ( i ) ( 1 − o ( i ) ) ⋅ w 2 [ 2 ] ⋅ h 2 ( i ) ( 1 − h 2 ( i ) ) ⋅ x 1 ( i ) \frac{\partial J}{\partial w^{[1]}_{1, 2}} = 2\left(o^{(i)}-y^{(i)}\right) \cdot o^{(i)}\left(1-o^{(i)}\right) \cdot w_{2}^{[2]} \cdot h_{2}^{(i)}\left(1-h_{2}^{(i)}\right) \cdot x_{1}^{(i)} w1,2[1]J=2(o(i)y(i))o(i)(1o(i))w2[2]h2(i)(1h2(i))x1(i)
其中 h 2 = σ ( w 1 , 2 [ 1 ] x 1 + w 2 , 2 [ 1 ] x 2 + w 0 , 2 [ 1 ] ) h_{2}=\sigma(w_{1,2}^{[1]} x_{1}+w_{2,2}^{[1]} x_{2}+w_{0,2}^{[1]}) h2=σ(w1,2[1]x1+w2,2[1]x2+w0,2[1])

从而对于 ℓ \ell 的求导为:
∂ ℓ ∂ w 1 , 2 [ 1 ] = 2 m ∑ i = 1 m ( o ( i ) − y ( i ) ) ⋅ o ( i ) ( 1 − o ( i ) ) ⋅ w 2 [ 2 ] ⋅ h 2 ( i ) ( 1 − h 2 ( i ) ) ⋅ x 1 ( i ) \frac{\partial \ell}{\partial w^{[1]}_{1, 2}} = \frac{2}{m} \sum_{i=1}^{m}\left(o^{(i)}-y^{(i)}\right) \cdot o^{(i)}\left(1-o^{(i)}\right) \cdot w_{2}^{[2]} \cdot h_{2}^{(i)}\left(1-h_{2}^{(i)}\right) \cdot x_{1}^{(i)} w1,2[1]=m2i=1m(o(i)y(i))o(i)(1o(i))w2[2]h2(i)(1h2(i))x1(i)
w 1 , 2 [ 1 ] w^{[1]}_{1,2} w1,2[1]的更新规则为:
w 1 , 2 [ 1 ] : = w 1 , 2 [ 1 ] − α 2 m ∑ i = 1 m ( o ( i ) − y ( i ) ) ⋅ o ( i ) ( 1 − o ( i ) ) ⋅ w 2 [ 2 ] ⋅ h 2 ( i ) ( 1 − h 2 ( i ) ) ⋅ x 1 ( i ) w_{1,2}^{[1]}:=w_{1,2}^{[1]}-\alpha \frac{2}{m} \sum_{i=1}^{m}\left(o^{(i)}-y^{(i)}\right) \cdot o^{(i)}\left(1-o^{(i)}\right) \cdot w_{2}^{[2]} \cdot h_{2}^{(i)}\left(1-h_{2}^{(i)}\right) \cdot x_{1}^{(i)} w1,2[1]:=w1,2[1]αm2i=1m(o(i)y(i))o(i)(1o(i))w2[2]h2(i)(1h2(i))x1(i)

(b)

是可能的。可以将三个神经元看作三个独立的线性分类器,每个分类器代表一个超平面,分别落在散点图中0-1两类的三个三角形边边界上。则每个数据经过其中一个神经元的线性weights时,得出正负表示在超平面的上下(左右),然后利用step函数,直接将其划分为两类。在输出层线形计算部分,判断三个神经元是否都判断样本点在各个超平面1(0)一侧,如果都在则使其值为正(负或0),则最后经历一个step函数,将各自归类。100%可以保证是因为每个边界都线性可分。一个例子如下:

def optimal_step_weights():
    """Return the optimal weights for the neural network with a step activation function.
    
    This function will not be graded if there are no optimal weights.
    See the PDF for instructions on what each weight represents.
    
    The hidden layer weights are notated by [1] on the problem set and 
    the output layer weights are notated by [2].

    This function should return a dict with elements for each weight, see example_weights above.

    """
    w = example_weights()

    # *** START CODE HERE ***
    # x1 = 0.5 超平面
    w['hidden_layer_0_1'] = 0.5
    w['hidden_layer_1_1'] = -1
    w['hidden_layer_2_1'] = 0
    # x2 = 0.5 超平面
    w['hidden_layer_0_2'] = 0.5
    w['hidden_layer_1_2'] = 0
    w['hidden_layer_2_2'] = -1
    # x1 + x2 = 4 超平面
    w['hidden_layer_0_3'] = -4
    w['hidden_layer_1_3'] = 1
    w['hidden_layer_2_3'] = 1
    # 使得以上三个条件均为0的为0类
    w['output_layer_0'] = -0.5
    w['output_layer_1'] = 1
    w['output_layer_2'] = 1
    w['output_layer_3'] = 1
    # *** END CODE HERE ***

    return w
def example_weights():
    """This is an example function that returns weights.
    Use this function as a template for optimal_step_weights and optimal_sigmoid_weights.
    You do not need to modify this class for this assignment.
    """
    w = {}

    w['hidden_layer_0_1'] = 0
    w['hidden_layer_1_1'] = 0
    w['hidden_layer_2_1'] = 0
    w['hidden_layer_0_2'] = 0
    w['hidden_layer_1_2'] = 0
    w['hidden_layer_2_2'] = 0
    w['hidden_layer_0_3'] = 0
    w['hidden_layer_1_3'] = 0
    w['hidden_layer_2_3'] = 0

    w['output_layer_0'] = 0
    w['output_layer_1'] = 0
    w['output_layer_2'] = 0
    w['output_layer_3'] = 0

    return w
example_w = optimal_step_weights()
example_w
{'hidden_layer_0_1': 0.5,
 'hidden_layer_1_1': -1,
 'hidden_layer_2_1': 0,
 'hidden_layer_0_2': 0.5,
 'hidden_layer_1_2': 0,
 'hidden_layer_2_2': -1,
 'hidden_layer_0_3': -4,
 'hidden_layer_1_3': 1,
 'hidden_layer_2_3': 1,
 'output_layer_0': -0.5,
 'output_layer_1': 1,
 'output_layer_2': 1,
 'output_layer_3': 1}

©

不可能。这个在讲义中讲到了。如果没有在每层加激活函数(activation function),那么会导致本质上所有层等效为一个线性的运算过程。在这个题目中,如果隐藏层激活函数为线性,就是自身函数,则:
o = σ ( z [ 2 ] ) = σ ( W [ 2 ] h + W 0 [ 2 ] ) = σ ( W [ 2 ] ( W [ 1 ] x + W 0 [ 1 ] ) + W 0 [ 2 ] ) = σ ( W [ 2 ] W [ 1 ] x + W [ 2 ] W 0 [ 1 ] + W 0 [ 2 ] ) = σ ( W ~ [ 1 + W ~ 0 ) \begin{aligned} o &=\sigma\left(z^{[2]}\right) \\ &=\sigma\left(W^{[2]} h+W_{0}^{[2]}\right) \\ &=\sigma\left(W^{[2]}\left(W^{[1]} x+W_{0}^{[1]}\right)+W_{0}^{[2]}\right) \\ &=\sigma\left(W^{[2]} W^{[1]} x+W^{[2]} W_{0}^{[1]}+W_{0}^{[2]}\right) \\ &=\sigma\left(\tilde{W}^{[1}+\tilde{W}_{0}\right) \end{aligned} o=σ(z[2])=σ(W[2]h+W0[2])=σ(W[2](W[1]x+W0[1])+W0[2])=σ(W[2]W[1]x+W[2]W0[1]+W0[2])=σ(W~[1+W~0)
其中, W ~ = W [ 2 ] W [ 1 ]  and  W ~ 0 = W [ 2 ] W 0 [ 1 ] + W 0 [ 2 ] \tilde{W}=W^{[2]} W^{[1]} \text { and } \tilde{W}_{0}=W^{[2]} W_{0}^{[1]}+W_{0}^{[2]} W~=W[2]W[1] and W~0=W[2]W0[1]+W0[2],等效于只做一次线性分类,而图中可以看出,这是一个线性不可分数据集,因此无法完成100%正确分类的目标。

2. KL divergence and Maximum Likelihood

(a)

关键是利用Jensens不等式:
D K L ( P ∥ Q ) = ∑ x ∈ X P ( x ) log ⁡ P ( x ) Q ( x ) = − ∑ x ∈ X P ( x ) log ⁡ Q ( x ) P ( x ) = E [ − log ⁡ Q ( x ) P ( x ) ] ≥ − log ⁡ E [ Q ( x ) P ( x ) ] = − log ⁡ ( ∑ x ∈ X P ( x ) Q ( x ) P ( x ) ) = − log ⁡ ∑ x ∈ X Q ( x ) = − log ⁡ 1 = 0 \begin{aligned} D_{\mathrm{KL}}(P \| Q) &=\sum_{x \in \mathcal{X}} P(x) \log \frac{P(x)}{Q(x)} \\ &=-\sum_{x \in \mathcal{X}} P(x) \log \frac{Q(x)}{P(x)} \\ &=E\left[-\log \frac{Q(x)}{P(x)}\right] \\ & \geq-\log E\left[\frac{Q(x)}{P(x)}\right] \\ &=-\log \left(\sum_{x \in \mathcal{X}} P(x) \frac{Q(x)}{P(x)}\right) \\ &=-\log \sum_{x \in \mathcal{X}} Q(x) \\ &=-\log 1 \\ &=0 \end{aligned} DKL(PQ)=xXP(x)logQ(x)P(x)=xXP(x)logP(x)Q(x)=E[logP(x)Q(x)]logE[P(x)Q(x)]=log(xXP(x)P(x)Q(x))=logxXQ(x)=log1=0
对于等号的取得,当 P = Q P=Q P=Q时, D K L ( P ∥ Q ) = ∑ x ∈ X P ( x ) log ⁡ 1 = 0 D_{\mathrm{KL}}(P \| Q)=\sum_{x \in \mathcal{X}} P(x) \log 1=0 DKL(PQ)=xXP(x)log1=0

而若 D K L ( P ∥ Q ) = 0 D_{\mathrm{KL}}(P \| Q)=0 DKL(PQ)=0,则由Jensens不等式取等号的条件, Q ( x ) P ( x ) = E [ Q ( x ) P ( x ) ] = ∑ x ∈ X P ( x ) P ( x ) Q ( x ) = ∑ x ∈ X Q ( x ) = 1 \frac{Q(x)}{P(x)}=E\left[\frac{Q(x)}{P(x)}\right]=\sum_{x \in \mathcal{X}} P(x) \frac{P(x)}{Q(x)}=\sum_{x \in \mathcal{X}} Q(x)=1 P(x)Q(x)=E[P(x)Q(x)]=xXP(x)Q(x)P(x)=xXQ(x)=1,即: P = Q P=Q P=Q

所以,当且仅当 P = Q P=Q P=Q D K L ( P ∥ Q ) = 0 D_{\mathrm{KL}}(P \| Q)=0 DKL(PQ)=0

(b)

证明:
D K L ( P ( X , Y ) ∥ Q ( X , Y ) ) = ∑ x ∑ y P ( x , y ) log ⁡ P ( x , y ) Q ( x , y ) = ∑ x ∑ y P ( x ) P ( y ∣ x ) log ⁡ P ( x ) P ( y ∣ x ) Q ( x ) Q ( y ∣ x ) = ∑ x ∑ y P ( x ) P ( y ∣ x ) ( log ⁡ P ( x ) Q ( x ) + log ⁡ P ( y ∣ x ) Q ( y ∣ x ) ) = ∑ x ∑ y P ( x ) P ( y ∣ x ) log ⁡ P ( x ) Q ( x ) + ∑ x ∑ y P ( x ) P ( y ∣ x ) log ⁡ P ( y ∣ x ) Q ( y ∣ x ) = ∑ x P ( x ) log ⁡ P ( x ) Q ( x ) ∑ y P ( y ∣ x ) + ∑ x P ( x ) ∑ y P ( y ∣ x ) log ⁡ P ( y ∣ x ) Q ( y ∣ x ) = ∑ x P ( x ) log ⁡ P ( x ) Q ( x ) + ∑ x P ( x ) ( ∑ y P ( y ∣ x ) log ⁡ P ( y ∣ x ) Q ( y ∣ x ) ) = D K L ( P ( X ) ∥ Q ( X ) ) + D K L ( P ( Y ∣ X ) ∥ Q ( Y ∣ X ) ) \begin{aligned} D_{\mathrm{KL}}(P(X, Y) \| Q(X, Y)) &=\sum_{x} \sum_{y} P(x, y) \log \frac{P(x, y)}{Q(x, y)} \\ &=\sum_{x} \sum_{y} P(x) P(y \mid x) \log \frac{P(x) P(y \mid x)}{Q(x) Q(y \mid x)} \\ &=\sum_{x} \sum_{y} P(x) P(y \mid x)\left(\log \frac{P(x)}{Q(x)}+\log \frac{P(y \mid x)}{Q(y \mid x)}\right) \\ &=\sum_{x} \sum_{y} P(x) P(y \mid x) \log \frac{P(x)}{Q(x)}+\sum_{x} \sum_{y} P(x) P(y \mid x) \log \frac{P(y \mid x)}{Q(y \mid x)} \\ &=\sum_{x} P(x) \log \frac{P(x)}{Q(x)} \sum_{y} P(y \mid x)+\sum_{x} P(x) \sum_{y} P(y \mid x) \log \frac{P(y \mid x)}{Q(y \mid x)} \\ &=\sum_{x} P(x) \log \frac{P(x)}{Q(x)}+\sum_{x} P(x)\left(\sum_{y} P(y \mid x) \log \frac{P(y \mid x)}{Q(y \mid x)}\right) \\ &=D_{\mathrm{KL}}(P(X) \| Q(X))+D_{\mathrm{KL}}(P(Y \mid X) \| Q(Y \mid X)) \end{aligned} DKL(P(X,Y)Q(X,Y))=xyP(x,y)logQ(x,y)P(x,y)=xyP(x)P(yx)logQ(x)Q(yx)P(x)P(yx)=xyP(x)P(yx)(logQ(x)P(x)+logQ(yx)P(yx))=xyP(x)P(yx)logQ(x)P(x)+xyP(x)P(yx)logQ(yx)P(yx)=xP(x)logQ(x)P(x)yP(yx)+xP(x)yP(yx)logQ(yx)P(yx)=xP(x)logQ(x)P(x)+xP(x)(yP(yx)logQ(yx)P(yx))=DKL(P(X)Q(X))+DKL(P(YX)Q(YX))

©

证明:
arg ⁡ min ⁡ θ D K L ( P ^ ∥ P θ ) = arg ⁡ min ⁡ θ ∑ x ∈ X P ^ ( x ) log ⁡ P ^ ( x ) P θ ( x ) = arg ⁡ min ⁡ θ ∑ x ∈ X P ^ ( x ) log ⁡ P ^ ( x ) − ∑ x ∈ X P ^ ( x ) log ⁡ P θ ( x ) = arg ⁡ max ⁡ θ ∑ x ∈ X P ^ ( x ) log ⁡ P θ ( x ) = arg ⁡ max ⁡ θ ∑ x ∈ X ( 1 m ∑ i = 1 m 1 { x ( i ) = x } ) log ⁡ P θ ( x ) = arg ⁡ max ⁡ θ ∑ i = 1 m log ⁡ P θ ( x ( i ) ) \begin{aligned} \arg \min _{\theta} D_{\mathrm{KL}}\left(\hat{P} \| P_{\theta}\right) &=\arg \min _{\theta} \sum_{x \in \mathcal{X}} \hat{P}(x) \frac{\log \hat{P}(x)}{P_{\theta}(x)} \\ &=\arg \min _{\theta} \sum_{x \in \mathcal{X}} \hat{P}(x) \log \hat{P}(x)-\sum_{x \in \mathcal{X}} \hat{P}(x) \log P_{\theta}(x) \\ &=\arg \max _{\theta} \sum_{x \in \mathcal{X}} \hat{P}(x) \log P_{\theta}(x) \\ &=\arg \max _{\theta} \sum_{x \in \mathcal{X}}\left(\frac{1}{m} \sum_{i=1}^{m} 1\left\{x^{(i)}=x\right\}\right) \log P_{\theta}(x) \\ &=\arg \max _{\theta} \sum_{i=1}^{m} \log P_{\theta}\left(x^{(i)}\right) \end{aligned} argθminDKL(P^Pθ)=argθminxXP^(x)Pθ(x)logP^(x)=argθminxXP^(x)logP^(x)xXP^(x)logPθ(x)=argθmaxxXP^(x)logPθ(x)=argθmaxxX(m1i=1m1{x(i)=x})logPθ(x)=argθmaxi=1mlogPθ(x(i))

3.KL divergence, Fisher Information, and the Natural Gradient

(a)

证明:
∇ θ log ⁡ p ( y ; θ ) = ∇ θ p ( y ; θ ) p ( y ; θ ) E y ∼ p ( y ; θ ) [ ∇ θ ′ log ⁡ p ( y ; θ ′ ) ∣ θ ′ = θ ] = E y ∼ p ( y ; θ ) [ ∇ θ p ( y ; θ ) p ( y ; θ ) ] = ∫ − ∞ ∞ p ( y ; θ ) ∇ θ p ( y ; θ ) p ( y ; θ ) d y = ∫ − ∞ ∞ ∇ θ p ( y ; θ ) d y = ∇ θ ∫ − ∞ ∞ p ( y ; θ ) d y = ∇ θ 1 = 0 \begin{aligned} \nabla_{\theta} \log p(y ; \theta) &=\frac{\nabla_{\theta} p(y ; \theta)}{p(y ; \theta)} \\ \mathbb{E}_{y \sim p(y ; \theta)}\left[\left.\nabla_{\theta^{\prime}} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}\right] &=\mathbb{E}_{y \sim p(y ; \theta)}\left[\frac{\nabla_{\theta} p(y ; \theta)}{p(y ; \theta)}\right] \\ &=\int_{-\infty}^{\infty} p(y ; \theta) \frac{\nabla_{\theta} p(y ; \theta)}{p(y ; \theta)} d y \\ &=\int_{-\infty}^{\infty} \nabla_{\theta} p(y ; \theta) d y \\ &=\nabla_{\theta} \int_{-\infty}^{\infty} p(y ; \theta) d y \\ &=\nabla_{\theta} 1 \\ &=0 \end{aligned} θlogp(y;θ)Eyp(y;θ)[θlogp(y;θ)θ=θ]=p(y;θ)θp(y;θ)=Eyp(y;θ)[p(y;θ)θp(y;θ)]=p(y;θ)p(y;θ)θp(y;θ)dy=θp(y;θ)dy=θp(y;θ)dy=θ1=0

(b)

证明:
Cov ⁡ [ X ] = E [ ( X − E [ X ] ) ( X − E [ X ] ) T ] = E [ X X T ]  when  E [ X ] = 0 I ( θ ) = Cov ⁡ y ∼ p ( y ; θ ) [ ∇ θ ′ log ⁡ p ( y ; θ ′ ) ∣ θ ′ = θ ] = E y ∼ p ( y ; θ ) [ ∇ θ ′ log ⁡ p ( y ; θ ′ ) ∇ θ ′ log ⁡ p ( y ; θ ′ ) T ∣ θ ′ = θ ] \begin{array}{c}\operatorname{Cov}[X]=E\left[(X-E[X])(X-E[X])^{T}\right] \\ =E\left[X X^{T}\right] \quad \text { when } E[X]=0 \\ \mathcal{I}(\theta)=\operatorname{Cov}_{y \sim p(y ; \theta)}\left[\left.\nabla_{\theta^{\prime}} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}\right] \\ =\mathbb{E}_{y \sim p(y ; \theta)}\left[\left.\nabla_{\theta^{\prime}} \log p\left(y ; \theta^{\prime}\right) \nabla_{\theta^{\prime}} \log p\left(y ; \theta^{\prime}\right)^{T}\right|_{\theta^{\prime}=\theta}\right]\end{array} Cov[X]=E[(XE[X])(XE[X])T]=E[XXT] when E[X]=0I(θ)=Covyp(y;θ)[θlogp(y;θ)θ=θ]=Eyp(y;θ)[θlogp(y;θ)θlogp(y;θ)Tθ=θ]
其中用到了(a)中,score function的均值为0的结论。

©

证明:
∂ log ⁡ p ( y ; θ ) ∂ θ i = 1 p ( y ; θ ) ∂ p ( y ; θ ) ∂ θ i I ( θ ) i j = E y ∼ p ( y ; θ ) [ ∇ θ ′ log ⁡ p ( y ; θ ′ ) ∇ θ ′ log ⁡ p ( y ; θ ′ ) T ∣ θ ′ = θ ] i j = E y ∼ p ( y ; θ ) [ ∂ log ⁡ p ( y ; θ ) ∂ θ i ∂ log ⁡ p ( y ; θ ) ∂ θ j ] = E y ∼ p ( y ; θ ) [ 1 ( p ( y ; θ ) ) 2 ∂ 2 p ( y ; θ ) ∂ θ i ∂ θ j ] ∂ 2 log ⁡ p ( y ; θ ) ∂ θ i ∂ θ j = − 1 ( p ( y ; θ ) ) 2 ∂ 2 p ( y ; θ ) ∂ θ i ∂ θ j + 1 p ( y ; θ ) ∂ 2 p ( y ; θ ) ∂ θ i ∂ θ j E y ∼ p ( y ; θ ) [ − ∇ θ ′ 2 log ⁡ p ( y ; θ ′ ) ∣ θ ′ = θ ] i j = E y ∼ p ( y ; θ ) [ 1 ( p ( y ; θ ) ) 2 ∂ 2 p ( y ; θ ) ∂ θ i ∂ θ j − 1 p ( y ; θ ) ∂ 2 p ( y ; θ ) ∂ θ i ∂ θ j ] = E y ∼ p ( y ; θ ) [ 1 ( p ( y ; θ ) ) 2 ∂ 2 p ( y ; θ ) ∂ θ i ∂ θ j ] − E y ∼ p ( y ; θ ) [ 1 p ( y ; θ ) ∂ 2 p ( y ; θ ) ∂ θ i ∂ θ j ] = E y ∼ p ( y ; θ ) [ 1 ( p ( y ; θ ) ) 2 ∂ 2 p ( y ; θ ) ∂ θ i ∂ θ j ] − ∫ − ∞ ∞ p ( y ; θ ) 1 p ( y ; θ ) ∂ 2 p ( y ; θ ) ∂ θ i ∂ θ j d y = E y ∼ p ( y ; θ ) [ 1 ( p ( y ; θ ) ) 2 ∂ 2 p ( y ; θ ) ∂ θ i ∂ θ j ] − ∂ 2 ∂ θ i ∂ θ j ∫ − ∞ ∞ p ( y ; θ ) d y = E y ∼ p ( y ; θ ) [ 1 ( p ( y ; θ ) ) 2 ∂ 2 p ( y ; θ ) ∂ θ i ∂ θ j ] = I ( θ ) i j 从 而 : E y ∼ p ( y ; θ ) [ − ∇ θ ′ 2 log ⁡ p ( y ; θ ′ ) ∣ θ ′ = θ ] = I ( θ ) \begin{aligned} & \frac{\partial \log p(y ; \theta)}{\partial \theta_{i}}=\frac{1}{p(y ; \theta)} \frac{\partial p(y ; \theta)}{\partial \theta_{i}} \\ \mathcal{I}(\theta)_{i j} &=\mathbb{E}_{y \sim p(y ; \theta)}\left[\left.\nabla_{\theta^{\prime}} \log p\left(y ; \theta^{\prime}\right) \nabla_{\theta^{\prime}} \log p\left(y ; \theta^{\prime}\right)^{T}\right|_{\theta^{\prime}=\theta}\right]_{i j} \\=& \mathbb{E}_{y \sim p(y ; \theta)}\left[\frac{\partial \log p(y ; \theta)}{\partial \theta_{i}} \frac{\partial \log p(y ; \theta)}{\partial \theta_{j}}\right] \\=& \mathbb{E}_{y \sim p(y ; \theta)}\left[\frac{1}{(p(y ; \theta))^{2}} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}}\right] \\ \frac{\partial^{2} \log p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}} &=-\frac{1}{(p(y ; \theta))^{2}} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}}+\frac{1}{p(y ; \theta)} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}} \\ \mathbb{E}_{y \sim p(y ; \theta)}\left[-\left.\nabla_{\theta^{\prime}}^{2} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}\right]_{i j} &=\mathbb{E}_{y \sim p(y ; \theta)}\left[\frac{1}{(p(y ; \theta))^{2}} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}}-\frac{1}{p(y ; \theta)} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}}\right] \\ &=\mathbb{E}_{y \sim p(y ; \theta)}\left[\frac{1}{(p(y ; \theta))^{2}} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}}\right]-\mathbb{E}_{y \sim p(y ; \theta)}\left[\frac{1}{p(y ; \theta)} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}}\right] \\ &=\mathbb{E}_{y \sim p(y ; \theta)}\left[\frac{1}{(p(y ; \theta))^{2}} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}}\right]-\int_{-\infty}^{\infty} p(y ; \theta) \frac{1}{p(y ; \theta)} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}} d y \\ &=\mathbb{E}_{y \sim p(y ; \theta)}\left[\frac{1}{(p(y ; \theta))^{2}} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}}\right]-\frac{\partial^{2}}{\partial \theta_{i} \partial \theta_{j}} \int_{-\infty}^{\infty} p(y ; \theta) d y \\ &=\mathbb{E}_{y \sim p(y ; \theta)}\left[\frac{1}{(p(y ; \theta))^{2}} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}}\right] \\ &=\mathcal{I}(\theta)_{i j} \\ 从而: \mathbb{E}_{y \sim p(y ; \theta)}\left[-\left.\nabla_{\theta^{\prime}}^{2} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}\right]=\mathcal{I}(\theta) \end{aligned} I(θ)ij==θiθj2logp(y;θ)Eyp(y;θ)[θ2logp(y;θ)θ=θ]ijEyp(y;θ)[θ2logp(y;θ)θ=θ]=I(θ)θilogp(y;θ)=p(y;θ)1θip(y;θ)=Eyp(y;θ)[θlogp(y;θ)θlogp(y;θ)Tθ=θ]ijEyp(y;θ)[θilogp(y;θ)θjlogp(y;θ)]Eyp(y;θ)[(p(y;θ))21θiθj2p(y;θ)]=(p(y;θ))21θiθj2p(y;θ)+p(y;θ)1θiθj2p(y;θ)=Eyp(y;θ)[(p(y;θ))21θiθj2p(y;θ)p(y;θ)1θiθj2p(y;θ)]=Eyp(y;θ)[(p(y;θ))21θiθj2p(y;θ)]Eyp(y;θ)[p(y;θ)1θiθj2p(y;θ)]=Eyp(y;θ)[(p(y;θ))21θiθj2p(y;θ)]p(y;θ)p(y;θ)1θiθj2p(y;θ)dy=Eyp(y;θ)[(p(y;θ))21θiθj2p(y;θ)]θiθj2p(y;θ)dy=Eyp(y;θ)[(p(y;θ))21θiθj2p(y;θ)]=I(θ)ij

(d)

证明:
log ⁡ p ( y ; θ ~ ) ≈ log ⁡ p ( y ; θ ) + ( θ ~ − θ ) T ∇ θ ′ log ⁡ p ( y ; θ ′ ) ∣ θ ′ = θ + 1 2 ( θ ~ − θ ) T ( ∇ θ ′ 2 log ⁡ p ( y ; θ ′ ) ∣ θ ′ = θ ) ( θ ~ − θ ) = log ⁡ p ( y ; θ ) + d T ∇ θ ′ log ⁡ p ( y ; θ ′ ) ∣ θ ′ = θ + 1 2 d T ( ∇ θ ′ 2 log ⁡ p ( y ; θ ′ ) ∣ θ ′ = θ ) d \begin{aligned} \log p(y ; \tilde{\theta}) & \approx \log p(y ; \theta)+\left.(\tilde{\theta}-\theta)^{T} \nabla_{\theta^{\prime}} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}+\frac{1}{2}(\tilde{\theta}-\theta)^{T}\left(\left.\nabla_{\theta^{\prime}}^{2} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}\right)(\tilde{\theta}-\theta) \\ &=\log p(y ; \theta)+\left.d^{T} \nabla_{\theta^{\prime}} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}+\frac{1}{2} d^{T}\left(\left.\nabla_{\theta^{\prime}}^{2} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}\right) d \end{aligned} logp(y;θ~)logp(y;θ)+(θ~θ)Tθlogp(y;θ)θ=θ+21(θ~θ)T(θ2logp(y;θ)θ=θ)(θ~θ)=logp(y;θ)+dTθlogp(y;θ)θ=θ+21dT(θ2logp(y;θ)θ=θ)d
E y ∼ p ( y ; θ ) [ log ⁡ p ( y ; θ ~ ) ] = E y ∼ p ( y ; θ ) [ log ⁡ p ( y ; θ ) ] + 1 2 d T E y ∼ p ( y ; θ ) [ ∇ θ ′ 2 log ⁡ p ( y ; θ ′ ) ∣ θ ′ = θ ] d = E y ∼ p ( y ; θ ) [ log ⁡ p ( y ; θ ) ] + 1 2 d T I ( θ ) d D K L ( p θ ∥ p θ + d ) = D K L ( p θ ∥ p θ ~ ) = E y ∼ p ( y ; θ ) [ log ⁡ p ( y ; θ ) ] − E y ∼ p ( y ; θ ) [ log ⁡ p ( y ; θ ~ ) ] ≈ 1 2 d T I ( θ ) d \begin{aligned} \mathbb{E}_{y \sim p(y ; \theta)}[\log p(y ; \tilde{\theta})] &=\mathbb{E}_{y \sim p(y ; \theta)}[\log p(y ; \theta)]+\frac{1}{2} d^{T} \mathbb{E}_{y \sim p(y ; \theta)}\left[\left.\nabla_{\theta^{\prime}}^{2} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}\right] d \\=& \mathbb{E}_{y \sim p(y ; \theta)}[\log p(y ; \theta)]+\frac{1}{2} d^{T} \mathcal{I}(\theta) d \\ D_{\mathrm{KL}}\left(p_{\theta} \| p_{\theta+d}\right) &=D_{\mathrm{KL}}\left(p_{\theta} \| p_{\tilde{\theta}}\right) \\ &=\mathbb{E}_{y \sim p(y ; \theta)}[\log p(y ; \theta)]-\mathbb{E}_{y \sim p(y ; \theta)}[\log p(y ; \tilde{\theta})] \\ & \approx \frac{1}{2} d^{T} \mathcal{I}(\theta) d \end{aligned} Eyp(y;θ)[logp(y;θ~)]=DKL(pθpθ+d)=Eyp(y;θ)[logp(y;θ)]+21dTEyp(y;θ)[θ2logp(y;θ)θ=θ]dEyp(y;θ)[logp(y;θ)]+21dTI(θ)d=DKL(pθpθ~)=Eyp(y;θ)[logp(y;θ)]Eyp(y;θ)[logp(y;θ~)]21dTI(θ)d

(e)

第一步,用泰勒展开近似目标函数和约束:
ℓ ( θ + d ) ≈ ℓ ( θ ) + d T ∇ θ ′ ℓ ( θ ′ ) ∣ θ ′ = θ = log ⁡ p ( y ; θ ) + d T ∇ θ ′ log ⁡ p ( y ; θ ′ ) ∣ θ ′ = θ = log ⁡ p ( y ; θ ) + d T ∇ θ ′ p ( y ; θ ′ ) ∣ θ ′ = θ p ( y ; θ ) D K L ( p θ ∥ p θ + d ) ≈ 1 2 d T I ( θ ) d \begin{aligned} \ell(\theta+d) & \approx \ell(\theta)+\left.d^{T} \nabla_{\theta^{\prime}} \ell\left(\theta^{\prime}\right)\right|_{\theta^{\prime}=\theta} \\ &=\log p(y ; \theta)+\left.d^{T} \nabla_{\theta^{\prime}} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta} \\ &=\log p(y ; \theta)+d^{T} \frac{\left.\nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}}{p(y ; \theta)} \\ & D_{\mathrm{KL}}\left(p_{\theta} \| p_{\theta+d}\right) \approx \frac{1}{2} d^{T} \mathcal{I}(\theta) d \end{aligned} (θ+d)(θ)+dTθ(θ)θ=θ=logp(y;θ)+dTθlogp(y;θ)θ=θ=logp(y;θ)+dTp(y;θ)θp(y;θ)θ=θDKL(pθpθ+d)21dTI(θ)d
第二步,写出拉格朗日函数:
L ( d , λ ) = ℓ ( θ + d ) − λ [ D K L ( p θ ∥ p θ + d ) − c ] ≈ log ⁡ p ( y ; θ ) + d T ∇ θ ′ p ( y ; θ ′ ) ∣ θ ′ = θ p ( y ; θ ) − λ [ 1 2 d T I ( θ ) d − c ] \begin{aligned} \mathcal{L}(d, \lambda) &=\ell(\theta+d)-\lambda\left[D_{\mathrm{KL}}\left(p_{\theta} \| p_{\theta+d}\right)-c\right] \\ & \approx \log p(y ; \theta)+d^{T} \frac{\left.\nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}}{p(y ; \theta)}-\lambda\left[\frac{1}{2} d^{T} \mathcal{I}(\theta) d-c\right] \end{aligned} L(d,λ)=(θ+d)λ[DKL(pθpθ+d)c]logp(y;θ)+dTp(y;θ)θp(y;θ)θ=θλ[21dTI(θ)dc]
第三步,拉格朗日函数对参数求导为0,其中关于d的求导为:
∇ d L ( d , λ ) ≈ ∇ θ ′ p ( y ; θ ′ ) ∣ θ ′ = θ p ( y ; θ ) − λ I ( θ ) d = 0 d ~ = 1 λ I ( θ ) − 1 ∇ θ ′ p ( y ; θ ′ ) ∣ θ ′ = θ p ( y ; θ ) \begin{aligned} \nabla_{d} \mathcal{L}(d, \lambda) & \approx \frac{\left.\nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}}{p(y ; \theta)}-\lambda \mathcal{I}(\theta) d=0 \\ \tilde{d}=\frac{1}{\lambda} \mathcal{I}(\theta)^{-1} \frac{\left.\nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}}{p(y ; \theta)} \end{aligned} dL(d,λ)d~=λ1I(θ)1p(y;θ)θp(y;θ)θ=θp(y;θ)θp(y;θ)θ=θλI(θ)d=0
则此时尽管不知 λ \lambda λ的具体值,但是其为一个正实数,所以仍然已经得到了natural gradient的方向,下面利用关于 λ \lambda λ求梯度的方程,
解出 λ \lambda λ的具体值:
∇ λ L ( d , λ ) ≈ c − 1 2 d T I ( θ ) d = c − 1 2 ⋅ 1 λ ∇ θ ′ p ( y ; θ ′ ) ∣ θ ′ = θ T p ( y ; θ ) I ( θ ) − 1 ⋅ I ( θ ) ⋅ 1 λ I ( θ ) − 1 ∇ θ ′ p ( y ; θ ′ ) ∣ θ ′ = θ p ( y ; θ ) = c − 1 2 λ 2 ( p ( y ; θ ) ) 2 ∇ θ ′ p ( y ; θ ′ ) ∣ θ ′ = θ T I ( θ ) − 1 ∇ θ ′ p ( y ; θ ′ ) ∣ θ ′ = θ = 0 λ = 1 2 c ( p ( y ; θ ) ) 2 ∇ θ ′ p ( y ; θ ′ ) ∣ θ ′ = θ T I ( θ ) − 1 ∇ θ ′ p ( y ; θ ′ ) ∣ θ ′ = θ \begin{aligned} \nabla_{\lambda} \mathcal{L}(d, \lambda) & \approx c-\frac{1}{2} d^{T} \mathcal{I}(\theta) d \\ &=c-\frac{1}{2} \cdot \frac{1}{\lambda} \frac{\left.\nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta} ^{T}}{p(y ; \theta)} \mathcal{I}(\theta)^{-1} \cdot \mathcal{I}(\theta) \cdot \frac{1}{\lambda} \mathcal{I}(\theta)^{-1} \frac{\left.\nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}}{p(y ; \theta)} \\ &=c-\left.\left.\frac{1}{2 \lambda^{2}(p(y ; \theta))^{2}} \nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta} ^{T} \mathcal{I}(\theta)^{-1} \nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta} \\ &=0 \\ & \lambda=\sqrt{\left.\left.\frac{1}{2 c(p(y ; \theta))^{2}} \nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta} ^{T} \mathcal{I}(\theta)^{-1} \nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}} \end{aligned} λL(d,λ)c21dTI(θ)d=c21λ1p(y;θ)θp(y;θ)θ=θTI(θ)1I(θ)λ1I(θ)1p(y;θ)θp(y;θ)θ=θ=c2λ2(p(y;θ))21θp(y;θ)θ=θTI(θ)1θp(y;θ)θ=θ=0λ=2c(p(y;θ))21θp(y;θ)θ=θTI(θ)1θp(y;θ)θ=θ
从而可知natural gradient d ∗ d^{*} d:
d ∗ = 2 c ( p ( y ; θ ) ) 2 ∇ θ ′ p ( y ; θ ′ ) ∣ θ ′ = θ T I ( θ ) − 1 ∇ θ ′ p ( y ; θ ′ ) ∣ θ ′ = θ I ( θ ) − 1 ∇ θ ′ p ( y ; θ ′ ) ∣ θ ′ = θ p ( y ; θ ) = 2 c ∇ θ ′ p ( y ; θ ′ ) ∣ θ ′ = θ T I ( θ ) − 1 ∇ θ ′ p ( y ; θ ′ ) ∣ θ ′ = θ I ( θ ) − 1 ∇ θ ′ p ( y ; θ ′ ) ∣ θ ′ = θ \begin{aligned} d^{*} &=\sqrt{\frac{2 c(p(y ; \theta))^{2}}{\left.\left.\nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta} ^{T} \mathcal{I}(\theta)^{-1} \nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}}} \mathcal{I}(\theta)^{-1} \frac{\left.\nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}}{p(y ; \theta)} \\ &=\left.\sqrt{\frac{2 c}{\left.\left.\nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta} ^{T} \mathcal{I}(\theta)^{-1} \nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}}} \mathcal{I}(\theta)^{-1} \nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta} \end{aligned} d=θp(y;θ)θ=θTI(θ)1θp(y;θ)θ=θ2c(p(y;θ))2 I(θ)1p(y;θ)θp(y;θ)θ=θ=θp(y;θ)θ=θTI(θ)1θp(y;θ)θ=θ2c I(θ)1θp(y;θ)θ=θ

(f)

由上一问:
d ~ = 1 λ I ( θ ) − 1 ∇ θ ′ p ( y ; θ ′ ) ∣ θ ′ = θ p ( y ; θ ) = 1 λ I ( θ ) − 1 ∇ θ ′ l o g   p ( y ; θ ′ ) ∣ θ ′ = θ = 1 λ I ( θ ) − 1 ∇ θ ′ l o g   ℓ ( θ ′ ) ∣ θ ′ = θ \tilde{d}=\frac{1}{\lambda} \mathcal{I}(\theta)^{-1} \frac{\left.\nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}}{p(y ; \theta)} = \frac{1}{\lambda} \mathcal{I}(\theta)^{-1} \nabla_{\theta^{\prime}} log\, p\left(y ; \theta^{\prime}\right)|_{\theta^{\prime}=\theta}= \frac{1}{\lambda} \mathcal{I}(\theta)^{-1} \nabla_{\theta^{\prime}} log\, \ell(\theta^{\prime})|_{\theta^{\prime}=\theta} d~=λ1I(θ)1p(y;θ)θp(y;θ)θ=θ=λ1I(θ)1θlogp(y;θ)θ=θ=λ1I(θ)1θlog(θ)θ=θ
又注意到:
I ( θ ) = E y ∼ p ( y ; θ ) [ − ∇ θ ′ 2 log ⁡ p ( y ; θ ′ ) ∣ θ ′ = θ ] = E y ∼ p ( y ; θ ) [ − ∇ θ 2 ℓ ( θ ) ] = − E y ∼ p ( y ; θ ) [ H ] \begin{aligned} \mathcal{I}(\theta) &=\mathbb{E}_{y \sim p(y ; \theta)}\left[-\left.\nabla_{\theta^{\prime}}^{2} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}\right] \\ &=\mathbb{E}_{y \sim p(y ; \theta)}\left[-\nabla_{\theta}^{2} \ell(\theta)\right] \\ &=-\mathbb{E}_{y \sim p(y ; \theta)}[H] \end{aligned} I(θ)=Eyp(y;θ)[θ2logp(y;θ)θ=θ]=Eyp(y;θ)[θ2(θ)]=Eyp(y;θ)[H]
从而:
θ : = θ + d ~ = θ + 1 λ I ( θ ) − 1 ∇ θ ℓ ( θ ) = θ − 1 λ E y ∼ p ( y ; θ ) [ H ] − 1 ∇ θ ℓ ( θ ) \begin{aligned} \theta: &=\theta+\tilde{d} \\ &=\theta+\frac{1}{\lambda} \mathcal{I}(\theta)^{-1} \nabla_{\theta} \ell(\theta) \\ &=\theta-\frac{1}{\lambda} \mathbb{E}_{y \sim p(y ; \theta)}[H]^{-1} \nabla_{\theta} \ell(\theta) \end{aligned} θ:=θ+d~=θ+λ1I(θ)1θ(θ)=θλ1Eyp(y;θ)[H]1θ(θ)
对于牛顿法,更新规则为:
θ : = θ − H − 1 ∇ θ ℓ ( θ ) \theta:=\theta-H^{-1} \nabla_{\theta} \ell(\theta) θ:=θH1θ(θ)
可以发现更新的放向都是黑塞矩阵你矩阵与目标函数关于参数梯度的乘积方向。

你可能感兴趣的:(CS229,python,机器学习,自然语言处理,聚类)