首先写出forward过程:
z [ 1 ] = W [ 1 ] x + W 0 [ 1 ] h = σ ( z [ 1 ] ) z [ 2 ] = W [ 2 ] h + W 0 [ 2 ] o = σ ( z [ 2 ] ) \begin{aligned} z^{[1]} &=W^{[1]} x+W_{0}^{[1]} \\ h &=\sigma\left(z^{[1]}\right) \\ z^{[2]} &=W^{[2]} h+W_{0}^{[2]} \\ o &=\sigma\left(z^{[2]}\right) \end{aligned} z[1]hz[2]o=W[1]x+W0[1]=σ(z[1])=W[2]h+W0[2]=σ(z[2])
损失函数为:
ℓ = 1 m ∑ i = 1 m ( o ( i ) − y ( i ) ) 2 = 1 m ∑ i = 1 m J ( i ) \begin{aligned} \ell &=\frac{1}{m} \sum_{i=1}^{m}\left(o^{(i)}-y^{(i)}\right)^{2}=\frac{1}{m} \sum_{i=1}^{m} J^{(i)} \end{aligned} ℓ=m1i=1∑m(o(i)−y(i))2=m1i=1∑mJ(i)
对于一个样本,通过链式法则,先求对于 W [ 2 ] W^{[2]} W[2]的导数,利用其中结果,再求对于 W [ 1 ] W^{[1]} W[1]的导数:
∂ J ∂ W [ 1 ] = ∂ J ∂ o ∂ o ∂ z [ 2 ] ∂ z [ 2 ] ∂ h ∂ h ∂ z [ 1 ] ∂ z [ 1 ] ∂ W [ 1 ] = 2 ( o − y ) o ( 1 − o ) x T h ⋅ ( 1 − h ) ⋅ W [ 2 ] \frac{\partial J}{\partial W^{[1]}}=\frac{\partial J}{\partial o} \frac{\partial o}{\partial z^{[2]}} \frac{\partial z^{[2]}}{\partial h} \frac{\partial h}{\partial z^{[1]}} \frac{\partial z^{[1]}}{\partial W^{[1]}} \\ = 2(o-y)o(1-o) x^{T} h \cdot (1 - h) \cdot W^{[2]} ∂W[1]∂J=∂o∂J∂z[2]∂o∂h∂z[2]∂z[1]∂h∂W[1]∂z[1]=2(o−y)o(1−o)xTh⋅(1−h)⋅W[2]
其中" ⋅ \cdot ⋅"表示element-wise乘法。
则,对于 w 1 , 2 [ 1 ] w^{[1]}_{1,2} w1,2[1],就是RHS所表示的(2x3)矩阵中的第(1,2)个元素,同时将上标(i)
加入,表示第i个样本:
∂ J ∂ w 1 , 2 [ 1 ] = 2 ( o ( i ) − y ( i ) ) ⋅ o ( i ) ( 1 − o ( i ) ) ⋅ w 2 [ 2 ] ⋅ h 2 ( i ) ( 1 − h 2 ( i ) ) ⋅ x 1 ( i ) \frac{\partial J}{\partial w^{[1]}_{1, 2}} = 2\left(o^{(i)}-y^{(i)}\right) \cdot o^{(i)}\left(1-o^{(i)}\right) \cdot w_{2}^{[2]} \cdot h_{2}^{(i)}\left(1-h_{2}^{(i)}\right) \cdot x_{1}^{(i)} ∂w1,2[1]∂J=2(o(i)−y(i))⋅o(i)(1−o(i))⋅w2[2]⋅h2(i)(1−h2(i))⋅x1(i)
其中 h 2 = σ ( w 1 , 2 [ 1 ] x 1 + w 2 , 2 [ 1 ] x 2 + w 0 , 2 [ 1 ] ) h_{2}=\sigma(w_{1,2}^{[1]} x_{1}+w_{2,2}^{[1]} x_{2}+w_{0,2}^{[1]}) h2=σ(w1,2[1]x1+w2,2[1]x2+w0,2[1])
从而对于 ℓ \ell ℓ的求导为:
∂ ℓ ∂ w 1 , 2 [ 1 ] = 2 m ∑ i = 1 m ( o ( i ) − y ( i ) ) ⋅ o ( i ) ( 1 − o ( i ) ) ⋅ w 2 [ 2 ] ⋅ h 2 ( i ) ( 1 − h 2 ( i ) ) ⋅ x 1 ( i ) \frac{\partial \ell}{\partial w^{[1]}_{1, 2}} = \frac{2}{m} \sum_{i=1}^{m}\left(o^{(i)}-y^{(i)}\right) \cdot o^{(i)}\left(1-o^{(i)}\right) \cdot w_{2}^{[2]} \cdot h_{2}^{(i)}\left(1-h_{2}^{(i)}\right) \cdot x_{1}^{(i)} ∂w1,2[1]∂ℓ=m2i=1∑m(o(i)−y(i))⋅o(i)(1−o(i))⋅w2[2]⋅h2(i)(1−h2(i))⋅x1(i)
则 w 1 , 2 [ 1 ] w^{[1]}_{1,2} w1,2[1]的更新规则为:
w 1 , 2 [ 1 ] : = w 1 , 2 [ 1 ] − α 2 m ∑ i = 1 m ( o ( i ) − y ( i ) ) ⋅ o ( i ) ( 1 − o ( i ) ) ⋅ w 2 [ 2 ] ⋅ h 2 ( i ) ( 1 − h 2 ( i ) ) ⋅ x 1 ( i ) w_{1,2}^{[1]}:=w_{1,2}^{[1]}-\alpha \frac{2}{m} \sum_{i=1}^{m}\left(o^{(i)}-y^{(i)}\right) \cdot o^{(i)}\left(1-o^{(i)}\right) \cdot w_{2}^{[2]} \cdot h_{2}^{(i)}\left(1-h_{2}^{(i)}\right) \cdot x_{1}^{(i)} w1,2[1]:=w1,2[1]−αm2i=1∑m(o(i)−y(i))⋅o(i)(1−o(i))⋅w2[2]⋅h2(i)(1−h2(i))⋅x1(i)
是可能的。可以将三个神经元看作三个独立的线性分类器,每个分类器代表一个超平面,分别落在散点图中0-1两类的三个三角形边边界上。则每个数据经过其中一个神经元的线性weights时,得出正负表示在超平面的上下(左右),然后利用step函数,直接将其划分为两类。在输出层线形计算部分,判断三个神经元是否都判断样本点在各个超平面1(0)一侧,如果都在则使其值为正(负或0),则最后经历一个step函数,将各自归类。100%可以保证是因为每个边界都线性可分。一个例子如下:
def optimal_step_weights():
"""Return the optimal weights for the neural network with a step activation function.
This function will not be graded if there are no optimal weights.
See the PDF for instructions on what each weight represents.
The hidden layer weights are notated by [1] on the problem set and
the output layer weights are notated by [2].
This function should return a dict with elements for each weight, see example_weights above.
"""
w = example_weights()
# *** START CODE HERE ***
# x1 = 0.5 超平面
w['hidden_layer_0_1'] = 0.5
w['hidden_layer_1_1'] = -1
w['hidden_layer_2_1'] = 0
# x2 = 0.5 超平面
w['hidden_layer_0_2'] = 0.5
w['hidden_layer_1_2'] = 0
w['hidden_layer_2_2'] = -1
# x1 + x2 = 4 超平面
w['hidden_layer_0_3'] = -4
w['hidden_layer_1_3'] = 1
w['hidden_layer_2_3'] = 1
# 使得以上三个条件均为0的为0类
w['output_layer_0'] = -0.5
w['output_layer_1'] = 1
w['output_layer_2'] = 1
w['output_layer_3'] = 1
# *** END CODE HERE ***
return w
def example_weights():
"""This is an example function that returns weights.
Use this function as a template for optimal_step_weights and optimal_sigmoid_weights.
You do not need to modify this class for this assignment.
"""
w = {}
w['hidden_layer_0_1'] = 0
w['hidden_layer_1_1'] = 0
w['hidden_layer_2_1'] = 0
w['hidden_layer_0_2'] = 0
w['hidden_layer_1_2'] = 0
w['hidden_layer_2_2'] = 0
w['hidden_layer_0_3'] = 0
w['hidden_layer_1_3'] = 0
w['hidden_layer_2_3'] = 0
w['output_layer_0'] = 0
w['output_layer_1'] = 0
w['output_layer_2'] = 0
w['output_layer_3'] = 0
return w
example_w = optimal_step_weights()
example_w
{'hidden_layer_0_1': 0.5,
'hidden_layer_1_1': -1,
'hidden_layer_2_1': 0,
'hidden_layer_0_2': 0.5,
'hidden_layer_1_2': 0,
'hidden_layer_2_2': -1,
'hidden_layer_0_3': -4,
'hidden_layer_1_3': 1,
'hidden_layer_2_3': 1,
'output_layer_0': -0.5,
'output_layer_1': 1,
'output_layer_2': 1,
'output_layer_3': 1}
不可能。这个在讲义中讲到了。如果没有在每层加激活函数(activation function),那么会导致本质上所有层等效为一个线性的运算过程。在这个题目中,如果隐藏层激活函数为线性,就是自身函数,则:
o = σ ( z [ 2 ] ) = σ ( W [ 2 ] h + W 0 [ 2 ] ) = σ ( W [ 2 ] ( W [ 1 ] x + W 0 [ 1 ] ) + W 0 [ 2 ] ) = σ ( W [ 2 ] W [ 1 ] x + W [ 2 ] W 0 [ 1 ] + W 0 [ 2 ] ) = σ ( W ~ [ 1 + W ~ 0 ) \begin{aligned} o &=\sigma\left(z^{[2]}\right) \\ &=\sigma\left(W^{[2]} h+W_{0}^{[2]}\right) \\ &=\sigma\left(W^{[2]}\left(W^{[1]} x+W_{0}^{[1]}\right)+W_{0}^{[2]}\right) \\ &=\sigma\left(W^{[2]} W^{[1]} x+W^{[2]} W_{0}^{[1]}+W_{0}^{[2]}\right) \\ &=\sigma\left(\tilde{W}^{[1}+\tilde{W}_{0}\right) \end{aligned} o=σ(z[2])=σ(W[2]h+W0[2])=σ(W[2](W[1]x+W0[1])+W0[2])=σ(W[2]W[1]x+W[2]W0[1]+W0[2])=σ(W~[1+W~0)
其中, W ~ = W [ 2 ] W [ 1 ] and W ~ 0 = W [ 2 ] W 0 [ 1 ] + W 0 [ 2 ] \tilde{W}=W^{[2]} W^{[1]} \text { and } \tilde{W}_{0}=W^{[2]} W_{0}^{[1]}+W_{0}^{[2]} W~=W[2]W[1] and W~0=W[2]W0[1]+W0[2],等效于只做一次线性分类,而图中可以看出,这是一个线性不可分数据集,因此无法完成100%正确分类的目标。
关键是利用Jensens不等式:
D K L ( P ∥ Q ) = ∑ x ∈ X P ( x ) log P ( x ) Q ( x ) = − ∑ x ∈ X P ( x ) log Q ( x ) P ( x ) = E [ − log Q ( x ) P ( x ) ] ≥ − log E [ Q ( x ) P ( x ) ] = − log ( ∑ x ∈ X P ( x ) Q ( x ) P ( x ) ) = − log ∑ x ∈ X Q ( x ) = − log 1 = 0 \begin{aligned} D_{\mathrm{KL}}(P \| Q) &=\sum_{x \in \mathcal{X}} P(x) \log \frac{P(x)}{Q(x)} \\ &=-\sum_{x \in \mathcal{X}} P(x) \log \frac{Q(x)}{P(x)} \\ &=E\left[-\log \frac{Q(x)}{P(x)}\right] \\ & \geq-\log E\left[\frac{Q(x)}{P(x)}\right] \\ &=-\log \left(\sum_{x \in \mathcal{X}} P(x) \frac{Q(x)}{P(x)}\right) \\ &=-\log \sum_{x \in \mathcal{X}} Q(x) \\ &=-\log 1 \\ &=0 \end{aligned} DKL(P∥Q)=x∈X∑P(x)logQ(x)P(x)=−x∈X∑P(x)logP(x)Q(x)=E[−logP(x)Q(x)]≥−logE[P(x)Q(x)]=−log(x∈X∑P(x)P(x)Q(x))=−logx∈X∑Q(x)=−log1=0
对于等号的取得,当 P = Q P=Q P=Q时, D K L ( P ∥ Q ) = ∑ x ∈ X P ( x ) log 1 = 0 D_{\mathrm{KL}}(P \| Q)=\sum_{x \in \mathcal{X}} P(x) \log 1=0 DKL(P∥Q)=∑x∈XP(x)log1=0;
而若 D K L ( P ∥ Q ) = 0 D_{\mathrm{KL}}(P \| Q)=0 DKL(P∥Q)=0,则由Jensens不等式取等号的条件, Q ( x ) P ( x ) = E [ Q ( x ) P ( x ) ] = ∑ x ∈ X P ( x ) P ( x ) Q ( x ) = ∑ x ∈ X Q ( x ) = 1 \frac{Q(x)}{P(x)}=E\left[\frac{Q(x)}{P(x)}\right]=\sum_{x \in \mathcal{X}} P(x) \frac{P(x)}{Q(x)}=\sum_{x \in \mathcal{X}} Q(x)=1 P(x)Q(x)=E[P(x)Q(x)]=∑x∈XP(x)Q(x)P(x)=∑x∈XQ(x)=1,即: P = Q P=Q P=Q;
所以,当且仅当 P = Q P=Q P=Q, D K L ( P ∥ Q ) = 0 D_{\mathrm{KL}}(P \| Q)=0 DKL(P∥Q)=0
证明:
D K L ( P ( X , Y ) ∥ Q ( X , Y ) ) = ∑ x ∑ y P ( x , y ) log P ( x , y ) Q ( x , y ) = ∑ x ∑ y P ( x ) P ( y ∣ x ) log P ( x ) P ( y ∣ x ) Q ( x ) Q ( y ∣ x ) = ∑ x ∑ y P ( x ) P ( y ∣ x ) ( log P ( x ) Q ( x ) + log P ( y ∣ x ) Q ( y ∣ x ) ) = ∑ x ∑ y P ( x ) P ( y ∣ x ) log P ( x ) Q ( x ) + ∑ x ∑ y P ( x ) P ( y ∣ x ) log P ( y ∣ x ) Q ( y ∣ x ) = ∑ x P ( x ) log P ( x ) Q ( x ) ∑ y P ( y ∣ x ) + ∑ x P ( x ) ∑ y P ( y ∣ x ) log P ( y ∣ x ) Q ( y ∣ x ) = ∑ x P ( x ) log P ( x ) Q ( x ) + ∑ x P ( x ) ( ∑ y P ( y ∣ x ) log P ( y ∣ x ) Q ( y ∣ x ) ) = D K L ( P ( X ) ∥ Q ( X ) ) + D K L ( P ( Y ∣ X ) ∥ Q ( Y ∣ X ) ) \begin{aligned} D_{\mathrm{KL}}(P(X, Y) \| Q(X, Y)) &=\sum_{x} \sum_{y} P(x, y) \log \frac{P(x, y)}{Q(x, y)} \\ &=\sum_{x} \sum_{y} P(x) P(y \mid x) \log \frac{P(x) P(y \mid x)}{Q(x) Q(y \mid x)} \\ &=\sum_{x} \sum_{y} P(x) P(y \mid x)\left(\log \frac{P(x)}{Q(x)}+\log \frac{P(y \mid x)}{Q(y \mid x)}\right) \\ &=\sum_{x} \sum_{y} P(x) P(y \mid x) \log \frac{P(x)}{Q(x)}+\sum_{x} \sum_{y} P(x) P(y \mid x) \log \frac{P(y \mid x)}{Q(y \mid x)} \\ &=\sum_{x} P(x) \log \frac{P(x)}{Q(x)} \sum_{y} P(y \mid x)+\sum_{x} P(x) \sum_{y} P(y \mid x) \log \frac{P(y \mid x)}{Q(y \mid x)} \\ &=\sum_{x} P(x) \log \frac{P(x)}{Q(x)}+\sum_{x} P(x)\left(\sum_{y} P(y \mid x) \log \frac{P(y \mid x)}{Q(y \mid x)}\right) \\ &=D_{\mathrm{KL}}(P(X) \| Q(X))+D_{\mathrm{KL}}(P(Y \mid X) \| Q(Y \mid X)) \end{aligned} DKL(P(X,Y)∥Q(X,Y))=x∑y∑P(x,y)logQ(x,y)P(x,y)=x∑y∑P(x)P(y∣x)logQ(x)Q(y∣x)P(x)P(y∣x)=x∑y∑P(x)P(y∣x)(logQ(x)P(x)+logQ(y∣x)P(y∣x))=x∑y∑P(x)P(y∣x)logQ(x)P(x)+x∑y∑P(x)P(y∣x)logQ(y∣x)P(y∣x)=x∑P(x)logQ(x)P(x)y∑P(y∣x)+x∑P(x)y∑P(y∣x)logQ(y∣x)P(y∣x)=x∑P(x)logQ(x)P(x)+x∑P(x)(y∑P(y∣x)logQ(y∣x)P(y∣x))=DKL(P(X)∥Q(X))+DKL(P(Y∣X)∥Q(Y∣X))
证明:
arg min θ D K L ( P ^ ∥ P θ ) = arg min θ ∑ x ∈ X P ^ ( x ) log P ^ ( x ) P θ ( x ) = arg min θ ∑ x ∈ X P ^ ( x ) log P ^ ( x ) − ∑ x ∈ X P ^ ( x ) log P θ ( x ) = arg max θ ∑ x ∈ X P ^ ( x ) log P θ ( x ) = arg max θ ∑ x ∈ X ( 1 m ∑ i = 1 m 1 { x ( i ) = x } ) log P θ ( x ) = arg max θ ∑ i = 1 m log P θ ( x ( i ) ) \begin{aligned} \arg \min _{\theta} D_{\mathrm{KL}}\left(\hat{P} \| P_{\theta}\right) &=\arg \min _{\theta} \sum_{x \in \mathcal{X}} \hat{P}(x) \frac{\log \hat{P}(x)}{P_{\theta}(x)} \\ &=\arg \min _{\theta} \sum_{x \in \mathcal{X}} \hat{P}(x) \log \hat{P}(x)-\sum_{x \in \mathcal{X}} \hat{P}(x) \log P_{\theta}(x) \\ &=\arg \max _{\theta} \sum_{x \in \mathcal{X}} \hat{P}(x) \log P_{\theta}(x) \\ &=\arg \max _{\theta} \sum_{x \in \mathcal{X}}\left(\frac{1}{m} \sum_{i=1}^{m} 1\left\{x^{(i)}=x\right\}\right) \log P_{\theta}(x) \\ &=\arg \max _{\theta} \sum_{i=1}^{m} \log P_{\theta}\left(x^{(i)}\right) \end{aligned} argθminDKL(P^∥Pθ)=argθminx∈X∑P^(x)Pθ(x)logP^(x)=argθminx∈X∑P^(x)logP^(x)−x∈X∑P^(x)logPθ(x)=argθmaxx∈X∑P^(x)logPθ(x)=argθmaxx∈X∑(m1i=1∑m1{x(i)=x})logPθ(x)=argθmaxi=1∑mlogPθ(x(i))
证明:
∇ θ log p ( y ; θ ) = ∇ θ p ( y ; θ ) p ( y ; θ ) E y ∼ p ( y ; θ ) [ ∇ θ ′ log p ( y ; θ ′ ) ∣ θ ′ = θ ] = E y ∼ p ( y ; θ ) [ ∇ θ p ( y ; θ ) p ( y ; θ ) ] = ∫ − ∞ ∞ p ( y ; θ ) ∇ θ p ( y ; θ ) p ( y ; θ ) d y = ∫ − ∞ ∞ ∇ θ p ( y ; θ ) d y = ∇ θ ∫ − ∞ ∞ p ( y ; θ ) d y = ∇ θ 1 = 0 \begin{aligned} \nabla_{\theta} \log p(y ; \theta) &=\frac{\nabla_{\theta} p(y ; \theta)}{p(y ; \theta)} \\ \mathbb{E}_{y \sim p(y ; \theta)}\left[\left.\nabla_{\theta^{\prime}} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}\right] &=\mathbb{E}_{y \sim p(y ; \theta)}\left[\frac{\nabla_{\theta} p(y ; \theta)}{p(y ; \theta)}\right] \\ &=\int_{-\infty}^{\infty} p(y ; \theta) \frac{\nabla_{\theta} p(y ; \theta)}{p(y ; \theta)} d y \\ &=\int_{-\infty}^{\infty} \nabla_{\theta} p(y ; \theta) d y \\ &=\nabla_{\theta} \int_{-\infty}^{\infty} p(y ; \theta) d y \\ &=\nabla_{\theta} 1 \\ &=0 \end{aligned} ∇θlogp(y;θ)Ey∼p(y;θ)[∇θ′logp(y;θ′)∣θ′=θ]=p(y;θ)∇θp(y;θ)=Ey∼p(y;θ)[p(y;θ)∇θp(y;θ)]=∫−∞∞p(y;θ)p(y;θ)∇θp(y;θ)dy=∫−∞∞∇θp(y;θ)dy=∇θ∫−∞∞p(y;θ)dy=∇θ1=0
证明:
Cov [ X ] = E [ ( X − E [ X ] ) ( X − E [ X ] ) T ] = E [ X X T ] when E [ X ] = 0 I ( θ ) = Cov y ∼ p ( y ; θ ) [ ∇ θ ′ log p ( y ; θ ′ ) ∣ θ ′ = θ ] = E y ∼ p ( y ; θ ) [ ∇ θ ′ log p ( y ; θ ′ ) ∇ θ ′ log p ( y ; θ ′ ) T ∣ θ ′ = θ ] \begin{array}{c}\operatorname{Cov}[X]=E\left[(X-E[X])(X-E[X])^{T}\right] \\ =E\left[X X^{T}\right] \quad \text { when } E[X]=0 \\ \mathcal{I}(\theta)=\operatorname{Cov}_{y \sim p(y ; \theta)}\left[\left.\nabla_{\theta^{\prime}} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}\right] \\ =\mathbb{E}_{y \sim p(y ; \theta)}\left[\left.\nabla_{\theta^{\prime}} \log p\left(y ; \theta^{\prime}\right) \nabla_{\theta^{\prime}} \log p\left(y ; \theta^{\prime}\right)^{T}\right|_{\theta^{\prime}=\theta}\right]\end{array} Cov[X]=E[(X−E[X])(X−E[X])T]=E[XXT] when E[X]=0I(θ)=Covy∼p(y;θ)[∇θ′logp(y;θ′)∣θ′=θ]=Ey∼p(y;θ)[∇θ′logp(y;θ′)∇θ′logp(y;θ′)T∣∣∣θ′=θ]
其中用到了(a)中,score function的均值为0的结论。
证明:
∂ log p ( y ; θ ) ∂ θ i = 1 p ( y ; θ ) ∂ p ( y ; θ ) ∂ θ i I ( θ ) i j = E y ∼ p ( y ; θ ) [ ∇ θ ′ log p ( y ; θ ′ ) ∇ θ ′ log p ( y ; θ ′ ) T ∣ θ ′ = θ ] i j = E y ∼ p ( y ; θ ) [ ∂ log p ( y ; θ ) ∂ θ i ∂ log p ( y ; θ ) ∂ θ j ] = E y ∼ p ( y ; θ ) [ 1 ( p ( y ; θ ) ) 2 ∂ 2 p ( y ; θ ) ∂ θ i ∂ θ j ] ∂ 2 log p ( y ; θ ) ∂ θ i ∂ θ j = − 1 ( p ( y ; θ ) ) 2 ∂ 2 p ( y ; θ ) ∂ θ i ∂ θ j + 1 p ( y ; θ ) ∂ 2 p ( y ; θ ) ∂ θ i ∂ θ j E y ∼ p ( y ; θ ) [ − ∇ θ ′ 2 log p ( y ; θ ′ ) ∣ θ ′ = θ ] i j = E y ∼ p ( y ; θ ) [ 1 ( p ( y ; θ ) ) 2 ∂ 2 p ( y ; θ ) ∂ θ i ∂ θ j − 1 p ( y ; θ ) ∂ 2 p ( y ; θ ) ∂ θ i ∂ θ j ] = E y ∼ p ( y ; θ ) [ 1 ( p ( y ; θ ) ) 2 ∂ 2 p ( y ; θ ) ∂ θ i ∂ θ j ] − E y ∼ p ( y ; θ ) [ 1 p ( y ; θ ) ∂ 2 p ( y ; θ ) ∂ θ i ∂ θ j ] = E y ∼ p ( y ; θ ) [ 1 ( p ( y ; θ ) ) 2 ∂ 2 p ( y ; θ ) ∂ θ i ∂ θ j ] − ∫ − ∞ ∞ p ( y ; θ ) 1 p ( y ; θ ) ∂ 2 p ( y ; θ ) ∂ θ i ∂ θ j d y = E y ∼ p ( y ; θ ) [ 1 ( p ( y ; θ ) ) 2 ∂ 2 p ( y ; θ ) ∂ θ i ∂ θ j ] − ∂ 2 ∂ θ i ∂ θ j ∫ − ∞ ∞ p ( y ; θ ) d y = E y ∼ p ( y ; θ ) [ 1 ( p ( y ; θ ) ) 2 ∂ 2 p ( y ; θ ) ∂ θ i ∂ θ j ] = I ( θ ) i j 从 而 : E y ∼ p ( y ; θ ) [ − ∇ θ ′ 2 log p ( y ; θ ′ ) ∣ θ ′ = θ ] = I ( θ ) \begin{aligned} & \frac{\partial \log p(y ; \theta)}{\partial \theta_{i}}=\frac{1}{p(y ; \theta)} \frac{\partial p(y ; \theta)}{\partial \theta_{i}} \\ \mathcal{I}(\theta)_{i j} &=\mathbb{E}_{y \sim p(y ; \theta)}\left[\left.\nabla_{\theta^{\prime}} \log p\left(y ; \theta^{\prime}\right) \nabla_{\theta^{\prime}} \log p\left(y ; \theta^{\prime}\right)^{T}\right|_{\theta^{\prime}=\theta}\right]_{i j} \\=& \mathbb{E}_{y \sim p(y ; \theta)}\left[\frac{\partial \log p(y ; \theta)}{\partial \theta_{i}} \frac{\partial \log p(y ; \theta)}{\partial \theta_{j}}\right] \\=& \mathbb{E}_{y \sim p(y ; \theta)}\left[\frac{1}{(p(y ; \theta))^{2}} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}}\right] \\ \frac{\partial^{2} \log p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}} &=-\frac{1}{(p(y ; \theta))^{2}} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}}+\frac{1}{p(y ; \theta)} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}} \\ \mathbb{E}_{y \sim p(y ; \theta)}\left[-\left.\nabla_{\theta^{\prime}}^{2} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}\right]_{i j} &=\mathbb{E}_{y \sim p(y ; \theta)}\left[\frac{1}{(p(y ; \theta))^{2}} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}}-\frac{1}{p(y ; \theta)} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}}\right] \\ &=\mathbb{E}_{y \sim p(y ; \theta)}\left[\frac{1}{(p(y ; \theta))^{2}} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}}\right]-\mathbb{E}_{y \sim p(y ; \theta)}\left[\frac{1}{p(y ; \theta)} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}}\right] \\ &=\mathbb{E}_{y \sim p(y ; \theta)}\left[\frac{1}{(p(y ; \theta))^{2}} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}}\right]-\int_{-\infty}^{\infty} p(y ; \theta) \frac{1}{p(y ; \theta)} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}} d y \\ &=\mathbb{E}_{y \sim p(y ; \theta)}\left[\frac{1}{(p(y ; \theta))^{2}} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}}\right]-\frac{\partial^{2}}{\partial \theta_{i} \partial \theta_{j}} \int_{-\infty}^{\infty} p(y ; \theta) d y \\ &=\mathbb{E}_{y \sim p(y ; \theta)}\left[\frac{1}{(p(y ; \theta))^{2}} \frac{\partial^{2} p(y ; \theta)}{\partial \theta_{i} \partial \theta_{j}}\right] \\ &=\mathcal{I}(\theta)_{i j} \\ 从而: \mathbb{E}_{y \sim p(y ; \theta)}\left[-\left.\nabla_{\theta^{\prime}}^{2} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}\right]=\mathcal{I}(\theta) \end{aligned} I(θ)ij==∂θi∂θj∂2logp(y;θ)Ey∼p(y;θ)[−∇θ′2logp(y;θ′)∣∣θ′=θ]ij从而:Ey∼p(y;θ)[−∇θ′2logp(y;θ′)∣∣θ′=θ]=I(θ)∂θi∂logp(y;θ)=p(y;θ)1∂θi∂p(y;θ)=Ey∼p(y;θ)[∇θ′logp(y;θ′)∇θ′logp(y;θ′)T∣∣∣θ′=θ]ijEy∼p(y;θ)[∂θi∂logp(y;θ)∂θj∂logp(y;θ)]Ey∼p(y;θ)[(p(y;θ))21∂θi∂θj∂2p(y;θ)]=−(p(y;θ))21∂θi∂θj∂2p(y;θ)+p(y;θ)1∂θi∂θj∂2p(y;θ)=Ey∼p(y;θ)[(p(y;θ))21∂θi∂θj∂2p(y;θ)−p(y;θ)1∂θi∂θj∂2p(y;θ)]=Ey∼p(y;θ)[(p(y;θ))21∂θi∂θj∂2p(y;θ)]−Ey∼p(y;θ)[p(y;θ)1∂θi∂θj∂2p(y;θ)]=Ey∼p(y;θ)[(p(y;θ))21∂θi∂θj∂2p(y;θ)]−∫−∞∞p(y;θ)p(y;θ)1∂θi∂θj∂2p(y;θ)dy=Ey∼p(y;θ)[(p(y;θ))21∂θi∂θj∂2p(y;θ)]−∂θi∂θj∂2∫−∞∞p(y;θ)dy=Ey∼p(y;θ)[(p(y;θ))21∂θi∂θj∂2p(y;θ)]=I(θ)ij
证明:
log p ( y ; θ ~ ) ≈ log p ( y ; θ ) + ( θ ~ − θ ) T ∇ θ ′ log p ( y ; θ ′ ) ∣ θ ′ = θ + 1 2 ( θ ~ − θ ) T ( ∇ θ ′ 2 log p ( y ; θ ′ ) ∣ θ ′ = θ ) ( θ ~ − θ ) = log p ( y ; θ ) + d T ∇ θ ′ log p ( y ; θ ′ ) ∣ θ ′ = θ + 1 2 d T ( ∇ θ ′ 2 log p ( y ; θ ′ ) ∣ θ ′ = θ ) d \begin{aligned} \log p(y ; \tilde{\theta}) & \approx \log p(y ; \theta)+\left.(\tilde{\theta}-\theta)^{T} \nabla_{\theta^{\prime}} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}+\frac{1}{2}(\tilde{\theta}-\theta)^{T}\left(\left.\nabla_{\theta^{\prime}}^{2} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}\right)(\tilde{\theta}-\theta) \\ &=\log p(y ; \theta)+\left.d^{T} \nabla_{\theta^{\prime}} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}+\frac{1}{2} d^{T}\left(\left.\nabla_{\theta^{\prime}}^{2} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}\right) d \end{aligned} logp(y;θ~)≈logp(y;θ)+(θ~−θ)T∇θ′logp(y;θ′)∣∣∣θ′=θ+21(θ~−θ)T(∇θ′2logp(y;θ′)∣∣θ′=θ)(θ~−θ)=logp(y;θ)+dT∇θ′logp(y;θ′)∣∣θ′=θ+21dT(∇θ′2logp(y;θ′)∣∣θ′=θ)d
E y ∼ p ( y ; θ ) [ log p ( y ; θ ~ ) ] = E y ∼ p ( y ; θ ) [ log p ( y ; θ ) ] + 1 2 d T E y ∼ p ( y ; θ ) [ ∇ θ ′ 2 log p ( y ; θ ′ ) ∣ θ ′ = θ ] d = E y ∼ p ( y ; θ ) [ log p ( y ; θ ) ] + 1 2 d T I ( θ ) d D K L ( p θ ∥ p θ + d ) = D K L ( p θ ∥ p θ ~ ) = E y ∼ p ( y ; θ ) [ log p ( y ; θ ) ] − E y ∼ p ( y ; θ ) [ log p ( y ; θ ~ ) ] ≈ 1 2 d T I ( θ ) d \begin{aligned} \mathbb{E}_{y \sim p(y ; \theta)}[\log p(y ; \tilde{\theta})] &=\mathbb{E}_{y \sim p(y ; \theta)}[\log p(y ; \theta)]+\frac{1}{2} d^{T} \mathbb{E}_{y \sim p(y ; \theta)}\left[\left.\nabla_{\theta^{\prime}}^{2} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}\right] d \\=& \mathbb{E}_{y \sim p(y ; \theta)}[\log p(y ; \theta)]+\frac{1}{2} d^{T} \mathcal{I}(\theta) d \\ D_{\mathrm{KL}}\left(p_{\theta} \| p_{\theta+d}\right) &=D_{\mathrm{KL}}\left(p_{\theta} \| p_{\tilde{\theta}}\right) \\ &=\mathbb{E}_{y \sim p(y ; \theta)}[\log p(y ; \theta)]-\mathbb{E}_{y \sim p(y ; \theta)}[\log p(y ; \tilde{\theta})] \\ & \approx \frac{1}{2} d^{T} \mathcal{I}(\theta) d \end{aligned} Ey∼p(y;θ)[logp(y;θ~)]=DKL(pθ∥pθ+d)=Ey∼p(y;θ)[logp(y;θ)]+21dTEy∼p(y;θ)[∇θ′2logp(y;θ′)∣∣θ′=θ]dEy∼p(y;θ)[logp(y;θ)]+21dTI(θ)d=DKL(pθ∥pθ~)=Ey∼p(y;θ)[logp(y;θ)]−Ey∼p(y;θ)[logp(y;θ~)]≈21dTI(θ)d
第一步,用泰勒展开近似目标函数和约束:
ℓ ( θ + d ) ≈ ℓ ( θ ) + d T ∇ θ ′ ℓ ( θ ′ ) ∣ θ ′ = θ = log p ( y ; θ ) + d T ∇ θ ′ log p ( y ; θ ′ ) ∣ θ ′ = θ = log p ( y ; θ ) + d T ∇ θ ′ p ( y ; θ ′ ) ∣ θ ′ = θ p ( y ; θ ) D K L ( p θ ∥ p θ + d ) ≈ 1 2 d T I ( θ ) d \begin{aligned} \ell(\theta+d) & \approx \ell(\theta)+\left.d^{T} \nabla_{\theta^{\prime}} \ell\left(\theta^{\prime}\right)\right|_{\theta^{\prime}=\theta} \\ &=\log p(y ; \theta)+\left.d^{T} \nabla_{\theta^{\prime}} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta} \\ &=\log p(y ; \theta)+d^{T} \frac{\left.\nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}}{p(y ; \theta)} \\ & D_{\mathrm{KL}}\left(p_{\theta} \| p_{\theta+d}\right) \approx \frac{1}{2} d^{T} \mathcal{I}(\theta) d \end{aligned} ℓ(θ+d)≈ℓ(θ)+dT∇θ′ℓ(θ′)∣∣θ′=θ=logp(y;θ)+dT∇θ′logp(y;θ′)∣∣θ′=θ=logp(y;θ)+dTp(y;θ)∇θ′p(y;θ′)∣θ′=θDKL(pθ∥pθ+d)≈21dTI(θ)d
第二步,写出拉格朗日函数:
L ( d , λ ) = ℓ ( θ + d ) − λ [ D K L ( p θ ∥ p θ + d ) − c ] ≈ log p ( y ; θ ) + d T ∇ θ ′ p ( y ; θ ′ ) ∣ θ ′ = θ p ( y ; θ ) − λ [ 1 2 d T I ( θ ) d − c ] \begin{aligned} \mathcal{L}(d, \lambda) &=\ell(\theta+d)-\lambda\left[D_{\mathrm{KL}}\left(p_{\theta} \| p_{\theta+d}\right)-c\right] \\ & \approx \log p(y ; \theta)+d^{T} \frac{\left.\nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}}{p(y ; \theta)}-\lambda\left[\frac{1}{2} d^{T} \mathcal{I}(\theta) d-c\right] \end{aligned} L(d,λ)=ℓ(θ+d)−λ[DKL(pθ∥pθ+d)−c]≈logp(y;θ)+dTp(y;θ)∇θ′p(y;θ′)∣θ′=θ−λ[21dTI(θ)d−c]
第三步,拉格朗日函数对参数求导为0,其中关于d的求导为:
∇ d L ( d , λ ) ≈ ∇ θ ′ p ( y ; θ ′ ) ∣ θ ′ = θ p ( y ; θ ) − λ I ( θ ) d = 0 d ~ = 1 λ I ( θ ) − 1 ∇ θ ′ p ( y ; θ ′ ) ∣ θ ′ = θ p ( y ; θ ) \begin{aligned} \nabla_{d} \mathcal{L}(d, \lambda) & \approx \frac{\left.\nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}}{p(y ; \theta)}-\lambda \mathcal{I}(\theta) d=0 \\ \tilde{d}=\frac{1}{\lambda} \mathcal{I}(\theta)^{-1} \frac{\left.\nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}}{p(y ; \theta)} \end{aligned} ∇dL(d,λ)d~=λ1I(θ)−1p(y;θ)∇θ′p(y;θ′)∣θ′=θ≈p(y;θ)∇θ′p(y;θ′)∣θ′=θ−λI(θ)d=0
则此时尽管不知 λ \lambda λ的具体值,但是其为一个正实数,所以仍然已经得到了natural gradient的方向,下面利用关于 λ \lambda λ求梯度的方程,
解出 λ \lambda λ的具体值:
∇ λ L ( d , λ ) ≈ c − 1 2 d T I ( θ ) d = c − 1 2 ⋅ 1 λ ∇ θ ′ p ( y ; θ ′ ) ∣ θ ′ = θ T p ( y ; θ ) I ( θ ) − 1 ⋅ I ( θ ) ⋅ 1 λ I ( θ ) − 1 ∇ θ ′ p ( y ; θ ′ ) ∣ θ ′ = θ p ( y ; θ ) = c − 1 2 λ 2 ( p ( y ; θ ) ) 2 ∇ θ ′ p ( y ; θ ′ ) ∣ θ ′ = θ T I ( θ ) − 1 ∇ θ ′ p ( y ; θ ′ ) ∣ θ ′ = θ = 0 λ = 1 2 c ( p ( y ; θ ) ) 2 ∇ θ ′ p ( y ; θ ′ ) ∣ θ ′ = θ T I ( θ ) − 1 ∇ θ ′ p ( y ; θ ′ ) ∣ θ ′ = θ \begin{aligned} \nabla_{\lambda} \mathcal{L}(d, \lambda) & \approx c-\frac{1}{2} d^{T} \mathcal{I}(\theta) d \\ &=c-\frac{1}{2} \cdot \frac{1}{\lambda} \frac{\left.\nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta} ^{T}}{p(y ; \theta)} \mathcal{I}(\theta)^{-1} \cdot \mathcal{I}(\theta) \cdot \frac{1}{\lambda} \mathcal{I}(\theta)^{-1} \frac{\left.\nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}}{p(y ; \theta)} \\ &=c-\left.\left.\frac{1}{2 \lambda^{2}(p(y ; \theta))^{2}} \nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta} ^{T} \mathcal{I}(\theta)^{-1} \nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta} \\ &=0 \\ & \lambda=\sqrt{\left.\left.\frac{1}{2 c(p(y ; \theta))^{2}} \nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta} ^{T} \mathcal{I}(\theta)^{-1} \nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}} \end{aligned} ∇λL(d,λ)≈c−21dTI(θ)d=c−21⋅λ1p(y;θ)∇θ′p(y;θ′)∣θ′=θTI(θ)−1⋅I(θ)⋅λ1I(θ)−1p(y;θ)∇θ′p(y;θ′)∣θ′=θ=c−2λ2(p(y;θ))21∇θ′p(y;θ′)∣∣∣∣θ′=θTI(θ)−1∇θ′p(y;θ′)∣∣∣∣∣θ′=θ=0λ=2c(p(y;θ))21∇θ′p(y;θ′)∣∣∣∣θ′=θTI(θ)−1∇θ′p(y;θ′)∣∣∣∣∣θ′=θ
从而可知natural gradient d ∗ d^{*} d∗:
d ∗ = 2 c ( p ( y ; θ ) ) 2 ∇ θ ′ p ( y ; θ ′ ) ∣ θ ′ = θ T I ( θ ) − 1 ∇ θ ′ p ( y ; θ ′ ) ∣ θ ′ = θ I ( θ ) − 1 ∇ θ ′ p ( y ; θ ′ ) ∣ θ ′ = θ p ( y ; θ ) = 2 c ∇ θ ′ p ( y ; θ ′ ) ∣ θ ′ = θ T I ( θ ) − 1 ∇ θ ′ p ( y ; θ ′ ) ∣ θ ′ = θ I ( θ ) − 1 ∇ θ ′ p ( y ; θ ′ ) ∣ θ ′ = θ \begin{aligned} d^{*} &=\sqrt{\frac{2 c(p(y ; \theta))^{2}}{\left.\left.\nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta} ^{T} \mathcal{I}(\theta)^{-1} \nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}}} \mathcal{I}(\theta)^{-1} \frac{\left.\nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}}{p(y ; \theta)} \\ &=\left.\sqrt{\frac{2 c}{\left.\left.\nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta} ^{T} \mathcal{I}(\theta)^{-1} \nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}}} \mathcal{I}(\theta)^{-1} \nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta} \end{aligned} d∗=∇θ′p(y;θ′)∣θ′=θTI(θ)−1∇θ′p(y;θ′)∣∣∣θ′=θ2c(p(y;θ))2I(θ)−1p(y;θ)∇θ′p(y;θ′)∣θ′=θ=∇θ′p(y;θ′)∣θ′=θTI(θ)−1∇θ′p(y;θ′)∣∣∣θ′=θ2cI(θ)−1∇θ′p(y;θ′)∣∣∣∣∣∣∣θ′=θ
由上一问:
d ~ = 1 λ I ( θ ) − 1 ∇ θ ′ p ( y ; θ ′ ) ∣ θ ′ = θ p ( y ; θ ) = 1 λ I ( θ ) − 1 ∇ θ ′ l o g p ( y ; θ ′ ) ∣ θ ′ = θ = 1 λ I ( θ ) − 1 ∇ θ ′ l o g ℓ ( θ ′ ) ∣ θ ′ = θ \tilde{d}=\frac{1}{\lambda} \mathcal{I}(\theta)^{-1} \frac{\left.\nabla_{\theta^{\prime}} p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}}{p(y ; \theta)} = \frac{1}{\lambda} \mathcal{I}(\theta)^{-1} \nabla_{\theta^{\prime}} log\, p\left(y ; \theta^{\prime}\right)|_{\theta^{\prime}=\theta}= \frac{1}{\lambda} \mathcal{I}(\theta)^{-1} \nabla_{\theta^{\prime}} log\, \ell(\theta^{\prime})|_{\theta^{\prime}=\theta} d~=λ1I(θ)−1p(y;θ)∇θ′p(y;θ′)∣θ′=θ=λ1I(θ)−1∇θ′logp(y;θ′)∣θ′=θ=λ1I(θ)−1∇θ′logℓ(θ′)∣θ′=θ
又注意到:
I ( θ ) = E y ∼ p ( y ; θ ) [ − ∇ θ ′ 2 log p ( y ; θ ′ ) ∣ θ ′ = θ ] = E y ∼ p ( y ; θ ) [ − ∇ θ 2 ℓ ( θ ) ] = − E y ∼ p ( y ; θ ) [ H ] \begin{aligned} \mathcal{I}(\theta) &=\mathbb{E}_{y \sim p(y ; \theta)}\left[-\left.\nabla_{\theta^{\prime}}^{2} \log p\left(y ; \theta^{\prime}\right)\right|_{\theta^{\prime}=\theta}\right] \\ &=\mathbb{E}_{y \sim p(y ; \theta)}\left[-\nabla_{\theta}^{2} \ell(\theta)\right] \\ &=-\mathbb{E}_{y \sim p(y ; \theta)}[H] \end{aligned} I(θ)=Ey∼p(y;θ)[−∇θ′2logp(y;θ′)∣∣θ′=θ]=Ey∼p(y;θ)[−∇θ2ℓ(θ)]=−Ey∼p(y;θ)[H]
从而:
θ : = θ + d ~ = θ + 1 λ I ( θ ) − 1 ∇ θ ℓ ( θ ) = θ − 1 λ E y ∼ p ( y ; θ ) [ H ] − 1 ∇ θ ℓ ( θ ) \begin{aligned} \theta: &=\theta+\tilde{d} \\ &=\theta+\frac{1}{\lambda} \mathcal{I}(\theta)^{-1} \nabla_{\theta} \ell(\theta) \\ &=\theta-\frac{1}{\lambda} \mathbb{E}_{y \sim p(y ; \theta)}[H]^{-1} \nabla_{\theta} \ell(\theta) \end{aligned} θ:=θ+d~=θ+λ1I(θ)−1∇θℓ(θ)=θ−λ1Ey∼p(y;θ)[H]−1∇θℓ(θ)
对于牛顿法,更新规则为:
θ : = θ − H − 1 ∇ θ ℓ ( θ ) \theta:=\theta-H^{-1} \nabla_{\theta} \ell(\theta) θ:=θ−H−1∇θℓ(θ)
可以发现更新的放向都是黑塞矩阵你矩阵与目标函数关于参数梯度的乘积方向。