好久没有更新了,最近看了Policy Gradient 的原文,里边的证明看不懂,于是又找了 Stanford University 的策略梯度定理证明的PPT,感觉写的比较清晰。
链接如下:
原文-Policy Gradient Methods for Reinforcement Learning with Function Approximation.
PPT.
策略梯度定理的意义在于:该定理建立了指标函数 J ( θ ) J(\theta) J(θ)的梯度与策略 π ( a ∣ s ; θ ) \pi(a|s;\theta) π(a∣s;θ)的梯度的直接联系。( θ \theta θ为策略函数中的参数,如果使用neural network (nn) 作为梯度,那么 θ \theta θ就是nn里面的参数)。下边给出定理:
∇ θ J ( θ ) = ∫ S ρ π ( s ) ∫ A ∇ θ π ( a ∣ s ; θ ) ⋅ Q π ( s , a ) ⋅ d a ⋅ d s \nabla_{\theta}J(\theta)=\int_{\mathcal{S}}\rho^{\pi}(s)\int_{\mathcal{A}}\nabla_{\theta}\pi(a|s;\theta)\cdot Q^{\pi}(s,a)\cdot da \cdot ds ∇θJ(θ)=∫Sρπ(s)∫A∇θπ(a∣s;θ)⋅Qπ(s,a)⋅da⋅ds
其中 ∇ θ J ( θ ) = ∂ ∂ θ J ( θ ) \nabla_{\theta}J(\theta)=\frac{\partial}{\partial\theta}J(\theta) ∇θJ(θ)=∂θ∂J(θ), ρ π ( s ) = ∑ t = 0 ∞ γ t ∫ s 0 ∈ S P 0 ( s 0 ) P ( s 0 → s 1 , 1 , π ) ⋅ d s 0 \rho^{\pi}(s) = \sum_{t=0}^\infty{\gamma^t \int_{s_0\in\mathcal{S}}P_0(s_0)P(s_0\rightarrow s_1,1,\pi)\cdot ds_0} ρπ(s)=∑t=0∞γt∫s0∈SP0(s0)P(s0→s1,1,π)⋅ds0。
因本人基础比较差,所以推导的时候很多公式都不知道是什么含义,所以查阅了一下,放在这里备用。
不严格地讲,这里的积分可以换成求和的形式,它们所代表的都是整个动作或者状态空间。
整体的证明都是按照PPT链接中的内容进行的,不同的是在数学归纳的地方多往后推了一项,让结果看起来更加显然。
J ( θ ) = E π [ ∑ t = 0 ∞ γ t r t ] = ∑ t = 0 ∞ E π γ t r t = ∑ t = 0 ∞ γ t ∫ s ∈ S r s a ⋅ d s = ∑ t = 0 ∞ γ t ∫ s ∈ S R s a ∫ a ∈ A π ( a ∣ s ; θ ) ⋅ d a ⋅ d s = ∑ t = 0 ∞ γ t ∫ s ∈ S ⋅ ∫ a ∈ A π ( a ∣ s ; θ ) R s a ⋅ d a ⋅ d s = ∫ s ∈ S [ ∑ t = 0 ∞ γ t ∫ s 0 ∈ S P 0 ( s 0 ) P ( s 0 → s t , t , π ) ⋅ d s 0 ] ∫ a ∈ A π ( a ∣ s ; θ ) R s a ⋅ d a ⋅ d s = ∫ s ∈ S ρ π ( s ) ∫ a ∈ A π ( a ∣ s ; θ ) R s a ⋅ d a ⋅ d s \begin{align} \begin{aligned} J(\theta) & = \mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty}{\gamma^tr_t}\right] \\ & = \sum_{t=0}^{\infty}{\mathbb{E}_{\pi}\gamma^tr_t} \\ & = \sum_{t=0}^{\infty}{\gamma^t\int_{s\in\mathcal{S}}r_s^a \cdot ds} \\ & = \sum_{t=0}^{\infty}{\gamma^t\int_{s\in\mathcal{S}}\mathcal{R}_s^a \int_{a\in\mathcal{A}}\pi(a|s;\theta)\cdot da \cdot ds} \\ & = \sum_{t=0}^{\infty}{\gamma^t\int_{s\in\mathcal{S}}\cdot\int_{a\in\mathcal{A}}\pi(a|s;\theta)\mathcal{R}_s^a\cdot da\cdot ds} \\ & = \int_{s\in\mathcal{S}}\left[\sum_{t=0}^\infty{\gamma^t \int_{s_0\in\mathcal{S}}P_0(s_0)P(s_0\rightarrow s_t,t,\pi)\cdot ds_0}\right]\int_{a\in\mathcal{A}}\pi(a|s;\theta)\mathcal{R}_s^a \cdot da \cdot ds \\ & = \int_{s\in\mathcal{S}}\rho^{\pi}(s)\int_{a\in\mathcal{A}}\pi(a|s;\theta)\mathcal{R}_s^a \cdot da \cdot ds \end{aligned} \end{align} J(θ)=Eπ[t=0∑∞γtrt]=t=0∑∞Eπγtrt=t=0∑∞γt∫s∈Srsa⋅ds=t=0∑∞γt∫s∈SRsa∫a∈Aπ(a∣s;θ)⋅da⋅ds=t=0∑∞γt∫s∈S⋅∫a∈Aπ(a∣s;θ)Rsa⋅da⋅ds=∫s∈S[t=0∑∞γt∫s0∈SP0(s0)P(s0→st,t,π)⋅ds0]∫a∈Aπ(a∣s;θ)Rsa⋅da⋅ds=∫s∈Sρπ(s)∫a∈Aπ(a∣s;θ)Rsa⋅da⋅ds
其中, ρ π ( s ) = ∑ t = 0 ∞ γ t ∫ s 0 ∈ S P 0 ( s 0 ) P ( s 0 → s 1 , 1 , π ) ⋅ d s 0 \rho^{\pi}(s) = \sum_{t=0}^\infty{\gamma^t \int_{s_0\in\mathcal{S}}P_0(s_0)P(s_0\rightarrow s_1,1,\pi)\cdot ds_0} ρπ(s)=∑t=0∞γt∫s0∈SP0(s0)P(s0→s1,1,π)⋅ds0, P 0 ( s 0 ) P_0(s_0) P0(s0)是状态的初始分布, P ( s 0 → s 1 , 1 , π ) P(s_0\rightarrow s_1,1,\pi) P(s0→s1,1,π)代表的是系统依照策略 π \pi π从经过1步从状态 s 0 s_0 s0转移到 s 1 s_1 s1的概率(记住这个符号,最终的迭代公式会用到
)。
至此, J ( θ ) J(\theta) J(θ)已经初步具有定理的形式了。
式(2)给出了由初始状态值函数所定义的指标函数
J ( θ ) = ∫ s 0 P 0 ( s 0 ) V π ( s 0 ) ⋅ d s 0 = ∫ s 0 P 0 ( s 0 ) ∫ a 0 π ( a 0 ∣ s 0 ; θ ) Q π ( s 0 , a 0 ) ⋅ d a 0 ⋅ d s 0 \begin{align} \begin{aligned} J(\theta) & = \int_{s_0}{P_0(s_0)V^{\pi}(s_0)} \cdot ds_0 \\ & = \int_{s_0}{P_0(s_0)\int_{a_0}\pi(a_0|s_0;\theta)Q^{\pi}(s_0,a_0)}\cdot da_0 \cdot ds_0 \\ \end{aligned} \end{align} J(θ)=∫s0P0(s0)Vπ(s0)⋅ds0=∫s0P0(s0)∫a0π(a0∣s0;θ)Qπ(s0,a0)⋅da0⋅ds0
这里注意,为了省空间,积分变量 s 0 s_0 s0和 a 0 a_0 a0都是简写的。由(2),可以得到指标函数对 θ \theta θ的微分形式
∇ θ J ( θ ) = ∇ θ [ ∫ s 0 P 0 ( s 0 ) ∫ a 0 π ( a 0 ∣ s 0 ; θ ) Q π ( s 0 , a 0 ) ⋅ d a 0 ⋅ d s 0 ] \begin{align} \begin{aligned} \nabla_{\theta}J(\theta) = \nabla_{\theta}\left[\int_{s_0}{P_0(s_0)\int_{a_0}\pi(a_0|s_0;\theta)Q^{\pi}(s_0,a_0)}\cdot da_0 \cdot ds_0\right] \end{aligned} \end{align} ∇θJ(θ)=∇θ[∫s0P0(s0)∫a0π(a0∣s0;θ)Qπ(s0,a0)⋅da0⋅ds0]
至此,开始了无穷无尽的推导。根据预备公式中状态值函数与动作值函数的转换公式可得
∇ θ J ( θ ) = ∫ s 0 P 0 ( s 0 ) ∫ a 0 ∇ θ π ( a 0 ∣ s 0 ; θ ) Q π ( s 0 , a 0 ) ⋅ d a 0 ⋅ d s + ∫ s 0 P 0 ( s 0 ) ∫ a 0 π ( a 0 ∣ s 0 ; θ ) ∇ θ Q π ( s 0 , a 0 ) ⋅ d a 0 ⋅ d s = M 0 + ∫ s 0 P 0 ( s 0 ) ∫ a 0 π ( a 0 ∣ s 0 ; θ ) ∇ θ [ R s a + ∫ s 1 γ P s 0 , s 1 a 0 V π ( s 1 ) ⋅ d s 1 ] ⋅ d a 0 ⋅ d s \begin{align} \begin{aligned} \nabla_{\theta}J(\theta) & = \int_{s_0}{P_0(s_0)\int_{a_0}\nabla_{\theta}\pi(a_0|s_0;\theta)Q^{\pi}(s_0,a_0)}\cdot da_0 \cdot ds \\ & + \int_{s_0}{P_0(s_0)\int_{a_0}\pi(a_0|s_0;\theta)\nabla_{\theta}Q^{\pi}(s_0,a_0)}\cdot da_0 \cdot ds \\ & = M_0+\int_{s_0}{P_0(s_0)\int_{a_0}\pi(a_0|s_0;\theta)\nabla_{\theta}\left[\mathcal{R}_s^a+\int_{s_1} \gamma P_{s_0,s_1}^{a_0}V^{\pi}(s_1) \cdot ds_1\right]}\cdot da_0 \cdot ds \\ \end{aligned} \end{align} ∇θJ(θ)=∫s0P0(s0)∫a0∇θπ(a0∣s0;θ)Qπ(s0,a0)⋅da0⋅ds+∫s0P0(s0)∫a0π(a0∣s0;θ)∇θQπ(s0,a0)⋅da0⋅ds=M0+∫s0P0(s0)∫a0π(a0∣s0;θ)∇θ[Rsa+∫s1γPs0,s1a0Vπ(s1)⋅ds1]⋅da0⋅ds
其中, M 0 = ∫ s 0 P 0 ( s 0 ) ∫ a 0 ∇ θ π ( a 0 ∣ s 0 ; θ ) Q π ( s 0 , a 0 ) ⋅ d a 0 ⋅ d s M_0=\int_{s_0}{P_0(s_0)\int_{a_0}\nabla_{\theta}\pi(a_0|s_0;\theta)Q^{\pi}(s_0,a_0)}\cdot da_0 \cdot ds M0=∫s0P0(s0)∫a0∇θπ(a0∣s0;θ)Qπ(s0,a0)⋅da0⋅ds。
由 ∇ θ R s a = 0 \nabla_{\theta}\mathcal{R}_s^a=0 ∇θRsa=0可得,
∇ θ J ( θ ) = M 0 + ∫ s 0 P 0 ( s 0 ) ∫ a 0 π ( a 0 ∣ s 0 ; θ ) ∇ θ ∫ s 1 γ P s 0 , s 1 a 0 V π ( s 1 ) ⋅ d s 1 ⋅ d a 0 ⋅ d s = M 0 + ∫ s 1 [ ∫ s 0 γ P 0 ( s 0 ) ∫ a 0 π ( a 0 ∣ s 0 ; θ ) P s 0 , s 1 a 0 ⋅ d a 0 ⋅ d s 0 ] ∇ θ V π ( s 1 ) ⋅ d s 1 \begin{align} \begin{aligned} \nabla_{\theta}J(\theta) & = M_0 + \int_{s_0}{P_0(s_0)\int_{a_0}\pi(a_0|s_0;\theta)\nabla_{\theta}\int_{s_1} \gamma P_{s_0,s_1}^{a_0}V^{\pi}(s_1) \cdot ds_1}\cdot da_0 \cdot ds \\ & = M_0 + \int_{s_1}\left[\int_{s_0}\gamma P_0(s_0)\textcolor{red}{\int_{a_0}\pi(a_0|s_0;\theta)P_{s_0,s_1}^{a_0}\cdot da_0}\cdot ds_0\right]\nabla_{\theta}V^{\pi}(s_1)\cdot ds_1 \end{aligned} \end{align} ∇θJ(θ)=M0+∫s0P0(s0)∫a0π(a0∣s0;θ)∇θ∫s1γPs0,s1a0Vπ(s1)⋅ds1⋅da0⋅ds=M0+∫s1[∫s0γP0(s0)∫a0π(a0∣s0;θ)Ps0,s1a0⋅da0⋅ds0]∇θVπ(s1)⋅ds1
其中,红色部分恰好是
∫ a 0 π ( a 0 ∣ s 0 ; θ ) P s 0 , s 1 a 0 ⋅ d a 0 = P ( s 0 → s 1 , 1 , π ) \int_{a_0}\pi(a_0|s_0;\theta)P_{s_0,s_1}^{a_0}\cdot da_0=P(s_0\rightarrow s_1,1,\pi) ∫a0π(a0∣s0;θ)Ps0,s1a0⋅da0=P(s0→s1,1,π)
因此,有
∇ θ J ( θ ) = M 0 + ∫ s 1 [ ∫ s 0 γ P 0 ( s 0 ) P ( s 0 → s 1 , 1 , π ) ⋅ d s 0 ] ⋅ ∇ θ V π ( s 1 ) ⋅ d s 1 \begin{align} \begin{aligned} \nabla_{\theta}J(\theta) = M_0 + \int_{s_1}\left[\int_{s_0}\gamma P_0(s_0)P(s_0\rightarrow s_1,1,\pi)\cdot ds_0\right] \cdot \textcolor{blue}{\nabla_{\theta}V^{\pi}(s_1)}\cdot ds_1 \end{aligned} \end{align} ∇θJ(θ)=M0+∫s1[∫s0γP0(s0)P(s0→s1,1,π)⋅ds0]⋅∇θVπ(s1)⋅ds1
至此,迭代公式的第一轮循环已经结束了,结束时的公式(6)中蓝色部分与公式(2)的第一行具有相似的形式。可以看出,整体公式的推导思路就是:
顺着这个思路继续化简(6),将 V π ( s 1 ) = ∫ a 1 π ( a 1 ∣ s 1 ; θ ) Q π ( s 1 , a 1 ) d a 1 V^{\pi}(s_1)=\int_{a_1}\pi(a_1|s_1;\theta)Q^{\pi}(s_1,a_1)da_1 Vπ(s1)=∫a1π(a1∣s1;θ)Qπ(s1,a1)da1可得
∇ θ J ( θ ) = M 0 + ∫ s 1 [ ∫ s 0 γ P 0 ( s 0 ) P ( s 0 → s 1 , 1 , π ) ⋅ d s 0 ] ∇ θ V π ( s 1 ) ⋅ d s 1 = M 0 + ∫ s 1 [ ∫ s 0 γ P 0 ( s 0 ) P ( s 0 → s 1 , 1 , π ) ⋅ d s 0 ] ∇ θ ∫ a 1 π ( a 1 ∣ s 1 ; θ ) Q π ( s 1 , a 1 ) d a 1 ⋅ d s 1 = M 0 + ∫ s 1 [ ∫ s 0 γ P 0 ( s 0 ) P ( s 0 → s 1 , 1 , π ) ⋅ d s 0 ] [ ∫ a 1 ∇ θ π ( a 1 ∣ s 1 ; θ ) Q π ( s 1 , a 1 ) d a 1 + ∫ a 1 π ( a 1 ∣ s 1 ; θ ) ∇ θ Q π ( s 1 , a 1 ) d a 1 ] d s 1 = M 0 + M 1 + ∫ s 1 [ ∫ s 0 γ P 0 ( s 0 ) P ( s 0 → s 1 , 1 , π ) ⋅ d s 0 ] [ ∫ a 1 π ( a 1 ∣ s 1 ; θ ) ∇ θ Q π ( s 1 , a 1 ) d a 1 ] d s 1 \begin{align} \begin{aligned} \nabla_{\theta}J(\theta) & = M_0 + \int_{s_1}\left[\int_{s_0}\gamma P_0(s_0)P(s_0\rightarrow s_1,1,\pi)\cdot ds_0\right]\textcolor{blue}{\nabla_{\theta}V^{\pi}(s_1)}\cdot ds_1 \\ & = M_0 + \int_{s_1}\left[\int_{s_0}\gamma P_0(s_0)P(s_0\rightarrow s_1,1,\pi)\cdot ds_0\right]\nabla_{\theta}\int_{a_1}\pi(a_1|s_1;\theta)Q^{\pi}(s_1,a_1)da_1\cdot ds_1 \\ & = M_0+ \int_{s_1}\left[\int_{s_0}\gamma P_0(s_0)P(s_0\rightarrow s_1,1,\pi)\cdot ds_0\right]\left[\int_{a_1}\nabla_{\theta}\pi(a_1|s_1;\theta)Q^{\pi}(s_1,a_1) da_1+\int_{a_1}\pi(a_1|s_1;\theta)\nabla_{\theta}Q^{\pi}(s_1,a_1)da_1\right]ds_1 \\ & = M_0 + M_1 + \int_{s_1}\left[\int_{s_0}\gamma P_0(s_0)P(s_0\rightarrow s_1,1,\pi)\cdot ds_0\right]\left[\int_{a_1}\pi(a_1|s_1;\theta)\nabla_{\theta}Q^{\pi}(s_1,a_1)da_1\right]ds_1 \\ \end{aligned} \end{align} ∇θJ(θ)=M0+∫s1[∫s0γP0(s0)P(s0→s1,1,π)⋅ds0]∇θVπ(s1)⋅ds1=M0+∫s1[∫s0γP0(s0)P(s0→s1,1,π)⋅ds0]∇θ∫a1π(a1∣s1;θ)Qπ(s1,a1)da1⋅ds1=M0+∫s1[∫s0γP0(s0)P(s0→s1,1,π)⋅ds0][∫a1∇θπ(a1∣s1;θ)Qπ(s1,a1)da1+∫a1π(a1∣s1;θ)∇θQπ(s1,a1)da1]ds1=M0+M1+∫s1[∫s0γP0(s0)P(s0→s1,1,π)⋅ds0][∫a1π(a1∣s1;θ)∇θQπ(s1,a1)da1]ds1
其中, M 1 = ∫ s 1 [ ∫ s 0 γ P 0 ( s 0 ) P ( s 0 → s 1 , 1 , π ) ⋅ d s 0 ] ∫ a 1 ∇ θ π ( a 1 ∣ s 1 ; θ ) Q π ( s 1 , a 1 ) d a 1 d s 1 M_1= \int_{s_1}\left[\int_{s_0}\gamma P_0(s_0)P(s_0\rightarrow s_1,1,\pi)\cdot ds_0\right]\int_{a_1}\nabla_{\theta}\pi(a_1|s_1;\theta)Q^{\pi}(s_1,a_1) da_1ds_1 M1=∫s1[∫s0γP0(s0)P(s0→s1,1,π)⋅ds0]∫a1∇θπ(a1∣s1;θ)Qπ(s1,a1)da1ds1。
代入 Q π ( s 1 , a 1 ) = R s 1 a 1 + γ ∫ s 1 P s 1 , s 2 a 1 V π ( s 2 ) d s 2 Q^{\pi}(s_1,a_1)=\mathcal{R}_{s_1}^{a_1}+\gamma\int_{s_1}P_{s_1,s_2}^{a_1}V^{\pi}(s_2)ds_2 Qπ(s1,a1)=Rs1a1+γ∫s1Ps1,s2a1Vπ(s2)ds2,得
∇ θ J ( θ ) = M 0 + M 1 + ∫ s 1 [ ∫ s 0 γ P 0 ( s 0 ) P ( s 0 → s 1 , 1 , π ) ⋅ d s 0 ] [ ∫ a 1 π ( a 1 ∣ s 1 ; θ ) ∇ θ γ ∫ s 1 P s 1 , s 2 a 1 V π ( s 2 ) ⋅ d s 2 ⋅ d a 1 ] d s 1 = M 0 + M 1 + ∫ s 2 ∫ s 1 [ ∫ s 0 γ 2 P 0 ( s 0 ) P ( s 0 → s 1 , 1 , π ) ⋅ d s 0 ] ∫ a 1 π ( a 1 ∣ s 1 ; θ ) P s 1 , s 2 a 1 ⋅ d a 1 ⋅ d s 1 ⋅ ∇ θ V π ( s 2 ) ⋅ d s 2 \begin{align} \begin{aligned} \nabla_{\theta}J(\theta) & = M_0 + M_1 \\ & + \int_{s_1}\left[\int_{s_0}\gamma P_0(s_0)P(s_0\rightarrow s_1,1,\pi)\cdot ds_0\right]\left[\int_{a_1}\pi(a_1|s_1;\theta)\nabla_{\theta}\gamma\int_{s_1}P_{s_1,s_2}^{a_1}V^{\pi}(s_2)\cdot ds_2\cdot da_1\right]ds_1 \\ & = M_0 + M_1 \\ & + \int_{s_2}\int_{s_1}\left[\int_{s_0}\gamma^2 P_0(s_0)P(s_0\rightarrow s_1,1,\pi)\cdot ds_0\right]\textcolor{red}{\int_{a_1} \pi(a_1|s_1;\theta) P_{s_1,s_2}^{a_1}\cdot da_1}\cdot ds_1\cdot\nabla_{\theta}V^{\pi}(s_2)\cdot ds_2 \end{aligned} \end{align} ∇θJ(θ)=M0+M1+∫s1[∫s0γP0(s0)P(s0→s1,1,π)⋅ds0][∫a1π(a1∣s1;θ)∇θγ∫s1Ps1,s2a1Vπ(s2)⋅ds2⋅da1]ds1=M0+M1+∫s2∫s1[∫s0γ2P0(s0)P(s0→s1,1,π)⋅ds0]∫a1π(a1∣s1;θ)Ps1,s2a1⋅da1⋅ds1⋅∇θVπ(s2)⋅ds2
同理,红色部分恰好是 ∫ a 1 π ( a 1 ∣ s 1 ; θ ) P s 1 , s 2 a 1 ⋅ d a 1 = P ( s 1 → s 2 , 1 , π ) \int_{a_1} \pi(a_1|s_1;\theta) P_{s_1,s_2}^{a_1}\cdot da_1=P(s_1\rightarrow s_2,1,\pi) ∫a1π(a1∣s1;θ)Ps1,s2a1⋅da1=P(s1→s2,1,π),进而有
∇ θ J ( θ ) = M 0 + M 1 + ∫ s 2 ∫ s 1 [ ∫ s 0 γ 2 P 0 ( s 0 ) P ( s 0 → s 1 , 1 , π ) ⋅ d s 0 ] P ( s 1 → s 2 , 1 , π ) ⋅ d s 1 ⋅ ∇ θ V π ( s 2 ) ⋅ d s 2 \begin{align} \begin{aligned} \nabla_{\theta}J(\theta) & = M_0 + M_1 \\ & + \int_{s_2}\int_{s_1}\left[\int_{s_0}\gamma^2 P_0(s_0)P(s_0\rightarrow s_1,1,\pi)\cdot ds_0\right]P(s_1\rightarrow s_2,1,\pi)\cdot ds_1\cdot\nabla_{\theta}V^{\pi}(s_2)\cdot ds_2 \end{aligned} \end{align} ∇θJ(θ)=M0+M1+∫s2∫s1[∫s0γ2P0(s0)P(s0→s1,1,π)⋅ds0]P(s1→s2,1,π)⋅ds1⋅∇θVπ(s2)⋅ds2
再次调换积分次序,有
∇ θ J ( θ ) = M 0 + M 1 + ∫ s 2 ∫ s 1 [ ∫ s 0 γ 2 P 0 ( s 0 ) P ( s 0 → s 1 , 1 , π ) ⋅ d s 0 ] P ( s 1 → s 2 , 1 , π ) ⋅ d s 1 ⋅ ∇ θ V π ( s 2 ) ⋅ d s 2 = M 0 + M 1 + ∫ s 2 γ 2 ∇ θ V π ( s 2 ) ∫ s 0 P 0 ( s 0 ) ∫ s 1 P ( s 0 → s 1 , 1 , π ) P ( s 1 → s 2 , 1 , π ) ⋅ d s 1 ⋅ d s 0 ⋅ d s 2 \begin{align} \begin{aligned} \nabla_{\theta}J(\theta) & = M_0 + M_1 \\ & + \int_{s_2}\int_{s_1}\left[\int_{s_0}\gamma^2 P_0(s_0)P(s_0\rightarrow s_1,1,\pi)\cdot ds_0\right]P(s_1\rightarrow s_2,1,\pi)\cdot ds_1\cdot\nabla_{\theta}V^{\pi}(s_2)\cdot ds_2 \\ & = M_0 + M_1 \\ &+ \int_{s_2}\gamma^2 \nabla_{\theta}V^{\pi}(s_2)\int_{s_0}P_0(s_0)\textcolor{red}{\int_{s_1} P(s_0\rightarrow s_1,1,\pi)P(s_1\rightarrow s_2,1,\pi) \cdot ds_1}\cdot ds_0 \cdot ds_2 \end{aligned} \end{align} ∇θJ(θ)=M0+M1+∫s2∫s1[∫s0γ2P0(s0)P(s0→s1,1,π)⋅ds0]P(s1→s2,1,π)⋅ds1⋅∇θVπ(s2)⋅ds2=M0+M1+∫s2γ2∇θVπ(s2)∫s0P0(s0)∫s1P(s0→s1,1,π)P(s1→s2,1,π)⋅ds1⋅ds0⋅ds2
可以看出,红色部分又恰好是 ∫ s 1 P ( s 0 → s 1 , 1 , π ) P ( s 1 → s 2 , 1 , π ) ⋅ d s 1 = P ( s 0 → s 2 , 2 , π ) \int_{s_1} P(s_0\rightarrow s_1,1,\pi)P(s_1\rightarrow s_2,1,\pi) \cdot ds_1=P(s_0\rightarrow s_2,2,\pi) ∫s1P(s0→s1,1,π)P(s1→s2,1,π)⋅ds1=P(s0→s2,2,π),进而有
∇ θ J ( θ ) = M 0 + M 1 + ∫ s 2 γ 2 ∇ θ V π ( s 2 ) ∫ s 0 P 0 ( s 0 ) P ( s 0 → s 2 , 2 , π ) ⋅ d s 0 ⋅ d s 2 = M 0 + M 1 + ∫ s 2 [ ∫ s 0 γ 2 P 0 ( s 0 ) P ( s 0 → s 2 , 2 , π ) ⋅ d s 0 ] ∇ θ V π ( s 2 ) ⋅ d s 2 \begin{align} \begin{aligned} \nabla_{\theta}J(\theta) & = M_0 + M_1 \\ &+ \int_{s_2}\gamma^2 \nabla_{\theta}V^{\pi}(s_2)\int_{s_0}P_0(s_0)P(s_0\rightarrow s_2,2,\pi)\cdot ds_0 \cdot ds_2 \\ & = M_0 + M_1 + \int_{s_2}\left[\int_{s_0}\gamma^2 P_0(s_0)P(s_0\rightarrow s_2,2,\pi)\cdot ds_0\right] \textcolor{blue}{\nabla_{\theta}V^{\pi}(s_2)}\cdot ds_2 \end{aligned} \end{align} ∇θJ(θ)=M0+M1+∫s2γ2∇θVπ(s2)∫s0P0(s0)P(s0→s2,2,π)⋅ds0⋅ds2=M0+M1+∫s2[∫s0γ2P0(s0)P(s0→s2,2,π)⋅ds0]∇θVπ(s2)⋅ds2
至此,可以看出式(11)和式(6)中的 M M M项和后边的积分项都具有极其相似的形式,通过找规律(归纳),可以得出,当 t → ∞ t\rightarrow\infty t→∞时,有
∇ θ J ( θ ) = ∑ t = 0 ∞ ∫ s t [ ∫ s 0 γ t P 0 ( s 0 ) P ( s 0 → s t , t , π ) ⋅ d s 0 ] ∫ a t ∇ θ π ( s t , a t ; θ ) Q π ( s t , a t ) ⋅ d a t ⋅ d s t = ∫ s ∫ s 0 ∑ t = 0 ∞ γ t P 0 ( s 0 ) P ( s 0 → s , t , π ) ⋅ d s 0 ∫ a ∇ θ π ( s , a ; θ ) Q π ( s , a ) ⋅ d a ⋅ d s \begin{align} \begin{aligned} \nabla_{\theta}J(\theta) & = \sum_{t=0}^{\infty}{\int_{s_t} \left[\int_{s_0}\gamma^t P_0(s_0)P(s_0\rightarrow s_t,t,\pi)\cdot ds_0\right]\int_{a_t} \nabla_{\theta}\pi(s_t,a_t;\theta)Q^{\pi}(s_t,a_t) \cdot da_t \cdot ds_t} \\ & = \int_{s}\int_{s_0}\sum_{t=0}^{\infty}\gamma^t P_0(s_0)P(s_0\rightarrow s, t, \pi)\cdot ds_0\int_{a} \nabla_{\theta}\pi(s,a;\theta)Q^{\pi}(s,a) \cdot da \cdot ds \end{aligned} \end{align} ∇θJ(θ)=t=0∑∞∫st[∫s0γtP0(s0)P(s0→st,t,π)⋅ds0]∫at∇θπ(st,at;θ)Qπ(st,at)⋅dat⋅dst=∫s∫s0t=0∑∞γtP0(s0)P(s0→s,t,π)⋅ds0∫a∇θπ(s,a;θ)Qπ(s,a)⋅da⋅ds
公式(1)的结尾得到 ρ π ( s ) = ∑ t = 0 ∞ γ t ∫ s 0 ∈ S P 0 ( s 0 ) P ( s 0 → s 1 , 1 , π ) ⋅ d s 0 \rho^{\pi}(s) = \sum_{t=0}^\infty{\gamma^t \int_{s_0\in\mathcal{S}}P_0(s_0)P(s_0\rightarrow s_1,1,\pi)\cdot ds_0} ρπ(s)=∑t=0∞γt∫s0∈SP0(s0)P(s0→s1,1,π)⋅ds0,故而
∇ θ J ( θ ) = ∫ S ρ π ( s ) ∫ A ∇ θ π ( a ∣ s ; θ ) ⋅ Q π ( s , a ) ⋅ d a ⋅ d s \nabla_{\theta}J(\theta)=\int_{\mathcal{S}}\rho^{\pi}(s)\int_{\mathcal{A}}\nabla_{\theta}\pi(a|s;\theta)\cdot Q^{\pi}(s,a)\cdot da \cdot ds ∇θJ(θ)=∫Sρπ(s)∫A∇θπ(a∣s;θ)⋅Qπ(s,a)⋅da⋅ds
证明完毕。