强化学习策略梯度定理证明

强化学习策略梯度定理证明

  • 前言
  • 策略梯度定理
  • 预备公式
  • 证明
    • J ( θ ) J(\theta) J(θ)定理形式推导
    • 定理证明

前言

好久没有更新了,最近看了Policy Gradient 的原文,里边的证明看不懂,于是又找了 Stanford University 的策略梯度定理证明的PPT,感觉写的比较清晰。

链接如下:
原文-Policy Gradient Methods for Reinforcement Learning with Function Approximation.
PPT.

策略梯度定理

策略梯度定理的意义在于:该定理建立了指标函数 J ( θ ) J(\theta) J(θ)的梯度与策略 π ( a ∣ s ; θ ) \pi(a|s;\theta) π(as;θ)的梯度的直接联系。( θ \theta θ为策略函数中的参数,如果使用neural network (nn) 作为梯度,那么 θ \theta θ就是nn里面的参数)。下边给出定理:
∇ θ J ( θ ) = ∫ S ρ π ( s ) ∫ A ∇ θ π ( a ∣ s ; θ ) ⋅ Q π ( s , a ) ⋅ d a ⋅ d s \nabla_{\theta}J(\theta)=\int_{\mathcal{S}}\rho^{\pi}(s)\int_{\mathcal{A}}\nabla_{\theta}\pi(a|s;\theta)\cdot Q^{\pi}(s,a)\cdot da \cdot ds θJ(θ)=Sρπ(s)Aθπ(as;θ)Qπ(s,a)dads
其中 ∇ θ J ( θ ) = ∂ ∂ θ J ( θ ) \nabla_{\theta}J(\theta)=\frac{\partial}{\partial\theta}J(\theta) θJ(θ)=θJ(θ) ρ π ( s ) = ∑ t = 0 ∞ γ t ∫ s 0 ∈ S P 0 ( s 0 ) P ( s 0 → s 1 , 1 , π ) ⋅ d s 0 \rho^{\pi}(s) = \sum_{t=0}^\infty{\gamma^t \int_{s_0\in\mathcal{S}}P_0(s_0)P(s_0\rightarrow s_1,1,\pi)\cdot ds_0} ρπ(s)=t=0γts0SP0(s0)P(s0s1,1,π)ds0

预备公式

因本人基础比较差,所以推导的时候很多公式都不知道是什么含义,所以查阅了一下,放在这里备用。

  • 概率论中的期望公式: E ( X ) = ∫ x ∈ X p ( x ) x d x \mathbb{E}(X)=\int_{x \in X}{p(x)xdx} E(X)=xXp(x)xdx
  • 强化学习状态值函数 V π ( s ) V^{\pi}(s) Vπ(s)与动作值函数 Q π ( s , a ) Q^{\pi}(s,a) Qπ(s,a)的对应关系:
    • V π ( s ) → Q π ( s , a ) V^{\pi}(s) \rightarrow Q^{\pi}(s,a) Vπ(s)Qπ(s,a) V π ( s ) = ∫ a ∈ A π ( a ∣ s ; θ ) Q π ( s , a ) d a V^{\pi}(s) = \int_{a\in\mathcal{A}}{\pi(a|s;\theta)Q^{\pi}(s,a)}da Vπ(s)=aAπ(as;θ)Qπ(s,a)da
    • Q π ( s , a ) → V π ( s ′ ) Q^{\pi}(s,a) \rightarrow V^{\pi}(s') Qπ(s,a)Vπ(s) Q π ( s , a ) = R s a + ∫ s ′ ∈ S P s , s ′ a V π ( s ′ ) d s ′ Q^{\pi}(s,a) = \mathcal{R}_s^a + \int_{s'\in\mathcal{S}}{\mathcal{P_{s,s'}^a}V^{\pi}(s')}ds' Qπ(s,a)=Rsa+sSPs,saVπ(s)ds

不严格地讲,这里的积分可以换成求和的形式,它们所代表的都是整个动作或者状态空间。

证明

整体的证明都是按照PPT链接中的内容进行的,不同的是在数学归纳的地方多往后推了一项,让结果看起来更加显然。

J ( θ ) J(\theta) J(θ)定理形式推导

J ( θ ) = E π [ ∑ t = 0 ∞ γ t r t ] = ∑ t = 0 ∞ E π γ t r t = ∑ t = 0 ∞ γ t ∫ s ∈ S r s a ⋅ d s = ∑ t = 0 ∞ γ t ∫ s ∈ S R s a ∫ a ∈ A π ( a ∣ s ; θ ) ⋅ d a ⋅ d s = ∑ t = 0 ∞ γ t ∫ s ∈ S ⋅ ∫ a ∈ A π ( a ∣ s ; θ ) R s a ⋅ d a ⋅ d s = ∫ s ∈ S [ ∑ t = 0 ∞ γ t ∫ s 0 ∈ S P 0 ( s 0 ) P ( s 0 → s t , t , π ) ⋅ d s 0 ] ∫ a ∈ A π ( a ∣ s ; θ ) R s a ⋅ d a ⋅ d s = ∫ s ∈ S ρ π ( s ) ∫ a ∈ A π ( a ∣ s ; θ ) R s a ⋅ d a ⋅ d s \begin{align} \begin{aligned} J(\theta) & = \mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty}{\gamma^tr_t}\right] \\ & = \sum_{t=0}^{\infty}{\mathbb{E}_{\pi}\gamma^tr_t} \\ & = \sum_{t=0}^{\infty}{\gamma^t\int_{s\in\mathcal{S}}r_s^a \cdot ds} \\ & = \sum_{t=0}^{\infty}{\gamma^t\int_{s\in\mathcal{S}}\mathcal{R}_s^a \int_{a\in\mathcal{A}}\pi(a|s;\theta)\cdot da \cdot ds} \\ & = \sum_{t=0}^{\infty}{\gamma^t\int_{s\in\mathcal{S}}\cdot\int_{a\in\mathcal{A}}\pi(a|s;\theta)\mathcal{R}_s^a\cdot da\cdot ds} \\ & = \int_{s\in\mathcal{S}}\left[\sum_{t=0}^\infty{\gamma^t \int_{s_0\in\mathcal{S}}P_0(s_0)P(s_0\rightarrow s_t,t,\pi)\cdot ds_0}\right]\int_{a\in\mathcal{A}}\pi(a|s;\theta)\mathcal{R}_s^a \cdot da \cdot ds \\ & = \int_{s\in\mathcal{S}}\rho^{\pi}(s)\int_{a\in\mathcal{A}}\pi(a|s;\theta)\mathcal{R}_s^a \cdot da \cdot ds \end{aligned} \end{align} J(θ)=Eπ[t=0γtrt]=t=0Eπγtrt=t=0γtsSrsads=t=0γtsSRsaaAπ(as;θ)dads=t=0γtsSaAπ(as;θ)Rsadads=sS[t=0γts0SP0(s0)P(s0st,t,π)ds0]aAπ(as;θ)Rsadads=sSρπ(s)aAπ(as;θ)Rsadads
其中, ρ π ( s ) = ∑ t = 0 ∞ γ t ∫ s 0 ∈ S P 0 ( s 0 ) P ( s 0 → s 1 , 1 , π ) ⋅ d s 0 \rho^{\pi}(s) = \sum_{t=0}^\infty{\gamma^t \int_{s_0\in\mathcal{S}}P_0(s_0)P(s_0\rightarrow s_1,1,\pi)\cdot ds_0} ρπ(s)=t=0γts0SP0(s0)P(s0s1,1,π)ds0 P 0 ( s 0 ) P_0(s_0) P0(s0)是状态的初始分布, P ( s 0 → s 1 , 1 , π ) P(s_0\rightarrow s_1,1,\pi) P(s0s1,1,π)代表的是系统依照策略 π \pi π从经过1步从状态 s 0 s_0 s0转移到 s 1 s_1 s1的概率(记住这个符号,最终的迭代公式会用到)。

至此, J ( θ ) J(\theta) J(θ)已经初步具有定理的形式了。

定理证明

式(2)给出了由初始状态值函数所定义的指标函数
J ( θ ) = ∫ s 0 P 0 ( s 0 ) V π ( s 0 ) ⋅ d s 0 = ∫ s 0 P 0 ( s 0 ) ∫ a 0 π ( a 0 ∣ s 0 ; θ ) Q π ( s 0 , a 0 ) ⋅ d a 0 ⋅ d s 0 \begin{align} \begin{aligned} J(\theta) & = \int_{s_0}{P_0(s_0)V^{\pi}(s_0)} \cdot ds_0 \\ & = \int_{s_0}{P_0(s_0)\int_{a_0}\pi(a_0|s_0;\theta)Q^{\pi}(s_0,a_0)}\cdot da_0 \cdot ds_0 \\ \end{aligned} \end{align} J(θ)=s0P0(s0)Vπ(s0)ds0=s0P0(s0)a0π(a0s0;θ)Qπ(s0,a0)da0ds0
这里注意,为了省空间,积分变量 s 0 s_0 s0 a 0 a_0 a0都是简写的。由(2),可以得到指标函数对 θ \theta θ的微分形式
∇ θ J ( θ ) = ∇ θ [ ∫ s 0 P 0 ( s 0 ) ∫ a 0 π ( a 0 ∣ s 0 ; θ ) Q π ( s 0 , a 0 ) ⋅ d a 0 ⋅ d s 0 ] \begin{align} \begin{aligned} \nabla_{\theta}J(\theta) = \nabla_{\theta}\left[\int_{s_0}{P_0(s_0)\int_{a_0}\pi(a_0|s_0;\theta)Q^{\pi}(s_0,a_0)}\cdot da_0 \cdot ds_0\right] \end{aligned} \end{align} θJ(θ)=θ[s0P0(s0)a0π(a0s0;θ)Qπ(s0,a0)da0ds0]
至此,开始了无穷无尽的推导。根据预备公式中状态值函数与动作值函数的转换公式可得
∇ θ J ( θ ) = ∫ s 0 P 0 ( s 0 ) ∫ a 0 ∇ θ π ( a 0 ∣ s 0 ; θ ) Q π ( s 0 , a 0 ) ⋅ d a 0 ⋅ d s + ∫ s 0 P 0 ( s 0 ) ∫ a 0 π ( a 0 ∣ s 0 ; θ ) ∇ θ Q π ( s 0 , a 0 ) ⋅ d a 0 ⋅ d s = M 0 + ∫ s 0 P 0 ( s 0 ) ∫ a 0 π ( a 0 ∣ s 0 ; θ ) ∇ θ [ R s a + ∫ s 1 γ P s 0 , s 1 a 0 V π ( s 1 ) ⋅ d s 1 ] ⋅ d a 0 ⋅ d s \begin{align} \begin{aligned} \nabla_{\theta}J(\theta) & = \int_{s_0}{P_0(s_0)\int_{a_0}\nabla_{\theta}\pi(a_0|s_0;\theta)Q^{\pi}(s_0,a_0)}\cdot da_0 \cdot ds \\ & + \int_{s_0}{P_0(s_0)\int_{a_0}\pi(a_0|s_0;\theta)\nabla_{\theta}Q^{\pi}(s_0,a_0)}\cdot da_0 \cdot ds \\ & = M_0+\int_{s_0}{P_0(s_0)\int_{a_0}\pi(a_0|s_0;\theta)\nabla_{\theta}\left[\mathcal{R}_s^a+\int_{s_1} \gamma P_{s_0,s_1}^{a_0}V^{\pi}(s_1) \cdot ds_1\right]}\cdot da_0 \cdot ds \\ \end{aligned} \end{align} θJ(θ)=s0P0(s0)a0θπ(a0s0;θ)Qπ(s0,a0)da0ds+s0P0(s0)a0π(a0s0;θ)θQπ(s0,a0)da0ds=M0+s0P0(s0)a0π(a0s0;θ)θ[Rsa+s1γPs0,s1a0Vπ(s1)ds1]da0ds
其中, M 0 = ∫ s 0 P 0 ( s 0 ) ∫ a 0 ∇ θ π ( a 0 ∣ s 0 ; θ ) Q π ( s 0 , a 0 ) ⋅ d a 0 ⋅ d s M_0=\int_{s_0}{P_0(s_0)\int_{a_0}\nabla_{\theta}\pi(a_0|s_0;\theta)Q^{\pi}(s_0,a_0)}\cdot da_0 \cdot ds M0=s0P0(s0)a0θπ(a0s0;θ)Qπ(s0,a0)da0ds

∇ θ R s a = 0 \nabla_{\theta}\mathcal{R}_s^a=0 θRsa=0可得,
∇ θ J ( θ ) = M 0 + ∫ s 0 P 0 ( s 0 ) ∫ a 0 π ( a 0 ∣ s 0 ; θ ) ∇ θ ∫ s 1 γ P s 0 , s 1 a 0 V π ( s 1 ) ⋅ d s 1 ⋅ d a 0 ⋅ d s = M 0 + ∫ s 1 [ ∫ s 0 γ P 0 ( s 0 ) ∫ a 0 π ( a 0 ∣ s 0 ; θ ) P s 0 , s 1 a 0 ⋅ d a 0 ⋅ d s 0 ] ∇ θ V π ( s 1 ) ⋅ d s 1 \begin{align} \begin{aligned} \nabla_{\theta}J(\theta) & = M_0 + \int_{s_0}{P_0(s_0)\int_{a_0}\pi(a_0|s_0;\theta)\nabla_{\theta}\int_{s_1} \gamma P_{s_0,s_1}^{a_0}V^{\pi}(s_1) \cdot ds_1}\cdot da_0 \cdot ds \\ & = M_0 + \int_{s_1}\left[\int_{s_0}\gamma P_0(s_0)\textcolor{red}{\int_{a_0}\pi(a_0|s_0;\theta)P_{s_0,s_1}^{a_0}\cdot da_0}\cdot ds_0\right]\nabla_{\theta}V^{\pi}(s_1)\cdot ds_1 \end{aligned} \end{align} θJ(θ)=M0+s0P0(s0)a0π(a0s0;θ)θs1γPs0,s1a0Vπ(s1)ds1da0ds=M0+s1[s0γP0(s0)a0π(a0s0;θ)Ps0,s1a0da0ds0]θVπ(s1)ds1
其中,红色部分恰好是
∫ a 0 π ( a 0 ∣ s 0 ; θ ) P s 0 , s 1 a 0 ⋅ d a 0 = P ( s 0 → s 1 , 1 , π ) \int_{a_0}\pi(a_0|s_0;\theta)P_{s_0,s_1}^{a_0}\cdot da_0=P(s_0\rightarrow s_1,1,\pi) a0π(a0s0;θ)Ps0,s1a0da0=P(s0s1,1,π)
因此,有
∇ θ J ( θ ) = M 0 + ∫ s 1 [ ∫ s 0 γ P 0 ( s 0 ) P ( s 0 → s 1 , 1 , π ) ⋅ d s 0 ] ⋅ ∇ θ V π ( s 1 ) ⋅ d s 1 \begin{align} \begin{aligned} \nabla_{\theta}J(\theta) = M_0 + \int_{s_1}\left[\int_{s_0}\gamma P_0(s_0)P(s_0\rightarrow s_1,1,\pi)\cdot ds_0\right] \cdot \textcolor{blue}{\nabla_{\theta}V^{\pi}(s_1)}\cdot ds_1 \end{aligned} \end{align} θJ(θ)=M0+s1[s0γP0(s0)P(s0s1,1,π)ds0]θVπ(s1)ds1
至此,迭代公式的第一轮循环已经结束了,结束时的公式(6)中蓝色部分与公式(2)的第一行具有相似的形式。可以看出,整体公式的推导思路就是:

  1. V ( s t ) → Q ( s t , a t ) V(s_t) \rightarrow Q(s_t, a_t) V(st)Q(st,at)
  2. 提出一项 M t M_t Mt
  3. Q ( s t , a t ) → V ( s t + 1 ) Q(s_t,a_t) \rightarrow V(s_{t+1}) Q(st,at)V(st+1)
  4. 调换积分次序
  5. 得到 P ( s t → s t + 1 , 1 , π ) P(s_t \rightarrow s_{t+1},1,\pi) P(stst+1,1,π)
  6. 剩下“一堆东西”和一个 V ( s t + 1 ) V(s_{t+1}) V(st+1)

顺着这个思路继续化简(6),将 V π ( s 1 ) = ∫ a 1 π ( a 1 ∣ s 1 ; θ ) Q π ( s 1 , a 1 ) d a 1 V^{\pi}(s_1)=\int_{a_1}\pi(a_1|s_1;\theta)Q^{\pi}(s_1,a_1)da_1 Vπ(s1)=a1π(a1s1;θ)Qπ(s1,a1)da1可得
∇ θ J ( θ ) = M 0 + ∫ s 1 [ ∫ s 0 γ P 0 ( s 0 ) P ( s 0 → s 1 , 1 , π ) ⋅ d s 0 ] ∇ θ V π ( s 1 ) ⋅ d s 1 = M 0 + ∫ s 1 [ ∫ s 0 γ P 0 ( s 0 ) P ( s 0 → s 1 , 1 , π ) ⋅ d s 0 ] ∇ θ ∫ a 1 π ( a 1 ∣ s 1 ; θ ) Q π ( s 1 , a 1 ) d a 1 ⋅ d s 1 = M 0 + ∫ s 1 [ ∫ s 0 γ P 0 ( s 0 ) P ( s 0 → s 1 , 1 , π ) ⋅ d s 0 ] [ ∫ a 1 ∇ θ π ( a 1 ∣ s 1 ; θ ) Q π ( s 1 , a 1 ) d a 1 + ∫ a 1 π ( a 1 ∣ s 1 ; θ ) ∇ θ Q π ( s 1 , a 1 ) d a 1 ] d s 1 = M 0 + M 1 + ∫ s 1 [ ∫ s 0 γ P 0 ( s 0 ) P ( s 0 → s 1 , 1 , π ) ⋅ d s 0 ] [ ∫ a 1 π ( a 1 ∣ s 1 ; θ ) ∇ θ Q π ( s 1 , a 1 ) d a 1 ] d s 1 \begin{align} \begin{aligned} \nabla_{\theta}J(\theta) & = M_0 + \int_{s_1}\left[\int_{s_0}\gamma P_0(s_0)P(s_0\rightarrow s_1,1,\pi)\cdot ds_0\right]\textcolor{blue}{\nabla_{\theta}V^{\pi}(s_1)}\cdot ds_1 \\ & = M_0 + \int_{s_1}\left[\int_{s_0}\gamma P_0(s_0)P(s_0\rightarrow s_1,1,\pi)\cdot ds_0\right]\nabla_{\theta}\int_{a_1}\pi(a_1|s_1;\theta)Q^{\pi}(s_1,a_1)da_1\cdot ds_1 \\ & = M_0+ \int_{s_1}\left[\int_{s_0}\gamma P_0(s_0)P(s_0\rightarrow s_1,1,\pi)\cdot ds_0\right]\left[\int_{a_1}\nabla_{\theta}\pi(a_1|s_1;\theta)Q^{\pi}(s_1,a_1) da_1+\int_{a_1}\pi(a_1|s_1;\theta)\nabla_{\theta}Q^{\pi}(s_1,a_1)da_1\right]ds_1 \\ & = M_0 + M_1 + \int_{s_1}\left[\int_{s_0}\gamma P_0(s_0)P(s_0\rightarrow s_1,1,\pi)\cdot ds_0\right]\left[\int_{a_1}\pi(a_1|s_1;\theta)\nabla_{\theta}Q^{\pi}(s_1,a_1)da_1\right]ds_1 \\ \end{aligned} \end{align} θJ(θ)=M0+s1[s0γP0(s0)P(s0s1,1,π)ds0]θVπ(s1)ds1=M0+s1[s0γP0(s0)P(s0s1,1,π)ds0]θa1π(a1s1;θ)Qπ(s1,a1)da1ds1=M0+s1[s0γP0(s0)P(s0s1,1,π)ds0][a1θπ(a1s1;θ)Qπ(s1,a1)da1+a1π(a1s1;θ)θQπ(s1,a1)da1]ds1=M0+M1+s1[s0γP0(s0)P(s0s1,1,π)ds0][a1π(a1s1;θ)θQπ(s1,a1)da1]ds1
其中, M 1 = ∫ s 1 [ ∫ s 0 γ P 0 ( s 0 ) P ( s 0 → s 1 , 1 , π ) ⋅ d s 0 ] ∫ a 1 ∇ θ π ( a 1 ∣ s 1 ; θ ) Q π ( s 1 , a 1 ) d a 1 d s 1 M_1= \int_{s_1}\left[\int_{s_0}\gamma P_0(s_0)P(s_0\rightarrow s_1,1,\pi)\cdot ds_0\right]\int_{a_1}\nabla_{\theta}\pi(a_1|s_1;\theta)Q^{\pi}(s_1,a_1) da_1ds_1 M1=s1[s0γP0(s0)P(s0s1,1,π)ds0]a1θπ(a1s1;θ)Qπ(s1,a1)da1ds1

代入 Q π ( s 1 , a 1 ) = R s 1 a 1 + γ ∫ s 1 P s 1 , s 2 a 1 V π ( s 2 ) d s 2 Q^{\pi}(s_1,a_1)=\mathcal{R}_{s_1}^{a_1}+\gamma\int_{s_1}P_{s_1,s_2}^{a_1}V^{\pi}(s_2)ds_2 Qπ(s1,a1)=Rs1a1+γs1Ps1,s2a1Vπ(s2)ds2,得
∇ θ J ( θ ) = M 0 + M 1 + ∫ s 1 [ ∫ s 0 γ P 0 ( s 0 ) P ( s 0 → s 1 , 1 , π ) ⋅ d s 0 ] [ ∫ a 1 π ( a 1 ∣ s 1 ; θ ) ∇ θ γ ∫ s 1 P s 1 , s 2 a 1 V π ( s 2 ) ⋅ d s 2 ⋅ d a 1 ] d s 1 = M 0 + M 1 + ∫ s 2 ∫ s 1 [ ∫ s 0 γ 2 P 0 ( s 0 ) P ( s 0 → s 1 , 1 , π ) ⋅ d s 0 ] ∫ a 1 π ( a 1 ∣ s 1 ; θ ) P s 1 , s 2 a 1 ⋅ d a 1 ⋅ d s 1 ⋅ ∇ θ V π ( s 2 ) ⋅ d s 2 \begin{align} \begin{aligned} \nabla_{\theta}J(\theta) & = M_0 + M_1 \\ & + \int_{s_1}\left[\int_{s_0}\gamma P_0(s_0)P(s_0\rightarrow s_1,1,\pi)\cdot ds_0\right]\left[\int_{a_1}\pi(a_1|s_1;\theta)\nabla_{\theta}\gamma\int_{s_1}P_{s_1,s_2}^{a_1}V^{\pi}(s_2)\cdot ds_2\cdot da_1\right]ds_1 \\ & = M_0 + M_1 \\ & + \int_{s_2}\int_{s_1}\left[\int_{s_0}\gamma^2 P_0(s_0)P(s_0\rightarrow s_1,1,\pi)\cdot ds_0\right]\textcolor{red}{\int_{a_1} \pi(a_1|s_1;\theta) P_{s_1,s_2}^{a_1}\cdot da_1}\cdot ds_1\cdot\nabla_{\theta}V^{\pi}(s_2)\cdot ds_2 \end{aligned} \end{align} θJ(θ)=M0+M1+s1[s0γP0(s0)P(s0s1,1,π)ds0][a1π(a1s1;θ)θγs1Ps1,s2a1Vπ(s2)ds2da1]ds1=M0+M1+s2s1[s0γ2P0(s0)P(s0s1,1,π)ds0]a1π(a1s1;θ)Ps1,s2a1da1ds1θVπ(s2)ds2
同理,红色部分恰好是 ∫ a 1 π ( a 1 ∣ s 1 ; θ ) P s 1 , s 2 a 1 ⋅ d a 1 = P ( s 1 → s 2 , 1 , π ) \int_{a_1} \pi(a_1|s_1;\theta) P_{s_1,s_2}^{a_1}\cdot da_1=P(s_1\rightarrow s_2,1,\pi) a1π(a1s1;θ)Ps1,s2a1da1=P(s1s2,1,π),进而有
∇ θ J ( θ ) = M 0 + M 1 + ∫ s 2 ∫ s 1 [ ∫ s 0 γ 2 P 0 ( s 0 ) P ( s 0 → s 1 , 1 , π ) ⋅ d s 0 ] P ( s 1 → s 2 , 1 , π ) ⋅ d s 1 ⋅ ∇ θ V π ( s 2 ) ⋅ d s 2 \begin{align} \begin{aligned} \nabla_{\theta}J(\theta) & = M_0 + M_1 \\ & + \int_{s_2}\int_{s_1}\left[\int_{s_0}\gamma^2 P_0(s_0)P(s_0\rightarrow s_1,1,\pi)\cdot ds_0\right]P(s_1\rightarrow s_2,1,\pi)\cdot ds_1\cdot\nabla_{\theta}V^{\pi}(s_2)\cdot ds_2 \end{aligned} \end{align} θJ(θ)=M0+M1+s2s1[s0γ2P0(s0)P(s0s1,1,π)ds0]P(s1s2,1,π)ds1θVπ(s2)ds2
再次调换积分次序,有
∇ θ J ( θ ) = M 0 + M 1 + ∫ s 2 ∫ s 1 [ ∫ s 0 γ 2 P 0 ( s 0 ) P ( s 0 → s 1 , 1 , π ) ⋅ d s 0 ] P ( s 1 → s 2 , 1 , π ) ⋅ d s 1 ⋅ ∇ θ V π ( s 2 ) ⋅ d s 2 = M 0 + M 1 + ∫ s 2 γ 2 ∇ θ V π ( s 2 ) ∫ s 0 P 0 ( s 0 ) ∫ s 1 P ( s 0 → s 1 , 1 , π ) P ( s 1 → s 2 , 1 , π ) ⋅ d s 1 ⋅ d s 0 ⋅ d s 2 \begin{align} \begin{aligned} \nabla_{\theta}J(\theta) & = M_0 + M_1 \\ & + \int_{s_2}\int_{s_1}\left[\int_{s_0}\gamma^2 P_0(s_0)P(s_0\rightarrow s_1,1,\pi)\cdot ds_0\right]P(s_1\rightarrow s_2,1,\pi)\cdot ds_1\cdot\nabla_{\theta}V^{\pi}(s_2)\cdot ds_2 \\ & = M_0 + M_1 \\ &+ \int_{s_2}\gamma^2 \nabla_{\theta}V^{\pi}(s_2)\int_{s_0}P_0(s_0)\textcolor{red}{\int_{s_1} P(s_0\rightarrow s_1,1,\pi)P(s_1\rightarrow s_2,1,\pi) \cdot ds_1}\cdot ds_0 \cdot ds_2 \end{aligned} \end{align} θJ(θ)=M0+M1+s2s1[s0γ2P0(s0)P(s0s1,1,π)ds0]P(s1s2,1,π)ds1θVπ(s2)ds2=M0+M1+s2γ2θVπ(s2)s0P0(s0)s1P(s0s1,1,π)P(s1s2,1,π)ds1ds0ds2
可以看出,红色部分又恰好是 ∫ s 1 P ( s 0 → s 1 , 1 , π ) P ( s 1 → s 2 , 1 , π ) ⋅ d s 1 = P ( s 0 → s 2 , 2 , π ) \int_{s_1} P(s_0\rightarrow s_1,1,\pi)P(s_1\rightarrow s_2,1,\pi) \cdot ds_1=P(s_0\rightarrow s_2,2,\pi) s1P(s0s1,1,π)P(s1s2,1,π)ds1=P(s0s2,2,π),进而有
∇ θ J ( θ ) = M 0 + M 1 + ∫ s 2 γ 2 ∇ θ V π ( s 2 ) ∫ s 0 P 0 ( s 0 ) P ( s 0 → s 2 , 2 , π ) ⋅ d s 0 ⋅ d s 2 = M 0 + M 1 + ∫ s 2 [ ∫ s 0 γ 2 P 0 ( s 0 ) P ( s 0 → s 2 , 2 , π ) ⋅ d s 0 ] ∇ θ V π ( s 2 ) ⋅ d s 2 \begin{align} \begin{aligned} \nabla_{\theta}J(\theta) & = M_0 + M_1 \\ &+ \int_{s_2}\gamma^2 \nabla_{\theta}V^{\pi}(s_2)\int_{s_0}P_0(s_0)P(s_0\rightarrow s_2,2,\pi)\cdot ds_0 \cdot ds_2 \\ & = M_0 + M_1 + \int_{s_2}\left[\int_{s_0}\gamma^2 P_0(s_0)P(s_0\rightarrow s_2,2,\pi)\cdot ds_0\right] \textcolor{blue}{\nabla_{\theta}V^{\pi}(s_2)}\cdot ds_2 \end{aligned} \end{align} θJ(θ)=M0+M1+s2γ2θVπ(s2)s0P0(s0)P(s0s2,2,π)ds0ds2=M0+M1+s2[s0γ2P0(s0)P(s0s2,2,π)ds0]θVπ(s2)ds2
至此,可以看出式(11)和式(6)中的 M M M项和后边的积分项都具有极其相似的形式,通过找规律(归纳),可以得出,当 t → ∞ t\rightarrow\infty t时,有

∇ θ J ( θ ) = ∑ t = 0 ∞ ∫ s t [ ∫ s 0 γ t P 0 ( s 0 ) P ( s 0 → s t , t , π ) ⋅ d s 0 ] ∫ a t ∇ θ π ( s t , a t ; θ ) Q π ( s t , a t ) ⋅ d a t ⋅ d s t = ∫ s ∫ s 0 ∑ t = 0 ∞ γ t P 0 ( s 0 ) P ( s 0 → s , t , π ) ⋅ d s 0 ∫ a ∇ θ π ( s , a ; θ ) Q π ( s , a ) ⋅ d a ⋅ d s \begin{align} \begin{aligned} \nabla_{\theta}J(\theta) & = \sum_{t=0}^{\infty}{\int_{s_t} \left[\int_{s_0}\gamma^t P_0(s_0)P(s_0\rightarrow s_t,t,\pi)\cdot ds_0\right]\int_{a_t} \nabla_{\theta}\pi(s_t,a_t;\theta)Q^{\pi}(s_t,a_t) \cdot da_t \cdot ds_t} \\ & = \int_{s}\int_{s_0}\sum_{t=0}^{\infty}\gamma^t P_0(s_0)P(s_0\rightarrow s, t, \pi)\cdot ds_0\int_{a} \nabla_{\theta}\pi(s,a;\theta)Q^{\pi}(s,a) \cdot da \cdot ds \end{aligned} \end{align} θJ(θ)=t=0st[s0γtP0(s0)P(s0st,t,π)ds0]atθπ(st,at;θ)Qπ(st,at)datdst=ss0t=0γtP0(s0)P(s0s,t,π)ds0aθπ(s,a;θ)Qπ(s,a)dads
公式(1)的结尾得到 ρ π ( s ) = ∑ t = 0 ∞ γ t ∫ s 0 ∈ S P 0 ( s 0 ) P ( s 0 → s 1 , 1 , π ) ⋅ d s 0 \rho^{\pi}(s) = \sum_{t=0}^\infty{\gamma^t \int_{s_0\in\mathcal{S}}P_0(s_0)P(s_0\rightarrow s_1,1,\pi)\cdot ds_0} ρπ(s)=t=0γts0SP0(s0)P(s0s1,1,π)ds0,故而
∇ θ J ( θ ) = ∫ S ρ π ( s ) ∫ A ∇ θ π ( a ∣ s ; θ ) ⋅ Q π ( s , a ) ⋅ d a ⋅ d s \nabla_{\theta}J(\theta)=\int_{\mathcal{S}}\rho^{\pi}(s)\int_{\mathcal{A}}\nabla_{\theta}\pi(a|s;\theta)\cdot Q^{\pi}(s,a)\cdot da \cdot ds θJ(θ)=Sρπ(s)Aθπ(as;θ)Qπ(s,a)dads
证明完毕。

你可能感兴趣的:(RL,概率论,机器学习,算法)