最近在学习TRPO,但是本人发现TRPO用了好多好多好多前人的论文结论,包括策略梯度定理 (Policy gradient theorem, PGT),自然策略梯度 (Natural policy gradient, NPG)和保守策略迭代 (Conservative policy iteration, CPI)。
因此,本人最近更新的博客数量剧增。。。PGT 和 NPG 可以参考我的另外两个博客
PGT.
NPG.
CPI 的主要贡献在于它提出了一种混合策略更新的方法,并且给出了使用这种方法更新策略可以使值函数单调增加的证明。可以设想一下,如果值函数单调不减,那么理论上是可以说明得到的策略是越来越好的。
实际上,CPI 算法并不会被简单粗暴地应用在程序中,因为结论中提到的变量都是未知的,都是很难确定的,所以CPI 算法更多地是提供一种解决一类RL问题的思路。然后,十几年之后,TRPO的提出就是这个思路被成功应用的很典型的一个案例。
贪婪策略是十分常见的一种策略更新方式,每一回合计算出优势函数 A π o l d ( s , a ) A^{\pi_{old}}(s,a) Aπold(s,a),然后反向找到使这个 A π o l d ( s , a ) A^{\pi_{old}}(s,a) Aπold(s,a)最大的策略 π \pi π作为 π n e w \pi_{new} πnew,简单粗暴。(有的时候也会用 Q π o l d ( s , a ) Q^{\pi_{old}}(s,a) Qπold(s,a),本质上一样),即:
π n e w = arg max π ∈ Π E s ∼ d μ π [ A π ( s , a ) ] \begin{align} \begin{aligned} \pi_{new}=\arg\max_{\pi\in\Pi}{\mathbb{E}_{s\sim d_{\mu}^{\pi}}{\left[A^{\pi}(s,a)\right]}} \end{aligned} \end{align} πnew=argπ∈ΠmaxEs∼dμπ[Aπ(s,a)]
但是这就存在一个问题,如果两次的改进比较大,那么很有可能会导致由 π o l d \pi_{old} πold和 π n e w \pi_{new} πnew所诱导出的轨迹存在较大差别。考虑到此时值函数并没有收敛 (甚至刚开始学习),那么就会导致策略在一定范围内“反复横跳” (哎,就是玩,你说啥都没用,我就不收敛,你能咋的…>_<)。这也是混合策略提出的初衷。
CPI将策略的更新变成旧策略和临时策略的加权平均,即
{ π ′ = arg max π ∈ Π E s ∼ d μ π o l d [ A π ( s , a ) ] π n e w = ( 1 − α ) π + α π ′ \begin{align} \left\{ \begin{aligned} & \pi'=\arg\max_{\pi\in\Pi}{\mathbb{E}_{s\sim d_{\mu}^{\pi_{old}}}{\left[A^{\pi}(s,a)\right]}} \\ & \pi_{new}=(1-\alpha)\pi+\alpha\pi' \end{aligned} \right. \end{align} ⎩ ⎨ ⎧π′=argπ∈ΠmaxEs∼dμπold[Aπ(s,a)]πnew=(1−α)π+απ′
这个式子给出来挺容易,其实到这里就已经结束了,问题是,如何说明它确实“有所提高”。这里给出三个引理
对于两个策略 π \pi π和 π ′ \pi' π′,以及他们所诱导出的状态的分布 d μ π d_{\mu}^{\pi} dμπ和 d μ π ′ d_{\mu}^{\pi'} dμπ′,有
{ V π ′ ( s 0 ) − V π ( s 0 ) = 1 1 − γ E s ∼ d μ π ′ [ A π ( s , π ′ ( s ) ) ] V π ( s 0 ) − V π ′ ( s 0 ) = 1 1 − γ E s ∼ d μ π [ A π ′ ( s , π ( s ) ) ] \begin{align} \left\{ \begin{aligned} & V^{\pi'}(s_0)-V^{\pi}(s_0)=\frac{1}{1-\gamma}\mathbb{E}_{s\sim d_{\mu}^{\pi'}}\left[A^{\pi}(s,\pi'(s))\right] \\ & V^{\pi}(s_0)-V^{\pi'}(s_0)=\frac{1}{1-\gamma}\mathbb{E}_{s\sim d_{\mu}^{\pi}}\left[A^{\pi'}(s,\pi(s))\right] \end{aligned} \right. \end{align} ⎩ ⎨ ⎧Vπ′(s0)−Vπ(s0)=1−γ1Es∼dμπ′[Aπ(s,π′(s))]Vπ(s0)−Vπ′(s0)=1−γ1Es∼dμπ[Aπ′(s,π(s))]
成立,其中 μ \mu μ为状态的初始分布, A A A是优势函数。
Proof V π ( s 0 ) − V π ′ ( s 0 ) = 1 1 − γ E s ∼ d μ π [ A π ′ ( s , π ( s ) ) ] V^{\pi}(s_0)-V^{\pi'}(s_0)=\frac{1}{1-\gamma}\mathbb{E}_{s\sim d_{\mu}^{\pi}}\left[A^{\pi'}(s,\pi(s))\right] Vπ(s0)−Vπ′(s0)=1−γ1Es∼dμπ[Aπ′(s,π(s))]
V π ( s 0 ) − V π ′ ( s 0 ) = V π ( s 0 ) − E a 0 ∼ π [ r s 0 a 0 + γ E s 1 ∼ P s 0 a 0 V π ′ ( s 1 ) ] + E a 0 ∼ π [ r s 0 a 0 + γ E s 1 ∼ P s 0 a 0 V π ′ ( s 1 ) ] − V π ′ ( s 0 ) = E a 0 ∼ π [ r s 0 a 0 + γ E s 1 ∼ P s 0 a 0 [ V π ( s 1 ) ] ] − E a 0 ∼ π [ r s 0 a 0 + γ E s 1 ∼ P s 0 a 0 [ V π ′ ( s 1 ) ] ] + E a 0 ∼ π [ r s 0 a 0 + γ E s 1 ∼ P s 0 a 0 [ V π ′ ( s 1 ) ] ] − V π ′ ( s 0 ) = γ E a 0 ∼ π E s 1 ∼ P s 0 a 0 [ V π ( s 1 ) − V π ′ ( s 1 ) ] + E a 0 ∼ π [ r s 0 a 0 + γ E s 1 ∼ P s 0 a 0 [ V π ′ ( s 1 ) ] ] − V π ′ ( s 0 ) = γ E a 0 ∼ π E s 1 ∼ P s 0 a 0 [ V π ( s 1 ) − V π ′ ( s 1 ) ] + E a 0 ∼ π [ Q π ′ ( s 0 , a 0 ) − V π ′ ( s 0 ) ] = γ E a 0 ∼ π E s 1 ∼ P s 0 a 0 [ V π ( s 1 ) − V π ′ ( s 1 ) ] + E a 0 ∼ π A π ′ ( s 0 , a 0 ) \begin{align} \begin{aligned} V^{\pi}(s_0)-V^{\pi'}(s_0) & = V^{\pi}(s_0) -\mathbb{E}_{a_0\sim\pi}\left[r_{s_0}^{a_0}+\gamma\mathbb{E}_{s_1\sim P_{s_0}^{a_0}}V^{\pi'}(s_1)\right] + \mathbb{E}_{a_0\sim\pi}\left[r_{s_0}^{a_0}+\gamma\mathbb{E}_{s_1\sim P_{s_0}^{a_0}}V^{\pi'}(s_1)\right] - V^{\pi'}(s_0)\\ &=\mathbb{E}_{a_0\sim\pi}\left[r_{s_0}^{a_0}+\gamma\mathbb{E}_{s_1\sim P_{s_0}^{a_0}}\left[V^{\pi}(s_1)\right]\right] -\mathbb{E}_{a_0\sim\pi}\left[r_{s_0}^{a_0}+\gamma\mathbb{E}_{s_1\sim P_{s_0}^{a_0}}\left[V^{\pi'}(s_1)\right]\right] \\ &+ \mathbb{E}_{a_0\sim\pi}\left[r_{s_0}^{a_0}+\gamma\mathbb{E}_{s_1\sim P_{s_0}^{a_0}}\left[V^{\pi'}(s_1)\right]\right] - V^{\pi'}(s_0) \\ &= \gamma\mathbb{E}_{a_0\sim\pi}\mathbb{E}_{s_1\sim P_{s_0}^{a_0}}\left[V^{\pi}(s_1)-V^{\pi'}(s_1)\right]+ \mathbb{E}_{a_0\sim\pi}\left[r_{s_0}^{a_0}+\gamma\mathbb{E}_{s_1\sim P_{s_0}^{a_0}}\left[V^{\pi'}(s_1)\right]\right] - V^{\pi'}(s_0) \\ & = \gamma\mathbb{E}_{a_0\sim\pi}\mathbb{E}_{s_1\sim P_{s_0}^{a_0}}\left[V^{\pi}(s_1)-V^{\pi'}(s_1)\right] + \mathbb{E}_{a_0\sim\pi}\left[Q^{\pi'}(s_0,a_0) - V^{\pi'}(s_0)\right]\\ &= \textcolor{red}{\gamma\mathbb{E}_{a_0\sim\pi}\mathbb{E}_{s_1\sim P_{s_0}^{a_0}}\left[V^{\pi}(s_1)-V^{\pi'}(s_1)\right]} + \mathbb{E}_{a_0\sim\pi}A^{\pi'}(s_0,a_0) \end{aligned} \end{align} Vπ(s0)−Vπ′(s0)=Vπ(s0)−Ea0∼π[rs0a0+γEs1∼Ps0a0Vπ′(s1)]+Ea0∼π[rs0a0+γEs1∼Ps0a0Vπ′(s1)]−Vπ′(s0)=Ea0∼π[rs0a0+γEs1∼Ps0a0[Vπ(s1)]]−Ea0∼π[rs0a0+γEs1∼Ps0a0[Vπ′(s1)]]+Ea0∼π[rs0a0+γEs1∼Ps0a0[Vπ′(s1)]]−Vπ′(s0)=γEa0∼πEs1∼Ps0a0[Vπ(s1)−Vπ′(s1)]+Ea0∼π[rs0a0+γEs1∼Ps0a0[Vπ′(s1)]]−Vπ′(s0)=γEa0∼πEs1∼Ps0a0[Vπ(s1)−Vπ′(s1)]+Ea0∼π[Qπ′(s0,a0)−Vπ′(s0)]=γEa0∼πEs1∼Ps0a0[Vπ(s1)−Vπ′(s1)]+Ea0∼πAπ′(s0,a0)
红色部分中括号里面的部分的形式与最初的等式形式相同。因此,继续推导红色部分
γ E a 0 ∼ π E s 1 ∼ P s 0 a 0 [ V π ( s 1 ) − V π ′ ( s 1 ) ] = γ E s 1 ∼ P 1 π ( ⋅ ∣ s 0 ) [ V π ( s 1 ) − V π ′ ( s 1 ) ] = γ E s 1 ∼ P 1 π ( ⋅ ∣ s 0 ) [ γ E a 1 ∼ π E s 2 ∼ P s 2 a 2 [ V π ( s 2 ) − V π ′ ( s 2 ) ] ] = γ 2 E s 2 ∼ P 2 π ( ⋅ ∣ s 0 ) [ V π ( s 2 ) − V π ′ ( s 2 ) ] + γ E a 1 ∼ π A π ′ ( s 1 , a 1 ) \begin{align} \begin{aligned} \gamma\mathbb{E}_{a_0\sim\pi}\mathbb{E}_{s_1\sim P_{s_0}^{a_0}}\left[V^{\pi}(s_1)-V^{\pi'}(s_1)\right] & = \gamma\mathbb{E}_{s_1\sim P_1^{\pi}(\cdot|s_0)}\left[V^{\pi}(s_1)-V^{\pi'}(s_1)\right] \\ & = \gamma\mathbb{E}_{s_1\sim P_1^{\pi}(\cdot|s_0)}\left[\gamma\mathbb{E}_{a_1\sim\pi}\mathbb{E}_{s_2\sim P_{s_2}^{a_2}}\left[V^{\pi}(s_2)-V^{\pi'}(s_2)\right]\right] \\ &= \gamma^2\mathbb{E}_{s_2\sim P_2^{\pi}(\cdot|s_0)}\left[V^{\pi}(s_2)-V^{\pi'}(s_2)\right]+\gamma\mathbb{E}_{a_1\sim \pi}A^{\pi'}(s_1,a_1) \end{aligned} \end{align} γEa0∼πEs1∼Ps0a0[Vπ(s1)−Vπ′(s1)]=γEs1∼P1π(⋅∣s0)[Vπ(s1)−Vπ′(s1)]=γEs1∼P1π(⋅∣s0)[γEa1∼πEs2∼Ps2a2[Vπ(s2)−Vπ′(s2)]]=γ2Es2∼P2π(⋅∣s0)[Vπ(s2)−Vπ′(s2)]+γEa1∼πAπ′(s1,a1)
由此类推,推导至无穷项,可以得出
V π ( s 0 ) − V π ′ ( s 0 ) = ∑ t = 0 ∞ γ t E s , a ∼ π A π ′ ( s , a ) = 1 1 − γ E s ∼ d μ π [ A π ′ ( s , π ( s ) ) ] \begin{align} \begin{aligned} V^{\pi}(s_0)-V^{\pi'}(s_0) &= \sum_{t=0}^{\infty}\gamma^t \mathbb{E}_{s,a\sim\pi}A^{\pi'}(s,a)\\ &=\frac{1}{1-\gamma}\mathbb{E}_{s\sim d_{\mu}^{\pi}}\left[A^{\pi'}(s,\pi(s))\right] \end{aligned} \end{align} Vπ(s0)−Vπ′(s0)=t=0∑∞γtEs,a∼πAπ′(s,a)=1−γ1Es∼dμπ[Aπ′(s,π(s))]
第二个的证明与第一个完全一样。整体的表达式是对称的,只需要把 π ′ \pi' π′和 π \pi π互换就好。
Lemma 2 还有另外一种表达形式 (这个就是恶心的地方,论文里都说显然,然后不同的资料用的符号和定义都不一样,或者实际上一样,但是看起来不一样>_<)
{ V π ′ ( s 0 ) − V π ( s 0 ) = ∑ s d π ′ ( s ) ∑ a π ′ ( s , a ) A π ( s , a ) V π ( s 0 ) − V π ′ ( s 0 ) = ∑ s d π ( s ) ∑ a π ( s , a ) A π ′ ( s , a ) \begin{align} \left\{ \begin{aligned} & V^{\pi'}(s_0)-V^{\pi}(s_0)=\sum_sd^{\pi'}(s)\sum_{a}{\pi'(s,a)A^{\pi}(s,a)} \\ & V^{\pi}(s_0)-V^{\pi'}(s_0)=\sum_sd^{\pi}(s)\sum_{a}{\pi(s,a)A^{\pi'}(s,a)} \end{aligned} \right. \end{align} ⎩ ⎨ ⎧Vπ′(s0)−Vπ(s0)=s∑dπ′(s)a∑π′(s,a)Aπ(s,a)Vπ(s0)−Vπ′(s0)=s∑dπ(s)a∑π(s,a)Aπ′(s,a)
对于混合策略更新,有
V π n e w − V π ≥ α A π ( π ′ ) − 2 α 2 γ ϵ 1 − γ ( 1 − α ) \begin{align} \begin{aligned} V^{\pi_{new}}- V^{\pi}\geq\alpha A_{\pi}(\pi')-\frac{2\alpha^2\gamma\epsilon }{1-\gamma(1-\alpha)} \end{aligned} \end{align} Vπnew−Vπ≥αAπ(π′)−1−γ(1−α)2α2γϵ
成立。其中
A π ( π ′ ) ≡ ∑ s d π ( s ) ∑ a π ′ ( s , a ) A π ( s , a ) ϵ ≡ 1 1 − γ [ max s ∑ a π ′ ( s , a ) A π ( s , a ) ] A_{\pi}(\pi')\equiv\sum_sd^{\pi}(s)\sum_a\pi'(s,a)A^{\pi}(s,a)\\ \epsilon\equiv\frac{1}{1-\gamma}\left[ \max_s\sum_a\pi'(s,a)A^{\pi}(s,a) \right] Aπ(π′)≡s∑dπ(s)a∑π′(s,a)Aπ(s,a)ϵ≡1−γ1[smaxa∑π′(s,a)Aπ(s,a)]
Proof
在这里,可以将 π n e w \pi_{new} πnew看做 α \alpha α的函数。当 α = 0 \alpha=0 α=0时, π n e w = π \pi_{new}=\pi πnew=π;当 α = 1 \alpha=1 α=1时, π n e w = π ′ \pi_{new}=\pi' πnew=π′。我们计算 π n e w \pi_{new} πnew对 α \alpha α的导数在 α = 0 \alpha=0 α=0时的数值
∇ α V π n e w ∣ α = 0 = ∑ s d π n e w ( s ) ∑ a ∇ α π n e w A π n e w ( s , a ) ∣ α = 0 = ∑ s d π n e w ( s ) ∑ a ( π ′ − π ) A π n e w ( s , a ) ∣ α = 0 = ∑ s d π ( s ) ∑ a ( π ′ − π ) A π ( s , a ) = ∑ s d π ( s ) ∑ a π ′ A π ( s , a ) − ∑ s d π ( s ) ∑ a π A π ( s , a ) = ∑ s d π ( s ) ∑ a π ′ A π ( s , a ) = A π ( π ′ ) \begin{align} \begin{aligned} \nabla_{\alpha}V^{\pi_{new}}|_{\alpha=0} &= \left.\sum_sd^{\pi_{new}}(s)\sum_a\nabla_{\alpha}\pi_{new}A^{\pi_{new}}(s,a)\right|_{\alpha=0} \\ &= \left.\sum_sd^{\pi_{new}}(s)\sum_a\left(\pi'-\pi\right)A^{\pi_{new}}(s,a)\right|_{\alpha=0}\\ &= \sum_sd^{\pi}(s)\sum_a\left(\pi'-\pi\right)A^{\pi}(s,a)\\ &= \sum_sd^{\pi}(s)\sum_a\pi'A^{\pi}(s,a)-\sum_sd^{\pi}(s)\sum_a \pi A^{\pi}(s,a)\\ &= \sum_sd^{\pi}(s)\sum_a\pi'A^{\pi}(s,a)\\ &= A_{\pi}(\pi') \end{aligned} \end{align} ∇αVπnew∣α=0=s∑dπnew(s)a∑∇απnewAπnew(s,a)∣ ∣α=0=s∑dπnew(s)a∑(π′−π)Aπnew(s,a)∣ ∣α=0=s∑dπ(s)a∑(π′−π)Aπ(s,a)=s∑dπ(s)a∑π′Aπ(s,a)−s∑dπ(s)a∑πAπ(s,a)=s∑dπ(s)a∑π′Aπ(s,a)=Aπ(π′)
同理, A π ( π n e w ) = α A π ( π ′ ) A_{\pi}(\pi_{new})=\alpha A_{\pi}(\pi') Aπ(πnew)=αAπ(π′)。将 V π n e w V^{\pi_{new}} Vπnew在 π n e w = π \pi_{new}=\pi πnew=π处Taylor展开,得到
V π n e w − V π = V π + α ∇ α V π n e w ∣ α = 0 − V π ≥ α A π ( π n e w ) − O ( α 2 ) \begin{align} \begin{aligned} V^{\pi_{new}}- V^{\pi} & =V^{\pi} + \alpha\nabla_{\alpha}V^{\pi_{new}}|_{\alpha=0}-V^{\pi} \\ &\geq\alpha A_{\pi}(\pi_{new}) -O(\alpha^2) \end{aligned} \end{align} Vπnew−Vπ=Vπ+α∇αVπnew∣α=0−Vπ≥αAπ(πnew)−O(α2)
这里只说明了值函数的增益大于一个下确界,但是并没有得到目标的形式。继续推导,根据Lemma 1可知
V π n e w − V π = ∑ s d π n e w ( s ) ∑ a π n e w ( s , a ) A π ( s , a ) \begin{align} \begin{aligned} V^{\pi_{new}}- V^{\pi} & = \sum_sd^{\pi_{new}}(s)\sum_{a}{\pi_{new}(s,a)A^{\pi}(s,a)} \end{aligned} \end{align} Vπnew−Vπ=s∑dπnew(s)a∑πnew(s,a)Aπ(s,a)
这里定义一个变量 η t \eta_t ηt,含义为截止到第 t t t个时刻时,策略 π n e w \pi_{new} πnew中没有采用 π \pi π的步数,则有:
P r τ ∼ π n e w ( S t = s ∣ η t = 0 ) = P r τ ∼ π ( S t = s ) = ( 1 − α ) t \begin{align} \begin{aligned} Pr_{\tau\sim\pi_{new}}(S_t=s|\eta_t=0)=Pr_{\tau\sim\pi}(S_t=s)=(1-\alpha)^{t} \end{aligned} \end{align} Prτ∼πnew(St=s∣ηt=0)=Prτ∼π(St=s)=(1−α)t
有的推导中写 t − 1 t-1 t−1次方,这都无所谓的,最后是无穷项等比数列相加,不差这一个数。进而有
p t ≡ P r ( η t > 0 ) = 1 − ( 1 − α ) t \begin{align} \begin{aligned} p_t\equiv Pr(\eta_t>0)=1-(1-\alpha)^{t} \end{aligned} \end{align} pt≡Pr(ηt>0)=1−(1−α)t
进而有 (开始了)
V π n e w − V π = ∑ s d π n e w ( s ) ∑ a π n e w ( s , a ) A π ( s , a ) = E τ ∼ π n e w [ ∑ t γ t ∑ a α π ′ ( s t , a ) A π ( s t , a ) ] = α ∑ t ( 1 − p t ) γ t E τ ∼ π n e w [ ∑ a α π ′ ( s t , a ) A π ( s t , a ) ∣ η = 0 ] + α ∑ t p t γ t E τ ∼ π n e w [ ∑ a α π ′ ( s t , a ) A π ( s t , a ) ∣ η > 0 ] = α ∑ t ( 1 − p t ) γ t E τ ∼ π [ ∑ a α π ′ ( s t , a ) A π ( s t , a ) ] + α ∑ t p t γ t E τ ∼ π n e w [ ∑ a α π ′ ( s t , a ) A π ( s t , a ) ∣ η > 0 ] = α A π ( π ′ ) − α ∑ t p t γ t E τ ∼ π [ ∑ a α π ′ ( s t , a ) A π ( s t , a ) ] + α ∑ t p t γ t E τ ∼ π n e w [ ∑ a α π ′ ( s t , a ) A π ( s t , a ) ∣ η > 0 ] ≥ α A π ( π ′ ) − 2 α ∑ t γ t [ 1 − ( 1 − α ) t ] [ max s ∑ a π ′ ( s , a ) A π ( s , a ) ] = α A π ( π ′ ) − 2 α 2 γ ϵ 1 − γ ( 1 − α ) \begin{align} \begin{aligned} V^{\pi_{new}}- V^{\pi} &= \sum_sd^{\pi_{new}}(s)\sum_{a}{\pi_{new}(s,a)A^{\pi}(s,a)}\\ &= \mathbb{E}_{\tau\sim\pi_{new}}\left[\sum_t\gamma^{t}\sum_a\alpha\pi'(s_t,a)A^{\pi}(s_t,a)\right]\\ &= \alpha\sum_t(1-p_t)\gamma^t\mathbb{E}_{\tau\sim\pi_{new}}\left[\sum_a\alpha\pi'(s_t,a)A^{\pi}(s_t,a)|\eta=0\right]\\ &+ \alpha\sum_tp_t\gamma^t\mathbb{E}_{\tau\sim\pi_{new}}\left[\sum_a\alpha\pi'(s_t,a)A^{\pi}(s_t,a)|\eta>0\right]\\ &= \alpha\sum_t(1-p_t)\gamma^t\mathbb{E}_{\tau\sim\pi}\left[\sum_a\alpha\pi'(s_t,a)A^{\pi}(s_t,a)\right]\\ &+ \alpha\sum_tp_t\gamma^t\mathbb{E}_{\tau\sim\pi_{new}}\left[\sum_a\alpha\pi'(s_t,a)A^{\pi}(s_t,a)|\eta>0\right]\\ &=\alpha A_{\pi}(\pi')-\alpha\sum_tp_t\gamma^t\mathbb{E}_{\tau\sim\pi}\left[\sum_a\alpha\pi'(s_t,a)A^{\pi}(s_t,a)\right]\\ &+ \alpha\sum_tp_t\gamma^t\mathbb{E}_{\tau\sim\pi_{new}}\left[\sum_a\alpha\pi'(s_t,a)A^{\pi}(s_t,a)|\eta>0\right]\\ &\geq \alpha A_{\pi}(\pi')-2\alpha\sum_t\gamma^t\left[1-(1-\alpha)^{t}\right]\left[\max_s\sum_a\pi'(s,a)A^{\pi}(s,a)\right]\\ &= \alpha A_{\pi}(\pi')-\frac{2\alpha^2\gamma\epsilon }{1-\gamma(1-\alpha)} \end{aligned} \end{align} Vπnew−Vπ=s∑dπnew(s)a∑πnew(s,a)Aπ(s,a)=Eτ∼πnew[t∑γta∑απ′(st,a)Aπ(st,a)]=αt∑(1−pt)γtEτ∼πnew[a∑απ′(st,a)Aπ(st,a)∣η=0]+αt∑ptγtEτ∼πnew[a∑απ′(st,a)Aπ(st,a)∣η>0]=αt∑(1−pt)γtEτ∼π[a∑απ′(st,a)Aπ(st,a)]+αt∑ptγtEτ∼πnew[a∑απ′(st,a)Aπ(st,a)∣η>0]=αAπ(π′)−αt∑ptγtEτ∼π[a∑απ′(st,a)Aπ(st,a)]+αt∑ptγtEτ∼πnew[a∑απ′(st,a)Aπ(st,a)∣η>0]≥αAπ(π′)−2αt∑γt[1−(1−α)t][smaxa∑π′(s,a)Aπ(s,a)]=αAπ(π′)−1−γ(1−α)2α2γϵ
Proof completes.
在得到Lemma 2之后,很容易得出
V π n e w − V π ≥ α A π ( π ′ ) − 2 α 2 γ ϵ 1 − γ \begin{align} \begin{aligned} V^{\pi_{new}}- V^{\pi}\geq\alpha A_{\pi}(\pi')-\frac{2\alpha^2\gamma\epsilon }{1-\gamma} \end{aligned} \end{align} Vπnew−Vπ≥αAπ(π′)−1−γ2α2γϵ
令 R R R是奖励的上界 (大于零),则有
{ A π ( π ′ ) = ∑ s d π ( s ) ∑ a π ′ ( s , a ) A π ( s , a ) ≤ R 1 − γ ϵ = 1 1 − γ [ max s ∑ a π ′ ( s , a ) A π ( s , a ) ] ≤ R 1 − γ \begin{align} \left\{ \begin{aligned} & A_{\pi}(\pi') = \sum_sd^{\pi}(s)\sum_a\pi'(s,a)A^{\pi}(s,a) \leq \frac{R}{1-\gamma} \\ & \epsilon = \frac{1}{1-\gamma}\left[ \max_s\sum_a\pi'(s,a)A^{\pi}(s,a) \right]\leq \frac{R}{1-\gamma} \end{aligned} \right. \end{align} ⎩ ⎨ ⎧Aπ(π′)=s∑dπ(s)a∑π′(s,a)Aπ(s,a)≤1−γRϵ=1−γ1[smaxa∑π′(s,a)Aπ(s,a)]≤1−γR
这时,若选取 α = A π ( π ′ ) ( 1 − γ ) 2 4 R \alpha=\frac{A_{\pi}(\pi')(1-\gamma)^2}{4R} α=4RAπ(π′)(1−γ)2,则有
V π n e w − V π ≥ α A π ( π ′ ) − 2 α 2 γ ϵ 1 − γ = A π 2 ( π ′ ) ( 1 − γ ) 2 4 R − 2 α 2 γ ϵ 1 − γ ≥ A π 2 ( π ′ ) ( 1 − γ ) 2 4 R − 2 α 2 γ R ( 1 − γ ) 2 = A π ( π ′ ) ( 1 − γ ) 2 8 R > 0 \begin{align} \begin{aligned} V^{\pi_{new}}- V^{\pi} & \geq \alpha A_{\pi}(\pi')-\frac{2\alpha^2\gamma\epsilon }{1-\gamma}\\ &= \frac{A_{\pi}^2(\pi')(1-\gamma)^2}{4R}-\frac{2\alpha^2\gamma\epsilon }{1-\gamma}\\ & \geq \frac{A_{\pi}^2(\pi')(1-\gamma)^2}{4R} - \frac{2\alpha^2\gamma R}{(1-\gamma)^2}\\ & = \frac{A_{\pi}(\pi')(1-\gamma)^2}{8R}>0 \end{aligned} \end{align} Vπnew−Vπ≥αAπ(π′)−1−γ2α2γϵ=4RAπ2(π′)(1−γ)2−1−γ2α2γϵ≥4RAπ2(π′)(1−γ)2−(1−γ)22α2γR=8RAπ(π′)(1−γ)2>0
如此,就保证了策略改进的单调提升,论文原文的后边还有很多推导,关于算法细节的,没太看懂,不往这放了。实际应用中,很多不会直接使用这个CPI技术,而是借用这个思路去设计。
OK,CPI理论推导到此结束了。