强化学习算法TRPO的理解

Trust Region Policy Optimization

  • 角度一:off-policy
    • 重要性采样 Importance Sampling
    • 梯度优化
  • 角度二:数值优化
    • 置信域优化
    • 蒙特卡洛近似

TRPO算法的全称是 Trust Region Policy Optimization,即信赖域策略优化。

角度一:off-policy

通常在强化学习策略梯度训练中,智能体每跟环境做一次完整的交互得到一条蒙特卡洛采样轨迹,策略网络的参数做一次更新,也就是on-policy方法,其优点是误差较小,但学习效率低,每个样本只能学习一次。为了提高样本学习效率,此时使用参数为 θ ′ \theta ' θ 的旧网络做采样收集样本来跟新当前网络的参数 θ \theta θ,即off-policy方法。但旧网络收集的样本能直接用于更新当前网络的参数吗?

重要性采样 Importance Sampling

E x ∼ p ( x ) [ f ( x ) ] = ∫ f ( x ) p ( x ) d x = ∫ f ( x ) p ( x ) q ( x ) q ( x ) d x = E x ∼ q ( x ) [ f ( x ) p ( x ) q ( x ) ] \begin{split} \mathbb{E} _{x\sim p(x)}\left [ f(x) \right ] &= \int f(x)p(x)dx \\ &= \int f(x)\frac{p(x)}{q(x)} q(x)dx\\ &= \mathbb{E} _{x\sim q(x)}\left [ f(x)\frac{p(x)}{q(x)} \right ] \end{split} Exp(x)[f(x)]=f(x)p(x)dx=f(x)q(x)p(x)q(x)dx=Exq(x)[f(x)q(x)p(x)]
可以看出,有办法通过从 q ( x ) q(x) q(x)中采样来计算从 p ( x ) p(x) p(x)中采样的关于 f ( x ) f(x) f(x)的期望,尽管二者期望相同,但方差略有不同。
V a r x ∼ p ( x ) [ f ( x ) ] = E x ∼ p ( x ) [ f 2 ( x ) ] − [ E x ∼ p ( x ) [ f ( x ) ] ] 2 Var_{x\sim p(x)}\left[f(x) \right] = \mathbb{E} _{x\sim p(x)}\left [ f^2(x) \right ]-\left[ \mathbb{E} _{x\sim p(x)}\left [ f(x) \right ]\right]^2 Varxp(x)[f(x)]=Exp(x)[f2(x)][Exp(x)[f(x)]]2
V a r x ∼ p ( x ) [ f ( x ) ] = E x ∼ q ( x ) [ f ( x ) p ( x ) q ( x ) ] 2 − [ E x ∼ q ( x ) [ f ( x ) p ( x ) q ( x ) ] ] 2 = ∫ f 2 ( x ) p 2 ( x ) q 2 ( x ) q ( x ) d x − [ E x ∼ p ( x ) [ f ( x ) ] ] 2 = E x ∼ p ( x ) [ f 2 ( x ) p ( x ) q ( x ) ] − [ E x ∼ p ( x ) [ f ( x ) ] ] 2 \begin{split} Var_{x\sim p(x)}\left[f(x) \right] &= \mathbb{E} _{x\sim q(x)}\left [ f(x)\frac{p(x)}{q(x)} \right ]^2-\left[ \mathbb{E} _{x\sim q(x)}\left [ f(x)\frac{p(x)}{q(x)} \right ] \right]^2 \\ &=\int f^2(x)\frac{p^2(x)}{q^2(x)}q(x)dx-\left[ \mathbb{E} _{x\sim p(x)}\left [ f(x) \right ] \right]^2\\ &= \mathbb{E} _{x\sim p(x)}\left[f^2(x)\frac{p(x)}{q(x)} \right]-\left[ \mathbb{E} _{x\sim p(x)}\left [ f(x) \right ] \right]^2 \end{split} Varxp(x)[f(x)]=Exq(x)[f(x)q(x)p(x)]2[Exq(x)[f(x)q(x)p(x)]]2=f2(x)q2(x)p2(x)q(x)dx[Exp(x)[f(x)]]2=Exp(x)[f2(x)q(x)p(x)][Exp(x)[f(x)]]2
可以看出,二者方差只有在第一项中相差 p ( x ) q ( x ) \frac{p(x)}{q(x)} q(x)p(x) 倍,因此 p ( x ) p(x) p(x) q ( x ) q(x) q(x)的分布不能相差很大。我们再回头看策略梯度方法的优化函数 J ( θ ) J(\theta) J(θ)

记一条轨迹序列:
τ = { s 1 , a 1 , s 2 , a 2 , . . . s T , a T } \tau = \{s_1,a_1,s_2,a_2,...s_T,a_T\} τ={s1,a1,s2,a2,...sT,aT}
其在参数 θ \theta θ下的概率和回报Return为:
π θ ( τ ) = π ( s 1 ) ∏ t = 1 T π θ ( a t ∣ s t ) π ( s t + 1 ∣ s t , a t ) R ( τ ) = ∑ t = 1 T r t \begin{split} \pi_{\theta}(\tau) &= \pi(s_1)\prod_{t=1}^{T} \pi_{\theta }(a_t\mid s_t)\pi(s_{t+1}\mid s_t,a_t)\\ R(\tau) &= \sum_{t=1}^{T}r_t \end{split} πθ(τ)R(τ)=π(s1)t=1Tπθ(atst)π(st+1st,at)=t=1Trt
注意, π θ ( τ ) \pi_{\theta}(\tau) πθ(τ)中只有一项与参数 θ \theta θ有关。我们要做的事是做大化期望回报,使用梯度上升法需要求导数,即:
θ ∗ = arg max ⁡ θ J ( θ ) = arg max ⁡ θ ∑ τ R ( τ ) π θ ( τ ) ∇ J ( θ ) = ∑ τ R ( τ ) π θ ( τ ) = ∑ τ [ R ( τ ) π θ ( τ ) ∇ π θ ( τ ) π θ ( τ ) ] = ∑ τ [ R ( τ ) π θ ( τ ) ∇ log ⁡ π θ ( τ ) ] = E τ ∼ π θ ( τ ) [ R ( τ ) ∇ log ⁡ π θ ( τ ) ] \begin{split} \theta^* &=\underset{\theta}{\argmax } J(\theta)\\ &=\underset{\theta}{\argmax } \sum_{\tau} R(\tau) \pi_{\theta}(\tau)\\ \nabla J(\theta)&= \sum_{\tau} R(\tau) \pi_{\theta}(\tau)\\ &=\sum_{\tau} \left[R(\tau) \pi_{\theta}(\tau)\frac{\nabla \pi_{\theta}(\tau)}{\pi_{\theta}(\tau)} \right]\\ &=\sum_{\tau} \left[R(\tau)\pi_{\theta}(\tau)\nabla \log \pi_{\theta}\left (\tau \right)\right ]\\ &= \mathbb{E} _{\tau \sim \pi_{\theta}(\tau )}\left[R(\tau)\nabla \log \pi_{\theta}\left (\tau \right)\right ] \end{split} θJ(θ)=θargmaxJ(θ)=θargmaxτR(τ)πθ(τ)=τR(τ)πθ(τ)=τ[R(τ)πθ(τ)πθ(τ)πθ(τ)]=τ[R(τ)πθ(τ)logπθ(τ)]=Eτπθ(τ)[R(τ)logπθ(τ)]

梯度优化

假设我们有不同参数的策略网络 π θ ′ \pi_{\theta'} πθ来采样数据,并更新策略网络 π θ \pi_{\theta} πθ的参数,使用蒙特卡洛近似采用,根据重要性采样公式,有:
∇ log ⁡ π θ ( τ ) = ∇ log ⁡ [ π ( s 1 ) ∏ t = 1 T π θ ( a t ∣ s t ) π ( s t + 1 ∣ s t , a t ) ] = ∇ [ log ⁡ π ( s 1 ) + ∑ t = 1 T log ⁡ π θ ( a t ∣ s t ) + ∑ t = 1 T π ( s t + 1 ∣ s t , a t ) ] = ∑ t = 1 T ∇ log ⁡ π θ ( a t ∣ s t ) ∇ J θ ′ ( θ ) = E ( s t , a t ) ∼ π θ ′ [ π θ ( s t , a t ) π θ ′ ( s t , a t ) R θ ′ ( s t , a t ) ∇ log ⁡ π θ ( a t ∣ s t ) ] = E ( s t , a t ) ∼ π θ ′ [ π θ ( a t ∣ s t ) π θ ( s t ) π θ ′ ( a t ∣ s t ) π θ ′ ( s t ) R θ ′ ( a t , s t ) ∇ log ⁡ π θ ( a t ∣ s t ) ] ≈ E ( s t , a t ) ∼ π θ ′ [ π θ ( a t ∣ s t ) π θ ′ ( a t ∣ s t ) R θ ′ ( s t , a t ) ∇ log ⁡ π θ ( a t ∣ s t ) ] = E ( s t , a t ) ∼ π θ ′ [ ∇ π θ ( a t ∣ s t ) π θ ′ ( a t ∣ s t ) R θ ′ ( s t , a t ) ] \begin{split} \nabla \log \pi_{\theta}\left (\tau \right)&=\nabla \log \left[ \pi(s_1)\prod_{t=1}^{T} \pi_{\theta }(a_t\mid s_t)\pi(s_{t+1}\mid s_t,a_t) \right]\\ &=\nabla \left[\log \pi(s_1)+\sum_{t=1}^T\log \pi_{\theta }(a_t\mid s_t)+\sum_{t=1}^T \pi(s_{t+1}\mid s_t,a_t) \right]\\ &=\sum_{t=1}^T\nabla\log \pi_{\theta }(a_t\mid s_t) \\ \nabla J_{\theta'}(\theta)&= \mathbb{E} _{(s_t,a_t) \sim \pi_{\theta'}}\left[\frac{\pi_{\theta}(s_t,a_t)}{\pi_{\theta'}(s_t,a_t)}R^{\theta'}(s_t,a_t)\nabla \log \pi_{\theta}\left (a_t|s_t \right)\right ]\\ &=\mathbb{E} _{(s_t,a_t) \sim \pi_{\theta'}}\left[\frac{\pi_{\theta}(a_t|s_t)\pi_{\theta}(s_t)}{\pi_{\theta'}(a_t|s_t)\pi_{\theta'}(s_t)}R^{\theta'}(a_t,s_t)\nabla \log \pi_{\theta}\left (a_t|s_t \right)\right ]\\ &\approx \mathbb{E} _{(s_t,a_t) \sim \pi_{\theta'}}\left[\frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta'}(a_t|s_t)}R^{\theta'}(s_t,a_t)\nabla \log \pi_{\theta}\left (a_t|s_t \right)\right ]\\ &=\mathbb{E} _{(s_t,a_t) \sim \pi_{\theta'}}\left[\frac{\nabla \pi_{\theta}(a_t|s_t)}{\pi_{\theta'}(a_t|s_t)}R^{\theta'}(s_t,a_t)\right ] \end{split} logπθ(τ)Jθ(θ)=log[π(s1)t=1Tπθ(atst)π(st+1st,at)]=[logπ(s1)+t=1Tlogπθ(atst)+t=1Tπ(st+1st,at)]=t=1Tlogπθ(atst)=E(st,at)πθ[πθ(st,at)πθ(st,at)Rθ(st,at)logπθ(atst)]=E(st,at)πθ[πθ(atst)πθ(st)πθ(atst)πθ(st)Rθ(at,st)logπθ(atst)]E(st,at)πθ[πθ(atst)πθ(atst)Rθ(st,at)logπθ(atst)]=E(st,at)πθ[πθ(atst)πθ(atst)Rθ(st,at)]
其中 R θ ′ ( s t , a t ) = ∑ i = t T γ i − t r s i , a i ∼ π θ R^{\theta'}(s_t,a_t)=\sum_{i=t}^{T} \gamma^{i-t} r_{s_i,a_i\sim \pi_\theta} Rθ(st,at)=i=tTγitrsi,aiπθ,此时优化函数变为:
J ( θ ) = E ( s t , a t ) ∼ π θ ′ [ π θ ( a t ∣ s t ) π θ ′ ( a t ∣ s t ) R θ ′ ( s t , a t ) ] \begin{split} J(\theta)&= \mathbb{E} _{(s_t,a_t) \sim \pi_{\theta'}}\left[\frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta'}(a_t|s_t)}R^{\theta'}(s_t,a_t)\right ] \end{split} J(θ)=E(st,at)πθ[πθ(atst)πθ(atst)Rθ(st,at)]
注意重要性采样使用的条件是两个分布的差异不能太大,因此TRPO算法通过添加约束使得 π θ \pi_{\theta} πθ接近 π θ ′ \pi_{\theta'} πθ。即:
J ( θ ) = E ( s t , a t ) ∼ π θ ′ [ π θ ( a t ∣ s t ) π θ ′ ( a t ∣ s t ) R θ ′ ( s t , a t ) ] s . t . K L ( π θ ∣ π θ ′ ) < δ \begin{split} &J(\theta)= \mathbb{E} _{(s_t,a_t) \sim \pi_{\theta'}}\left[\frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta'}(a_t|s_t)}R^{\theta'}(s_t,a_t)\right ]\\ & s.t.\quad KL(\pi_{\theta}|\pi_{\theta'})< \delta \end{split} J(θ)=E(st,at)πθ[πθ(atst)πθ(atst)Rθ(st,at)]s.t.KL(πθπθ)<δ
意思是设置一个信赖域,使得网络参数的更新不要超出这个范围。相应的PPO算法则是添加一个类似正则项的东西,即:
J p p o ( θ ) = J ( θ ) − β K L ( π θ ∣ π θ ′ ) J_{ppo}(\theta)= J(\theta)-\beta KL(\pi_{\theta}|\pi_{\theta'}) Jppo(θ)=J(θ)βKL(πθπθ)

角度二:数值优化

策略梯度收敛快但不稳定,且对超参数敏感,但TRPO收敛稳定且对超参数不那么敏感。假设目标是最大化 J ( θ ) J(\theta) J(θ),可以使用梯度上升的方式求解,但有些情况下无法得到梯度,例如当目标函数为
J ( θ ) = E S [ V ( S ; θ ) ] J(\theta)=\mathbb{E} _S\left[V(S;\theta)\right] J(θ)=ES[V(S;θ)]
需要通过定积分才能求出期望,从而求出梯度,但定积分不一定存在解析解,因此一般采用随机梯度上升的方式求解。随机梯度上升从S中采样一个样本s,计算其梯度 g = ∂ V ( s ; θ ) ∂ θ g=\frac{\partial V(s;\theta )}{\partial \theta } g=θV(s;θ),然后更新梯度。

置信域优化

定义
N ( θ ′ ) = { θ ′ ∣ ∥ θ ′ − θ ∥ ≤ δ } \mathcal{N} (\theta' ) = \left \{ \theta' \mid \left \| \theta' - \theta \right \| \le \delta \right \} N(θ)={θθθδ}
为参数 θ \theta θ 的一个领域,我们定义已知的可微分近似函数 L ( θ ′ ∣ θ ) L(\theta'|\theta) L(θθ),使得在这个领域内, L ( θ ′ ∣ θ ) L(\theta'|\theta) L(θθ) J ( θ ′ ) J(\theta') J(θ)足够接近,于是在置信域内最大化 J ( θ ′ ) J(\theta') J(θ)问题转换为最大化 L ( θ ′ ∣ θ ) L(\theta'|\theta) L(θθ)。通常置信域优化分为两个关键步:

  • 置信域内近似: L ( θ ∣ θ o l d ) ≈ J ( θ ) L(\theta|\theta_{old})\approx J(\theta) L(θθold)J(θ)
  • 置信域内最大化近似函数: θ n e w = arg max ⁡ θ ∈ N ( θ ) L ( θ ∣ θ o l d ) \theta_{new}=\underset{\theta\in \mathcal{N}(\theta )}{\argmax} L(\theta|\theta_{old}) θnew=θN(θ)argmaxL(θθold)

第一步近似可以使用蒙特卡洛近似或者泰勒展开等近似方法,第二部是一个带约束优化问题,求解起来比较复杂。因此置信域算法通过不断重复两个步骤(第二笔求解最优化问题是也需要重复多次),便可以得到 J ( θ ) J(\theta) J(θ)的局部最优解。

在强化学习中,有
V π ( s ) = E A ∼ π ( ⋅ ∣ s ; θ ) [ Q π ( s , A ) ] = ∑ a π ( a ∣ s ; θ ) Q π ( s , a ) = ∑ a π ( a ∣ s ; θ o l d ) π ( a ∣ s ; θ ) π ( a ∣ s ; θ o l d ) Q π ( s , a ) = E A ∼ π ( ⋅ ∣ s ; θ o l d ) [ π ( a ∣ s ; θ ) π ( a ∣ s ; θ o l d ) Q π ( s , a ) ] J ( θ ) = E s [ V π ( s ) ] = E s [ E A ∼ π ( ⋅ ∣ s ; θ o l d ) [ π ( a ∣ s ; θ ) π ( a ∣ s ; θ o l d ) Q π ( s , a ) ] ] \begin{split} V_{\pi}(s)&=\mathbb{E}_{A\sim \pi(\cdot|s;\theta)}\left[Q_{\pi}(s,A)\right]\\ &=\sum_a \pi(a|s;\theta)Q_{\pi}(s,a)\\ &=\sum_a \pi(a|s;\theta_{old}) \frac{\pi(a|s;\theta)}{\pi(a|s;\theta_{old})}Q_{\pi}(s,a) \\ &=\mathbb{E}_{A\sim \pi(\cdot|s;\theta_{old})}\left[ \frac{\pi(a|s;\theta)}{\pi(a|s;\theta_{old})}Q_{\pi}(s,a) \right]\\ J(\theta)&=\mathbb{E}_{s}[V_{\pi}(s)]\\ &=\mathbb{E}_{s}\left[\mathbb{E}_{A\sim \pi(\cdot|s;\theta_{old})}\left[ \frac{\pi(a|s;\theta)}{\pi(a|s;\theta_{old})}Q_{\pi}(s,a) \right] \right] \end{split} Vπ(s)J(θ)=EAπ(s;θ)[Qπ(s,A)]=aπ(as;θ)Qπ(s,a)=aπ(as;θold)π(as;θold)π(as;θ)Qπ(s,a)=EAπ(s;θold)[π(as;θold)π(as;θ)Qπ(s,a)]=Es[Vπ(s)]=Es[EAπ(s;θold)[π(as;θold)π(as;θ)Qπ(s,a)]]

蒙特卡洛近似

我们对目标函数 J ( θ ) J(\theta) J(θ)做蒙特卡洛近似,假设采样得到一条轨迹
τ = { s 1 , a 1 , s 2 , a 2 , . . . s T , a T } \tau = \{s_1,a_1,s_2,a_2,...s_T,a_T\} τ={s1,a1,s2,a2,...sT,aT}
则有近似函数
L ( θ ∣ θ o l d ) = 1 N ∑ i = 1 N ( π ( a i ∣ s i ; θ ) π ( a i ∣ s i ; θ o l d ) Q π ( s i , a i ) ) L(\theta|\theta_{old})=\frac{1}{N}\sum_{i=1}^{N}\left( \frac{\pi(a_i|s_i;\theta)}{\pi(a_i|s_i;\theta_{old})}Q_{\pi}(s_i,a_i) \right) L(θθold)=N1i=1N(π(aisi;θold)π(aisi;θ)Qπ(si,ai))
由于 Q π ( s i , a i ) Q_{\pi}(s_i,a_i) Qπ(si,ai)无法计算,因此我们使用折扣回报作为动作价值函数的近似,即
R i = ∑ i = t T γ i − t r i L ( θ ∣ θ o l d ) ≈ 1 N ∑ i = 1 N ( π ( a i ∣ s i ; θ ) π ( a i ∣ s i ; θ o l d ) R i ) \begin{split} R_i&=\sum_{i=t}^{T} \gamma^{i-t} r_i\\ L(\theta|\theta_{old})&\approx \frac{1}{N}\sum_{i=1}^{N}\left( \frac{\pi(a_i|s_i;\theta)}{\pi(a_i|s_i;\theta_{old})}R_i \right) \end{split} RiL(θθold)=i=tTγitriN1i=1N(π(aisi;θold)π(aisi;θ)Ri)
此时做梯度上升
θ n e w = arg max ⁡ θ L ( θ ∣ θ o l d ) s . t . θ ∈ N ( θ ) \begin{split} &\theta_{new} =\underset{\theta}{\argmax} L(\theta|\theta_{old})\\ & s.t.\quad \theta\in \mathcal{N}(\theta ) \end{split} θnew=θargmaxL(θθold)s.t.θN(θ)
其中 N ( θ ) \mathcal{N}(\theta ) N(θ)可以为:
K L ( π θ ∣ π θ o l d ) ≤ δ 或者 ∥ θ − θ o l d ∥ ≤ δ \begin{split} &KL(\pi_{\theta}|\pi_{\theta_{old}})\le \delta\\ 或者 &\left \| \theta- {\theta_{old}} \right \| \le \delta \end{split} 或者KL(πθπθold)δθθoldδ

你可能感兴趣的:(算法,人工智能,强化学习)