通常在强化学习策略梯度训练中,智能体每跟环境做一次完整的交互得到一条蒙特卡洛采样轨迹,策略网络的参数做一次更新,也就是on-policy
方法,其优点是误差较小,但学习效率低,每个样本只能学习一次。为了提高样本学习效率,此时使用参数为 θ ′ \theta ' θ′ 的旧网络做采样收集样本来跟新当前网络的参数 θ \theta θ,即off-policy
方法。但旧网络收集的样本能直接用于更新当前网络的参数吗?
E x ∼ p ( x ) [ f ( x ) ] = ∫ f ( x ) p ( x ) d x = ∫ f ( x ) p ( x ) q ( x ) q ( x ) d x = E x ∼ q ( x ) [ f ( x ) p ( x ) q ( x ) ] \begin{split} \mathbb{E} _{x\sim p(x)}\left [ f(x) \right ] &= \int f(x)p(x)dx \\ &= \int f(x)\frac{p(x)}{q(x)} q(x)dx\\ &= \mathbb{E} _{x\sim q(x)}\left [ f(x)\frac{p(x)}{q(x)} \right ] \end{split} Ex∼p(x)[f(x)]=∫f(x)p(x)dx=∫f(x)q(x)p(x)q(x)dx=Ex∼q(x)[f(x)q(x)p(x)]
可以看出,有办法通过从 q ( x ) q(x) q(x)中采样来计算从 p ( x ) p(x) p(x)中采样的关于 f ( x ) f(x) f(x)的期望,尽管二者期望相同,但方差略有不同。
V a r x ∼ p ( x ) [ f ( x ) ] = E x ∼ p ( x ) [ f 2 ( x ) ] − [ E x ∼ p ( x ) [ f ( x ) ] ] 2 Var_{x\sim p(x)}\left[f(x) \right] = \mathbb{E} _{x\sim p(x)}\left [ f^2(x) \right ]-\left[ \mathbb{E} _{x\sim p(x)}\left [ f(x) \right ]\right]^2 Varx∼p(x)[f(x)]=Ex∼p(x)[f2(x)]−[Ex∼p(x)[f(x)]]2
V a r x ∼ p ( x ) [ f ( x ) ] = E x ∼ q ( x ) [ f ( x ) p ( x ) q ( x ) ] 2 − [ E x ∼ q ( x ) [ f ( x ) p ( x ) q ( x ) ] ] 2 = ∫ f 2 ( x ) p 2 ( x ) q 2 ( x ) q ( x ) d x − [ E x ∼ p ( x ) [ f ( x ) ] ] 2 = E x ∼ p ( x ) [ f 2 ( x ) p ( x ) q ( x ) ] − [ E x ∼ p ( x ) [ f ( x ) ] ] 2 \begin{split} Var_{x\sim p(x)}\left[f(x) \right] &= \mathbb{E} _{x\sim q(x)}\left [ f(x)\frac{p(x)}{q(x)} \right ]^2-\left[ \mathbb{E} _{x\sim q(x)}\left [ f(x)\frac{p(x)}{q(x)} \right ] \right]^2 \\ &=\int f^2(x)\frac{p^2(x)}{q^2(x)}q(x)dx-\left[ \mathbb{E} _{x\sim p(x)}\left [ f(x) \right ] \right]^2\\ &= \mathbb{E} _{x\sim p(x)}\left[f^2(x)\frac{p(x)}{q(x)} \right]-\left[ \mathbb{E} _{x\sim p(x)}\left [ f(x) \right ] \right]^2 \end{split} Varx∼p(x)[f(x)]=Ex∼q(x)[f(x)q(x)p(x)]2−[Ex∼q(x)[f(x)q(x)p(x)]]2=∫f2(x)q2(x)p2(x)q(x)dx−[Ex∼p(x)[f(x)]]2=Ex∼p(x)[f2(x)q(x)p(x)]−[Ex∼p(x)[f(x)]]2
可以看出,二者方差只有在第一项中相差 p ( x ) q ( x ) \frac{p(x)}{q(x)} q(x)p(x) 倍,因此 p ( x ) p(x) p(x)和 q ( x ) q(x) q(x)的分布不能相差很大。我们再回头看策略梯度方法的优化函数 J ( θ ) J(\theta) J(θ)
记一条轨迹序列:
τ = { s 1 , a 1 , s 2 , a 2 , . . . s T , a T } \tau = \{s_1,a_1,s_2,a_2,...s_T,a_T\} τ={s1,a1,s2,a2,...sT,aT}
其在参数 θ \theta θ下的概率和回报Return为:
π θ ( τ ) = π ( s 1 ) ∏ t = 1 T π θ ( a t ∣ s t ) π ( s t + 1 ∣ s t , a t ) R ( τ ) = ∑ t = 1 T r t \begin{split} \pi_{\theta}(\tau) &= \pi(s_1)\prod_{t=1}^{T} \pi_{\theta }(a_t\mid s_t)\pi(s_{t+1}\mid s_t,a_t)\\ R(\tau) &= \sum_{t=1}^{T}r_t \end{split} πθ(τ)R(τ)=π(s1)t=1∏Tπθ(at∣st)π(st+1∣st,at)=t=1∑Trt
注意, π θ ( τ ) \pi_{\theta}(\tau) πθ(τ)中只有一项与参数 θ \theta θ有关。我们要做的事是做大化期望回报,使用梯度上升法需要求导数,即:
θ ∗ = arg max θ J ( θ ) = arg max θ ∑ τ R ( τ ) π θ ( τ ) ∇ J ( θ ) = ∑ τ R ( τ ) π θ ( τ ) = ∑ τ [ R ( τ ) π θ ( τ ) ∇ π θ ( τ ) π θ ( τ ) ] = ∑ τ [ R ( τ ) π θ ( τ ) ∇ log π θ ( τ ) ] = E τ ∼ π θ ( τ ) [ R ( τ ) ∇ log π θ ( τ ) ] \begin{split} \theta^* &=\underset{\theta}{\argmax } J(\theta)\\ &=\underset{\theta}{\argmax } \sum_{\tau} R(\tau) \pi_{\theta}(\tau)\\ \nabla J(\theta)&= \sum_{\tau} R(\tau) \pi_{\theta}(\tau)\\ &=\sum_{\tau} \left[R(\tau) \pi_{\theta}(\tau)\frac{\nabla \pi_{\theta}(\tau)}{\pi_{\theta}(\tau)} \right]\\ &=\sum_{\tau} \left[R(\tau)\pi_{\theta}(\tau)\nabla \log \pi_{\theta}\left (\tau \right)\right ]\\ &= \mathbb{E} _{\tau \sim \pi_{\theta}(\tau )}\left[R(\tau)\nabla \log \pi_{\theta}\left (\tau \right)\right ] \end{split} θ∗∇J(θ)=θargmaxJ(θ)=θargmaxτ∑R(τ)πθ(τ)=τ∑R(τ)πθ(τ)=τ∑[R(τ)πθ(τ)πθ(τ)∇πθ(τ)]=τ∑[R(τ)πθ(τ)∇logπθ(τ)]=Eτ∼πθ(τ)[R(τ)∇logπθ(τ)]
假设我们有不同参数的策略网络 π θ ′ \pi_{\theta'} πθ′来采样数据,并更新策略网络 π θ \pi_{\theta} πθ的参数,使用蒙特卡洛近似采用,根据重要性采样公式,有:
∇ log π θ ( τ ) = ∇ log [ π ( s 1 ) ∏ t = 1 T π θ ( a t ∣ s t ) π ( s t + 1 ∣ s t , a t ) ] = ∇ [ log π ( s 1 ) + ∑ t = 1 T log π θ ( a t ∣ s t ) + ∑ t = 1 T π ( s t + 1 ∣ s t , a t ) ] = ∑ t = 1 T ∇ log π θ ( a t ∣ s t ) ∇ J θ ′ ( θ ) = E ( s t , a t ) ∼ π θ ′ [ π θ ( s t , a t ) π θ ′ ( s t , a t ) R θ ′ ( s t , a t ) ∇ log π θ ( a t ∣ s t ) ] = E ( s t , a t ) ∼ π θ ′ [ π θ ( a t ∣ s t ) π θ ( s t ) π θ ′ ( a t ∣ s t ) π θ ′ ( s t ) R θ ′ ( a t , s t ) ∇ log π θ ( a t ∣ s t ) ] ≈ E ( s t , a t ) ∼ π θ ′ [ π θ ( a t ∣ s t ) π θ ′ ( a t ∣ s t ) R θ ′ ( s t , a t ) ∇ log π θ ( a t ∣ s t ) ] = E ( s t , a t ) ∼ π θ ′ [ ∇ π θ ( a t ∣ s t ) π θ ′ ( a t ∣ s t ) R θ ′ ( s t , a t ) ] \begin{split} \nabla \log \pi_{\theta}\left (\tau \right)&=\nabla \log \left[ \pi(s_1)\prod_{t=1}^{T} \pi_{\theta }(a_t\mid s_t)\pi(s_{t+1}\mid s_t,a_t) \right]\\ &=\nabla \left[\log \pi(s_1)+\sum_{t=1}^T\log \pi_{\theta }(a_t\mid s_t)+\sum_{t=1}^T \pi(s_{t+1}\mid s_t,a_t) \right]\\ &=\sum_{t=1}^T\nabla\log \pi_{\theta }(a_t\mid s_t) \\ \nabla J_{\theta'}(\theta)&= \mathbb{E} _{(s_t,a_t) \sim \pi_{\theta'}}\left[\frac{\pi_{\theta}(s_t,a_t)}{\pi_{\theta'}(s_t,a_t)}R^{\theta'}(s_t,a_t)\nabla \log \pi_{\theta}\left (a_t|s_t \right)\right ]\\ &=\mathbb{E} _{(s_t,a_t) \sim \pi_{\theta'}}\left[\frac{\pi_{\theta}(a_t|s_t)\pi_{\theta}(s_t)}{\pi_{\theta'}(a_t|s_t)\pi_{\theta'}(s_t)}R^{\theta'}(a_t,s_t)\nabla \log \pi_{\theta}\left (a_t|s_t \right)\right ]\\ &\approx \mathbb{E} _{(s_t,a_t) \sim \pi_{\theta'}}\left[\frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta'}(a_t|s_t)}R^{\theta'}(s_t,a_t)\nabla \log \pi_{\theta}\left (a_t|s_t \right)\right ]\\ &=\mathbb{E} _{(s_t,a_t) \sim \pi_{\theta'}}\left[\frac{\nabla \pi_{\theta}(a_t|s_t)}{\pi_{\theta'}(a_t|s_t)}R^{\theta'}(s_t,a_t)\right ] \end{split} ∇logπθ(τ)∇Jθ′(θ)=∇log[π(s1)t=1∏Tπθ(at∣st)π(st+1∣st,at)]=∇[logπ(s1)+t=1∑Tlogπθ(at∣st)+t=1∑Tπ(st+1∣st,at)]=t=1∑T∇logπθ(at∣st)=E(st,at)∼πθ′[πθ′(st,at)πθ(st,at)Rθ′(st,at)∇logπθ(at∣st)]=E(st,at)∼πθ′[πθ′(at∣st)πθ′(st)πθ(at∣st)πθ(st)Rθ′(at,st)∇logπθ(at∣st)]≈E(st,at)∼πθ′[πθ′(at∣st)πθ(at∣st)Rθ′(st,at)∇logπθ(at∣st)]=E(st,at)∼πθ′[πθ′(at∣st)∇πθ(at∣st)Rθ′(st,at)]
其中 R θ ′ ( s t , a t ) = ∑ i = t T γ i − t r s i , a i ∼ π θ R^{\theta'}(s_t,a_t)=\sum_{i=t}^{T} \gamma^{i-t} r_{s_i,a_i\sim \pi_\theta} Rθ′(st,at)=∑i=tTγi−trsi,ai∼πθ,此时优化函数变为:
J ( θ ) = E ( s t , a t ) ∼ π θ ′ [ π θ ( a t ∣ s t ) π θ ′ ( a t ∣ s t ) R θ ′ ( s t , a t ) ] \begin{split} J(\theta)&= \mathbb{E} _{(s_t,a_t) \sim \pi_{\theta'}}\left[\frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta'}(a_t|s_t)}R^{\theta'}(s_t,a_t)\right ] \end{split} J(θ)=E(st,at)∼πθ′[πθ′(at∣st)πθ(at∣st)Rθ′(st,at)]
注意重要性采样使用的条件是两个分布的差异不能太大,因此TRPO算法通过添加约束使得 π θ \pi_{\theta} πθ接近 π θ ′ \pi_{\theta'} πθ′。即:
J ( θ ) = E ( s t , a t ) ∼ π θ ′ [ π θ ( a t ∣ s t ) π θ ′ ( a t ∣ s t ) R θ ′ ( s t , a t ) ] s . t . K L ( π θ ∣ π θ ′ ) < δ \begin{split} &J(\theta)= \mathbb{E} _{(s_t,a_t) \sim \pi_{\theta'}}\left[\frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta'}(a_t|s_t)}R^{\theta'}(s_t,a_t)\right ]\\ & s.t.\quad KL(\pi_{\theta}|\pi_{\theta'})< \delta \end{split} J(θ)=E(st,at)∼πθ′[πθ′(at∣st)πθ(at∣st)Rθ′(st,at)]s.t.KL(πθ∣πθ′)<δ
意思是设置一个信赖域,使得网络参数的更新不要超出这个范围。相应的PPO算法则是添加一个类似正则项的东西,即:
J p p o ( θ ) = J ( θ ) − β K L ( π θ ∣ π θ ′ ) J_{ppo}(\theta)= J(\theta)-\beta KL(\pi_{\theta}|\pi_{\theta'}) Jppo(θ)=J(θ)−βKL(πθ∣πθ′)
策略梯度收敛快但不稳定,且对超参数敏感,但TRPO收敛稳定且对超参数不那么敏感。假设目标是最大化 J ( θ ) J(\theta) J(θ),可以使用梯度上升
的方式求解,但有些情况下无法得到梯度,例如当目标函数为
J ( θ ) = E S [ V ( S ; θ ) ] J(\theta)=\mathbb{E} _S\left[V(S;\theta)\right] J(θ)=ES[V(S;θ)]
需要通过定积分才能求出期望,从而求出梯度,但定积分不一定存在解析解,因此一般采用随机梯度上升
的方式求解。随机梯度上升从S
中采样一个样本s
,计算其梯度 g = ∂ V ( s ; θ ) ∂ θ g=\frac{\partial V(s;\theta )}{\partial \theta } g=∂θ∂V(s;θ),然后更新梯度。
定义
N ( θ ′ ) = { θ ′ ∣ ∥ θ ′ − θ ∥ ≤ δ } \mathcal{N} (\theta' ) = \left \{ \theta' \mid \left \| \theta' - \theta \right \| \le \delta \right \} N(θ′)={θ′∣∥θ′−θ∥≤δ}
为参数 θ \theta θ 的一个领域,我们定义已知的可微分近似函数 L ( θ ′ ∣ θ ) L(\theta'|\theta) L(θ′∣θ),使得在这个领域内, L ( θ ′ ∣ θ ) L(\theta'|\theta) L(θ′∣θ)与 J ( θ ′ ) J(\theta') J(θ′)足够接近,于是在置信域内最大化 J ( θ ′ ) J(\theta') J(θ′)问题转换为最大化 L ( θ ′ ∣ θ ) L(\theta'|\theta) L(θ′∣θ)。通常置信域优化分为两个关键步:
第一步近似可以使用蒙特卡洛近似或者泰勒展开等近似方法,第二部是一个带约束优化问题,求解起来比较复杂。因此置信域算法通过不断重复两个步骤(第二笔求解最优化问题是也需要重复多次),便可以得到 J ( θ ) J(\theta) J(θ)的局部最优解。
在强化学习中,有
V π ( s ) = E A ∼ π ( ⋅ ∣ s ; θ ) [ Q π ( s , A ) ] = ∑ a π ( a ∣ s ; θ ) Q π ( s , a ) = ∑ a π ( a ∣ s ; θ o l d ) π ( a ∣ s ; θ ) π ( a ∣ s ; θ o l d ) Q π ( s , a ) = E A ∼ π ( ⋅ ∣ s ; θ o l d ) [ π ( a ∣ s ; θ ) π ( a ∣ s ; θ o l d ) Q π ( s , a ) ] J ( θ ) = E s [ V π ( s ) ] = E s [ E A ∼ π ( ⋅ ∣ s ; θ o l d ) [ π ( a ∣ s ; θ ) π ( a ∣ s ; θ o l d ) Q π ( s , a ) ] ] \begin{split} V_{\pi}(s)&=\mathbb{E}_{A\sim \pi(\cdot|s;\theta)}\left[Q_{\pi}(s,A)\right]\\ &=\sum_a \pi(a|s;\theta)Q_{\pi}(s,a)\\ &=\sum_a \pi(a|s;\theta_{old}) \frac{\pi(a|s;\theta)}{\pi(a|s;\theta_{old})}Q_{\pi}(s,a) \\ &=\mathbb{E}_{A\sim \pi(\cdot|s;\theta_{old})}\left[ \frac{\pi(a|s;\theta)}{\pi(a|s;\theta_{old})}Q_{\pi}(s,a) \right]\\ J(\theta)&=\mathbb{E}_{s}[V_{\pi}(s)]\\ &=\mathbb{E}_{s}\left[\mathbb{E}_{A\sim \pi(\cdot|s;\theta_{old})}\left[ \frac{\pi(a|s;\theta)}{\pi(a|s;\theta_{old})}Q_{\pi}(s,a) \right] \right] \end{split} Vπ(s)J(θ)=EA∼π(⋅∣s;θ)[Qπ(s,A)]=a∑π(a∣s;θ)Qπ(s,a)=a∑π(a∣s;θold)π(a∣s;θold)π(a∣s;θ)Qπ(s,a)=EA∼π(⋅∣s;θold)[π(a∣s;θold)π(a∣s;θ)Qπ(s,a)]=Es[Vπ(s)]=Es[EA∼π(⋅∣s;θold)[π(a∣s;θold)π(a∣s;θ)Qπ(s,a)]]
我们对目标函数 J ( θ ) J(\theta) J(θ)做蒙特卡洛近似,假设采样得到一条轨迹
τ = { s 1 , a 1 , s 2 , a 2 , . . . s T , a T } \tau = \{s_1,a_1,s_2,a_2,...s_T,a_T\} τ={s1,a1,s2,a2,...sT,aT}
则有近似函数
L ( θ ∣ θ o l d ) = 1 N ∑ i = 1 N ( π ( a i ∣ s i ; θ ) π ( a i ∣ s i ; θ o l d ) Q π ( s i , a i ) ) L(\theta|\theta_{old})=\frac{1}{N}\sum_{i=1}^{N}\left( \frac{\pi(a_i|s_i;\theta)}{\pi(a_i|s_i;\theta_{old})}Q_{\pi}(s_i,a_i) \right) L(θ∣θold)=N1i=1∑N(π(ai∣si;θold)π(ai∣si;θ)Qπ(si,ai))
由于 Q π ( s i , a i ) Q_{\pi}(s_i,a_i) Qπ(si,ai)无法计算,因此我们使用折扣回报作为动作价值函数的近似,即
R i = ∑ i = t T γ i − t r i L ( θ ∣ θ o l d ) ≈ 1 N ∑ i = 1 N ( π ( a i ∣ s i ; θ ) π ( a i ∣ s i ; θ o l d ) R i ) \begin{split} R_i&=\sum_{i=t}^{T} \gamma^{i-t} r_i\\ L(\theta|\theta_{old})&\approx \frac{1}{N}\sum_{i=1}^{N}\left( \frac{\pi(a_i|s_i;\theta)}{\pi(a_i|s_i;\theta_{old})}R_i \right) \end{split} RiL(θ∣θold)=i=t∑Tγi−tri≈N1i=1∑N(π(ai∣si;θold)π(ai∣si;θ)Ri)
此时做梯度上升
θ n e w = arg max θ L ( θ ∣ θ o l d ) s . t . θ ∈ N ( θ ) \begin{split} &\theta_{new} =\underset{\theta}{\argmax} L(\theta|\theta_{old})\\ & s.t.\quad \theta\in \mathcal{N}(\theta ) \end{split} θnew=θargmaxL(θ∣θold)s.t.θ∈N(θ)
其中 N ( θ ) \mathcal{N}(\theta ) N(θ)可以为:
K L ( π θ ∣ π θ o l d ) ≤ δ 或者 ∥ θ − θ o l d ∥ ≤ δ \begin{split} &KL(\pi_{\theta}|\pi_{\theta_{old}})\le \delta\\ 或者 &\left \| \theta- {\theta_{old}} \right \| \le \delta \end{split} 或者KL(πθ∣πθold)≤δ∥θ−θold∥≤δ