强化学习算法 { Policy-Based Approach:Policy Gradient算法:Learning an Actor/Policy π Value-based Approach:Critic { State value function V π ( s ) State-Action value function Q π ( s , a ) ⟹ Q-Learning算法 Actor+Critic \begin{aligned} \text{强化学习算法} \begin{cases} \text{Policy-Based Approach:Policy Gradient算法:Learning an Actor/Policy π} \\[2ex] \text{Value-based Approach:Critic} \begin{cases} \text{State value function $V^π(s)$}\\ \\ \text{State-Action value function $Q^π(s,a)$ $\implies$ Q-Learning算法} \end{cases} \\[2ex] \text{Actor+Critic} \end{cases} \end{aligned} 强化学习算法⎩⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎧Policy-Based Approach:Policy Gradient算法:Learning an Actor/Policy πValue-based Approach:Critic⎩⎪⎨⎪⎧State value function Vπ(s)State-Action value function Qπ(s,a) ⟹ Q-Learning算法Actor+Critic
OpenAI 把PPO当作他们默认强化学习算法
The on-policy approach in the preceding section is actually a compromise—it learns action values not for the optimal policy, but for a near-optimal policy that still explores. on-policy算法是在保证跟随最优策略的基础上同时保持着对其它动作的探索性,对于on-policy算法来说如要保持探索性则必然会牺牲一定的最优选择机会。
A more straightforward approach is to use two policies, one that is learned about and that becomes the optimal policy, and one that is more exploratory and is used to generate behavior. 有一个更加直接的办法就是在迭代过程中允许存在两个policy,一个用于生成学习过程的动作,具有很强的探索性,另外一个则是由值函数产生的最优策略,这个方法就被称作off-policy。
The policy being learned about is called the target policy, and the policy used to generate behavior is called the behavior policy.In this case we say that learning is from data “off” the target policy, and the overall process is termed off-policy learning.
利用 Importance Sampling 将 Policy Gradient 这个 On-policy算法 转为 Off-policy算法
由Policy Gradient 可知:将 Policy Gradient 经过 “Add a Baseline”、“Assign Suitable Credit” 处理后的梯度为:
∇ θ R ‾ θ ≈ 1 N ∑ i = 1 N ∑ t = 1 T i [ A θ ( s t , a t ) ∇ θ l o g p θ ( a t i ∣ s t i ) ] = E ( s t , a t ) ∼ π θ [ A θ ( s t , a t ) ∇ θ l o g P θ ( a t ∣ s t ) ] = E ( s t , a t ) ∼ π θ ′ [ P θ ( s t , a t ) P θ ′ ( s t , a t ) A ( θ ′ ) ( s t , a t ) ∇ θ l o g P θ ( a t ∣ s t ) ] = E ( s t , a t ) ∼ π θ ′ [ P θ ( a t ∣ s t ) P θ ( s t ) P θ ′ ( a t ∣ s t ) P θ ′ ( s t ) A ( θ ′ ) ( s t , a t ) ∇ θ l o g P θ ( a t ∣ s t ) ] = 令 P θ ( s t ) = P θ ′ ( s t ) E ( s t , a t ) ∼ π θ ′ [ P θ ( a t ∣ s t ) P θ ′ ( a t ∣ s t ) A ( θ ′ ) ( s t , a t ) ∇ θ l o g P θ ( a t ∣ s t ) ] = E ( s t , a t ) ∼ π θ ′ { [ P θ ( a t ∣ s t ) P θ ′ ( a t ∣ s t ) A ( θ ′ ) ( s t , a t ) ] ∇ θ l o g [ P θ ( a t ∣ s t ) P θ ′ ( a t ∣ s t ) A ( θ ′ ) ( s t , a t ) ] } 紫色部分与θ无关 \begin{aligned} \nabla_θ\overline{R}_θ&≈\cfrac1N\sum^N_{i=1}\sum^{T_i}_{t=1}[A^θ(s_t,a_t)\nabla_θ logp_θ(a^i_t|s^i_t)]\\ &=E_{(s_t,a_t)\sim π_θ}[A^θ(s_t,a_t)\nabla_θ logP_θ(a_t|s_t)]\\ &=E_{(s_t,a_t)\sim π_{θ'}}\left[\cfrac{P_θ(s_t,a_t)}{P_{θ'}(s_t,a_t)}A^{(θ')}(s_t,a_t)\nabla_θ logP_θ(a_t|s_t)\right]\\ &=\color{black}{E_{(s_t,a_t)\sim π_{θ'}}\left[\cfrac{P_θ(a_t|s_t)\color{violet}{P_θ(s_t)}}{P_{θ'}(a_t|s_t)\color{violet}{P_{θ'}(s_t)}}A^{(θ')}(s_t,a_t)\nabla_θ logP_θ(a_t|s_t)\right]}\\ &\xlongequal{令P_θ(s_t)=P_{θ'}(s_t)} E_{(s_t,a_t)\sim π_{θ'}}\left[\cfrac{P_θ(a_t|s_t)}{P_{θ'}(a_t|s_t)}A^{(θ')}(s_t,a_t)\nabla_θ logP_θ(a_t|s_t)\right]\\ &=E_{(s_t,a_t)\sim π_{θ'}}\left\{\left[\cfrac{P_θ(a_t|s_t)}{P_{θ'}(a_t|s_t)}A^{(θ')}(s_t,a_t)\right]\nabla_θ log\left[\color{black}{\cfrac{P_θ(a_t|s_t)}{\color{violet}{P_{θ'}(a_t|s_t)}}\color{violet}{A^{(θ')}(s_t,a_t)}}\right]\right\} \text{紫色部分与θ无关}\\ \end{aligned} ∇θRθ≈N1i=1∑Nt=1∑Ti[Aθ(st,at)∇θlogpθ(ati∣sti)]=E(st,at)∼πθ[Aθ(st,at)∇θlogPθ(at∣st)]=E(st,at)∼πθ′[Pθ′(st,at)Pθ(st,at)A(θ′)(st,at)∇θlogPθ(at∣st)]=E(st,at)∼πθ′[Pθ′(at∣st)Pθ′(st)Pθ(at∣st)Pθ(st)A(θ′)(st,at)∇θlogPθ(at∣st)]令Pθ(st)=Pθ′(st)E(st,at)∼πθ′[Pθ′(at∣st)Pθ(at∣st)A(θ′)(st,at)∇θlogPθ(at∣st)]=E(st,at)∼πθ′{[Pθ′(at∣st)Pθ(at∣st)A(θ′)(st,at)]∇θlog[Pθ′(at∣st)Pθ(at∣st)A(θ′)(st,at)]}紫色部分与θ无关
∵ ∇ x l o g f ( x ) = 1 f ( x ) ∇ x f ( x ) ∴ ∇ x f ( x ) = f ( x ) ∇ x l o g f ( x ) \textbf{∵} \quad \nabla_xlogf(x)=\cfrac{1}{f(x)}\nabla_xf(x) \qquad \textbf{∴} \quad \nabla_xf(x)=f(x)\nabla_xlogf(x) ∵∇xlogf(x)=f(x)1∇xf(x)∴∇xf(x)=f(x)∇xlogf(x)
∴ \textbf{∴} ∴目标函数为:
J ( θ ′ ) ( θ ) = R ‾ ( θ ′ ) ( θ ) = E ( s t , a t ) ∼ π θ ′ [ P θ ( a t ∣ s t ) P θ ′ ( a t ∣ s t ) A ( θ ′ ) ( s t , a t ) ] \begin{aligned} \color{red}{J^{(θ')}(θ)=\overline{R}^{(θ')}(θ)=E_{(s_t,a_t)\sim π_{θ'}}\left[\cfrac{P_θ(a_t|s_t)}{P_{θ'}(a_t|s_t)}A^{(θ')}(s_t,a_t)\right]} \end{aligned} J(θ′)(θ)=R(θ′)(θ)=E(st,at)∼πθ′[Pθ′(at∣st)Pθ(at∣st)A(θ′)(st,at)]
其中:
J P P O ( θ ′ ) ( θ ) = J ( θ ′ ) ( θ ) − β K L ( θ , θ ′ ) = E ( s t , a t ) ∼ π θ ′ [ P θ ( a t ∣ s t ) P θ ′ ( a t ∣ s t ) A ( θ ′ ) ( s t , a t ) ] − β K L ( θ , θ ′ ) \begin{aligned} J^{(θ')}_{PPO}(θ)&=J^{(θ')}(θ)-βKL(θ,θ')\\ &=E_{(s_t,a_t)\sim π_{θ'}}\left[\cfrac{P_θ(a_t|s_t)}{P_{θ'}(a_t|s_t)}A^{(θ')}(s_t,a_t)\right]-βKL(θ,θ') \end{aligned} JPPO(θ′)(θ)=J(θ′)(θ)−βKL(θ,θ′)=E(st,at)∼πθ′[Pθ′(at∣st)Pθ(at∣st)A(θ′)(st,at)]−βKL(θ,θ′)
其中:
J T R P O ( θ ′ ) ( θ ) = E ( s t , a t ) ∼ π θ ′ [ P θ ( a t ∣ s t ) P θ ′ ( a t ∣ s t ) A ( θ ′ ) ( s t , a t ) ] K L ( π θ , π θ ′ ) < δ \begin{aligned} J^{(θ')}_{TRPO }(θ)=E_{(s_t,a_t)\sim π_{θ'}}\left[\cfrac{P_θ(a_t|s_t)}{P_{θ'}(a_t|s_t)}A^{(θ')}(s_t,a_t)\right]_{KL(π_θ,π_{θ'})<δ} \end{aligned} JTRPO(θ′)(θ)=E(st,at)∼πθ′[Pθ′(at∣st)Pθ(at∣st)A(θ′)(st,at)]KL(πθ,πθ′)<δ
J P P O 2 θ k ( θ ) ≈ ∑ ( s t , a t ) m i n { P θ ( a t ∣ s t ) P θ k ( a t ∣ s t ) A θ k ( s t , a t ) , c l i p [ P θ ( a t ∣ s t ) P θ k ( a t ∣ s t ) , 1 − ε , 1 + ε ] A θ k ( s t , a t ) } \begin{aligned} J^{θ_k}_{PPO2}(θ)≈\sum_{(s_t,a_t)}min\left\{\cfrac{P_θ(a_t|s_t)}{P_{θ_k}(a_t|s_t)}A^{θ_k}(s_t,a_t), clip\left[\cfrac{P_θ(a_t|s_t)}{P_{θ_k}(a_t|s_t)},1-ε,1+ε\right]A^{θ_k}(s_t,a_t)\right\} \end{aligned} JPPO2θk(θ)≈(st,at)∑min{Pθk(at∣st)Pθ(at∣st)Aθk(st,at),clip[Pθk(at∣st)Pθ(at∣st),1−ε,1+ε]Aθk(st,at)}
在多数的 cases 里面,PPO 都是不错的,不是最好的,就是第二好的
参考资料:
深度增强学习PPO(Proximal Policy Optimization)算法源码走读
【RL系列】强化学习之On-Policy与Off-Policy
Sutton的Google硬盘
Proximal Policy Optimization Algorithms