策略梯度迭代,通过计算策略梯度的估计,并利用随机梯度上升算法进行迭代。其常用的梯度估计形式为:
E ^ t [ ∇ θ l o g π θ ( a t ∣ s t ) A ^ t ] \hat{\mathbb{E}}_t[\nabla_\theta log \pi_\theta(a_t | s_t)\hat{A}_t] E^t[∇θlogπθ(at∣st)A^t]
其中 π θ \pi_\theta πθ为随机策略, A ^ t \hat{A}_t A^t是优势函数在时间步t的估计,其损失函数为:
L P G ( θ ) = E ^ t [ l o g π θ ( a t ∣ s t ) A ^ t ] L^{PG}(\theta)=\hat{\mathbb{E}}_t[log_{\pi_\theta}(a_t|s_t)\hat{A}_t] LPG(θ)=E^t[logπθ(at∣st)A^t]
TRPO要优化的目标函数如下:
m a x i m i z e θ E ^ [ π θ ( a t ∣ s t ) π θ o l d ( a t ∣ s t ) A ^ t ] maximize_\theta \hat{\mathbb{E}}[\frac{\pi_{\theta(a_t|s_t)}}{\pi_{\theta_{old}}(a_t|s_t)}\hat{A}_t] maximizeθE^[πθold(at∣st)πθ(at∣st)A^t]
s u b j e c t t o E t ^ [ K L [ π o l d ( ⋅ ∣ s t ) ∣ ∣ π θ ( ⋅ ∣ s t ) ] ] ≤ U subject\ to \ \hat{\mathbb{E}_t}[KL[\pi_{old}(·|s_t)||\pi_\theta(·|s_t)]] \leq U subject to Et^[KL[πold(⋅∣st)∣∣πθ(⋅∣st)]]≤U
令 r t ( θ ) = π θ ( a t ∣ s t ) π o l d ( a t ∣ s t ) r_t({\theta})=\frac{\pi_{\theta(a_t|s_t)}}{\pi_{old}(a_t|s_t)} rt(θ)=πold(at∣st)πθ(at∣st),那么 r ( θ o l d ) = 1 r(\theta_{old})=1 r(θold)=1。TRPO把目标函数替换为:
L C P L ( θ ) = E ^ [ π θ ( a t ∣ s t ) π θ o l d ( a t ∣ s t ) A ^ t ] = E ^ t [ r t ( θ ) A ^ t ] L^{CPL}(\theta) =\hat{\mathbb{E}}[\frac{\pi_{\theta(a_t|s_t)}}{\pi_{\theta_{old}}(a_t|s_t)}\hat{A}_t]=\hat{\mathbb{E}}_t[r_t(\theta)\hat{A}_t] LCPL(θ)=E^[πθold(at∣st)πθ(at∣st)A^t]=E^t[rt(θ)A^t]
L C P L L^{CPL} LCPL指的是前述TRPO中的保守策略迭代,如果不加约束,最大化 L C P L L^{CPL} LCPL会产生较大幅度的梯度更新。为了惩罚策略的变化(使得 r t ( θ ) r_t(\theta) rt(θ)远离1,新旧策略的KL散度不能太大),使用了以下的目标函数:
L C L I P ( θ ) = E ^ [ m i n ( r t ( θ ) A ^ t , c l i p ( r t ( θ ) , 1 − ϵ , 1 + ϵ ) A ^ t ) ] L^{CLIP}(\theta)=\hat{\mathbb{E}}[min(r_t(\theta)\hat{A}_t, clip(r_t({\theta}),1-\epsilon, 1+\epsilon)\hat{A}_t)] LCLIP(θ)=E^[min(rt(θ)A^t,clip(rt(θ),1−ϵ,1+ϵ)A^t)]
原论文中取 ϵ = 0.2 \epsilon=0.2 ϵ=0.2,直观示意图如下:
即:
当A>0时,如果 r t ( θ ) > 1 + ϵ r_t(\theta)>1+\epsilon rt(θ)>1+ϵ,则 L C L I P ( θ ) = ( 1 + ϵ ) A ^ t L^{CLIP}(\theta)=(1+\epsilon)\hat{A}_t LCLIP(θ)=(1+ϵ)A^t;如果 r t ( θ ) < 1 + ϵ r_t(\theta)<1+\epsilon rt(θ)<1+ϵ,则 L C L I P ( θ ) = r t ( θ ) A ^ t L^{CLIP}(\theta)=r_t(\theta)\hat{A}_t LCLIP(θ)=rt(θ)A^t;
当A<0时,如果 r t ( θ ) > 1 − ϵ r_t(\theta)>1-\epsilon rt(θ)>1−ϵ,则 L C L I P ( θ ) = r t ( θ ) A ^ t L^{CLIP}(\theta)=r_t(\theta)\hat{A}_t LCLIP(θ)=rt(θ)A^t;如果 r t ( θ ) < 1 − ϵ r_t(\theta)<1-\epsilon rt(θ)<1−ϵ,则 L C L I P ( θ ) = ( 1 − ϵ ) A ^ t L^{CLIP}(\theta)=(1-\epsilon)\hat{A}_t LCLIP(θ)=(1−ϵ)A^t;
在TRPO中,使用"自适应惩罚系数" β \beta β来约束KL散度,在该算法的最简单实例中,在每一步策略更新中执行以下步骤:
使用多个minibatch SGD,优化KL惩罚的目标
L K L P E N ( θ ) = E ^ t [ π θ ( a t ∣ s t ) π θ o l d ( a t ∣ s t ) A ^ t − β K L [ π θ o l d ( ⋅ ∣ s t ) , π ( ⋅ ∣ s t ) ] ] L^{KLPEN}(\theta)=\hat{\mathbb{E}}_t[\frac{\pi_{\theta(a_t|s_t)}}{\pi_{\theta_{old}}}(a_t|s_t)\hat{A}_t-\beta KL[\pi_{\theta_{old}}(·|s_t), \pi(·|s_t)]] LKLPEN(θ)=E^t[πθoldπθ(at∣st)(at∣st)A^t−βKL[πθold(⋅∣st),π(⋅∣st)]]
计算 d = E ^ [ K L [ π θ o l d ( ⋅ ∣ s t ) , π ( ⋅ ∣ s t ) ] ] d=\hat{\mathbb{E}}[KL[\pi_{\theta_{old}}(·|s_t), \pi(·|s_t)]] d=E^[KL[πθold(⋅∣st),π(⋅∣st)]]
如果 d < d t a r g / 1.5 , β < − β / 2 d
如果 d > d t a r g ∗ 1.5 , β < − β ∗ 2 d>d_{targ}*1.5, \beta<-\beta*2 d>dtarg∗1.5,β<−β∗2
实验中,PPO2的效果可能没有PPO1的效果好。
L t C L I P + V F + S ( θ ) = E ^ t [ L t C L I P ( θ ) − c 1 L t V F ( θ ) ] + c 2 S [ π θ ] ( s t ) ] L_t^{CLIP+VF+S}(\theta) = \hat{E}_t[L_t^{CLIP}(\theta)-c_1L_t^{VF}(\theta)]+c_2 S[\pi_{\theta}](s_t)] LtCLIP+VF+S(θ)=E^t[LtCLIP(θ)−c1LtVF(θ)]+c2S[πθ](st)]
其中 c 1 c1 c1, c 2 c2 c2是系数, S S S表示熵奖励, L t V F L_t^{VF} LtVF是平方误差损失 ( V θ ( s t ) − V t t a r g ) 2 (V_\theta(s_t)-V_t^{targ})^2 (Vθ(st)−Vttarg)2
优势估计函数为
A ^ t = − V ( s t ) + r t + γ r t + 1 + . . . + γ T − t + 1 r T − 1 + γ T − t V ( s T ) \hat{A}_t = -V(s_t)+r_t+\gamma r_{t+1}+...+\gamma^{T-t+1}r_{T-1}+\gamma^{T-t}V(s^T) A^t=−V(st)+rt+γrt+1+...+γT−t+1rT−1+γT−tV(sT)
另外,我们可以使用广义优势函数来扩广 A ^ t \hat{A}_t A^t,当 λ = 1 \lambda=1 λ=1时,它趋近于上面的等式
A ^ t = δ + ( γ λ ) δ t + 1 + . . . + . . . + γ λ T − t + 1 δ T − 1 \hat{A}_t=\delta+(\gamma\lambda)\delta_{t+1}+...+...+{\gamma\lambda^{T-t+1}}\delta_{T-1} A^t=δ+(γλ)δt+1+...+...+γλT−t+1δT−1
w h e r e δ t = r t + γ V ( s t + 1 − V ( s t ) ) where \ \delta_t = r_t+\gamma V(s_{t+1}-V(s_t)) where δt=rt+γV(st+1−V(st))
如下所示:
A l g o r i t h m P P O , A c t o r − C r i t i c S t y l e Algorithm \ PPO, Actor-Critic \ Style Algorithm PPO,Actor−Critic Style
f o r i t e r a t i o n = 1 , 2 , . . . , d o for \ iteration=1,2,...,do for iteration=1,2,...,do
f o r a c t o r = 1 , 2 , . . . , N , d o \qquad for \ actor=1,2,...,N, do for actor=1,2,...,N,do
R u n p o l i c y π θ o l d \qquad \qquad Run \ policy \ \pi_{\theta_{old}} Run policy πθold in environment for T timesteps
C o m p u t e a d v a n t a g e e s t i m a t e s A ^ 1 , . . . , A ^ T \qquad \qquad Compute \ advantage \ estimates \hat{A}_1,...,\hat{A}_{T} Compute advantage estimatesA^1,...,A^T
e n d f o r \qquad end for endfor
O p t i m i z e s u r r o g a t e L w r t θ , w i t h K e p o c h s a n d m i n i b a t c h s i z e M < = N T Optimize \ surrogate \ L \ wrt \ \theta, with \ K \ epochs \ and \ minibatch \ size \ M <= NT Optimize surrogate L wrt θ,with K epochs and minibatch size M<=NT
e n d f o r end \ for end for
李宏毅机器学习系列-强化学习之近端策略优化PPO
该博文从Import Sampling的角度叙述了PPO的演化。
个人认为写得最好的TRPO讲解
该博文从数学方面叙述了PPO的推导。