深度强化学习CS285 lec5-lec9 学习感悟
- 一、策略梯度(Policy Gradient)
-
- 1.1 REINFORCE
- 1.2 改进方法
-
- 1.2.1 因果性(Causality)
- 1.2.2 基准(Baselines)
- 1.2.3 重要性采样(Important Sampling)
- 二、执行者-评估者 (Actor-Critic)
-
- 2.1 优势函数 A π ( s t , a t ) A^\pi(s_t,a_t) Aπ(st,at)(Advantage Function)
- 2.2 拟合值函数 V π ( s ) V^\pi(s) Vπ(s)
-
- 2.2.1 基于蒙特卡洛采样的策略评估 (Monte-Carlo Policy Evaluation)
- 2.2.2 基于自举拟合值函数 (Bootstrapped)
- 2.2.3 拟合目标值小总结(Target Value Summary)
- 2.3 Actor-Critic综合改进版
- 三、基于值函数的方法(Value-based Methods)
-
- 3.1 策略迭代(Policy Iteration)
- 3.2 值迭代(Value Iteration)
- 3.3 拟合的值迭代(Fitted Value Iteration)
- 3.4 拟合的Q值迭代(Fitted Q-iteration)
- 3.5 小总结(Summary)
- 四、基于Q值函数的方法(Q-Value based Methods)
-
- 4.1 针对Q值迭代的改进方法
-
- 4.1.1 经验池回放(Replay Buffer)
- 4.1.2 目标网络 (Target Network)
- 4.2 基于Q-learning的三种形态
-
- 4.2.1 Online Q-learning
- 4.2.2 DQN(N=1,K=1)
- 4.2.3 Fitted Q-learning
- 4.2.4 A Brief Summary
- 4.3 Q值深度网络的有效改进 (Practical Tips)
-
- 4.3.1 Double Q-learning
- 4.3.2 N-step Returns
- 4.3.3 Dueling Structure
- Double DQN具体算法例子
- 4.4 连续动作的Q-learning (Continuous action)
-
- 4.4.1 随机优化(Stochastic Optimization)
- 4.4.2 使用易于优化的函数类型 (Easy to optimize)
- 4.4.3 学习一个新的执行者 (Learn A Second Actor)
- 五、优化策略梯度(Advanced Policy Gradient)
-
- 5.1 自然策略梯度 ( Natural Policy Gradient)
- 5.2 TRPO (Trust Region Policy Optimization)
- 5.3 PPO(Proximal Policy Optimization)
- 参考资料
- 补充
一、策略梯度(Policy Gradient)
1.1 REINFORCE
- 优化目标 J ( θ ) J(\theta) J(θ)
θ ∗ = arg max θ J ( θ ) = arg max θ E τ ∼ p θ ( τ ) [ ∑ t r ( s t , a t ) ] = arg max θ ∑ t = 1 T E ( s t , a t ) ∼ p θ ( s t , a t ) [ r ( s t , a t ) ] ( F i n i t e H o r i z o n ) = arg max θ E s 1 ∼ p ( s 1 ) [ V ( s 1 ) ] = arg max θ E s 1 ∼ p ( s 1 ) [ E a ∼ π θ ( a ∣ s ) [ Q ( s , a ) ] ] \begin{aligned} \theta^* & =\argmax_\theta J(\theta) \\ &=\argmax_\theta E_{\tau \sim p_\theta(\tau)}\Big[\sum_t r(s_t,a_t)\Big] \\ & = \argmax_\theta \sum_{t=1}^T E_{(s_t,a_t) \sim p_\theta (s_t,a_t)}\Big[ r(s_t,a_t)\Big] (Finite \quad Horizon)\\ &=\argmax_\theta E_{s_1 \sim p(s_1)}\big[V(s_1)\big]\\ &=\argmax_\theta E_{s_1\sim p(s_1)}\Big[E_{a\sim \pi_\theta(a|s)}[Q(s,a)]\Big] \end{aligned} θ∗=θargmaxJ(θ)=θargmaxEτ∼pθ(τ)[t∑r(st,at)]=θargmaxt=1∑TE(st,at)∼pθ(st,at)[r(st,at)](FiniteHorizon)=θargmaxEs1∼p(s1)[V(s1)]=θargmaxEs1∼p(s1)[Ea∼πθ(a∣s)[Q(s,a)]]
优化目标一般是最大化在一条轨迹 τ \tau τ上累积奖励函数 ∑ t r ( s t , a t ) \sum_tr(s_t,a_t) ∑tr(st,at)的期望;抑或是状态服从平稳分布 s 1 ∼ p ( s 1 ) s_1 \sim p(s_1) s1∼p(s1)时,最大化状态价值 V ( s ) V(s) V(s)的期望;
- 目标梯度
令 π θ ( τ ) = p θ ( τ ) , r ( τ ) = ∑ t r ( s t , a t ) \pi_\theta(\tau)=p_\theta(\tau),r(\tau)=\sum_tr(s_t,a_t) πθ(τ)=pθ(τ),r(τ)=∑tr(st,at),由lec1-lec4得 p θ ( τ ) = p ( s 1 ) ∏ t = 1 T π θ ( a t ∣ s t ) p ( s t + 1 ∣ s t , a t ) p_\theta(\tau)=p(s_1)\prod_{t=1}^T\pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t) pθ(τ)=p(s1)∏t=1Tπθ(at∣st)p(st+1∣st,at),因此目标 J ( θ ) J(\theta) J(θ)变为
J ( θ ) = E τ ∼ π θ ( τ ) [ ∑ t r ( s t , a t ) ] = E τ ∼ π θ ( τ ) [ r ( τ ) ] = ∫ π θ ( τ ) r ( τ ) d τ \begin{aligned} J(\theta)&=E_{\tau \sim \pi_\theta(\tau)}\Big[\sum_t r(s_t,a_t)\Big] \\ &=E_{\tau \sim \pi_\theta(\tau)}\Big[r(\tau)\Big] \\ &=\int \pi_\theta(\tau)r(\tau)d\tau \\ \end{aligned} J(θ)=Eτ∼πθ(τ)[t∑r(st,at)]=Eτ∼πθ(τ)[r(τ)]=∫πθ(τ)r(τ)dτ
目标梯度: 由求导公式有 ∇ π θ ( τ ) = π θ ( τ ) ∇ θ l o g π θ ( τ ) \nabla\pi_\theta(\tau)=\pi_\theta(\tau)\nabla_\theta log\pi_\theta(\tau) ∇πθ(τ)=πθ(τ)∇θlogπθ(τ)
∇ J ( θ ) = ∫ ∇ π θ ( τ ) r ( τ ) d τ = ∫ π θ ( τ ) ∇ θ l o g π θ ( τ ) r ( τ ) d τ = E τ ∼ π θ ( τ ) [ ∇ θ l o g π θ ( τ ) r ( τ ) ] \begin{aligned} \nabla J(\theta)&=\int \nabla \pi_\theta(\tau)r(\tau)d\tau \\ &=\int \pi_\theta(\tau) \nabla_\theta log\pi_\theta(\tau)r(\tau)d\tau\\ &=E_{\tau \sim \pi_\theta(\tau)}\Big[\nabla_\theta log\pi_\theta(\tau)r(\tau)\Big]\\ \end{aligned} ∇J(θ)=∫∇πθ(τ)r(τ)dτ=∫πθ(τ)∇θlogπθ(τ)r(τ)dτ=Eτ∼πθ(τ)[∇θlogπθ(τ)r(τ)]
因 为 π θ ( τ ) = p ( s 1 ) ∏ t = 1 T π θ ( a t ∣ s t ) p ( s t + 1 ∣ s t , a t ) 所 以 ∇ θ l o g π θ ( τ ) = ∇ θ [ l o g p ( s 1 ) + ∑ t = 1 T l o g π θ ( a t ∣ s t ) + ∑ t = 1 T p ( s t + 1 ∣ s t , a t ) ] = ∑ t = 1 T ∇ θ l o g π θ ( a t ∣ s t ) 因 此 ∇ θ J ( θ ) = E τ ∼ π θ ( τ ) [ ∑ t = 1 T ∇ θ l o g π θ ( a t ∣ s t ) ] ⏟ p o l i c y r e l a t e d [ ∑ t = 1 T r ( s t , a t ) ] ⏟ S u p e r v i s e d I n f o 因为\pi_\theta(\tau)=p(s_1)\prod_{t=1}^T\pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)\\ 所以\nabla_\theta log\pi_\theta(\tau)=\nabla_\theta[logp(s_1)+\sum_{t=1}^Tlog\pi_\theta(a_t|s_t)+\sum_{t=1}^Tp(s_{t+1}|s_t,a_t)]=\sum_{t=1}^T \nabla_\theta log\pi_\theta(a_t|s_t)\\ 因此\nabla_\theta J(\theta)=E_{\tau \sim \pi_\theta(\tau)}\underbrace{\Big[\sum_{t=1}^T\nabla_\theta log\pi_\theta(a_t|s_t)\Big]}_{policy\quad related}\underbrace{\Big[\sum_{t=1}^T{r(s_t,a_t)}\Big]}_{Supervised\quad Info} 因为πθ(τ)=p(s1)t=1∏Tπθ(at∣st)p(st+1∣st,at)所以∇θlogπθ(τ)=∇θ[logp(s1)+t=1∑Tlogπθ(at∣st)+t=1∑Tp(st+1∣st,at)]=t=1∑T∇θlogπθ(at∣st)因此∇θJ(θ)=Eτ∼πθ(τ)policyrelated [t=1∑T∇θlogπθ(at∣st)]SupervisedInfo [t=1∑Tr(st,at)]
对于期望采用蒙特卡洛M-C采样trajectory来近似期望,采 N N N个样本 τ i = ( s i , 1 , a i , 1 , . . . , s i , T , a i , T ) , i = 1 , 2 , . . . , N \tau_i=(s_{i,1},a_{i,1},...,s_{i,T},a_{i,T}),i=1,2,...,N τi=(si,1,ai,1,...,si,T,ai,T),i=1,2,...,N则
∇ θ J ( θ ) ≈ 1 N ∑ i = 1 N [ ∑ t = 1 T ∇ θ l o g π θ ( a i , t ∣ s i , t ) ] [ ∑ t = 1 T r ( s i , t , a i , t ) ] \nabla_\theta J(\theta)\approx \frac{1}{N}\sum_{i=1}^N\Big[\sum_{t=1}^T\nabla_\theta log\pi_\theta(a_{i,t}|s_{i,t})\Big]\Big[\sum_{t=1}^T{r(s_{i,t},a_{i,t})}\Big] ∇θJ(θ)≈N1i=1∑N[t=1∑T∇θlogπθ(ai,t∣si,t)][t=1∑Tr(si,t,ai,t)]
于是便得到了最原始的策略算法REINFORCE。
- REINFROCE(最原始的策略梯度算法)
- sample τ i {\tau^i} τi from π θ ( a t ∣ s t ) \pi_\theta(a_t|s_t) πθ(at∣st)
- ∇ θ J ( θ ) ≈ 1 N ∑ i = 1 N [ ∑ t = 1 T ∇ θ l o g π θ ( a i , t ∣ s i , t ) ] [ ∑ t = 1 T r ( s i , t , a i , t ) ] \nabla_\theta J(\theta)\approx \frac{1}{N}\sum_{i=1}^N\Big[\sum_{t=1}^T\nabla_\theta log\pi_\theta(a_{i,t}|s_{i,t})\Big]\Big[\sum_{t=1}^T{r(s_{i,t},a_{i,t})}\Big] ∇θJ(θ)≈N1i=1∑N[t=1∑T∇θlogπθ(ai,t∣si,t)][t=1∑Tr(si,t,ai,t)]
- θ ← θ + α ∇ θ J ( θ ) \theta\leftarrow\theta+\alpha\nabla_\theta J(\theta) θ←θ+α∇θJ(θ)
- 根据current policy, π θ \pi_\theta πθ,来采样得到轨迹(trajectory)样本;
- 由样本中环境反馈来的reward信号,计算更新当前policy参数的监督信号 ∇ θ J ( θ ) \nabla_\theta J(\theta) ∇θJ(θ)
- 加上可调整的学习率 α \alpha α来实现更新
由此可见,REINFORCE的每一步更新,都需要与环境交互来获取有效的监督信息实现更新, ∑ t r ( s i , t , a i , t ) \sum_t r(s_{i,t},a_{i,t}) ∑tr(si,t,ai,t)的监督信息方差很大,需要一些改进方法使其更有效、更为稳定。
1.2 改进方法
1.2.1 因果性(Causality)
- Motivation : Policy at time t’ cannot affect reward at time t when t
- MC的目标梯度:
∇ θ J ( θ ) ≈ 1 N ∑ i = 1 N [ ∑ t = 1 T ∇ θ l o g π θ ( a i , t ∣ s i , t ) ] [ ∑ t = 1 T r ( s i , t , a i , t ) ] \nabla_\theta J(\theta)\approx \frac{1}{N}\sum_{i=1}^N\Big[\sum_{t=1}^T\nabla_\theta log\pi_\theta(a_{i,t}|s_{i,t})\Big]\Big[\sum_{t=1}^T{r(s_{i,t},a_{i,t})}\Big] ∇θJ(θ)≈N1i=1∑N[t=1∑T∇θlogπθ(ai,t∣si,t)][t=1∑Tr(si,t,ai,t)]
- 因果性改进后的梯度:
∇ θ J ( θ ) ≈ 1 N ∑ i = 1 N [ ∑ t = 1 T ∇ θ l o g π θ ( a i , t ∣ s i , t ) ] [ ∑ t ′ = t T r ( s i , t ′ , a i , t ′ ) ] = 1 N ∑ i = 1 N [ ∑ t = 1 T ∇ θ l o g π θ ( a i , t ∣ s i , t ) ] Q ^ i , t \begin{aligned} \nabla_\theta J(\theta)&\approx \frac{1}{N}\sum_{i=1}^N\Big[\sum_{t=1}^T\nabla_\theta log\pi_\theta(a_{i,t}|s_{i,t})\Big]\Big[\sum_{t'=t}^T{r(s_{i,t'},a_{i,t'})}\Big] \\ &=\frac{1}{N}\sum_{i=1}^N\Big[\sum_{t=1}^T\nabla_\theta log\pi_\theta(a_{i,t}|s_{i,t})\Big]\hat{Q}_{i,t} \end{aligned} ∇θJ(θ)≈N1i=1∑N[t=1∑T∇θlogπθ(ai,t∣si,t)][t′=t∑Tr(si,t′,ai,t′)]=N1i=1∑N[t=1∑T∇θlogπθ(ai,t∣si,t)]Q^i,t
- 说明:在 J ( θ ) = ∑ τ π θ ( τ ) r ( τ ) J(\theta)=\sum_\tau\pi_\theta(\tau)r(\tau) J(θ)=∑τπθ(τ)r(τ)中,监督信息 r ( τ ) r(\tau) r(τ)计算时,要注意,未来的决策并不会改变过去的奖励,因此得从policy的时刻t开始计算 r ( τ ) r(\tau) r(τ)会更为准确,即因果性优化。
1.2.2 基准(Baselines)
- Motivation : 为了让估计出来的梯度更加稳定
- J ( θ ) = 1 N ∑ i = 1 N [ ∑ t = 1 T ∇ θ l o g π θ ( a i , t ∣ s i , t ) ] [ r ( τ ) − b ] J(\theta)=\frac{1}{N}\sum_{i=1}^N\Big[\sum_{t=1}^T\nabla_\theta log\pi_\theta(a_{i,t}|s_{i,t})\Big]\Big[r(\tau)-b\Big] J(θ)=N1i=1∑N[t=1∑T∇θlogπθ(ai,t∣si,t)][r(τ)−b]
- 关于reward r ( τ ) r(\tau) r(τ)与 b b b的选择
r ( τ ) = ∑ t = 1 T r ( s t , a t ) o r Q ^ i , t r(\tau)=\sum_{t=1}^Tr(s_t,a_t) \quad or \quad \hat{Q}_{i,t} r(τ)=t=1∑Tr(st,at)orQ^i,t
b = 1 N ∑ i = 1 N r ( τ i ) o r V π ( s t ) b=\frac{1}{N}\sum_{i=1}^Nr(\tau_i)\quad or \quad V^\pi(s_t) b=N1i=1∑Nr(τi)orVπ(st)
- 说明:因为每一条轨迹的奖励信号 r ( τ i ) r(\tau_i) r(τi)数值可能差异很多,而目标梯度 ∇ J ( θ ) \nabla J(\theta) ∇J(θ)的更新又依赖于对轨迹采样,因此对奖励信号进行均值标准化 b = 1 N ∑ i = 1 N r ( τ i ) b=\frac{1}{N}\sum_{i=1}^Nr(\tau_i) b=N1∑i=1Nr(τi),即通过减去一个基准的方式,来使目标梯度更新过程稳定。基准的选择多种多样,通常为 V π ( s t ) V^\pi(s_t) Vπ(st)。
1.2.3 重要性采样(Important Sampling)
- Motivation : 为了可以重用以前策略获得的样本,减少每一步梯度更新时都需要与环境交互获取新轨迹样本的弊端。
- J ( θ ) = E τ ∼ π ˉ ( τ ) [ π ( τ ) π ˉ ( τ ) r ( τ ) ] = E τ ∼ π ˉ ( τ ) [ p ( s 1 ) ∏ t = 1 T π θ ( a t ∣ s t ) p ( s t + 1 ∣ s t , a t ) p ( s 1 ) ∏ t = 1 T π ˉ θ ′ ( a t ∣ s t ) p ( s t + 1 ∣ s t , a t ) r ( τ ) ] = E τ ∼ π ˉ ( τ ) [ ∏ t = 1 T π θ ( a t ∣ s t ) π ˉ θ ′ ( a t ∣ s t ) r ( τ ) ] \begin{aligned} J(\theta)&=E_{\tau \sim \bar{\pi}(\tau)}\Big[\frac{\pi(\tau)}{\bar{\pi}(\tau)}r(\tau)\Big]\\ &=E_{\tau \sim \bar{\pi}(\tau)}\Big[\frac{p(s_1)\prod_{t=1}^T\pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)}{p(s_1)\prod_{t=1}^T\bar{\pi}_{\theta^{'}}(a_t|s_t)p(s_{t+1}|s_t,a_t)}r(\tau)\Big]\\ &=E_{\tau \sim \bar{\pi}(\tau)}\Big[\prod_{t=1}^T\frac{\pi_\theta(a_t|s_t)}{\bar{\pi}_{\theta^{'}}(a_t|s_t)}r(\tau)\Big] \end{aligned} J(θ)=Eτ∼πˉ(τ)[πˉ(τ)π(τ)r(τ)]=Eτ∼πˉ(τ)[p(s1)∏t=1Tπˉθ′(at∣st)p(st+1∣st,at)p(s1)∏t=1Tπθ(at∣st)p(st+1∣st,at)r(τ)]=Eτ∼πˉ(τ)[t=1∏Tπˉθ′(at∣st)πθ(at∣st)r(τ)]
- 说明: π ˉ ( τ ) \bar{\pi}(\tau) πˉ(τ)为以前的策略轨迹分布,重用过去轨迹时,给每一个轨迹加上一个权重来衡量,更新后轨迹分布与上一步的轨迹分布差异,如果差异小,则 π ( τ ) π ˉ ( τ ) \frac{\pi(\tau)}{\bar{\pi}(\tau)} πˉ(τ)π(τ)接近1,可重用上一步的轨迹样本。
- ∇ θ J ( θ ) = E τ ∼ π ˉ ( τ ) [ ∇ π ( τ ) π ˉ ( τ ) r ( τ ) ] = E τ ∼ π ˉ ( τ ) [ π ( τ ) ∇ l o g π ( τ ) π ˉ ( τ ) r ( τ ) ] = E τ ∼ π ˉ ( τ ) [ ∏ t = 1 T π θ ( a t ∣ s t ) π ˉ θ ′ ( a t ∣ s t ) r ( τ ) [ ∑ t = 1 T ∇ θ log π θ ( a t ∣ s t ) ] ] = E τ ∼ π ˉ ( τ ) [ [ ∑ t = 1 T ∇ θ log π θ ( a t ∣ s t ) ] ∏ t ′ = 1 t π θ ( a t ′ ∣ s t ′ ) π ˉ θ ′ ( a t ′ ∣ s t ′ ) [ ∑ t ′ = t T r ( s t ′ , a t ′ ) ] ∏ t ′ ′ = t t ′ π θ ( a t ′ ′ ∣ s t ′ ′ ) π ˉ θ ′ ′ ( a t ′ ∣ s t ′ ′ ) ] \begin{aligned} \nabla_{\theta}J(\theta)&=E_{\tau \sim \bar{\pi}(\tau)}\Big[\frac{\nabla\pi(\tau)}{\bar{\pi}(\tau)}r(\tau)\Big]\\ &=E_{\tau \sim \bar{\pi}(\tau)}\Big[\frac{\pi(\tau)\nabla log\pi(\tau)}{\bar{\pi}(\tau)}r(\tau)\Big]\\ &=E_{\tau \sim \bar{\pi}(\tau)}\Big[\prod_{t=1}^T\frac{\pi_\theta(a_t|s_t)}{\bar{\pi}_{\theta^{'}}(a_t|s_t)}r(\tau)\big[\sum_{t=1}^T\nabla_\theta\log\pi_\theta(a_t|s_t)\big ] \Big]\\ &=E_{\tau\sim\bar{\pi}(\tau)}\Big[\big[\sum_{t=1}^T\nabla_\theta\log\pi_\theta(a_t|s_t)\big ]\prod_{t'=1}^t\frac{\pi_\theta(a_{t'}|s_{t'})}{\bar{\pi}_{\theta^{'}}(a_{t'}|s_{t'})}\big[\sum_{t'=t}^Tr(s_{t'},a_{t'})\big]\cancel{\prod_{t''=t}^{t'}\frac{\pi_\theta(a_{t''}|s_{t''})}{\bar{\pi}_{\theta^{''}}(a_{t'}|s_{t''})}}\Big] \end{aligned} ∇θJ(θ)=Eτ∼πˉ(τ)[πˉ(τ)∇π(τ)r(τ)]=Eτ∼πˉ(τ)[πˉ(τ)π(τ)∇logπ(τ)r(τ)]=Eτ∼πˉ(τ)[t=1∏Tπˉθ′(at∣st)πθ(at∣st)r(τ)[t=1∑T∇θlogπθ(at∣st)]]=Eτ∼πˉ(τ)[[t=1∑T∇θlogπθ(at∣st)]t′=1∏tπˉθ′(at′∣st′)πθ(at′∣st′)[t′=t∑Tr(st′,at′)]t′′=t∏t′πˉθ′′(at′∣st′′)πθ(at′′∣st′′) ]
- 记 π θ ( a i , t ∣ s i , t ) π ˉ θ ′ ( a i , t ∣ s i , t ) = ∏ t ′ = 1 t π θ ( a i , t ′ ∣ s i , t ′ ) π ˉ θ ′ ( a i , t ′ ∣ s i , t ′ ) 故 ∇ θ J ( θ ) ≈ 1 N ∑ i = 1 N ∑ t = 1 T π θ ( a i , t ∣ s i , t ) π ˉ θ ′ ( a i , t ∣ s i , t ) ∇ θ l o g π θ ( a t ∣ s t ) Q ^ i , t 记\frac{\pi_\theta(a_{i,t}|s_{i,t})}{\bar \pi_{\theta'}(a_{i,t}|s_{i,t})}=\prod_{t'=1}^t\frac{\pi_\theta(a_{i,t'}|s_{i,t'})}{\bar{\pi}_{\theta^{'}}(a_{i,t'}|s_{i,t'})}\\ 故\nabla_\theta J(\theta) \approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\frac{\pi_\theta(a_{i,t}|s_{i,t})}{\bar \pi_{\theta'}(a_{i,t}|s_{i,t})}\nabla_\theta log\pi_\theta(a_t|s_t)\hat{Q}_{i,t}\\ 记πˉθ′(ai,t∣si,t)πθ(ai,t∣si,t)=t′=1∏tπˉθ′(ai,t′∣si,t′)πθ(ai,t′∣si,t′)故∇θJ(θ)≈N1i=1∑Nt=1∑Tπˉθ′(ai,t∣si,t)πθ(ai,t∣si,t)∇θlogπθ(at∣st)Q^i,t
二、执行者-评估者 (Actor-Critic)
- Policy Gradient直接从估计目标梯度入手,Actor-Critic则是通过评估当前的策略(Policy Evaluation),再往比当前策略好的方向进行更新(Policy Improvement)。Actor的作用为policy的执行者即图上第一步run the policy,Critic的作用为policy的评估者即图上第二步fit a model,而图上第三部improve the policy 可理解成一个learner,从Critic身上寻找一种学习机制来improve policy再交给actor执行。
2.1 优势函数 A π ( s t , a t ) A^\pi(s_t,a_t) Aπ(st,at)(Advantage Function)
- 值函数
-
V π ( s t ) = ∑ t ′ = t T E τ ∼ π θ ( τ ) [ r ( s t ′ , a t ′ ) ∣ s t ] V^\pi(s_t)=\sum_{t'=t}^TE_{\tau\sim\pi_\theta(\tau)}\Big[r(s_{t'},a_{t'})|s_t\Big] Vπ(st)=t′=t∑TEτ∼πθ(τ)[r(st′,at′)∣st]
-
Q π ( s t , a t ) = ∑ t ′ = t T E τ ∼ π θ ( τ ) [ r ( s t ′ , a t ′ ) ] = E a t ∼ π θ ( a t ∣ s t ) [ V π ( s t ) ] = E a t ∼ π θ ( a t ∣ s t ) [ E s t ∼ π θ ( s t ) [ ∑ t ′ = t T r ( s t ′ , a t ′ ) ] ] \begin{aligned} Q^\pi(s_t,a_t)&=\sum_{t'=t}^TE_{\tau\sim\pi_\theta(\tau)}\Big[r(s_{t'},a_{t'})\Big]\\ &=E_{a_t\sim\pi_\theta(a_t|s_t)}\Big[V^\pi(s_t)\Big]\\ &=E_{a_t\sim\pi_\theta(a_t|s_t)}\Big[E_{s_t\sim\pi_\theta(s_t)}[\sum_{t'=t}^Tr(s_{t'},a_{t'})]\Big] \end{aligned} Qπ(st,at)=t′=t∑TEτ∼πθ(τ)[r(st′,at′)]=Eat∼πθ(at∣st)[Vπ(st)]=Eat∼πθ(at∣st)[Est∼πθ(st)[t′=t∑Tr(st′,at′)]]
-
说明: V π ( s t ) V^\pi(s_t) Vπ(st)表示服从当前策略 π θ \pi_\theta πθ,初始状态为 s t s_t st时,可以获得的总回报; Q π ( s t , a t ) Q^\pi(s_t,a_t) Qπ(st,at)表示服从当策略 π θ \pi_\theta πθ,初始状态为 s t s_t st,选择动作为 a t a_t at时,可以获得的总回报,抽象可理解为对当前的状态或状态-动作下了一个价值判断,当V,Q知道时,便可从价值判断中选择Value最高的state或action,便于从价值判断中直接得到策略。
- 优势函数
A π ( s t , a t ) = Q π ( s t , a t ) − V π ( s t ) ≈ r ( s t , a t ) + ∑ t ′ = t + 1 T E π θ [ r ( s t ′ , a t ′ ) ] − V π ( s t ) = r ( s t , a t ) + V π ( s t + 1 ) − V π ( s t ) \begin{aligned} A^\pi(s_t,a_t)&=Q^\pi(s_t,a_t)-V^\pi(s_t)\\ &\approx r(s_t,a_t)+\sum_{t'=t+1}^TE_{\pi_\theta}\Big[r(s_{t'},a_{t'})\Big]-V^\pi(s_t)\\ &=r(s_t,a_t)+V^\pi(s_{t+1})-V^\pi(s_t) \end{aligned} Aπ(st,at)=Qπ(st,at)−Vπ(st)≈r(st,at)+t′=t+1∑TEπθ[r(st′,at′)]−Vπ(st)=r(st,at)+Vπ(st+1)−Vπ(st)
- 说明:优势函数=状态动作值-状态值,衡量的是动作好的趋势究竟有多好,类似于在状态 s t s_t st下各Q值减去一下各动作平均值的一个标准化操作,通过近似的方式变成监督信号reward+时间差分TD error,因此为了得到目标梯度 ∇ θ J ( θ ) \nabla_\theta J(\theta) ∇θJ(θ),只需要估计状态价值 V π ( s t ) V^\pi(s_t) Vπ(st)即可。
2.2 拟合值函数 V π ( s ) V^\pi(s) Vπ(s)
2.2.1 基于蒙特卡洛采样的策略评估 (Monte-Carlo Policy Evaluation)
- V π ( s t ) V^\pi(s_t) Vπ(st)最简单的估计方式:
V π ( s t ) ≈ ∑ t ′ = t T r ( s t ′ , a t ′ ) V π ( s t ) ≈ 1 N ∑ i = 1 N ∑ t ′ = t T r ( s i , t ′ , a i , t ′ ) V^\pi(s_t)\approx \sum_{t'=t}^Tr(s_{t'},a_{t'})\\ V^\pi(s_t) \approx \frac{1}{N}\sum_{i=1}^N\sum_{t'=t}^Tr(s_{i,t'},a_{i,t'}) Vπ(st)≈t′=t∑Tr(st′,at′)Vπ(st)≈N1i=1∑Nt′=t∑Tr(si,t′,ai,t′)
- 近似的估计方式:训练集 = { ( s i , t , y i , t ) } , i = 1 , . . . , N , t = 1 , . . . , T \{(s_{i,t},y_{i,t})\},i=1,...,N,t=1,...,T {(si,t,yi,t)},i=1,...,N,t=1,...,T
y i , t = ∑ t ′ = t T r ( s i , t ′ , a i , t ′ ) L ( ϕ ) = 1 2 ∑ i = 1 N ∣ ∣ V ^ ϕ π ( s i ) − y i ∣ ∣ 2 y_{i,t}=\sum_{t'=t}^Tr(s_{i,t'},a_{i,t'})\\ L(\phi)=\frac{1}{2}\sum_{i=1}^N||\hat{V}^\pi_\phi(s_i)-y_i||^2 yi,t=t′=t∑Tr(si,t′,ai,t′)L(ϕ)=21i=1∑N∣∣V^ϕπ(si)−yi∣∣2
- actor执行policy得到一些trajectory samples
- 以上述方式选择一个模型,拟合 V ^ ϕ π ( s ) \hat{V}_\phi^\pi(s) V^ϕπ(s)
- 利用 V ^ ϕ π ( s ) \hat{V}_\phi^\pi(s) V^ϕπ(s),计算各轨迹样本的优势函数 A ^ π ( s i , a i ) \hat{A}^\pi(s_i,a_i) A^π(si,ai)
- 计算目标梯度 ∇ J ( θ ) \nabla J(\theta) ∇J(θ)
- 更新policy梯度
2.2.2 基于自举拟合值函数 (Bootstrapped)
y i , t = r ( s i , t , a i , t ) + V ^ ϕ π ( s i , t + 1 ) y_{i,t}=r(s_{i,t},a_{i,t})+\hat{V}^\pi_\phi(s_{i,t+1}) yi,t=r(si,t,ai,t)+V^ϕπ(si,t+1)
2.2.3 拟合目标值小总结(Target Value Summary)
I d e a l T a r g e t : y i , t = ∑ t ′ = t T E π θ [ r ( s i , t ′ , a i , t ′ ) ] M C T a r g e t : y i , t = ∑ t ′ = t T r ( s i , t ′ , a i , t ′ ) B o o t s t r a p p e d T a r g e t : y i , t = r ( s i , t , a i , t ) + γ V ^ ϕ π ( s i , t + 1 ) Ideal \quad Target: y_{i,t}=\sum_{t'=t}^TE_{\pi_\theta}\Big[r(s_{i,t'},a_{i,t'})\Big]\\ MC\quad Target:y_{i,t}=\sum_{t'=t}^Tr(s_{i,t'},a_{i,t'})\\ Bootstrapped\quad Target:y_{i,t}=r(s_{i,t},a_{i,t})+\gamma \hat{V}_\phi^\pi(s_{i,t+1}) IdealTarget:yi,t=t′=t∑TEπθ[r(si,t′,ai,t′)]MCTarget:yi,t=t′=t∑Tr(si,t′,ai,t′)BootstrappedTarget:yi,t=r(si,t,ai,t)+γV^ϕπ(si,t+1)
2.3 Actor-Critic综合改进版
∇ J ( θ ) ≈ 1 N ∑ i = 1 N ∑ t = 1 T π θ ( a i , t ∣ s i , t ) π ˉ θ ′ ( a i , t ∣ s i , t ) ∇ θ l o g π θ ( a i , t ∣ s i , t ) A n π ( s i , t , a i , t ) \nabla J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\frac{\pi_\theta(a_{i,t}|s_{i,t})}{\bar \pi_{\theta'}(a_{i,t}|s_{i,t})}\nabla_\theta log\pi_\theta(a_{i,t}|s_{i,t})A^\pi_n(s_{i,t},a_{i,t}) ∇J(θ)≈N1i=1∑Nt=1∑Tπˉθ′(ai,t∣si,t)πθ(ai,t∣si,t)∇θlogπθ(ai,t∣si,t)Anπ(si,t,ai,t)
- 折扣因子discount factor γ = 0.99 \gamma=0.99 γ=0.99
A π ( s i , t , a i , t ) = ∑ t ′ = t T γ t ′ − t r ( s i , t ′ , a i , t ′ ) A^\pi(s_{i,t},a_{i,t})=\sum_{t'=t}^T\gamma^{t'-t}r(s_{i,t'},a_{i,t'}) Aπ(si,t,ai,t)=∑t′=tTγt′−tr(si,t′,ai,t′)
A π ( s i , t , a i , t ) ≈ r ( s i , t , a i , t ) + γ V ^ ϕ ( s i , t + 1 ) A^\pi(s_{i,t},a_{i,t})\approx r(s_{i,t},a_{i,t})+\gamma \hat{V}_\phi(s_{i,t+1}) Aπ(si,t,ai,t)≈r(si,t,ai,t)+γV^ϕ(si,t+1)
- 基准 baselines
A π ( s i , t , a i , t ) = ∑ t ′ = t T γ t ′ − t r ( s i , t ′ , a i , t ′ ) − V ^ ϕ ( s i , t ) A^\pi(s_{i,t},a_{i,t})=\sum_{t'=t}^T\gamma^{t'-t}r(s_{i,t'},a_{i,t'})-\hat{V}_\phi(s_{i,t}) Aπ(si,t,ai,t)=∑t′=tTγt′−tr(si,t′,ai,t′)−V^ϕ(si,t)
A π ( s i , t , a i , t ) ≈ r ( s i , t , a i , t ) + γ V ^ ϕ ( s i , t + 1 ) − V ^ ϕ ( s i , t ) A^\pi(s_{i,t},a_{i,t})\approx r(s_{i,t},a_{i,t})+\gamma \hat{V}_\phi(s_{i,t+1})-\hat{V}_\phi(s_{i,t}) Aπ(si,t,ai,t)≈r(si,t,ai,t)+γV^ϕ(si,t+1)−V^ϕ(si,t)
- n-step 回报(n=1时为bootstrapped)
A n π ( s i , t , a i , t ) ≈ ∑ t ′ = t t + n γ t ′ − t r ( s t ′ , a t ′ ) + γ n V ^ ϕ π ( s t + n ) − V ^ ϕ π ( s t ) A_n^\pi(s_{i,t},a_{i,t})\approx \sum_{t'=t}^{t+n}\gamma^{t'-t}r(s_{t'},a_{t'})+\gamma^n\hat{V}^\pi_\phi(s_{t+n})-\hat{V}^\pi_\phi(s_t) Anπ(si,t,ai,t)≈∑t′=tt+nγt′−tr(st′,at′)+γnV^ϕπ(st+n)−V^ϕπ(st)
- 说明:折扣因子的加入使得无限长(Infinite Horizon)的Trajectory得以处理,且与GAE(General Advantage Estimation)的效果相同。
三、基于值函数的方法(Value-based Methods)
- 说明:基于值函数的方法,是希望在迭代过程中绕过目标梯度 ∇ θ J ( θ ) \nabla_\theta J(\theta) ∇θJ(θ)来更新策略这一步,尝试直接通过值迭代得到最优策略 π ∗ \pi^* π∗的值函数 V ∗ ( s ) V^*(s) V∗(s)或 Q ∗ ( s , a ) Q^*(s,a) Q∗(s,a),从值函数中得到策略。
3.1 策略迭代(Policy Iteration)
- 评估优势函数 A π ( s , a ) A^\pi(s,a) Aπ(s,a)(Policy Evaluation)
A π ( s , a ) = Q π ( s , a ) − V π ( s , a ) = r ( s , a ) + γ E s ′ ∼ p ( s ′ ∣ s , a ) [ V π ( s ′ ) ] − V π ( s ) ≈ r ( s , a ) + γ V π ( s ′ ) − V π ( s ) \begin{aligned} A^\pi(s,a)&=Q^\pi(s,a)-V^\pi(s,a)\\ &=r(s,a)+\gamma E_{s'\sim p(s'|s,a)}[V^\pi(s')]-V^\pi(s)\\ &\approx r(s,a)+\gamma V^\pi(s')-V^\pi(s) \end{aligned} Aπ(s,a)=Qπ(s,a)−Vπ(s,a)=r(s,a)+γEs′∼p(s′∣s,a)[Vπ(s′)]−Vπ(s)≈r(s,a)+γVπ(s′)−Vπ(s)
- 根据 A π ( s , a ) A^\pi(s,a) Aπ(s,a)得出好的 π ′ \pi' π′,比如下面的式子(Policy Improvement)
π ′ = a r g m a x a A π ( s , a ) , π ← π ′ \pi'=argmax_aA^\pi(s,a) ,\quad \pi\leftarrow\pi' π′=argmaxaAπ(s,a),π←π′
- 说明:策略迭代PI主要分为Policy Evaluation与Policy Improvement两大步,其中第一步需要用到与环境交互得到的样本来估计优势函数,第二步需要可选择性提升动作值或动作的概率分布,而与环境交互时需要遵循设定Exploration与Exploitation的原则进行,避免探索不到高奖励的样本。
3.2 值迭代(Value Iteration)
Policy Evaluation :
V π ( s ) ← E a ∼ π ( a ∣ s ) [ r ( s , a ) + γ E s ′ ∼ p ( s ′ ∣ s , a ) [ V π ( s ′ ) ] ] V^\pi(s)\leftarrow E_{a\sim\pi(a|s)}\Big[r(s,a)+\gamma E_{s'\sim p(s'|s,a)}\big[V^\pi(s')\big]\Big] Vπ(s)←Ea∼π(a∣s)[r(s,a)+γEs′∼p(s′∣s,a)[Vπ(s′)]]
选择一个action a a a
Q π ( s , a ) ← r ( s , a ) + γ E s ′ ∼ p ( s ′ ∣ s , a ) [ V π ( s ′ ) ] Q^\pi(s,a)\leftarrow r(s,a)+\gamma E_{s'\sim p(s'|s,a)}\big[V^\pi(s')\big] Qπ(s,a)←r(s,a)+γEs′∼p(s′∣s,a)[Vπ(s′)]
- 评估 Q π ( s , a ) Q^\pi(s,a) Qπ(s,a),根据上式评估Q值
- 令 V π ( s ) ← m a x a Q π ( s , a ) V^\pi(s)\leftarrow max_aQ^\pi(s,a) Vπ(s)←maxaQπ(s,a)
- 说明:第一步对Q值函数的更新,主要用到了环境的dynamics即 p ( s ′ ∣ s , a ) p(s'|s,a) p(s′∣s,a),第二步简单地将最大的Q值当作状态价值,从而构成迭代链条,问题在于环境dynamics一般情况下是未知的。
3.3 拟合的值迭代(Fitted Value Iteration)
- 说明:除了迭代状态值 V ϕ π ( s ) V^\pi_\phi(s) Vϕπ(s),还可以直接迭代Q值 Q ϕ π ( s , a ) Q^\pi_\phi(s,a) Qϕπ(s,a)。Fitted Value Iteration比Value Iteration更stable,但不能on-policy,两者均使用了环境的dynamics模型,如果dynamics未知,则需要用到Q值迭代。
3.4 拟合的Q值迭代(Fitted Q-iteration)
- Fitted Q-iteration
- online Q-learning (online Q-iteration)
- 说明:两者对于dynamics的部分,均采用 m a x a ′ Q ϕ ( s i ′ , a i ′ ) max_{a'}Q_\phi(s_i',a_i') maxa′Qϕ(si′,ai′)来代替 E s ′ ∼ p ( s ′ ∣ s , a ) [ V ϕ ( s ′ ) ] E_{s'\sim p(s'|s,a)}[V_\phi(s')] Es′∼p(s′∣s,a)[Vϕ(s′)],因此都在此损失了一定精度。而full fitted Q-iteration可以off-policy,拟合的Q值也更为稳定,online Q-learning收集一个样本,更新一次Q值的梯度,非常不稳定,而且使用Non-linear的Function Approximater时,理论上不保证收敛。如何使online Q-learning更适用于实际问题,可参考第四章的改进。
3.5 小总结(Summary)
- Policy Iteration更为general,当PI(Policy Improvement)为 π ′ = arg max Q π ( s , a ) \pi'=\argmax Q^\pi(s,a) π′=argmaxQπ(s,a)时,退化为Value Iteration,为Value-based;当PI为gradient ascent时,为Actor-Critic;当估计Q值,并采用梯度更新,限制更新步长时为Advanced Policy Gradient。
- Value Iteration中遇到的环境dynamics在Q-iteration中被克服,付出损失理论收敛的代价,fitted的值比gradient更新的值更加稳定。
四、基于Q值函数的方法(Q-Value based Methods)
4.1 针对Q值迭代的改进方法
存在两个问题:
- 第一步中一个轨迹的样本之间高度相关,应该打乱处理,使用不同轨迹的样本估计。(解决方案:经验池回放)
- 第三步中的目标值 y i y_i yi每次更新都变动,而且没有通过目标值的梯度更新。 (解决方案:目标网络)
4.1.1 经验池回放(Replay Buffer)
要点:存放一些样本,在线更新时收集存放在经验池中,每次更新时从Replay Buffer中采样。
4.1.2 目标网络 (Target Network)
#要点:用一个新的Q值网络 Q ϕ ′ ( s i ′ , a i ′ ) Q_{\phi'}(s_i',a_i') Qϕ′(si′,ai′)来代替目标值,使其一定时间内不发生变动,一段时间后更新目标网络参数 ϕ ′ ← ϕ \phi'\leftarrow\phi ϕ′←ϕ。
4.2 基于Q-learning的三种形态
4.2.1 Online Q-learning
4.2.2 DQN(N=1,K=1)
4.2.3 Fitted Q-learning
4.2.4 A Brief Summary
4.3 Q值深度网络的有效改进 (Practical Tips)
4.3.1 Double Q-learning
要点:动作选取从目标网络切换到原策略网络。
4.3.2 N-step Returns
y j , t = ∑ t ′ = t t + N − 1 γ t − t ′ r j , t ′ + γ N m a x a j , t + N Q ϕ ′ ( s j , t + N , a j , t + N ) y_{j,t}=\sum_{t'=t}^{t+N-1}\gamma^{t-t'}r_{j,t'}+\gamma^Nmax_{a_{j,t+N}}Q_{\phi'}(s_{j,t+N},a_{j,t+N}) yj,t=t′=t∑t+N−1γt−t′rj,t′+γNmaxaj,t+NQϕ′(sj,t+N,aj,t+N)
4.3.3 Dueling Structure
要点:第一个图中一般使用一个网络来拟合 Q π ( s , a ) Q^\pi(s,a) Qπ(s,a),在结构中分成拟合 V π ( s ) V^\pi(s) Vπ(s)与动作优势 A π ( s , a ) A^\pi(s,a) Aπ(s,a),得到 Q π ( s , a ) = A π ( s , a ) + V π ( s ) Q^\pi(s,a)=A^\pi(s,a)+V^\pi(s) Qπ(s,a)=Aπ(s,a)+Vπ(s),可参考下面一篇论文。
Dueling Network Architectures for Deep Reinforcement Learning
Double DQN具体算法例子
4.4 连续动作的Q-learning (Continuous action)
上述提到的Q-learning是适合离散动作的,这样 Q π ( s , a ) Q^\pi(s,a) Qπ(s,a)才好迭代表示,那对于连续动作,具体问题出现在max上,有如下三种方法处理。
4.4.1 随机优化(Stochastic Optimization)
随机优化主要思想,就是从模型中选择一些参数组合,设定一个评价指标,然后从参数组合中选出使评价指标最优的参数。如进化策略(Evolution Strategy)中最好的算法CMA-ES(Covariance Matrix Adaptation-Evolution Strategy),可从IGO(Information Geometric Optimization)求natural gradient得到。无梯度方法一般比有梯度需要的样本或计算量大,CMA-ES在中等规模(变量个数大约在 3-300范围内)的复杂优化问题上具有很好的效果。除此之外,进化策略可作为深度强化学习算法一个额外的、良好的可扩展算法,具体可参见下面两篇论文。
Evolution Strategies as a Scalable Alternative to Reinforcement Learning 2017
The CMA-ES: A Tutorial 2016
4.4.2 使用易于优化的函数类型 (Easy to optimize)
将Q函数以某种易于优化的方式代替,使用NAF(Normalize Advantage Function)使其适用于连续动作。具体NAF在连续Deep Q-Learning的细节可参考下面一篇ICML 2016的论文。
Continuous Deep Q-Learning with Model-based Acceleration 2016
4.4.3 学习一个新的执行者 (Learn A Second Actor)
弄一个新的actor如 μ ( s ) \mu(s) μ(s),来代替max对象,使actor与policy协同更新。
μ θ ( s ) ≈ a r g m a x a Q ϕ ( s , a ) m a x a Q ϕ ( s , a ) = Q ϕ ( s , a r g m a x a Q ϕ ( s , a ) ) \mu_\theta(s) \approx argmax_aQ_\phi(s,a)\\ max_aQ_\phi(s,a)=Q_\phi(s,argmax_aQ_\phi(s,a)) μθ(s)≈argmaxaQϕ(s,a)maxaQϕ(s,a)=Qϕ(s,argmaxaQϕ(s,a))
Continuous Control With Deep Reinforcement Learning
五、优化策略梯度(Advanced Policy Gradient)
目的:从目标函数 J ( θ ) J(\theta) J(θ)入手,希望找到一个更好的 J ( θ ′ ) J(\theta') J(θ′),使 J ( θ ′ ) − J ( θ ) ≥ 0 J(\theta')-J(\theta)\geq0 J(θ′)−J(θ)≥0,且两者之间距离越大越好。
已知 J ( θ ) = E τ ∼ p θ ( τ ) [ ∑ t γ t r ( s t , a t ) ] = E s 0 ∼ p θ ( s 0 ) [ V ( s 0 ) ] J(\theta)=E_{\tau\sim p_\theta(\tau)}\big[\sum_t \gamma^tr(s_t,a_t)\big]=E_{s_0\sim p_\theta(s_0)}\big[V(s_0)\big] J(θ)=Eτ∼pθ(τ)[∑tγtr(st,at)]=Es0∼pθ(s0)[V(s0)]
所以有:
J ( θ ′ ) − J ( θ ) = J ( θ ′ ) − E s 0 ∼ p θ ( s 0 ) [ V π θ ( s 0 ) ] ( 1 ) = J ( θ ′ ) − E τ ∼ p θ ′ ( τ ) [ V π θ ( s 0 ) ] ( 2 ) = J ( θ ′ ) − E τ ∼ p θ ′ ( τ ) [ ∑ t = 0 ∞ γ t V π θ ( s t ) − ∑ t = 1 ∞ γ t V π θ ( s t ) ] = J ( θ ′ ) − E τ ∼ p θ ′ ( τ ) [ ∑ t = 0 ∞ γ t ( γ V π θ ( s t + 1 ) − V π θ ( s t ) ] ) ] = E τ ∼ p θ ′ ( τ ) [ ∑ t = 0 ∞ γ t r ( s t , a t ) ] − E τ ∼ p θ ′ ( τ ) [ ∑ t = 0 ∞ γ t ( γ V π θ ( s t + 1 ) − V π θ ( s t ) ] ) ] = E τ ∼ p θ ′ ( τ ) [ ∑ t = 0 ∞ γ t ( r ( s t , a t ) + γ V π θ ( s t + 1 ) − V π θ ( s t ) ) ] = E τ ∼ p θ ′ ( τ ) [ ∑ t = 0 ∞ γ t A π θ ( s t , a t ) ] = ∑ t = 0 ∞ E s t ∼ p θ ′ ( s t ) [ E a t ∼ π θ ′ ( a t ∣ s t ) [ γ t A π θ ( s t , a t ) ] ] ( 3 ) = ∑ t = 0 ∞ E s t ∼ p θ ′ ( s t ) [ E a t ∼ π θ ( a t ∣ s t ) [ π θ ′ ( a t ∣ s t ) π θ ( a t ∣ s t ) γ t A π θ ( s t , a t ) ] ] ( 4 ) ≈ ∑ t = 0 ∞ E s t ∼ p θ ( s t ) [ E a t ∼ π θ ( a t ∣ s t ) [ π θ ′ ( a t ∣ s t ) π θ ( a t ∣ s t ) γ t A π θ ( s t , a t ) ] ] ( 5 ) = A ˉ ( θ ′ ) \begin{aligned} J(\theta')-J(\theta)&=J(\theta')-E_{s_0\sim p_\theta(s_0)}\big[V^{\pi_\theta}(s_0)\big] \quad(1) \\ &=J(\theta')-E_{\tau\sim p_{\theta'}(\tau)}\big[V^{\pi_\theta}(s_0)\big] \quad (2)\\ &=J(\theta')-E_{\tau\sim p_{\theta'}(\tau)}\big[\sum_{t=0}^\infty\gamma^tV^{\pi_\theta}(s_t)-\sum_{t=1}^\infty\gamma^tV^{\pi_\theta}(s_t)\big]\\ &=J(\theta')-E_{\tau\sim p_{\theta'}(\tau)}\big[\sum_{t=0}^\infty\gamma^t\big(\gamma V^{\pi_\theta}(s_{t+1})-V^{\pi_\theta}(s_t)]\big)\big]\\ &=E_{\tau\sim p_{\theta'}(\tau)}\big[\sum_{t=0}^\infty\gamma^tr(s_t,a_t)\big]-E_{\tau\sim p_{\theta'}(\tau)}\big[\sum_{t=0}^\infty\gamma^t\big(\gamma V^{\pi_\theta}(s_{t+1})-V^{\pi_\theta}(s_t)]\big)\big]\\ &=E_{\tau\sim p_{\theta'}(\tau)}\Big[\sum_{t=0}^\infty\gamma^t\big(r(s_t,a_t)+\gamma V^{\pi_\theta}(s_{t+1})-V^{\pi_\theta}(s_t)\big)\Big]\\ &=E_{\tau\sim p_{\theta'}(\tau)}\Big[\sum_{t=0}^\infty\gamma^tA^{\pi_\theta}(s_t,a_t)\Big]\\ &=\sum_{t=0}^\infty E_{s_t\sim p_{\theta'}(s_t)}\Big[E_{a_t\sim \pi_{\theta'}(a_t|s_t)}\big[\gamma^tA^{\pi_\theta}(s_t,a_t)\big]\Big] \quad (3)\\ &=\sum_{t=0}^\infty E_{s_t\sim p_{\theta'}(s_t)}\Big[E_{a_t\sim \pi_{\theta}(a_t|s_t)}\big[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_\theta(a_t|s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\big]\Big] \quad (4)\\ &\approx\sum_{t=0}^\infty E_{s_t\sim p_{\theta}(s_t)}\Big[E_{a_t\sim \pi_{\theta}(a_t|s_t)}\big[\frac{\pi_{\theta'}(a_t|s_t)} {\pi_\theta(a_t|s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\big]\Big]\quad (5)\\ &=\bar{A}(\theta') \end{aligned} J(θ′)−J(θ)=J(θ′)−Es0∼pθ(s0)[Vπθ(s0)](1)=J(θ′)−Eτ∼pθ′(τ)[Vπθ(s0)](2)=J(θ′)−Eτ∼pθ′(τ)[t=0∑∞γtVπθ(st)−t=1∑∞γtVπθ(st)]=J(θ′)−Eτ∼pθ′(τ)[t=0∑∞γt(γVπθ(st+1)−Vπθ(st)])]=Eτ∼pθ′(τ)[t=0∑∞γtr(st,at)]−Eτ∼pθ′(τ)[t=0∑∞γt(γVπθ(st+1)−Vπθ(st)])]=Eτ∼pθ′(τ)[t=0∑∞γt(r(st,at)+γVπθ(st+1)−Vπθ(st))]=Eτ∼pθ′(τ)[t=0∑∞γtAπθ(st,at)]=t=0∑∞Est∼pθ′(st)[Eat∼πθ′(at∣st)[γtAπθ(st,at)]](3)=t=0∑∞Est∼pθ′(st)[Eat∼πθ(at∣st)[πθ(at∣st)πθ′(at∣st)γtAπθ(st,at)]](4)≈t=0∑∞Est∼pθ(st)[Eat∼πθ(at∣st)[πθ(at∣st)πθ′(at∣st)γtAπθ(st,at)]](5)=Aˉ(θ′)
(1)到(2): s 0 ∼ p θ ( s 0 ) s_0\sim p_\theta(s_0) s0∼pθ(s0)表示在 θ \theta θ下的初始状态分布与 θ ′ \theta' θ′下的初始状态分布相同,即与轨迹的初始状态分布相同,即 τ ∼ p θ ′ ( τ ) \tau\sim p_{\theta'}(\tau) τ∼pθ′(τ)
(3)到(4):重要性采样(Importance Sampling)
于是我们的目的就变成如下:
θ ′ ← arg max θ ′ A ˉ ( θ ′ ) ≈ J ( θ ′ ) − J ( θ ) \theta'\leftarrow \argmax_{\theta'} \bar{A}(\theta')\approx J(\theta')-J(\theta) θ′←θ′argmaxAˉ(θ′)≈J(θ′)−J(θ)
如何使(5)中的近似尽可能成立呢?这需要新的policy π θ ′ ( a t ∣ s t ) \pi_{\theta'}(a_t|s_t) πθ′(at∣st)与旧的policy π θ ( a t ∣ s t ) \pi_{\theta}(a_t|s_t) πθ(at∣st)靠得很近的话,就有 p θ ′ ( s t ) ≈ p θ ( s t ) p_{\theta'}(s_t)\approx p_\theta(s_t) pθ′(st)≈pθ(st),相当于加了一个约束 ∣ p θ ′ ( s t ) − p θ ( s t ) ∣ ≤ ϵ |p_{\theta'}(s_t)- p_\theta(s_t)|\leq\epsilon ∣pθ′(st)−pθ(st)∣≤ϵ等价于 ∣ π θ ′ ( a t ∣ s t ) − π θ ( a t ∣ s t ) ∣ ≤ ϵ |\pi_{\theta'}(a_t|s_t)-\pi_\theta(a_t|s_t)|\leq\epsilon ∣πθ′(a