深度强化学习CS285 lec5-lec9(超长预警)

深度强化学习CS285 lec5-lec9 学习感悟

  • 一、策略梯度(Policy Gradient)
    • 1.1 REINFORCE
    • 1.2 改进方法
      • 1.2.1 因果性(Causality)
      • 1.2.2 基准(Baselines)
      • 1.2.3 重要性采样(Important Sampling)
  • 二、执行者-评估者 (Actor-Critic)
    • 2.1 优势函数 A π ( s t , a t ) A^\pi(s_t,a_t) Aπ(st,at)(Advantage Function)
    • 2.2 拟合值函数 V π ( s ) V^\pi(s) Vπ(s)
      • 2.2.1 基于蒙特卡洛采样的策略评估 (Monte-Carlo Policy Evaluation)
      • 2.2.2 基于自举拟合值函数 (Bootstrapped)
      • 2.2.3 拟合目标值小总结(Target Value Summary)
    • 2.3 Actor-Critic综合改进版
  • 三、基于值函数的方法(Value-based Methods)
    • 3.1 策略迭代(Policy Iteration)
    • 3.2 值迭代(Value Iteration)
    • 3.3 拟合的值迭代(Fitted Value Iteration)
    • 3.4 拟合的Q值迭代(Fitted Q-iteration)
    • 3.5 小总结(Summary)
  • 四、基于Q值函数的方法(Q-Value based Methods)
    • 4.1 针对Q值迭代的改进方法
      • 4.1.1 经验池回放(Replay Buffer)
      • 4.1.2 目标网络 (Target Network)
    • 4.2 基于Q-learning的三种形态
      • 4.2.1 Online Q-learning
      • 4.2.2 DQN(N=1,K=1)
      • 4.2.3 Fitted Q-learning
      • 4.2.4 A Brief Summary
    • 4.3 Q值深度网络的有效改进 (Practical Tips)
      • 4.3.1 Double Q-learning
      • 4.3.2 N-step Returns
      • 4.3.3 Dueling Structure
      • Double DQN具体算法例子
    • 4.4 连续动作的Q-learning (Continuous action)
      • 4.4.1 随机优化(Stochastic Optimization)
      • 4.4.2 使用易于优化的函数类型 (Easy to optimize)
      • 4.4.3 学习一个新的执行者 (Learn A Second Actor)
  • 五、优化策略梯度(Advanced Policy Gradient)
    • 5.1 自然策略梯度 ( Natural Policy Gradient)
    • 5.2 TRPO (Trust Region Policy Optimization)
    • 5.3 PPO(Proximal Policy Optimization)
  • 参考资料
  • 补充

一、策略梯度(Policy Gradient)


1.1 REINFORCE

  • 优化目标 J ( θ ) J(\theta) J(θ)
    θ ∗ = arg max ⁡ θ J ( θ ) = arg max ⁡ θ E τ ∼ p θ ( τ ) [ ∑ t r ( s t , a t ) ] = arg max ⁡ θ ∑ t = 1 T E ( s t , a t ) ∼ p θ ( s t , a t ) [ r ( s t , a t ) ] ( F i n i t e H o r i z o n ) = arg max ⁡ θ E s 1 ∼ p ( s 1 ) [ V ( s 1 ) ] = arg max ⁡ θ E s 1 ∼ p ( s 1 ) [ E a ∼ π θ ( a ∣ s ) [ Q ( s , a ) ] ] \begin{aligned} \theta^* & =\argmax_\theta J(\theta) \\ &=\argmax_\theta E_{\tau \sim p_\theta(\tau)}\Big[\sum_t r(s_t,a_t)\Big] \\ & = \argmax_\theta \sum_{t=1}^T E_{(s_t,a_t) \sim p_\theta (s_t,a_t)}\Big[ r(s_t,a_t)\Big] (Finite \quad Horizon)\\ &=\argmax_\theta E_{s_1 \sim p(s_1)}\big[V(s_1)\big]\\ &=\argmax_\theta E_{s_1\sim p(s_1)}\Big[E_{a\sim \pi_\theta(a|s)}[Q(s,a)]\Big] \end{aligned} θ=θargmaxJ(θ)=θargmaxEτpθ(τ)[tr(st,at)]=θargmaxt=1TE(st,at)pθ(st,at)[r(st,at)](FiniteHorizon)=θargmaxEs1p(s1)[V(s1)]=θargmaxEs1p(s1)[Eaπθ(as)[Q(s,a)]]
    优化目标一般是最大化在一条轨迹 τ \tau τ上累积奖励函数 ∑ t r ( s t , a t ) \sum_tr(s_t,a_t) tr(st,at)的期望;抑或是状态服从平稳分布 s 1 ∼ p ( s 1 ) s_1 \sim p(s_1) s1p(s1)时,最大化状态价值 V ( s ) V(s) V(s)的期望;
  • 目标梯度
    π θ ( τ ) = p θ ( τ ) , r ( τ ) = ∑ t r ( s t , a t ) \pi_\theta(\tau)=p_\theta(\tau),r(\tau)=\sum_tr(s_t,a_t) πθ(τ)=pθ(τ),r(τ)=tr(st,at),由lec1-lec4得 p θ ( τ ) = p ( s 1 ) ∏ t = 1 T π θ ( a t ∣ s t ) p ( s t + 1 ∣ s t , a t ) p_\theta(\tau)=p(s_1)\prod_{t=1}^T\pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t) pθ(τ)=p(s1)t=1Tπθ(atst)p(st+1st,at),因此目标 J ( θ ) J(\theta) J(θ)变为
    J ( θ ) = E τ ∼ π θ ( τ ) [ ∑ t r ( s t , a t ) ] = E τ ∼ π θ ( τ ) [ r ( τ ) ] = ∫ π θ ( τ ) r ( τ ) d τ \begin{aligned} J(\theta)&=E_{\tau \sim \pi_\theta(\tau)}\Big[\sum_t r(s_t,a_t)\Big] \\ &=E_{\tau \sim \pi_\theta(\tau)}\Big[r(\tau)\Big] \\ &=\int \pi_\theta(\tau)r(\tau)d\tau \\ \end{aligned} J(θ)=Eτπθ(τ)[tr(st,at)]=Eτπθ(τ)[r(τ)]=πθ(τ)r(τ)dτ
    目标梯度: 由求导公式有 ∇ π θ ( τ ) = π θ ( τ ) ∇ θ l o g π θ ( τ ) \nabla\pi_\theta(\tau)=\pi_\theta(\tau)\nabla_\theta log\pi_\theta(\tau) πθ(τ)=πθ(τ)θlogπθ(τ)
    ∇ J ( θ ) = ∫ ∇ π θ ( τ ) r ( τ ) d τ = ∫ π θ ( τ ) ∇ θ l o g π θ ( τ ) r ( τ ) d τ = E τ ∼ π θ ( τ ) [ ∇ θ l o g π θ ( τ ) r ( τ ) ] \begin{aligned} \nabla J(\theta)&=\int \nabla \pi_\theta(\tau)r(\tau)d\tau \\ &=\int \pi_\theta(\tau) \nabla_\theta log\pi_\theta(\tau)r(\tau)d\tau\\ &=E_{\tau \sim \pi_\theta(\tau)}\Big[\nabla_\theta log\pi_\theta(\tau)r(\tau)\Big]\\ \end{aligned} J(θ)=πθ(τ)r(τ)dτ=πθ(τ)θlogπθ(τ)r(τ)dτ=Eτπθ(τ)[θlogπθ(τ)r(τ)]
    因 为 π θ ( τ ) = p ( s 1 ) ∏ t = 1 T π θ ( a t ∣ s t ) p ( s t + 1 ∣ s t , a t ) 所 以 ∇ θ l o g π θ ( τ ) = ∇ θ [ l o g p ( s 1 ) + ∑ t = 1 T l o g π θ ( a t ∣ s t ) + ∑ t = 1 T p ( s t + 1 ∣ s t , a t ) ] = ∑ t = 1 T ∇ θ l o g π θ ( a t ∣ s t ) 因 此 ∇ θ J ( θ ) = E τ ∼ π θ ( τ ) [ ∑ t = 1 T ∇ θ l o g π θ ( a t ∣ s t ) ] ⏟ p o l i c y r e l a t e d [ ∑ t = 1 T r ( s t , a t ) ] ⏟ S u p e r v i s e d I n f o 因为\pi_\theta(\tau)=p(s_1)\prod_{t=1}^T\pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)\\ 所以\nabla_\theta log\pi_\theta(\tau)=\nabla_\theta[logp(s_1)+\sum_{t=1}^Tlog\pi_\theta(a_t|s_t)+\sum_{t=1}^Tp(s_{t+1}|s_t,a_t)]=\sum_{t=1}^T \nabla_\theta log\pi_\theta(a_t|s_t)\\ 因此\nabla_\theta J(\theta)=E_{\tau \sim \pi_\theta(\tau)}\underbrace{\Big[\sum_{t=1}^T\nabla_\theta log\pi_\theta(a_t|s_t)\Big]}_{policy\quad related}\underbrace{\Big[\sum_{t=1}^T{r(s_t,a_t)}\Big]}_{Supervised\quad Info} πθ(τ)=p(s1)t=1Tπθ(atst)p(st+1st,at)θlogπθ(τ)=θ[logp(s1)+t=1Tlogπθ(atst)+t=1Tp(st+1st,at)]=t=1Tθlogπθ(atst)θJ(θ)=Eτπθ(τ)policyrelated [t=1Tθlogπθ(atst)]SupervisedInfo [t=1Tr(st,at)]
    对于期望采用蒙特卡洛M-C采样trajectory来近似期望,采 N N N个样本 τ i = ( s i , 1 , a i , 1 , . . . , s i , T , a i , T ) , i = 1 , 2 , . . . , N \tau_i=(s_{i,1},a_{i,1},...,s_{i,T},a_{i,T}),i=1,2,...,N τi=(si,1,ai,1,...,si,T,ai,T),i=1,2,...,N
    ∇ θ J ( θ ) ≈ 1 N ∑ i = 1 N [ ∑ t = 1 T ∇ θ l o g π θ ( a i , t ∣ s i , t ) ] [ ∑ t = 1 T r ( s i , t , a i , t ) ] \nabla_\theta J(\theta)\approx \frac{1}{N}\sum_{i=1}^N\Big[\sum_{t=1}^T\nabla_\theta log\pi_\theta(a_{i,t}|s_{i,t})\Big]\Big[\sum_{t=1}^T{r(s_{i,t},a_{i,t})}\Big] θJ(θ)N1i=1N[t=1Tθlogπθ(ai,tsi,t)][t=1Tr(si,t,ai,t)]
    于是便得到了最原始的策略算法REINFORCE。
  • REINFROCE(最原始的策略梯度算法)
    • sample τ i {\tau^i} τi from π θ ( a t ∣ s t ) \pi_\theta(a_t|s_t) πθ(atst)
    • ∇ θ J ( θ ) ≈ 1 N ∑ i = 1 N [ ∑ t = 1 T ∇ θ l o g π θ ( a i , t ∣ s i , t ) ] [ ∑ t = 1 T r ( s i , t , a i , t ) ] \nabla_\theta J(\theta)\approx \frac{1}{N}\sum_{i=1}^N\Big[\sum_{t=1}^T\nabla_\theta log\pi_\theta(a_{i,t}|s_{i,t})\Big]\Big[\sum_{t=1}^T{r(s_{i,t},a_{i,t})}\Big] θJ(θ)N1i=1N[t=1Tθlogπθ(ai,tsi,t)][t=1Tr(si,t,ai,t)]
    • θ ← θ + α ∇ θ J ( θ ) \theta\leftarrow\theta+\alpha\nabla_\theta J(\theta) θθ+αθJ(θ)
  1. 根据current policy, π θ \pi_\theta πθ,来采样得到轨迹(trajectory)样本;
  2. 由样本中环境反馈来的reward信号,计算更新当前policy参数的监督信号 ∇ θ J ( θ ) \nabla_\theta J(\theta) θJ(θ)
  3. 加上可调整的学习率 α \alpha α来实现更新
    由此可见,REINFORCE的每一步更新,都需要与环境交互来获取有效的监督信息实现更新, ∑ t r ( s i , t , a i , t ) \sum_t r(s_{i,t},a_{i,t}) tr(si,t,ai,t)的监督信息方差很大,需要一些改进方法使其更有效、更为稳定。

1.2 改进方法


1.2.1 因果性(Causality)

  • Motivation : Policy at time t’ cannot affect reward at time t when t
  • MC的目标梯度:
    ∇ θ J ( θ ) ≈ 1 N ∑ i = 1 N [ ∑ t = 1 T ∇ θ l o g π θ ( a i , t ∣ s i , t ) ] [ ∑ t = 1 T r ( s i , t , a i , t ) ] \nabla_\theta J(\theta)\approx \frac{1}{N}\sum_{i=1}^N\Big[\sum_{t=1}^T\nabla_\theta log\pi_\theta(a_{i,t}|s_{i,t})\Big]\Big[\sum_{t=1}^T{r(s_{i,t},a_{i,t})}\Big] θJ(θ)N1i=1N[t=1Tθlogπθ(ai,tsi,t)][t=1Tr(si,t,ai,t)]
  • 因果性改进后的梯度:
    ∇ θ J ( θ ) ≈ 1 N ∑ i = 1 N [ ∑ t = 1 T ∇ θ l o g π θ ( a i , t ∣ s i , t ) ] [ ∑ t ′ = t T r ( s i , t ′ , a i , t ′ ) ] = 1 N ∑ i = 1 N [ ∑ t = 1 T ∇ θ l o g π θ ( a i , t ∣ s i , t ) ] Q ^ i , t \begin{aligned} \nabla_\theta J(\theta)&\approx \frac{1}{N}\sum_{i=1}^N\Big[\sum_{t=1}^T\nabla_\theta log\pi_\theta(a_{i,t}|s_{i,t})\Big]\Big[\sum_{t'=t}^T{r(s_{i,t'},a_{i,t'})}\Big] \\ &=\frac{1}{N}\sum_{i=1}^N\Big[\sum_{t=1}^T\nabla_\theta log\pi_\theta(a_{i,t}|s_{i,t})\Big]\hat{Q}_{i,t} \end{aligned} θJ(θ)N1i=1N[t=1Tθlogπθ(ai,tsi,t)][t=tTr(si,t,ai,t)]=N1i=1N[t=1Tθlogπθ(ai,tsi,t)]Q^i,t
  • 说明:在 J ( θ ) = ∑ τ π θ ( τ ) r ( τ ) J(\theta)=\sum_\tau\pi_\theta(\tau)r(\tau) J(θ)=τπθ(τ)r(τ)中,监督信息 r ( τ ) r(\tau) r(τ)计算时,要注意,未来的决策并不会改变过去的奖励,因此得从policy的时刻t开始计算 r ( τ ) r(\tau) r(τ)会更为准确,即因果性优化。

1.2.2 基准(Baselines)

  • Motivation : 为了让估计出来的梯度更加稳定
  • J ( θ ) = 1 N ∑ i = 1 N [ ∑ t = 1 T ∇ θ l o g π θ ( a i , t ∣ s i , t ) ] [ r ( τ ) − b ] J(\theta)=\frac{1}{N}\sum_{i=1}^N\Big[\sum_{t=1}^T\nabla_\theta log\pi_\theta(a_{i,t}|s_{i,t})\Big]\Big[r(\tau)-b\Big] J(θ)=N1i=1N[t=1Tθlogπθ(ai,tsi,t)][r(τ)b]
  • 关于reward r ( τ ) r(\tau) r(τ) b b b的选择

r ( τ ) = ∑ t = 1 T r ( s t , a t ) o r Q ^ i , t r(\tau)=\sum_{t=1}^Tr(s_t,a_t) \quad or \quad \hat{Q}_{i,t} r(τ)=t=1Tr(st,at)orQ^i,t

b = 1 N ∑ i = 1 N r ( τ i ) o r V π ( s t ) b=\frac{1}{N}\sum_{i=1}^Nr(\tau_i)\quad or \quad V^\pi(s_t) b=N1i=1Nr(τi)orVπ(st)

  • 说明:因为每一条轨迹的奖励信号 r ( τ i ) r(\tau_i) r(τi)数值可能差异很多,而目标梯度 ∇ J ( θ ) \nabla J(\theta) J(θ)的更新又依赖于对轨迹采样,因此对奖励信号进行均值标准化 b = 1 N ∑ i = 1 N r ( τ i ) b=\frac{1}{N}\sum_{i=1}^Nr(\tau_i) b=N1i=1Nr(τi),即通过减去一个基准的方式,来使目标梯度更新过程稳定。基准的选择多种多样,通常为 V π ( s t ) V^\pi(s_t) Vπ(st)

1.2.3 重要性采样(Important Sampling)

  • Motivation : 为了可以重用以前策略获得的样本,减少每一步梯度更新时都需要与环境交互获取新轨迹样本的弊端。
  • J ( θ ) = E τ ∼ π ˉ ( τ ) [ π ( τ ) π ˉ ( τ ) r ( τ ) ] = E τ ∼ π ˉ ( τ ) [ p ( s 1 ) ∏ t = 1 T π θ ( a t ∣ s t ) p ( s t + 1 ∣ s t , a t ) p ( s 1 ) ∏ t = 1 T π ˉ θ ′ ( a t ∣ s t ) p ( s t + 1 ∣ s t , a t ) r ( τ ) ] = E τ ∼ π ˉ ( τ ) [ ∏ t = 1 T π θ ( a t ∣ s t ) π ˉ θ ′ ( a t ∣ s t ) r ( τ ) ] \begin{aligned} J(\theta)&=E_{\tau \sim \bar{\pi}(\tau)}\Big[\frac{\pi(\tau)}{\bar{\pi}(\tau)}r(\tau)\Big]\\ &=E_{\tau \sim \bar{\pi}(\tau)}\Big[\frac{p(s_1)\prod_{t=1}^T\pi_\theta(a_t|s_t)p(s_{t+1}|s_t,a_t)}{p(s_1)\prod_{t=1}^T\bar{\pi}_{\theta^{'}}(a_t|s_t)p(s_{t+1}|s_t,a_t)}r(\tau)\Big]\\ &=E_{\tau \sim \bar{\pi}(\tau)}\Big[\prod_{t=1}^T\frac{\pi_\theta(a_t|s_t)}{\bar{\pi}_{\theta^{'}}(a_t|s_t)}r(\tau)\Big] \end{aligned} J(θ)=Eτπˉ(τ)[πˉ(τ)π(τ)r(τ)]=Eτπˉ(τ)[p(s1)t=1Tπˉθ(atst)p(st+1st,at)p(s1)t=1Tπθ(atst)p(st+1st,at)r(τ)]=Eτπˉ(τ)[t=1Tπˉθ(atst)πθ(atst)r(τ)]
  • 说明: π ˉ ( τ ) \bar{\pi}(\tau) πˉ(τ)为以前的策略轨迹分布,重用过去轨迹时,给每一个轨迹加上一个权重来衡量,更新后轨迹分布与上一步的轨迹分布差异,如果差异小,则 π ( τ ) π ˉ ( τ ) \frac{\pi(\tau)}{\bar{\pi}(\tau)} πˉ(τ)π(τ)接近1,可重用上一步的轨迹样本。
  • ∇ θ J ( θ ) = E τ ∼ π ˉ ( τ ) [ ∇ π ( τ ) π ˉ ( τ ) r ( τ ) ] = E τ ∼ π ˉ ( τ ) [ π ( τ ) ∇ l o g π ( τ ) π ˉ ( τ ) r ( τ ) ] = E τ ∼ π ˉ ( τ ) [ ∏ t = 1 T π θ ( a t ∣ s t ) π ˉ θ ′ ( a t ∣ s t ) r ( τ ) [ ∑ t = 1 T ∇ θ log ⁡ π θ ( a t ∣ s t ) ] ] = E τ ∼ π ˉ ( τ ) [ [ ∑ t = 1 T ∇ θ log ⁡ π θ ( a t ∣ s t ) ] ∏ t ′ = 1 t π θ ( a t ′ ∣ s t ′ ) π ˉ θ ′ ( a t ′ ∣ s t ′ ) [ ∑ t ′ = t T r ( s t ′ , a t ′ ) ] ∏ t ′ ′ = t t ′ π θ ( a t ′ ′ ∣ s t ′ ′ ) π ˉ θ ′ ′ ( a t ′ ∣ s t ′ ′ ) ] \begin{aligned} \nabla_{\theta}J(\theta)&=E_{\tau \sim \bar{\pi}(\tau)}\Big[\frac{\nabla\pi(\tau)}{\bar{\pi}(\tau)}r(\tau)\Big]\\ &=E_{\tau \sim \bar{\pi}(\tau)}\Big[\frac{\pi(\tau)\nabla log\pi(\tau)}{\bar{\pi}(\tau)}r(\tau)\Big]\\ &=E_{\tau \sim \bar{\pi}(\tau)}\Big[\prod_{t=1}^T\frac{\pi_\theta(a_t|s_t)}{\bar{\pi}_{\theta^{'}}(a_t|s_t)}r(\tau)\big[\sum_{t=1}^T\nabla_\theta\log\pi_\theta(a_t|s_t)\big ] \Big]\\ &=E_{\tau\sim\bar{\pi}(\tau)}\Big[\big[\sum_{t=1}^T\nabla_\theta\log\pi_\theta(a_t|s_t)\big ]\prod_{t'=1}^t\frac{\pi_\theta(a_{t'}|s_{t'})}{\bar{\pi}_{\theta^{'}}(a_{t'}|s_{t'})}\big[\sum_{t'=t}^Tr(s_{t'},a_{t'})\big]\cancel{\prod_{t''=t}^{t'}\frac{\pi_\theta(a_{t''}|s_{t''})}{\bar{\pi}_{\theta^{''}}(a_{t'}|s_{t''})}}\Big] \end{aligned} θJ(θ)=Eτπˉ(τ)[πˉ(τ)π(τ)r(τ)]=Eτπˉ(τ)[πˉ(τ)π(τ)logπ(τ)r(τ)]=Eτπˉ(τ)[t=1Tπˉθ(atst)πθ(atst)r(τ)[t=1Tθlogπθ(atst)]]=Eτπˉ(τ)[[t=1Tθlogπθ(atst)]t=1tπˉθ(atst)πθ(atst)[t=tTr(st,at)]t=ttπˉθ(atst)πθ(atst) ]
  • 记 π θ ( a i , t ∣ s i , t ) π ˉ θ ′ ( a i , t ∣ s i , t ) = ∏ t ′ = 1 t π θ ( a i , t ′ ∣ s i , t ′ ) π ˉ θ ′ ( a i , t ′ ∣ s i , t ′ ) 故 ∇ θ J ( θ ) ≈ 1 N ∑ i = 1 N ∑ t = 1 T π θ ( a i , t ∣ s i , t ) π ˉ θ ′ ( a i , t ∣ s i , t ) ∇ θ l o g π θ ( a t ∣ s t ) Q ^ i , t 记\frac{\pi_\theta(a_{i,t}|s_{i,t})}{\bar \pi_{\theta'}(a_{i,t}|s_{i,t})}=\prod_{t'=1}^t\frac{\pi_\theta(a_{i,t'}|s_{i,t'})}{\bar{\pi}_{\theta^{'}}(a_{i,t'}|s_{i,t'})}\\ 故\nabla_\theta J(\theta) \approx \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\frac{\pi_\theta(a_{i,t}|s_{i,t})}{\bar \pi_{\theta'}(a_{i,t}|s_{i,t})}\nabla_\theta log\pi_\theta(a_t|s_t)\hat{Q}_{i,t}\\ πˉθ(ai,tsi,t)πθ(ai,tsi,t)=t=1tπˉθ(ai,tsi,t)πθ(ai,tsi,t)θJ(θ)N1i=1Nt=1Tπˉθ(ai,tsi,t)πθ(ai,tsi,t)θlogπθ(atst)Q^i,t

二、执行者-评估者 (Actor-Critic)

深度强化学习CS285 lec5-lec9(超长预警)_第1张图片

  • Policy Gradient直接从估计目标梯度入手,Actor-Critic则是通过评估当前的策略(Policy Evaluation),再往比当前策略好的方向进行更新(Policy Improvement)。Actor的作用为policy的执行者即图上第一步run the policy,Critic的作用为policy的评估者即图上第二步fit a model,而图上第三部improve the policy 可理解成一个learner,从Critic身上寻找一种学习机制来improve policy再交给actor执行。

2.1 优势函数 A π ( s t , a t ) A^\pi(s_t,a_t) Aπ(st,at)(Advantage Function)


  • 值函数
    • V π ( s t ) = ∑ t ′ = t T E τ ∼ π θ ( τ ) [ r ( s t ′ , a t ′ ) ∣ s t ] V^\pi(s_t)=\sum_{t'=t}^TE_{\tau\sim\pi_\theta(\tau)}\Big[r(s_{t'},a_{t'})|s_t\Big] Vπ(st)=t=tTEτπθ(τ)[r(st,at)st]

    • Q π ( s t , a t ) = ∑ t ′ = t T E τ ∼ π θ ( τ ) [ r ( s t ′ , a t ′ ) ] = E a t ∼ π θ ( a t ∣ s t ) [ V π ( s t ) ] = E a t ∼ π θ ( a t ∣ s t ) [ E s t ∼ π θ ( s t ) [ ∑ t ′ = t T r ( s t ′ , a t ′ ) ] ] \begin{aligned} Q^\pi(s_t,a_t)&=\sum_{t'=t}^TE_{\tau\sim\pi_\theta(\tau)}\Big[r(s_{t'},a_{t'})\Big]\\ &=E_{a_t\sim\pi_\theta(a_t|s_t)}\Big[V^\pi(s_t)\Big]\\ &=E_{a_t\sim\pi_\theta(a_t|s_t)}\Big[E_{s_t\sim\pi_\theta(s_t)}[\sum_{t'=t}^Tr(s_{t'},a_{t'})]\Big] \end{aligned} Qπ(st,at)=t=tTEτπθ(τ)[r(st,at)]=Eatπθ(atst)[Vπ(st)]=Eatπθ(atst)[Estπθ(st)[t=tTr(st,at)]]

    • 说明: V π ( s t ) V^\pi(s_t) Vπ(st)表示服从当前策略 π θ \pi_\theta πθ,初始状态为 s t s_t st时,可以获得的总回报; Q π ( s t , a t ) Q^\pi(s_t,a_t) Qπ(st,at)表示服从当策略 π θ \pi_\theta πθ,初始状态为 s t s_t st,选择动作为 a t a_t at时,可以获得的总回报,抽象可理解为对当前的状态或状态-动作下了一个价值判断,当V,Q知道时,便可从价值判断中选择Value最高的state或action,便于从价值判断中直接得到策略。

  • 优势函数
    A π ( s t , a t ) = Q π ( s t , a t ) − V π ( s t ) ≈ r ( s t , a t ) + ∑ t ′ = t + 1 T E π θ [ r ( s t ′ , a t ′ ) ] − V π ( s t ) = r ( s t , a t ) + V π ( s t + 1 ) − V π ( s t ) \begin{aligned} A^\pi(s_t,a_t)&=Q^\pi(s_t,a_t)-V^\pi(s_t)\\ &\approx r(s_t,a_t)+\sum_{t'=t+1}^TE_{\pi_\theta}\Big[r(s_{t'},a_{t'})\Big]-V^\pi(s_t)\\ &=r(s_t,a_t)+V^\pi(s_{t+1})-V^\pi(s_t) \end{aligned} Aπ(st,at)=Qπ(st,at)Vπ(st)r(st,at)+t=t+1TEπθ[r(st,at)]Vπ(st)=r(st,at)+Vπ(st+1)Vπ(st)
    • 说明:优势函数=状态动作值-状态值,衡量的是动作好的趋势究竟有多好,类似于在状态 s t s_t st下各Q值减去一下各动作平均值的一个标准化操作,通过近似的方式变成监督信号reward+时间差分TD error,因此为了得到目标梯度 ∇ θ J ( θ ) \nabla_\theta J(\theta) θJ(θ),只需要估计状态价值 V π ( s t ) V^\pi(s_t) Vπ(st)即可。

2.2 拟合值函数 V π ( s ) V^\pi(s) Vπ(s)


2.2.1 基于蒙特卡洛采样的策略评估 (Monte-Carlo Policy Evaluation)

  • V π ( s t ) V^\pi(s_t) Vπ(st)最简单的估计方式:
    V π ( s t ) ≈ ∑ t ′ = t T r ( s t ′ , a t ′ ) V π ( s t ) ≈ 1 N ∑ i = 1 N ∑ t ′ = t T r ( s i , t ′ , a i , t ′ ) V^\pi(s_t)\approx \sum_{t'=t}^Tr(s_{t'},a_{t'})\\ V^\pi(s_t) \approx \frac{1}{N}\sum_{i=1}^N\sum_{t'=t}^Tr(s_{i,t'},a_{i,t'}) Vπ(st)t=tTr(st,at)Vπ(st)N1i=1Nt=tTr(si,t,ai,t)
  • 近似的估计方式:训练集 = { ( s i , t , y i , t ) } , i = 1 , . . . , N , t = 1 , . . . , T \{(s_{i,t},y_{i,t})\},i=1,...,N,t=1,...,T {(si,t,yi,t)},i=1,...,Nt=1,...,T
    y i , t = ∑ t ′ = t T r ( s i , t ′ , a i , t ′ ) L ( ϕ ) = 1 2 ∑ i = 1 N ∣ ∣ V ^ ϕ π ( s i ) − y i ∣ ∣ 2 y_{i,t}=\sum_{t'=t}^Tr(s_{i,t'},a_{i,t'})\\ L(\phi)=\frac{1}{2}\sum_{i=1}^N||\hat{V}^\pi_\phi(s_i)-y_i||^2 yi,t=t=tTr(si,t,ai,t)L(ϕ)=21i=1NV^ϕπ(si)yi2
    深度强化学习CS285 lec5-lec9(超长预警)_第2张图片
    • 说明
    1. actor执行policy得到一些trajectory samples
    2. 以上述方式选择一个模型,拟合 V ^ ϕ π ( s ) \hat{V}_\phi^\pi(s) V^ϕπ(s)
    3. 利用 V ^ ϕ π ( s ) \hat{V}_\phi^\pi(s) V^ϕπ(s),计算各轨迹样本的优势函数 A ^ π ( s i , a i ) \hat{A}^\pi(s_i,a_i) A^π(si,ai)
    4. 计算目标梯度 ∇ J ( θ ) \nabla J(\theta) J(θ)
    5. 更新policy梯度

2.2.2 基于自举拟合值函数 (Bootstrapped)

y i , t = r ( s i , t , a i , t ) + V ^ ϕ π ( s i , t + 1 ) y_{i,t}=r(s_{i,t},a_{i,t})+\hat{V}^\pi_\phi(s_{i,t+1}) yi,t=r(si,t,ai,t)+V^ϕπ(si,t+1)
深度强化学习CS285 lec5-lec9(超长预警)_第3张图片

2.2.3 拟合目标值小总结(Target Value Summary)

I d e a l T a r g e t : y i , t = ∑ t ′ = t T E π θ [ r ( s i , t ′ , a i , t ′ ) ] M C T a r g e t : y i , t = ∑ t ′ = t T r ( s i , t ′ , a i , t ′ ) B o o t s t r a p p e d T a r g e t : y i , t = r ( s i , t , a i , t ) + γ V ^ ϕ π ( s i , t + 1 ) Ideal \quad Target: y_{i,t}=\sum_{t'=t}^TE_{\pi_\theta}\Big[r(s_{i,t'},a_{i,t'})\Big]\\ MC\quad Target:y_{i,t}=\sum_{t'=t}^Tr(s_{i,t'},a_{i,t'})\\ Bootstrapped\quad Target:y_{i,t}=r(s_{i,t},a_{i,t})+\gamma \hat{V}_\phi^\pi(s_{i,t+1}) IdealTarget:yi,t=t=tTEπθ[r(si,t,ai,t)]MCTarget:yi,t=t=tTr(si,t,ai,t)BootstrappedTarget:yi,t=r(si,t,ai,t)+γV^ϕπ(si,t+1)

2.3 Actor-Critic综合改进版

∇ J ( θ ) ≈ 1 N ∑ i = 1 N ∑ t = 1 T π θ ( a i , t ∣ s i , t ) π ˉ θ ′ ( a i , t ∣ s i , t ) ∇ θ l o g π θ ( a i , t ∣ s i , t ) A n π ( s i , t , a i , t ) \nabla J(\theta)\approx\frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\frac{\pi_\theta(a_{i,t}|s_{i,t})}{\bar \pi_{\theta'}(a_{i,t}|s_{i,t})}\nabla_\theta log\pi_\theta(a_{i,t}|s_{i,t})A^\pi_n(s_{i,t},a_{i,t}) J(θ)N1i=1Nt=1Tπˉθ(ai,tsi,t)πθ(ai,tsi,t)θlogπθ(ai,tsi,t)Anπ(si,t,ai,t)

  • 折扣因子discount factor γ = 0.99 \gamma=0.99 γ=0.99
    A π ( s i , t , a i , t ) = ∑ t ′ = t T γ t ′ − t r ( s i , t ′ , a i , t ′ ) A^\pi(s_{i,t},a_{i,t})=\sum_{t'=t}^T\gamma^{t'-t}r(s_{i,t'},a_{i,t'}) Aπ(si,t,ai,t)=t=tTγttr(si,t,ai,t)
    A π ( s i , t , a i , t ) ≈ r ( s i , t , a i , t ) + γ V ^ ϕ ( s i , t + 1 ) A^\pi(s_{i,t},a_{i,t})\approx r(s_{i,t},a_{i,t})+\gamma \hat{V}_\phi(s_{i,t+1}) Aπ(si,t,ai,t)r(si,t,ai,t)+γV^ϕ(si,t+1)
  • 基准 baselines
    A π ( s i , t , a i , t ) = ∑ t ′ = t T γ t ′ − t r ( s i , t ′ , a i , t ′ ) − V ^ ϕ ( s i , t ) A^\pi(s_{i,t},a_{i,t})=\sum_{t'=t}^T\gamma^{t'-t}r(s_{i,t'},a_{i,t'})-\hat{V}_\phi(s_{i,t}) Aπ(si,t,ai,t)=t=tTγttr(si,t,ai,t)V^ϕ(si,t)
    A π ( s i , t , a i , t ) ≈ r ( s i , t , a i , t ) + γ V ^ ϕ ( s i , t + 1 ) − V ^ ϕ ( s i , t ) A^\pi(s_{i,t},a_{i,t})\approx r(s_{i,t},a_{i,t})+\gamma \hat{V}_\phi(s_{i,t+1})-\hat{V}_\phi(s_{i,t}) Aπ(si,t,ai,t)r(si,t,ai,t)+γV^ϕ(si,t+1)V^ϕ(si,t)
  • n-step 回报(n=1时为bootstrapped)
    A n π ( s i , t , a i , t ) ≈ ∑ t ′ = t t + n γ t ′ − t r ( s t ′ , a t ′ ) + γ n V ^ ϕ π ( s t + n ) − V ^ ϕ π ( s t ) A_n^\pi(s_{i,t},a_{i,t})\approx \sum_{t'=t}^{t+n}\gamma^{t'-t}r(s_{t'},a_{t'})+\gamma^n\hat{V}^\pi_\phi(s_{t+n})-\hat{V}^\pi_\phi(s_t) Anπ(si,t,ai,t)t=tt+nγttr(st,at)+γnV^ϕπ(st+n)V^ϕπ(st)
  • 说明:折扣因子的加入使得无限长(Infinite Horizon)的Trajectory得以处理,且与GAE(General Advantage Estimation)的效果相同。

三、基于值函数的方法(Value-based Methods)

深度强化学习CS285 lec5-lec9(超长预警)_第4张图片

  • 说明:基于值函数的方法,是希望在迭代过程中绕过目标梯度 ∇ θ J ( θ ) \nabla_\theta J(\theta) θJ(θ)来更新策略这一步,尝试直接通过值迭代得到最优策略 π ∗ \pi^* π的值函数 V ∗ ( s ) V^*(s) V(s) Q ∗ ( s , a ) Q^*(s,a) Q(s,a),从值函数中得到策略。

3.1 策略迭代(Policy Iteration)

  1. 评估优势函数 A π ( s , a ) A^\pi(s,a) Aπ(s,a)(Policy Evaluation)
    A π ( s , a ) = Q π ( s , a ) − V π ( s , a ) = r ( s , a ) + γ E s ′ ∼ p ( s ′ ∣ s , a ) [ V π ( s ′ ) ] − V π ( s ) ≈ r ( s , a ) + γ V π ( s ′ ) − V π ( s ) \begin{aligned} A^\pi(s,a)&=Q^\pi(s,a)-V^\pi(s,a)\\ &=r(s,a)+\gamma E_{s'\sim p(s'|s,a)}[V^\pi(s')]-V^\pi(s)\\ &\approx r(s,a)+\gamma V^\pi(s')-V^\pi(s) \end{aligned} Aπ(s,a)=Qπ(s,a)Vπ(s,a)=r(s,a)+γEsp(ss,a)[Vπ(s)]Vπ(s)r(s,a)+γVπ(s)Vπ(s)
  2. 根据 A π ( s , a ) A^\pi(s,a) Aπ(s,a)得出好的 π ′ \pi' π,比如下面的式子(Policy Improvement)
    π ′ = a r g m a x a A π ( s , a ) , π ← π ′ \pi'=argmax_aA^\pi(s,a) ,\quad \pi\leftarrow\pi' π=argmaxaAπ(s,a),ππ
  • 说明:策略迭代PI主要分为Policy Evaluation与Policy Improvement两大步,其中第一步需要用到与环境交互得到的样本来估计优势函数,第二步需要可选择性提升动作值或动作的概率分布,而与环境交互时需要遵循设定Exploration与Exploitation的原则进行,避免探索不到高奖励的样本

3.2 值迭代(Value Iteration)

Policy Evaluation :
V π ( s ) ← E a ∼ π ( a ∣ s ) [ r ( s , a ) + γ E s ′ ∼ p ( s ′ ∣ s , a ) [ V π ( s ′ ) ] ] V^\pi(s)\leftarrow E_{a\sim\pi(a|s)}\Big[r(s,a)+\gamma E_{s'\sim p(s'|s,a)}\big[V^\pi(s')\big]\Big] Vπ(s)Eaπ(as)[r(s,a)+γEsp(ss,a)[Vπ(s)]]
选择一个action a a a
Q π ( s , a ) ← r ( s , a ) + γ E s ′ ∼ p ( s ′ ∣ s , a ) [ V π ( s ′ ) ] Q^\pi(s,a)\leftarrow r(s,a)+\gamma E_{s'\sim p(s'|s,a)}\big[V^\pi(s')\big] Qπ(s,a)r(s,a)+γEsp(ss,a)[Vπ(s)]

  1. 评估 Q π ( s , a ) Q^\pi(s,a) Qπ(s,a),根据上式评估Q值
  2. V π ( s ) ← m a x a Q π ( s , a ) V^\pi(s)\leftarrow max_aQ^\pi(s,a) Vπ(s)maxaQπ(s,a)
    深度强化学习CS285 lec5-lec9(超长预警)_第5张图片
  • 说明:第一步对Q值函数的更新,主要用到了环境的dynamics即 p ( s ′ ∣ s , a ) p(s'|s,a) p(ss,a),第二步简单地将最大的Q值当作状态价值,从而构成迭代链条,问题在于环境dynamics一般情况下是未知的。

3.3 拟合的值迭代(Fitted Value Iteration)

深度强化学习CS285 lec5-lec9(超长预警)_第6张图片

  • 说明:除了迭代状态值 V ϕ π ( s ) V^\pi_\phi(s) Vϕπ(s),还可以直接迭代Q值 Q ϕ π ( s , a ) Q^\pi_\phi(s,a) Qϕπ(s,a)。Fitted Value Iteration比Value Iteration更stable,但不能on-policy,两者均使用了环境的dynamics模型,如果dynamics未知,则需要用到Q值迭代。

3.4 拟合的Q值迭代(Fitted Q-iteration)

  • Fitted Q-iteration
    深度强化学习CS285 lec5-lec9(超长预警)_第7张图片
  • online Q-learning (online Q-iteration)深度强化学习CS285 lec5-lec9(超长预警)_第8张图片
  • 说明:两者对于dynamics的部分,均采用 m a x a ′ Q ϕ ( s i ′ , a i ′ ) max_{a'}Q_\phi(s_i',a_i') maxaQϕ(si,ai)来代替 E s ′ ∼ p ( s ′ ∣ s , a ) [ V ϕ ( s ′ ) ] E_{s'\sim p(s'|s,a)}[V_\phi(s')] Esp(ss,a)[Vϕ(s)],因此都在此损失了一定精度。而full fitted Q-iteration可以off-policy,拟合的Q值也更为稳定,online Q-learning收集一个样本,更新一次Q值的梯度,非常不稳定,而且使用Non-linear的Function Approximater时,理论上不保证收敛。如何使online Q-learning更适用于实际问题,可参考第四章的改进。

3.5 小总结(Summary)

  • Policy Iteration更为general,当PI(Policy Improvement)为 π ′ = arg max ⁡ Q π ( s , a ) \pi'=\argmax Q^\pi(s,a) π=argmaxQπ(s,a)时,退化为Value Iteration,为Value-based;当PI为gradient ascent时,为Actor-Critic;当估计Q值,并采用梯度更新,限制更新步长时为Advanced Policy Gradient。
  • Value Iteration中遇到的环境dynamics在Q-iteration中被克服,付出损失理论收敛的代价,fitted的值比gradient更新的值更加稳定。

四、基于Q值函数的方法(Q-Value based Methods)

4.1 针对Q值迭代的改进方法

深度强化学习CS285 lec5-lec9(超长预警)_第9张图片
存在两个问题:

  1. 第一步中一个轨迹的样本之间高度相关,应该打乱处理,使用不同轨迹的样本估计。(解决方案:经验池回放
  2. 第三步中的目标值 y i y_i yi每次更新都变动,而且没有通过目标值的梯度更新。 (解决方案:目标网络

4.1.1 经验池回放(Replay Buffer)

深度强化学习CS285 lec5-lec9(超长预警)_第10张图片
要点:存放一些样本,在线更新时收集存放在经验池中,每次更新时从Replay Buffer中采样。

4.1.2 目标网络 (Target Network)

深度强化学习CS285 lec5-lec9(超长预警)_第11张图片
#要点:用一个新的Q值网络 Q ϕ ′ ( s i ′ , a i ′ ) Q_{\phi'}(s_i',a_i') Qϕ(si,ai)来代替目标值,使其一定时间内不发生变动,一段时间后更新目标网络参数 ϕ ′ ← ϕ \phi'\leftarrow\phi ϕϕ

4.2 基于Q-learning的三种形态

4.2.1 Online Q-learning

深度强化学习CS285 lec5-lec9(超长预警)_第12张图片

4.2.2 DQN(N=1,K=1)

深度强化学习CS285 lec5-lec9(超长预警)_第13张图片

4.2.3 Fitted Q-learning

深度强化学习CS285 lec5-lec9(超长预警)_第14张图片

4.2.4 A Brief Summary

深度强化学习CS285 lec5-lec9(超长预警)_第15张图片
深度强化学习CS285 lec5-lec9(超长预警)_第16张图片

4.3 Q值深度网络的有效改进 (Practical Tips)

4.3.1 Double Q-learning

深度强化学习CS285 lec5-lec9(超长预警)_第17张图片
要点:动作选取从目标网络切换到原策略网络。

4.3.2 N-step Returns

y j , t = ∑ t ′ = t t + N − 1 γ t − t ′ r j , t ′ + γ N m a x a j , t + N Q ϕ ′ ( s j , t + N , a j , t + N ) y_{j,t}=\sum_{t'=t}^{t+N-1}\gamma^{t-t'}r_{j,t'}+\gamma^Nmax_{a_{j,t+N}}Q_{\phi'}(s_{j,t+N},a_{j,t+N}) yj,t=t=tt+N1γttrj,t+γNmaxaj,t+NQϕ(sj,t+N,aj,t+N)

4.3.3 Dueling Structure

深度强化学习CS285 lec5-lec9(超长预警)_第18张图片
要点:第一个图中一般使用一个网络来拟合 Q π ( s , a ) Q^\pi(s,a) Qπ(s,a),在结构中分成拟合 V π ( s ) V^\pi(s) Vπ(s)与动作优势 A π ( s , a ) A^\pi(s,a) Aπ(s,a),得到 Q π ( s , a ) = A π ( s , a ) + V π ( s ) Q^\pi(s,a)=A^\pi(s,a)+V^\pi(s) Qπ(s,a)=Aπ(s,a)+Vπ(s),可参考下面一篇论文。
Dueling Network Architectures for Deep Reinforcement Learning

Double DQN具体算法例子

深度强化学习CS285 lec5-lec9(超长预警)_第19张图片

4.4 连续动作的Q-learning (Continuous action)

深度强化学习CS285 lec5-lec9(超长预警)_第20张图片
上述提到的Q-learning是适合离散动作的,这样 Q π ( s , a ) Q^\pi(s,a) Qπ(s,a)才好迭代表示,那对于连续动作,具体问题出现在max上,有如下三种方法处理。

4.4.1 随机优化(Stochastic Optimization)

深度强化学习CS285 lec5-lec9(超长预警)_第21张图片
随机优化主要思想,就是从模型中选择一些参数组合,设定一个评价指标,然后从参数组合中选出使评价指标最优的参数。如进化策略(Evolution Strategy)中最好的算法CMA-ES(Covariance Matrix Adaptation-Evolution Strategy),可从IGO(Information Geometric Optimization)求natural gradient得到。无梯度方法一般比有梯度需要的样本或计算量大,CMA-ES在中等规模(变量个数大约在 3-300范围内)的复杂优化问题上具有很好的效果。除此之外,进化策略可作为深度强化学习算法一个额外的、良好的可扩展算法,具体可参见下面两篇论文。

Evolution Strategies as a Scalable Alternative to Reinforcement Learning 2017

The CMA-ES: A Tutorial 2016

4.4.2 使用易于优化的函数类型 (Easy to optimize)

将Q函数以某种易于优化的方式代替,使用NAF(Normalize Advantage Function)使其适用于连续动作。具体NAF在连续Deep Q-Learning的细节可参考下面一篇ICML 2016的论文。
NAF
Continuous Deep Q-Learning with Model-based Acceleration 2016

4.4.3 学习一个新的执行者 (Learn A Second Actor)

弄一个新的actor如 μ ( s ) \mu(s) μ(s),来代替max对象,使actor与policy协同更新。
μ θ ( s ) ≈ a r g m a x a Q ϕ ( s , a ) m a x a Q ϕ ( s , a ) = Q ϕ ( s , a r g m a x a Q ϕ ( s , a ) ) \mu_\theta(s) \approx argmax_aQ_\phi(s,a)\\ max_aQ_\phi(s,a)=Q_\phi(s,argmax_aQ_\phi(s,a)) μθ(s)argmaxaQϕ(s,a)maxaQϕ(s,a)=Qϕ(s,argmaxaQϕ(s,a))
深度强化学习CS285 lec5-lec9(超长预警)_第22张图片
Continuous Control With Deep Reinforcement Learning

五、优化策略梯度(Advanced Policy Gradient)

目的:从目标函数 J ( θ ) J(\theta) J(θ)入手,希望找到一个更好的 J ( θ ′ ) J(\theta') J(θ),使 J ( θ ′ ) − J ( θ ) ≥ 0 J(\theta')-J(\theta)\geq0 J(θ)J(θ)0,且两者之间距离越大越好。
已知 J ( θ ) = E τ ∼ p θ ( τ ) [ ∑ t γ t r ( s t , a t ) ] = E s 0 ∼ p θ ( s 0 ) [ V ( s 0 ) ] J(\theta)=E_{\tau\sim p_\theta(\tau)}\big[\sum_t \gamma^tr(s_t,a_t)\big]=E_{s_0\sim p_\theta(s_0)}\big[V(s_0)\big] J(θ)=Eτpθ(τ)[tγtr(st,at)]=Es0pθ(s0)[V(s0)]
所以有:
J ( θ ′ ) − J ( θ ) = J ( θ ′ ) − E s 0 ∼ p θ ( s 0 ) [ V π θ ( s 0 ) ] ( 1 ) = J ( θ ′ ) − E τ ∼ p θ ′ ( τ ) [ V π θ ( s 0 ) ] ( 2 ) = J ( θ ′ ) − E τ ∼ p θ ′ ( τ ) [ ∑ t = 0 ∞ γ t V π θ ( s t ) − ∑ t = 1 ∞ γ t V π θ ( s t ) ] = J ( θ ′ ) − E τ ∼ p θ ′ ( τ ) [ ∑ t = 0 ∞ γ t ( γ V π θ ( s t + 1 ) − V π θ ( s t ) ] ) ] = E τ ∼ p θ ′ ( τ ) [ ∑ t = 0 ∞ γ t r ( s t , a t ) ] − E τ ∼ p θ ′ ( τ ) [ ∑ t = 0 ∞ γ t ( γ V π θ ( s t + 1 ) − V π θ ( s t ) ] ) ] = E τ ∼ p θ ′ ( τ ) [ ∑ t = 0 ∞ γ t ( r ( s t , a t ) + γ V π θ ( s t + 1 ) − V π θ ( s t ) ) ] = E τ ∼ p θ ′ ( τ ) [ ∑ t = 0 ∞ γ t A π θ ( s t , a t ) ] = ∑ t = 0 ∞ E s t ∼ p θ ′ ( s t ) [ E a t ∼ π θ ′ ( a t ∣ s t ) [ γ t A π θ ( s t , a t ) ] ] ( 3 ) = ∑ t = 0 ∞ E s t ∼ p θ ′ ( s t ) [ E a t ∼ π θ ( a t ∣ s t ) [ π θ ′ ( a t ∣ s t ) π θ ( a t ∣ s t ) γ t A π θ ( s t , a t ) ] ] ( 4 ) ≈ ∑ t = 0 ∞ E s t ∼ p θ ( s t ) [ E a t ∼ π θ ( a t ∣ s t ) [ π θ ′ ( a t ∣ s t ) π θ ( a t ∣ s t ) γ t A π θ ( s t , a t ) ] ] ( 5 ) = A ˉ ( θ ′ ) \begin{aligned} J(\theta')-J(\theta)&=J(\theta')-E_{s_0\sim p_\theta(s_0)}\big[V^{\pi_\theta}(s_0)\big] \quad(1) \\ &=J(\theta')-E_{\tau\sim p_{\theta'}(\tau)}\big[V^{\pi_\theta}(s_0)\big] \quad (2)\\ &=J(\theta')-E_{\tau\sim p_{\theta'}(\tau)}\big[\sum_{t=0}^\infty\gamma^tV^{\pi_\theta}(s_t)-\sum_{t=1}^\infty\gamma^tV^{\pi_\theta}(s_t)\big]\\ &=J(\theta')-E_{\tau\sim p_{\theta'}(\tau)}\big[\sum_{t=0}^\infty\gamma^t\big(\gamma V^{\pi_\theta}(s_{t+1})-V^{\pi_\theta}(s_t)]\big)\big]\\ &=E_{\tau\sim p_{\theta'}(\tau)}\big[\sum_{t=0}^\infty\gamma^tr(s_t,a_t)\big]-E_{\tau\sim p_{\theta'}(\tau)}\big[\sum_{t=0}^\infty\gamma^t\big(\gamma V^{\pi_\theta}(s_{t+1})-V^{\pi_\theta}(s_t)]\big)\big]\\ &=E_{\tau\sim p_{\theta'}(\tau)}\Big[\sum_{t=0}^\infty\gamma^t\big(r(s_t,a_t)+\gamma V^{\pi_\theta}(s_{t+1})-V^{\pi_\theta}(s_t)\big)\Big]\\ &=E_{\tau\sim p_{\theta'}(\tau)}\Big[\sum_{t=0}^\infty\gamma^tA^{\pi_\theta}(s_t,a_t)\Big]\\ &=\sum_{t=0}^\infty E_{s_t\sim p_{\theta'}(s_t)}\Big[E_{a_t\sim \pi_{\theta'}(a_t|s_t)}\big[\gamma^tA^{\pi_\theta}(s_t,a_t)\big]\Big] \quad (3)\\ &=\sum_{t=0}^\infty E_{s_t\sim p_{\theta'}(s_t)}\Big[E_{a_t\sim \pi_{\theta}(a_t|s_t)}\big[\frac{\pi_{\theta'}(a_t|s_t)}{\pi_\theta(a_t|s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\big]\Big] \quad (4)\\ &\approx\sum_{t=0}^\infty E_{s_t\sim p_{\theta}(s_t)}\Big[E_{a_t\sim \pi_{\theta}(a_t|s_t)}\big[\frac{\pi_{\theta'}(a_t|s_t)} {\pi_\theta(a_t|s_t)}\gamma^tA^{\pi_\theta}(s_t,a_t)\big]\Big]\quad (5)\\ &=\bar{A}(\theta') \end{aligned} J(θ)J(θ)=J(θ)Es0pθ(s0)[Vπθ(s0)](1)=J(θ)Eτpθ(τ)[Vπθ(s0)](2)=J(θ)Eτpθ(τ)[t=0γtVπθ(st)t=1γtVπθ(st)]=J(θ)Eτpθ(τ)[t=0γt(γVπθ(st+1)Vπθ(st)])]=Eτpθ(τ)[t=0γtr(st,at)]Eτpθ(τ)[t=0γt(γVπθ(st+1)Vπθ(st)])]=Eτpθ(τ)[t=0γt(r(st,at)+γVπθ(st+1)Vπθ(st))]=Eτpθ(τ)[t=0γtAπθ(st,at)]=t=0Estpθ(st)[Eatπθ(atst)[γtAπθ(st,at)]](3)=t=0Estpθ(st)[Eatπθ(atst)[πθ(atst)πθ(atst)γtAπθ(st,at)]](4)t=0Estpθ(st)[Eatπθ(atst)[πθ(atst)πθ(atst)γtAπθ(st,at)]](5)=Aˉ(θ)
(1)到(2): s 0 ∼ p θ ( s 0 ) s_0\sim p_\theta(s_0) s0pθ(s0)表示在 θ \theta θ下的初始状态分布与 θ ′ \theta' θ下的初始状态分布相同,即与轨迹的初始状态分布相同,即 τ ∼ p θ ′ ( τ ) \tau\sim p_{\theta'}(\tau) τpθ(τ)
(3)到(4):重要性采样(Importance Sampling)
于是我们的目的就变成如下:
θ ′ ← arg max ⁡ θ ′ A ˉ ( θ ′ ) ≈ J ( θ ′ ) − J ( θ ) \theta'\leftarrow \argmax_{\theta'} \bar{A}(\theta')\approx J(\theta')-J(\theta) θθargmaxAˉ(θ)J(θ)J(θ)
如何使(5)中的近似尽可能成立呢?这需要新的policy π θ ′ ( a t ∣ s t ) \pi_{\theta'}(a_t|s_t) πθ(atst)与旧的policy π θ ( a t ∣ s t ) \pi_{\theta}(a_t|s_t) πθ(atst)靠得很近的话,就有 p θ ′ ( s t ) ≈ p θ ( s t ) p_{\theta'}(s_t)\approx p_\theta(s_t) pθ(st)pθ(st),相当于加了一个约束 ∣ p θ ′ ( s t ) − p θ ( s t ) ∣ ≤ ϵ |p_{\theta'}(s_t)- p_\theta(s_t)|\leq\epsilon pθ(st)pθ(st)ϵ等价于 ∣ π θ ′ ( a t ∣ s t ) − π θ ( a t ∣ s t ) ∣ ≤ ϵ |\pi_{\theta'}(a_t|s_t)-\pi_\theta(a_t|s_t)|\leq\epsilon πθ(a

你可能感兴趣的:(Deep,RL,强化学习,人工智能,算法,深度学习,机器学习)