基于baseline的策略梯度(Reinforce算法与A2C)

强化学习—— 基于baseline的策略梯度(Reinforce算法与A2C)

  • 1. baseline的推导
  • 2. 策略梯度的蒙特卡洛近似
  • 3. baseline的选取
  • 4. Reinforce算法
    • 4.1 基本概念
    • 4.2 算法的训练流程
  • 5. A2C算法(Advantage Actor Critic)
    • 5.1 网络结构及其训练过程
    • 5.2 数学原理推导
      • 5.2.1 概念定义
      • 5.2.2 定理1(动作价值与状态价值之间的关系)
      • 5.2.3 定理2(前后时刻状态价值之间的关系)
      • 5.2.4 策略网络的更新:
      • 5.2.5 价值网络的更新:
    • 5.3 策略梯度的理解
  • 6. Reinforce算法 V.S. A2C算法
    • 6.1 A2C
      • 6.1.1 one step TD Target
      • 6.1.2 multi-step TD Target
    • 6.2 Reinforce 算法
    • 6.3 Reinforce 算法为A2C的特殊形式

1. baseline的推导

  • 策略网络为: π ( a ∣ s ; θ ) \pi(a|s;\theta) π(as;θ)
  • 状态价值函数为: V π ( s ) = E A ∼ π [ Q π ( A , s ) ] = ∑ a π ( a ∣ s ; θ ) ⋅ Q π ( a , s ) V_\pi(s)=E_{A\sim\pi}[Q_\pi(A,s)]\\=\sum_a\pi(a|s;\theta)\cdot Q_\pi(a,s) Vπ(s)=EAπ[Qπ(A,s)]=aπ(as;θ)Qπ(a,s)
  • 策略梯度为: ∂ V π ( s ) ∂ θ = E A ∼ π [ Q π ( s , a ) ⋅ ∂ l o g ( π ( a ∣ s ; θ ) ) ∂ θ ] \frac{\partial V_\pi(s)}{\partial \theta}=E_{A\sim\pi}[Q_\pi(s,a)\cdot\frac{\partial log(\pi(a|s;\theta))}{\partial \theta}] θVπ(s)=EAπ[Qπ(s,a)θlog(π(as;θ))]
  • 设b为不依赖于动作A的任何函数,则: E A ∼ π [ b ⋅ ∂ l o g ( π ( a ∣ s ; θ ) ) ∂ θ ] = b ⋅ E A ∼ π [ ∂ l o g ( π ( a ∣ s ; θ ) ) ∂ θ ] = b ⋅ ∑ a π ( a ∣ s ; θ ) ⋅ ∂ l o g ( π ( a ∣ s ; θ ) ) ∂ θ = b ⋅ ∑ a π ( a ∣ s ; θ ) ⋅ 1 π ( a ∣ s ; θ ) ⋅ ∂ π ( a ∣ s ; θ ) ∂ θ = b ⋅ ∂ ∑ a π ( a ∣ s ; θ ) ∂ θ = b ⋅ ∂ 1 ∂ θ = 0 E_{A\sim\pi}[b\cdot \frac{\partial log(\pi(a|s;\theta))}{\partial \theta}]\\=b\cdot E_{A\sim\pi}[\frac{\partial log(\pi(a|s;\theta))}{\partial \theta}]\\ = b\cdot \sum_a \pi(a|s;\theta)\cdot \frac{\partial log(\pi(a|s;\theta))}{\partial \theta}\\=b\cdot \sum_a \pi(a|s;\theta)\cdot \frac{1}{\pi(a|s;\theta)}\cdot \frac{\partial \pi(a|s;\theta)}{\partial \theta}\\ =b\cdot \frac{\partial \sum_a \pi (a|s;\theta)}{\partial \theta}\\=b\cdot\frac{\partial1}{\partial \theta}\\=0 EAπ[bθlog(π(as;θ))]=bEAπ[θlog(π(as;θ))]=baπ(as;θ)θlog(π(as;θ))=baπ(as;θ)π(as;θ)1θπ(as;θ)=bθaπ(as;θ)=bθ1=0因此,如果b独立于动作A,则: E A ∼ π [ b ⋅ ∂ l o g ( π ( a ∣ s ; θ ) ) ∂ θ ] = 0 E_{A\sim\pi}[b\cdot\frac{\partial log(\pi(a|s;\theta))}{\partial \theta}]=0 EAπ[bθlog(π(as;θ))]=0
  • 则带baseline的策略梯度为: ∂ V π ( s ) ∂ θ = E A ∼ π [ Q π ( A , s ) ⋅ ∂ l o g ( π ( A ∣ s , θ ) ) ∂ θ ] − E A ∼ π [ b ⋅ ∂ l o g ( π ( A ∣ s , θ ) ) ∂ θ ] = E A ∼ π [ ∂ l o g ( π ( A ∣ s ; θ ) ) ∂ θ ⋅ ( Q π ( A , s ) − b ) ] \frac{\partial V_\pi(s)}{\partial \theta}=E_{A\sim\pi}[Q_\pi(A,s)\cdot\frac{\partial log(\pi(A|s,\theta))}{\partial \theta}]-E_{A\sim\pi}[b\cdot\frac{\partial log(\pi(A|s,\theta))}{\partial \theta}]\\=E_{A\sim\pi}[\frac{\partial log(\pi(A|s;\theta))}{\partial \theta}\cdot(Q_\pi(A,s)-b)] θVπ(s)=EAπ[Qπ(A,s)θlog(π(As,θ))]EAπ[bθlog(π(As,θ))]=EAπ[θlog(π(As;θ))(Qπ(A,s)b)]b不会影响期望,但合适的b会降低蒙特卡洛近似的方差,加快模型收敛。

2. 策略梯度的蒙特卡洛近似

  • 基于baselin的策略梯度为: ∂ V π ( s t ) ∂ θ = = E A t ∼ π [ ∂ l o g ( π ( A t ∣ s t ; θ ) ) ∂ θ ⋅ ( Q π ( A t , s t ) − b ) ] g ( A t ) = ∂ l o g ( π ( A t ∣ s t ; θ ) ) ∂ θ ⋅ ( Q π ( A t , s t ) − b ) \frac{\partial V_\pi(s_t)}{\partial \theta}==E_{A_t\sim\pi}[\frac{\partial log(\pi(A_t|s_t;\theta))}{\partial \theta}\cdot(Q_\pi(A_t,s_t)-b)]\\g(A_t)=\frac{\partial log(\pi(A_t|s_t;\theta))}{\partial \theta}\cdot(Q_\pi(A_t,s_t)-b) θVπ(st)==EAtπ[θlog(π(Atst;θ))(Qπ(At,st)b)]g(At)=θlog(π(Atst;θ))(Qπ(At,st)b)
  • 依据策略函数随机抽样得到t时刻的动作: a t ∼ π ( ⋅ ∣ s t ; θ ) a_t\sim\pi(\cdot|s_t;\theta) atπ(st;θ)
  • 则策略梯度的无偏估计为: g ( a t ) g(a_t) g(at)
  • 随机策略梯度: g ( a t ) = ( Q π ( s t , a t ) − b ) ⋅ ( ∂ l o g ( π ( a t ∣ s t ; θ ) ) ∂ θ ) g(a_t)=(Q_\pi(s_t,a_t)-b)\cdot(\frac{\partial log(\pi(a_t|s_t;\theta))}{\partial \theta}) g(at)=(Qπ(st,at)b)(θlog(π(atst;θ)))
  • 做梯度上升: θ ← θ + β ⋅ g ( a t ) \theta\gets\theta+\beta\cdot g(a_t) θθ+βg(at)

3. baseline的选取

  1. 标准策略梯度: b = 0 b=0 b=0
  2. 使用状态价值函数,因为其与动作A_t无关,且接近动作价值函数: b = V π ( s t ) V π ( s t ) = E A t [ Q ( A t , s t ) ] b=V_{\pi} (s_t)\\V_\pi(s_t)=E_{A_t}[Q(A_t,s_t)] b=Vπ(st)Vπ(st)=EAt[Q(At,st)]

4. Reinforce算法

4.1 基本概念

  • 折扣回报: U t = R t + γ ⋅ R t + 1 + γ 2 ⋅ R t + 2 − . . . U_t=R_t+\gamma\cdot R_{t+1}+\gamma^2\cdot R_{t+2}- ... Ut=Rt+γRt+1+γ2Rt+2...
  • 动作价值函数: Q π ( s t , a t ) = E [ U t ∣ s t , a t ] Q_\pi(s_t,a_t)=E[U_t|s_t,a_t] Qπ(st,at)=E[Utst,at]
  • 状态价值函数: V π ( s t ) = E A t [ Q π ( s t , A t ) ∣ s t ] V_\pi(s_t)=E_{A_t}[Q_\pi(s_t,A_t)|s_t] Vπ(st)=EAt[Qπ(st,At)st]
  • 带baseline的策略梯度为: ∂ V π ( s t ) ∂ θ = E A t ∼ π [ g ( A t ) ] = E A t ∼ π [ ∂ l o g ( π ( A t ∣ s t ; θ ) ) ∂ θ ⋅ ( Q π ( A t , s t ) − V π ( s t ) ) ] \frac{\partial V_\pi(s_t)}{\partial\theta}=E_{A_t\sim\pi}[g(A_t)]\\=E_{A_t\sim\pi}[\frac{\partial log(\pi(A_t|s_t;\theta))}{\partial \theta}\cdot(Q_\pi(A_t,s_t)-V_\pi(s_t))] θVπ(st)=EAtπ[g(At)]=EAtπ[θlog(π(Atst;θ))(Qπ(At,st)Vπ(st))]
  • 对动作进行抽样,做蒙特卡洛近似,为无偏估计: a t ∼ π ( ⋅ ∣ s t ; θ ) g ( a t ) = ( Q π ( s t , a t ) − b ) ⋅ ( ∂ l o g ( π ( a t ∣ s t ; θ ) ) ∂ θ ) a_t\sim\pi(\cdot|s_t;\theta)\\g(a_t)=(Q_\pi(s_t,a_t)-b)\cdot(\frac{\partial log(\pi(a_t|s_t;\theta))}{\partial \theta}) atπ(st;θ)g(at)=(Qπ(st,at)b)(θlog(π(atst;θ)))
  • 对动作价值函数做蒙特卡洛近似(Reinforce算法的关键): Q π ( s t , a t ) = E [ U t ∣ s t , a t ] Q π ( s t , a t ) ≈ u t 观 测 轨 迹 为 : s t , a t , r t , s t + 1 , a t + 1 , r t + 1 , . . . , s t + n , a t + n , r t + n u t = ∑ i = t n γ i − t r i Q_\pi(s_t,a_t)=E[U_t|s_t,a_t]\\Q_\pi(s_t,a_t)\approx u_t\\观测轨迹为:s_t,a_t,r_t,s_{t+1},a_{t+1},r_{t+1},...,s_{t+n},a_{t+n},r_{t+n}\\u_t=\sum_{i=t}^n\gamma^{i-t}r_i Qπ(st,at)=E[Utst,at]Qπ(st,at)utst,at,rt,st+1,at+1,rt+1,...,st+n,at+n,rt+nut=i=tnγitri
  • 通过神经网络近似状态价值函数: v ( s t ; W ) ∼ V π ( s t ) v(s_t;W)\sim V_\pi(s_t) v(st;W)Vπ(st)
  • 近似后的策略梯度为: ∂ V π ( s t ) ∂ θ = ∂ l o g ( π ( a t ∣ s t ; θ ) ) ∂ θ ⋅ ( u t − v ( s t ; W ) ) \frac{\partial V_\pi(s_t)}{\partial\theta}=\frac{\partial log(\pi(a_t|s_t;\theta))}{\partial \theta}\cdot(u_t-v(s_t;W)) θVπ(st)=θlog(π(atst;θ))(utv(st;W))
    上述的推导做了三次近似:
  1. 动作的抽样为蒙特卡洛近似
  2. 动作价值函数的近似为蒙特卡洛近似
  3. 状态价值函数为神经网络近似

4.2 算法的训练流程

  • 策略网络为: π ( a t ∣ s t ; θ ) \pi(a_t|s_t;\theta) π(atst;θ)
    基于baseline的策略梯度(Reinforce算法与A2C)_第1张图片- 价值网络: v ( s t ; W ) v(s_t;W) v(st;W)基于baseline的策略梯度(Reinforce算法与A2C)_第2张图片

  • 两个网络可以进行参数共享基于baseline的策略梯度(Reinforce算法与A2C)_第3张图片

  1. 完成一局完整的游戏,得到一个轨迹: { ( s 1 , a 1 , r 1 ) ; ( s 2 , a 2 , r 2 ) ; . . . ; ( s n , a n , r n ) } \{(s_1,a_1,r_1);(s_2,a_2,r_2);...;(s_n,a_n,r_n)\} {(s1,a1,r1);(s2,a2,r2);...;(sn,an,rn)}
  2. 计算动作价值函数的近似: u t = ∑ i = t n γ i − t ⋅ r i δ t = v ( s t ; W ) − u t u_t=\sum_{i=t}^n \gamma^{i-t}\cdot r_i\\\delta_t=v(s_t;W)-u_t ut=i=tnγitriδt=v(st;W)ut
  3. 依据策略梯度更新策略网络的参数: θ ← θ + β ⋅ δ t ⋅ ∂ l o g ( π ( s t ∣ s t ; θ ) ) ∂ θ \theta\gets\theta+\beta\cdot\delta_t\cdot \frac{\partial log(\pi(s_t|s_t;\theta))}{\partial\theta} θθ+βδtθlog(π(stst;θ))
  4. 采用梯度下降更新价值网络的参数: W ← W − α ⋅ δ t ⋅ ∂ v ( s t ; W ) ∂ W W\gets W-\alpha\cdot \delta_t\cdot\frac{\partial v(s_t;W)}{\partial W} WWαδtWv(st;W)
  5. 由于轨迹的长度为n,可以对神经网络进行n次更新

5. A2C算法(Advantage Actor Critic)

5.1 网络结构及其训练过程

  • 策略网络为(actor): π ( a t ∣ s t ; θ ) \pi(a_t|s_t;\theta) π(atst;θ)
    基于baseline的策略梯度(Reinforce算法与A2C)_第4张图片- 价值网络(critic): v ( s t ; W ) v(s_t;W) v(st;W)基于baseline的策略梯度(Reinforce算法与A2C)_第5张图片

  • 两个网络可以进行参数共享基于baseline的策略梯度(Reinforce算法与A2C)_第6张图片

  1. 观测到一个transition: s t , a t , r t , s t + 1 s_t,a_t,r_t,s_{t+1} st,at,rt,st+1
  2. TD Target: r t + v ( s t + 1 ; W ) r_t+v(s_{t+1};W) rt+v(st+1;W)
  3. TD error 为: δ t = v ( s t ; W ) − y t \delta_t = v(s_t;W)-y_t δt=v(st;W)yt
  4. 更新策略网络: θ ← θ − β ⋅ δ t ⋅ ∂ l o g ( π ( a t ∣ s t ; θ ) ) ∂ θ \theta\gets\theta-\beta\cdot\delta_t\cdot\frac{\partial log(\pi(a_t|s_t;\theta))}{\partial\theta} θθβδtθlog(π(atst;θ))
  5. 更新价值网络: W ← W − α ⋅ δ t ⋅ ∂ v ( s t ; W ) ∂ W W\gets W-\alpha\cdot\delta_t\cdot\frac{\partial v(s_t;W)}{\partial W} WWαδtWv(st;W)

5.2 数学原理推导

5.2.1 概念定义

  • 折扣回报: U t = R t + γ R t + 1 + γ 2 R t + 2 + . . . U_t=R_t+\gamma R_{t+1}+\gamma^2 R_{t+2}+... Ut=Rt+γRt+1+γ2Rt+2+...
  • 动作价值函数: Q π ( s t , a t ) = E [ U t ∣ s t , a t ] Q_\pi(s_t,a_t)=E[U_t|s_t,a_t] Qπ(st,at)=E[Utst,at]
  • 状态价值函数: V π ( s t ) = E A t [ Q π ( s t , A t ) ∣ s t ] V_\pi(s_t)=E_{A_t}[Q_\pi(s_t,A_t)|s_t] Vπ(st)=EAt[Qπ(st,At)st]

5.2.2 定理1(动作价值与状态价值之间的关系)

Q π ( s t , a t ) = E [ U t ∣ s t , a t ] = E A t + 1 , S t + 1 [ R t + γ ⋅ Q π ( S t + 1 , A t + 1 ) ] = E S t + 1 [ R t + γ ⋅ E A t + 1 [ Q π ( S t + 1 , A t + 1 ) ] ] = E S t + 1 [ R t + V π ( S t + 1 ) ] Q_\pi(s_t,a_t)=E[U_t|s_t,a_t]\\=E_{A_{t+1},S_{t+1}}[R_t+\gamma\cdot Q_\pi(S_{t+1},A_{t+1})]\\=E_{S_{t+1}}[R_t+\gamma\cdot E_{A_{t+1}}[Q_\pi(S_{t+1},A_{t+1})]]\\ = E_{S_{t+1}}[R_t+V_\pi(S_{t+1})] Qπ(st,at)=E[Utst,at]=EAt+1,St+1[Rt+γQπ(St+1,At+1)]=ESt+1[Rt+γEAt+1[Qπ(St+1,At+1)]]=ESt+1[Rt+Vπ(St+1)]
蒙特卡洛近似为: Q π ( s t , a t ) ≈ r t + γ ⋅ V π ( s t + 1 ) Q_\pi(s_t,a_t)\approx r_t+\gamma\cdot V_\pi(s_{t+1}) Qπ(st,at)rt+γVπ(st+1)可用于训练策略网络

5.2.3 定理2(前后时刻状态价值之间的关系)

V π ( s t ) = E A t [ Q π ( s t , A t ) ] = E A t [ E S t + 1 [ R t + γ ⋅ V π ( S t + 1 ) ] ] = E A t , S t + 1 [ R t + V π ( S t + 1 ) ] V_\pi(s_t)=E_{A_t}[Q_\pi(s_t,A_t)]\\=E_{A_t}[E_{S_{t+1}}[R_t+\gamma\cdot V_\pi(S_{t+1})]]\\=E_{A_t,S_{t+1}}[R_t+V_\pi(S_{t+1})] Vπ(st)=EAt[Qπ(st,At)]=EAt[ESt+1[Rt+γVπ(St+1)]]=EAt,St+1[Rt+Vπ(St+1)]
蒙特卡洛近似为: V π ( s t ) ≈ r t + γ ⋅ V π ( s t + 1 ) V_\pi(s_t)\approx r_t+\gamma\cdot V_\pi(s_{t+1}) Vπ(st)rt+γVπ(st+1)可用于训练价值网络

5.2.4 策略网络的更新:

  • 随机策略梯度为: g ( a t ) = ∂ l o g ( π ( a t ∣ s t ; θ ) ) ∂ θ ⋅ ( Q π ( s t , a t ) − V π ( s t ) ) Q π ( s t , a t ) ≈ r t + γ ⋅ V π ( s t ) = y t v ( s t ; W ) ∼ V π ( s t ) θ ← θ + β ⋅ ( y t − v ( s t ; W ) ) ⋅ ∂ l o g ( π ( a t ∣ s t ; θ ) ) ∂ θ g(a_t)=\frac{\partial log(\pi(a_t|s_t;\theta))}{\partial \theta}\cdot(Q_\pi(s_t,a_t)-V_\pi(s_t))\\Q_\pi(s_t,a_t)\approx r_t+\gamma\cdot V_\pi(s_t)=y_t\\v(s_t;W)\sim V_\pi(s_t)\\\theta\gets \theta +\beta\cdot(y_t-v(s_t;W))\cdot \frac{\partial log(\pi(a_t|s_t;\theta))}{\partial \theta} g(at)=θlog(π(atst;θ))(Qπ(st,at)Vπ(st))Qπ(st,at)rt+γVπ(st)=ytv(st;W)Vπ(st)θθ+β(ytv(st;W))θlog(π(atst;θ))

5.2.5 价值网络的更新:

V π ( s t ) ≈ r t + γ ⋅ V π ( s t + 1 ) v ( s t ; W ) ≈ r t + γ ⋅ v ( s t + 1 ; W ) = y t V_\pi(s_t)\approx r_t+\gamma\cdot V_\pi(s_{t+1})\\v(s_t;W)\approx r_t+\gamma\cdot v(s_{t+1};W)=y_t Vπ(st)rt+γVπ(st+1)v(st;W)rt+γv(st+1;W)=yt

  1. TD error为: δ t = v ( s t ; W ) − y t \delta_t=v(s_t;W)-y_t δt=v(st;W)yt
  2. 梯度为: ∂ 1 2 ⋅ δ t 2 ∂ W = δ t ⋅ ∂ v ( s t ; W ) ∂ W \frac{\partial{\frac{1}{2}\cdot\delta_t^2}}{\partial W}=\delta_t\cdot \frac{\partial{v(s_t;W)}}{\partial W} W21δt2=δtWv(st;W)
  3. 梯度更新: W ← W − α ⋅ δ t ⋅ ∂ v ( s t ; W ) ∂ W W\gets W-\alpha\cdot \delta_t\cdot \frac{\partial{v(s_t;W)}}{\partial W} WWαδtWv(st;W)
    本文内容为参考B站学习视频书写的笔记!

5.3 策略梯度的理解

g ( a t ) = ( t + γ ⋅ v ( s t + 1 ; W ) − v ( s t ; W ) ) ⋅ ( ∂ l o g ( π ( a t ∣ s t ; θ ) ) ∂ θ ) g(a_t)=(_t+\gamma\cdot v(s_{t+1};W)-v(s_t;W))\cdot(\frac{\partial log(\pi(a_t|s_t;\theta))}{\partial \theta}) g(at)=(t+γv(st+1;W)v(st;W))(θlog(π(atst;θ)))
两者之差反映了执行动作后的优势(回报)
基于baseline的策略梯度(Reinforce算法与A2C)_第7张图片

6. Reinforce算法 V.S. A2C算法

6.1 A2C

6.1.1 one step TD Target

  • 观测到一个transition: ( s t , a t , r t , s t + 1 ) (s_t,a_t,r_t,s_{t+1}) (st,at,rt,st+1)
  • y t = r t + γ ⋅ v ( s t + 1 ; W ) y_t=r_t+\gamma\cdot v(s_{t+1;W}) yt=rt+γv(st+1;W)

6.1.2 multi-step TD Target

  • 观测到m个transition: { ( s t + i , a t + i , r t + i , s t + i + 1 ) } i = 0 m − 1 \{(s_{t+i},a_{t+i},r_{t+i},s_{t+i+1})\}_{i=0}^{m-1} {(st+i,at+i,rt+i,st+i+1)}i=0m1
  • y t = ∑ i = 0 m − 1 γ i ⋅ r t + i + γ m ⋅ v ( s t + 1 ; W ) y_t = \sum_{i=0}^{m-1}\gamma^i\cdot r_{t+i}+\gamma^m\cdot v(s_{t+1};W) yt=i=0m1γirt+i+γmv(st+1;W)

6.2 Reinforce 算法

  • 回报: u t = ∑ i = t n γ t − i ⋅ r i u_t = \sum_{i=t}^n \gamma^{t-i}\cdot r_i ut=i=tnγtiri
  • error: δ t = v ( s t ; W ) − u t \delta_t=v(s_t;W)-u_t δt=v(st;W)ut

6.3 Reinforce 算法为A2C的特殊形式

multi-step A2C的TD Target为: y t = ∑ i = 0 m − 1 γ i ⋅ r t + i + γ m ⋅ v ( s t + 1 ; W ) y_t = \sum_{i=0}^{m-1}\gamma^i\cdot r_{t+i}+\gamma^m\cdot v(s_{t+1};W) yt=i=0m1γirt+i+γmv(st+1;W)
当使用所有奖励时,则: y t = u t = ∑ i = t n γ t − i ⋅ r i y_t=u_t=\sum_{i=t}^n \gamma^{t-i}\cdot r_i yt=ut=i=tnγtiri
所以 Reinforce 算法为A2C的特例。

本文内容为参考B站学习视频书写的笔记!
by CyrusMay 2022 04 11

你可能感兴趣的:(强化学习,python,算法,强化学习,人工智能,A2C)