强化学习—(最优)贝尔曼方程推导以及对(最优)动作价值函数、(最优)状态价值函数的理解

强化学习中首先要明白(折扣)回报的定义:
U t = R t + γ R t + 1 + γ 2 R t + 2 + . . . . . . γ n R n U_t=R_t+\gamma R_{t+1}+\gamma^2 R_{t+2}+......\gamma^n R_{n} Ut=Rt+γRt+1+γ2Rt+2+......γnRn
这里的 R R R 是 reward:奖励, R t R_t Rt与当前的状态 S t S_t St、当前的动作 A t A_t At、以即 S t + 1 S_{t+1} St+1相关(或者简单认为 R t R_t Rt只与当前的状态 S t S_t St、当前的动作 A t A_t At相关),也就是说:
R t = r ( S t , A t , S t + 1 ) R_t=r(S_t,A_t,S_{t+1}) Rt=r(St,At,St+1),这里的大写字母代表具有随机性,小写字母表示已经被观测,没有随机性。

动作价值函数 Q π ( s t , a t ) Q_{\pi}(s_t,a_t) Qπ(st,at)与当前的策略函数 π \pi π 、当前动作 a t a_t at 、当前的状态 s t s_t st有关,动作价值函数是回报的期望 Q π ( s t , a t ) = E S t + 1 , A t + 1 , . . . , S n , A n [ U t ∣ S t = s t , A t = a t ] Q_{\pi}(s_t,a_t)=E_{S_{t+1},A_{t+1},...,S_{n},A_{n}}[U_t|S_t=s_t,A_t=a_t] Qπ(st,at)=ESt+1,At+1,...,Sn,An[UtSt=st,At=at],也就是说对后续的 S t + 1 , A t + 1 , . . . , S n , A n S_{t+1},A_{t+1},...,S_{n},A_{n} St+1,At+1,...,Sn,An求期望,消除了这些状态和动作的随机性。

状态价值函数 V π ( s t ) V_{\pi}(s_t) Vπ(st) 与当前的策略函数 π \pi π 、当前的状态 s t s_t st有关,是用来评估,在当前策略 π \pi π 下,状态 s t s_t st的好坏,算期望消除了动作的随机性。
V π ( s t ) = ∑ a ∈ A π ( a ∣ s t ) ∗ Q π ( s t , a t ) V_{\pi}(s_t)=\sum\limits_{a\in\mathcal{A}}\pi(a|s_t)*Q_{\pi}(s_t,a_t) Vπ(st)=aAπ(ast)Qπ(st,at)

最优动作价值函数 Q ∗ ( s t , a t ) Q_*(s_t,a_t) Q(st,at) 表示当前的策略是最优的情况下,在状态 s t s_t st做动作 a t a_t at能得到的价值,这个价值一定是各种策略下在状态 s t s_t st做动作 a t a_t at能得到的价值中最高的,因为我们的策略是最优的。 Q ∗ ( s t , a t ) = max ⁡ π Q π ( s t , a t ) Q_*(s_t,a_t)=\max_{\pi}Q_{\pi}(s_t,a_t) Q(st,at)=maxπQπ(st,at)

最优状态价值函数 V ∗ ( s t ) V_*(s_t) V(st) V ∗ ( s t ) = max ⁡ a Q ∗ ( s t , a t ) V_*(s_t)=\max_{a}Q_*(s_t,a_t) V(st)=maxaQ(st,at) V ∗ ( s t ) V_*(s_t) V(st)表示在最优策略下,最高的最优动作价值 Q ∗ ( s t , a t ) Q_*(s_t,a_t) Q(st,at)

对动作价值函数、状态价值函数的理解
可以看这篇博客强化学习中状态价值函数和动作价值函数的理解,讲的很好,通俗易懂

贝尔曼方程1
Q π ( s t , a t ) = E S t + 1 , A t + 1 [ R t + γ Q π ( S t + 1 , A t + 1 ) ∣ S t = s t , A t = a t ] Q_{\pi}(s_t,a_t)=E_{S_{t+1},A_{t+1}}[R_t+\gamma Q_{\pi}(S_{t+1},A_{t+1})|S_t=s_t,A_t=a_t] Qπ(st,at)=ESt+1,At+1[Rt+γQπ(St+1,At+1)St=st,At=at]
证明:

  1. U t = R t + γ R t + 1 + γ 2 R t + 2 + . . . . . . γ n R n U_t=R_t+\gamma R_{t+1}+\gamma^2 R_{t+2}+......\gamma^n R_{n} Ut=Rt+γRt+1+γ2Rt+2+......γnRn
  2. U t = R t + U t + 1 U_t=R_t+U_{t+1} Ut=Rt+Ut+1
  3. Q π ( s t , a t ) = E S t + 1 : , A t + 1 : [ U t ∣ S t = s t , A t = a t ] Q_{\pi}(s_t,a_t)=E_{S_{t+1}:,A_{t+1}:}[U_t|S_t=s_t,A_t=a_t] Qπ(st,at)=ESt+1:,At+1:[UtSt=st,At=at]这里用:来做省略
    将2代入3中,得
  4. Q π ( s t , a t ) = E S t + 1 : , A t + 1 : [ R t + U t + 1 ∣ S t = s t , A t = a t ] Q_{\pi}(s_t,a_t)=E_{S_{t+1}:,A_{t+1}:}[R_t+U_{t+1}|S_t=s_t,A_t=a_t] Qπ(st,at)=ESt+1:,At+1:[Rt+Ut+1St=st,At=at]
    拆开式子为两部分
  5. E S t + 1 : , A t + 1 : [ R t ∣ S t = s t , A t = a t ] E_{S_{t+1}:,A_{t+1}:}[R_t|S_t=s_t,A_t=a_t] ESt+1:,At+1:[RtSt=st,At=at] E S t + 1 : , A t + 1 : [ U t + 1 ∣ S t = s t , A t = a t ] E_{S_{t+1}:,A_{t+1}:}[U_{t+1}|S_t=s_t,A_t=a_t] ESt+1:,At+1:[Ut+1St=st,At=at]
  6. 其中 E S t + 1 : , A t + 1 : [ R t ∣ S t = s t , A t = a t ] E_{S_{t+1}:,A_{t+1}:}[R_t|S_t=s_t,A_t=a_t] ESt+1:,At+1:[RtSt=st,At=at] R t R_t Rt只与当前的状态 S t S_t St、当前的动作 A t A_t At、以即 S t + 1 S_{t+1} St+1相关,所以,可转化为 E S t + 1 [ R t ∣ S t = s t , A t = a t ] E_{S_{t+1}}[R_t|S_t=s_t,A_t=a_t] ESt+1[RtSt=st,At=at]
  7. E S t + 1 : , A t + 1 : [ U t + 1 ∣ S t = s t , A t = a t ] E_{S_{t+1}:,A_{t+1}:}[U_{t+1}|S_t=s_t,A_t=a_t] ESt+1:,At+1:[Ut+1St=st,At=at]=
  8. = E S t + 1 , A t + 1 [ E S t + 2 : , A t + 2 : [ U t + 1 ∣ S t + 1 , A t + 1 ] ∣ S t = s t , A t = a t ] E_{S_{t+1},A_{t+1}}[E_{S_{t+2}:,A_{t+2}:}[U_{t+1}|S_{t+1},A_{t+1}]|S_t=s_t,A_t=a_t] ESt+1,At+1[ESt+2:,At+2:[Ut+1St+1,At+1]St=st,At=at]
  9. = E S t + 1 , A t + 1 [ Q π ( S t + 1 , A t + 1 ) ∣ S t = s t , A t = a t ] E_{S_{t+1},A_{t+1}}[Q_{\pi}(S_{t+1},A_{t+1})|S_t=s_t,A_t=a_t] ESt+1,At+1[Qπ(St+1,At+1)St=st,At=at]
    将6、9带入到4中可得
    Q π ( s t , a t ) = E S t + 1 , A t + 1 [ R t + Q π ( S t + 1 , A t + 1 ) ∣ S t = s t , A t = a t ] Q_{\pi}(s_t,a_t)=E_{S_{t+1},A_{t+1}}[R_t+Q_{\pi}(S_{t+1},A_{t+1})|S_t=s_t,A_t=a_t] Qπ(st,at)=ESt+1,At+1[Rt+Qπ(St+1,At+1)St=st,At=at]
    证明完毕

贝尔曼方程2
因为 V π ( S t + 1 ) = E A t + 1 Q ( S t + 1 , A t + 1 ) V_{\pi}(S_{t+1})=E_{A_{t+1}}Q(S_{t+1},A_{t+1}) Vπ(St+1)=EAt+1Q(St+1,At+1)
所以上式贝尔曼方程1可以转换为
Q π ( s t , a t ) = E S t + 1 [ R t + V π ( S t + 1 ) ∣ S t = s t , A t = a t ] Q_{\pi}(s_t,a_t)=E_{S_{t+1}}[R_t+V_{\pi}(S_{t+1})|S_t=s_t,A_t=a_t] Qπ(st,at)=ESt+1[Rt+Vπ(St+1)St=st,At=at]

贝尔曼方程3
因为 V π ( S t ) = E A t Q ( S t , A t ) V_{\pi}(S_{t})=E_{A_{t}}Q(S_{t},A_{t}) Vπ(St)=EAtQ(St,At)
所以上式贝尔曼方程2可以转换为
V π ( s t ) = E S t + 1 , A t [ R t + V π ( S t + 1 ) ∣ S t = s t , A t = a t ] V_{\pi}(s_{t})=E_{S_{t+1},A_t}[R_t+V_{\pi}(S_{t+1})|S_t=s_t,A_t=a_t] Vπ(st)=ESt+1,At[Rt+Vπ(St+1)St=st,At=at]

最优贝尔曼方程
Q ∗ ( s t , a t ) = E S t + 1 ∼ p ( ⋅ ∣ s t , a t ) [ R t + γ ∗ m a x A ∈ A Q ∗ ( S t + 1 , A ) ∣ S t = s t , A t = a t ] Q_*(s_t,a_t)=E_{S_{t+1}\sim p(\cdot|s_t,a_t)}[R_t+\gamma * max_{A\in \mathcal{A}}Q_*(S_{t+1},A)|S_t=s_t,A_t=a_t] Q(st,at)=ESt+1p(st,at)[Rt+γmaxAAQ(St+1,A)St=st,At=at]
π ∗ = a r g m a x π Q π ( s , a ) \pi^*=argmax_{\pi}Q_{\pi}(s,a) π=argmaxπQπ(s,a)
由贝尔曼方程可得
Q π ∗ ( s t , a t ) = E S t + 1 , A t + 1 [ R t + Q π ∗ ( S t + 1 , A t + 1 ) ∣ S t = s t , A t = a t ] Q_{\pi^*}(s_t,a_t)=E_{S_{t+1},A_{t+1}}[R_t+Q_{\pi^*}(S_{t+1},A_{t+1})|S_t=s_t,A_t=a_t] Qπ(st,at)=ESt+1,At+1[Rt+Qπ(St+1,At+1)St=st,At=at]
Q π ∗ ( s t , a t ) = Q ∗ ( s t , a t ) Q_{\pi^*}(s_t,a_t)=Q_{*}(s_t,a_t) Qπ(st,at)=Q(st,at),可得
Q ∗ ( s t , a t ) = E S t + 1 , A t + 1 [ R t + Q ∗ ( S t + 1 , A t + 1 ) ∣ S t = s t , A t = a t ] Q_{*}(s_t,a_t)=E_{S_{t+1},A_{t+1}}[R_t+Q_{*}(S_{t+1},A_{t+1})|S_t=s_t,A_t=a_t] Q(st,at)=ESt+1,At+1[Rt+Q(St+1,At+1)St=st,At=at]
动作 A t + 1 = a r g m a x A Q ∗ ( S t + 1 , A ) A_{t+1}=argmax_A Q_{*}(S_{t+1},A) At+1=argmaxAQ(St+1,A)是状态 S t + 1 S_{t+1} St+1的确定函数(最好的那个动作),所以
Q ∗ ( s t , a t ) = E S t + 1 ∼ p ( ⋅ ∣ s t , a t ) [ R t + γ ∗ m a x A ∈ A Q ∗ ( S t + 1 , A ) ∣ S t = s t , A t = a t ] Q_*(s_t,a_t)=E_{S_{t+1}\sim p(\cdot|s_t,a_t)}[R_t+\gamma * max_{A\in \mathcal{A}}Q_*(S_{t+1},A)|S_t=s_t,A_t=a_t] Q(st,at)=ESt+1p(st,at)[Rt+γmaxAAQ(St+1,A)St=st,At=at]

你可能感兴趣的:(强化学习,算法,强化学习)