强化学习中首先要明白(折扣)回报的定义:
U t = R t + γ R t + 1 + γ 2 R t + 2 + . . . . . . γ n R n U_t=R_t+\gamma R_{t+1}+\gamma^2 R_{t+2}+......\gamma^n R_{n} Ut=Rt+γRt+1+γ2Rt+2+......γnRn
这里的 R R R 是 reward:奖励, R t R_t Rt与当前的状态 S t S_t St、当前的动作 A t A_t At、以即 S t + 1 S_{t+1} St+1相关(或者简单认为 R t R_t Rt只与当前的状态 S t S_t St、当前的动作 A t A_t At相关),也就是说:
R t = r ( S t , A t , S t + 1 ) R_t=r(S_t,A_t,S_{t+1}) Rt=r(St,At,St+1),这里的大写字母代表具有随机性,小写字母表示已经被观测,没有随机性。
动作价值函数 Q π ( s t , a t ) Q_{\pi}(s_t,a_t) Qπ(st,at)与当前的策略函数 π \pi π 、当前动作 a t a_t at 、当前的状态 s t s_t st有关,动作价值函数是回报的期望 Q π ( s t , a t ) = E S t + 1 , A t + 1 , . . . , S n , A n [ U t ∣ S t = s t , A t = a t ] Q_{\pi}(s_t,a_t)=E_{S_{t+1},A_{t+1},...,S_{n},A_{n}}[U_t|S_t=s_t,A_t=a_t] Qπ(st,at)=ESt+1,At+1,...,Sn,An[Ut∣St=st,At=at],也就是说对后续的 S t + 1 , A t + 1 , . . . , S n , A n S_{t+1},A_{t+1},...,S_{n},A_{n} St+1,At+1,...,Sn,An求期望,消除了这些状态和动作的随机性。
状态价值函数 V π ( s t ) V_{\pi}(s_t) Vπ(st) 与当前的策略函数 π \pi π 、当前的状态 s t s_t st有关,是用来评估,在当前策略 π \pi π 下,状态 s t s_t st的好坏,算期望消除了动作的随机性。
V π ( s t ) = ∑ a ∈ A π ( a ∣ s t ) ∗ Q π ( s t , a t ) V_{\pi}(s_t)=\sum\limits_{a\in\mathcal{A}}\pi(a|s_t)*Q_{\pi}(s_t,a_t) Vπ(st)=a∈A∑π(a∣st)∗Qπ(st,at)
最优动作价值函数 Q ∗ ( s t , a t ) Q_*(s_t,a_t) Q∗(st,at) 表示当前的策略是最优的情况下,在状态 s t s_t st做动作 a t a_t at能得到的价值,这个价值一定是各种策略下在状态 s t s_t st做动作 a t a_t at能得到的价值中最高的,因为我们的策略是最优的。 Q ∗ ( s t , a t ) = max π Q π ( s t , a t ) Q_*(s_t,a_t)=\max_{\pi}Q_{\pi}(s_t,a_t) Q∗(st,at)=maxπQπ(st,at)
最优状态价值函数 V ∗ ( s t ) V_*(s_t) V∗(st) , V ∗ ( s t ) = max a Q ∗ ( s t , a t ) V_*(s_t)=\max_{a}Q_*(s_t,a_t) V∗(st)=maxaQ∗(st,at), V ∗ ( s t ) V_*(s_t) V∗(st)表示在最优策略下,最高的最优动作价值 Q ∗ ( s t , a t ) Q_*(s_t,a_t) Q∗(st,at)
对动作价值函数、状态价值函数的理解
可以看这篇博客强化学习中状态价值函数和动作价值函数的理解,讲的很好,通俗易懂
贝尔曼方程1
Q π ( s t , a t ) = E S t + 1 , A t + 1 [ R t + γ Q π ( S t + 1 , A t + 1 ) ∣ S t = s t , A t = a t ] Q_{\pi}(s_t,a_t)=E_{S_{t+1},A_{t+1}}[R_t+\gamma Q_{\pi}(S_{t+1},A_{t+1})|S_t=s_t,A_t=a_t] Qπ(st,at)=ESt+1,At+1[Rt+γQπ(St+1,At+1)∣St=st,At=at]
证明:
贝尔曼方程2
因为 V π ( S t + 1 ) = E A t + 1 Q ( S t + 1 , A t + 1 ) V_{\pi}(S_{t+1})=E_{A_{t+1}}Q(S_{t+1},A_{t+1}) Vπ(St+1)=EAt+1Q(St+1,At+1)
所以上式贝尔曼方程1可以转换为
Q π ( s t , a t ) = E S t + 1 [ R t + V π ( S t + 1 ) ∣ S t = s t , A t = a t ] Q_{\pi}(s_t,a_t)=E_{S_{t+1}}[R_t+V_{\pi}(S_{t+1})|S_t=s_t,A_t=a_t] Qπ(st,at)=ESt+1[Rt+Vπ(St+1)∣St=st,At=at]
贝尔曼方程3
因为 V π ( S t ) = E A t Q ( S t , A t ) V_{\pi}(S_{t})=E_{A_{t}}Q(S_{t},A_{t}) Vπ(St)=EAtQ(St,At)
所以上式贝尔曼方程2可以转换为
V π ( s t ) = E S t + 1 , A t [ R t + V π ( S t + 1 ) ∣ S t = s t , A t = a t ] V_{\pi}(s_{t})=E_{S_{t+1},A_t}[R_t+V_{\pi}(S_{t+1})|S_t=s_t,A_t=a_t] Vπ(st)=ESt+1,At[Rt+Vπ(St+1)∣St=st,At=at]
最优贝尔曼方程
Q ∗ ( s t , a t ) = E S t + 1 ∼ p ( ⋅ ∣ s t , a t ) [ R t + γ ∗ m a x A ∈ A Q ∗ ( S t + 1 , A ) ∣ S t = s t , A t = a t ] Q_*(s_t,a_t)=E_{S_{t+1}\sim p(\cdot|s_t,a_t)}[R_t+\gamma * max_{A\in \mathcal{A}}Q_*(S_{t+1},A)|S_t=s_t,A_t=a_t] Q∗(st,at)=ESt+1∼p(⋅∣st,at)[Rt+γ∗maxA∈AQ∗(St+1,A)∣St=st,At=at]
π ∗ = a r g m a x π Q π ( s , a ) \pi^*=argmax_{\pi}Q_{\pi}(s,a) π∗=argmaxπQπ(s,a)
由贝尔曼方程可得
Q π ∗ ( s t , a t ) = E S t + 1 , A t + 1 [ R t + Q π ∗ ( S t + 1 , A t + 1 ) ∣ S t = s t , A t = a t ] Q_{\pi^*}(s_t,a_t)=E_{S_{t+1},A_{t+1}}[R_t+Q_{\pi^*}(S_{t+1},A_{t+1})|S_t=s_t,A_t=a_t] Qπ∗(st,at)=ESt+1,At+1[Rt+Qπ∗(St+1,At+1)∣St=st,At=at]
Q π ∗ ( s t , a t ) = Q ∗ ( s t , a t ) Q_{\pi^*}(s_t,a_t)=Q_{*}(s_t,a_t) Qπ∗(st,at)=Q∗(st,at),可得
Q ∗ ( s t , a t ) = E S t + 1 , A t + 1 [ R t + Q ∗ ( S t + 1 , A t + 1 ) ∣ S t = s t , A t = a t ] Q_{*}(s_t,a_t)=E_{S_{t+1},A_{t+1}}[R_t+Q_{*}(S_{t+1},A_{t+1})|S_t=s_t,A_t=a_t] Q∗(st,at)=ESt+1,At+1[Rt+Q∗(St+1,At+1)∣St=st,At=at]
动作 A t + 1 = a r g m a x A Q ∗ ( S t + 1 , A ) A_{t+1}=argmax_A Q_{*}(S_{t+1},A) At+1=argmaxAQ∗(St+1,A)是状态 S t + 1 S_{t+1} St+1的确定函数(最好的那个动作),所以
Q ∗ ( s t , a t ) = E S t + 1 ∼ p ( ⋅ ∣ s t , a t ) [ R t + γ ∗ m a x A ∈ A Q ∗ ( S t + 1 , A ) ∣ S t = s t , A t = a t ] Q_*(s_t,a_t)=E_{S_{t+1}\sim p(\cdot|s_t,a_t)}[R_t+\gamma * max_{A\in \mathcal{A}}Q_*(S_{t+1},A)|S_t=s_t,A_t=a_t] Q∗(st,at)=ESt+1∼p(⋅∣st,at)[Rt+γ∗maxA∈AQ∗(St+1,A)∣St=st,At=at]