每天一个RL基础理论(2)——VI&PI

CS6789-2

  • 一、Value Iteration
    • 1.1 VI
    • 1.2 VI的收敛证明
    • 1.3 VI的定量分析
  • 二、Policy Iteration
    • 2.1 PI
    • 2.2 Policy Evaluation
    • 2.3 Policy Improvement
    • 2.4 PI的收敛性分析
  • 三、补充
    • 3.1 PE的可逆证明
    • 3.2 PE可逆证明的背后
      • 3.2.1 P π P^\pi Pπ的理解
      • 3.2.2 P π P^\pi Pπ推出的常用引理
    • 3.3 评估策略性能的通用公式
  • 四、总结

  • 搬砖来源:https://wensun.github.io/CS6789_fall_2021.html
  • 主题:Planning in MDP
  • setting:infinite horizon discounted MDP
  • 解决的问题:给定 M = ( S , A , P , r , γ ) \mathcal M=(S,A,P,r,\gamma) M=(S,A,P,r,γ),其中转移矩阵 P P P已知,怎么找到最优策略 π ⋆ \pi^\star π(deterministic & stationary)?
  • 使用的理论工具:Bellman Optimality

简要回顾Bellman Optimality:

  1. 性质一:在上述setting下,存在一个deterministic & stationary 的最优策略 π ⋆ \pi^\star π,其value function满足:
    V ⋆ ( s ) = max ⁡ a [ r ( s , a ) + γ E s ′ ∼ P ( ⋅ ∣ s , a ) [ V ⋆ ( s ′ ) ] ] Q ⋆ ( s , a ) = r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ max ⁡ a ′ Q ⋆ ( s ′ , a ′ ) ] V^\star(s)=\max_a \left[r(s,a) + \gamma \mathbb E_{s'\sim P(\cdot\mid s,a)}[V^\star(s')]\right]\\ Q^\star(s,a)=r(s,a)+\gamma \mathbb E_{s'\sim p(\cdot|s,a)}\left[\max_{a'}Q^\star (s',a')\right] V(s)=amax[r(s,a)+γEsP(s,a)[V(s)]]Q(s,a)=r(s,a)+γEsp(s,a)[amaxQ(s,a)]
  2. 性质二:
    • (V-version)对于任意的价值函数 V V V,如果其满足 V ( s ) = max ⁡ a [ r ( s , a ) + γ E s ′ ∼ P ( ⋅ ∣ s , a ) V ( s ′ ) ] , ∀ s V(s)=\max_a\left[r(s,a)+\gamma\mathbb E_{s'\sim P(\cdot|s,a)}V(s')\right],\forall s V(s)=maxa[r(s,a)+γEsP(s,a)V(s)],s,则有:
      V ( s ) = V ⋆ ( s ) V(s)=V^\star(s) V(s)=V(s)
    • (Q-version)对于任意的Q值函数 Q Q Q,如果其满足 Q ( s , a ) = r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ max ⁡ a ′ Q ( s ′ , a ′ ) ] , ∀ s Q(s,a)=r(s,a)+\gamma \mathbb E_{s'\sim p(\cdot|s,a)}\left[\max_{a'}Q (s',a')\right],\forall s Q(s,a)=r(s,a)+γEsp(s,a)[maxaQ(s,a)],s,则有:
      Q ( s , a ) = Q ⋆ ( s , a ) Q(s,a)=Q^\star(s,a) Q(s,a)=Q(s,a)

一、Value Iteration

1.1 VI

VI算法

  1. 初始化Q值函数 Q 0 : ∣ ∣ Q 0 ∣ ∣ ∞ ∈ ( 0 , 1 1 − γ ) Q_0:||Q_0||_{\infty}\in (0,\frac{1}{1-\gamma}) Q0:Q0(0,1γ1)
  2. 迭代直到收敛: Q n + 1 ( s , a ) = r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ max ⁡ a ′ Q n ( s ′ , a ′ ) ] ∀ s , a ∈ S × A Q_{n+1}(s,a)= r(s,a) + \gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}[\max_{a'}Q_n(s',a')]\quad \forall s,a\in S\times A Qn+1(s,a)=r(s,a)+γEsp(s,a)[maxaQn(s,a)]s,aS×A

1.2 VI的收敛证明

  • 目的:找到 π ⋆ \pi^\star π
  • 分析
    1. 由Bellman Optimality性质一的证明,有 π ⋆ = arg max ⁡ a Q ⋆ ( s , a ) \pi^\star=\argmax_aQ^\star(s,a) π=aargmaxQ(s,a),所以只要知道 Q ⋆ Q^\star Q,便能得到 π ⋆ \pi^\star π
    2. 由Bellman Optimality性质二,只要现有的Q满足 Q ( s , a ) = r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ max ⁡ a ′ Q ( s ′ , a ′ ) ] , ∀ s Q(s,a)=r(s,a)+\gamma \mathbb E_{s'\sim p(\cdot|s,a)}\left[\max_{a'}Q (s',a')\right],\forall s Q(s,a)=r(s,a)+γEsp(s,a)[maxaQ(s,a)],s,便能得到 Q ⋆ Q^\star Q
    3. 定义Bellman Optimality Operator B \mathcal B B B Q ( s , a ) = r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ max ⁡ a ′ Q ( s ′ , a ′ ) ] \mathcal BQ(s,a)=r(s,a)+\gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}[\max_{a'}Q(s',a')] BQ(s,a)=r(s,a)+γEsp(s,a)[maxaQ(s,a)],所以只要现有的Q满足 B Q = Q \mathcal B Q=Q BQ=Q即可
    4. 考虑不动点理论,即找 f ( x ) = x f(x) = x f(x)=x的不动点x,从 x 0 , x 1 = f ( x 0 ) , x 2 = f f ( x 0 ) , . . . , x n = f . . . f ⏟ n ( x 0 ) x_0,x_1=f(x_0),x_2=ff(x_0),...,x_n=\underbrace{f...f}_n(x_0) x0,x1=f(x0),x2=ff(x0),...,xn=n f...f(x0) f f f满足什么性质时,经过足够多次 f f f的映射,初始点 x 0 x_0 x0能被映射到 x ⋆ x^\star x使得 x ⋆ = f ( x ⋆ ) x^\star = f(x^\star) x=f(x)
      ∣ x n − x ⋆ ∣ = ∣ f ( x n − 1 ) − f ( x ⋆ ) ∣ ≤ L ∣ x n − 1 − x ⋆ ∣ . . . ≤ L n ∣ x 0 − x ⋆ ∣ \begin{aligned} |x_n-x^\star|&=|f(x_{n-1})-f(x^\star)|\\ &\leq L|x_{n-1}-x^\star|\\ &...\\ &\leq L^n|x_0-x^\star| \end{aligned} xnx=f(xn1)f(x)Lxn1x...Lnx0x只要 L < 1 L<1 L<1即可,本质为莱布尼茨常数
    5. 所以Bellman Optimality Operator B \mathcal B B满足这个性质吗?以下证明这点(需要把不动点理论的x换成函数Q,因此需要引入函数Q的度量 ∣ ∣ ⋅ ∣ ∣ ∞ ||\cdot||_{\infty} )
  • VI证明1:对于任意的 Q , Q ′ Q,Q' Q,Q,有 ∣ ∣ B Q ( s , a ) − B Q ′ ( s , a ) ∣ ∣ ∞ ≤ γ ∣ ∣ Q ( s , a ) − Q ′ ( s , a ) ∣ ∣ ∞ ||\mathcal BQ(s,a)-\mathcal BQ'(s,a)||_{\infty}\leq \gamma ||Q(s,a)-Q'(s,a)||_{\infty} BQ(s,a)BQ(s,a)γQ(s,a)Q(s,a)
    ∣ ∣ B Q ( s , a ) − B Q ′ ( s , a ) ∣ ∣ ∞ = ∣ r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ max ⁡ a ′ Q ( s ′ , a ′ ) ] − ( r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ max ⁡ a ′ Q ′ ( s ′ , a ′ ) ] ) ∣ ∞ = γ ∣ E s ′ ∼ p ( ⋅ ∣ s , a ) [ max ⁡ a ′ Q ( s ′ , a ′ ) ] − E s ′ ∼ p ( ⋅ ∣ s , a ) [ max ⁡ a ′ Q ′ ( s ′ , a ′ ) ] ∣ ∞ ≤ γ E s ′ ∼ p ( ⋅ ∣ s , a ) ∣ max ⁡ a ′ Q ( s ′ , a ′ ) − max ⁡ a ′ Q ′ ( s ′ , a ′ ) ∣ ∞ ≤ γ E s ′ ∼ p ( ⋅ ∣ s , a ) max ⁡ a ′ ∣ Q ( s ′ , a ′ ) − Q ′ ( s ′ , a ′ ) ∣ ∞ (期望<=最大值点) ≤ γ max ⁡ s ′ max ⁡ a ′ ∣ Q ( s ′ , a ′ ) − Q ′ ( s ′ , a ′ ) ∣ ∞ = γ ∣ ∣ Q ( s , a ) − Q ′ ( s , a ) ∣ ∣ ∞ \begin{aligned} ||\mathcal BQ(s,a)-\mathcal BQ'(s,a)||_{\infty}&=\left| r(s,a)+\gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}[\max_{a'}Q(s',a')]-\left(r(s,a)+\gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}[\max_{a'}Q'(s',a')]\right)\right|_{\infty}\\ &=\gamma \left| \mathbb E_{s'\sim p(\cdot\mid s,a)}[\max_{a'}Q(s',a')]- \mathbb E_{s'\sim p(\cdot\mid s,a)}[\max_{a'}Q'(s',a')]\right|_{\infty}\\ &\leq \gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}\left| \max_{a'}Q(s',a')-\max_{a'}Q'(s',a')\right|_{\infty}\\ &\leq \gamma \mathbb E_{s'\sim p(\cdot\mid s,a)} \max_{a'} \left| Q(s',a')-Q'(s',a')\right|_{\infty}\text{(期望<=最大值点)}\\ &\leq \gamma \max_{s'}\max_{a'} \left| Q(s',a')-Q'(s',a')\right|_{\infty}\\ &=\gamma ||Q(s,a)-Q'(s,a)||_{\infty} \end{aligned} BQ(s,a)BQ(s,a)=r(s,a)+γEsp(s,a)[amaxQ(s,a)](r(s,a)+γEsp(s,a)[amaxQ(s,a)])=γEsp(s,a)[amaxQ(s,a)]Esp(s,a)[amaxQ(s,a)]γEsp(s,a)amaxQ(s,a)amaxQ(s,a)γEsp(s,a)amaxQ(s,a)Q(s,a)(期望<=最大值点)γsmaxamaxQ(s,a)Q(s,a)=γQ(s,a)Q(s,a)
  • VI收敛性证明:经过足够多次的Bellman operator B \mathcal B B,任意初始化的函数 Q 0 Q_0 Q0能收敛到 Q ⋆ Q^\star Q
    ∣ ∣ Q n − Q ⋆ ∣ ∣ ∞ = ∣ ∣ B Q n − 1 − B Q ⋆ ∣ ∣ ≤ γ ∣ ∣ Q n − 1 − Q ⋆ ∣ ∣ ≤ ⋯ ≤ γ n ∣ ∣ Q 0 − Q ⋆ ∣ ∣ ∞ ||Q_n-Q^\star||_{\infty}=||\mathcal BQ_{n-1}-\mathcal BQ^\star||\leq \gamma ||Q_{n-1}-Q^\star||\leq \cdots\leq \gamma^n ||Q_0-Q^\star||_{\infty} QnQ=BQn1BQγQn1QγnQ0Q经过Bellman Operator迭代n次后的 Q n Q_n Qn,离 Q ⋆ Q^\star Q的距离被 γ < 1 \gamma<1 γ<1的指数bound住,而 ∣ ∣ Q 0 − Q ⋆ ∣ ∣ ∞ ||Q_0-Q^\star||_{\infty} Q0Q为一个常数,所以只要 n n n足够大即可从任意初始化的 Q 0 Q_0 Q0得到 Q ⋆ Q^\star Q,optimal policy为 π ⋆ = arg max ⁡ a Q ⋆ ( s , a ) \pi^\star = \argmax_a Q^\star(s,a) π=aargmaxQ(s,a)

1.3 VI的定量分析

上述理论知道这么迭代Q是收敛的,而且收敛的速度定性分析是指数收敛 γ n \gamma^n γn,那如何知道定量的收敛速度?定的什么量?
定量分析回答如下问题:

  1. 迭代n步后 Q n Q_n Qn所对应的策略 π n \pi_n πn,离最优策略 π ⋆ \pi^\star π有多远?
  2. 如果只需要最优策略所对应性能的95%,那要迭代多少步?

小知识:衡量一个策略 π ′ \pi' π的“性能(performance)”,使用的是其所对应的价值函数即 V π ′ ( s ) V^{\pi'}(s) Vπ(s)
因为目标 E s 0 , a ∼ π ′ , s ∼ p [ ∑ t = 0 ∞ γ t r ( s , a ) ] \mathbb E_{s_0,a\sim \pi',s\sim p}[\sum_{t=0}^\infty{\gamma^t}r(s,a)] Es0,aπ,sp[t=0γtr(s,a)]可表述为 E s 0 [ V π ′ ( s 0 ) ] \mathbb E_{s_0}[V^{\pi'}(s_0)] Es0[Vπ(s0)],初始的状态分布 s 0 s_0 s0是可以指定的,因此策略性能就由优化目标所确定

VI定量的定理
根据VI迭代了n次后得到 Q n Q_n Qn,其策略 π n = arg max ⁡ a Q n ( s , a ) \pi_n=\argmax_a Q_n(s,a) πn=aargmaxQn(s,a)有如下定理:
V π n ( s ) ≥ V ⋆ ( s ) − 2 γ n 1 − γ ∣ ∣ Q 0 − Q ⋆ ∣ ∣ ∞ ∀ s ∈ S V^{\pi_n}(s)\geq V^\star(s)-\frac{2\gamma^n}{1-\gamma}||Q_0-Q^\star||_{\infty} \quad \forall s\in S Vπn(s)V(s)1γ2γnQ0QsS

证明

V π n ( s ) − V ⋆ ( s ) = Q π n ( s , π n ( s ) ) − Q ⋆ ( s , π ⋆ ( s ) ) (deterministic policy) = Q π n ( s , π n ( s ) ) − Q ⋆ ( s , π n ( s ) ) + Q ⋆ ( s , π n ( s ) ) ⏟ − Q ⋆ ( s , π ⋆ ( s ) ) = γ E s ′ ∼ p ( ⋅ ∣ s , π n ( s ) ) [ V π n ( s ′ ) − V ⋆ ( s ′ ) ] + Q ⋆ ( s , π n ( s ) ) − Q ⋆ ( s , π ⋆ ( s ) ) = γ E s ′ ∼ p ( ⋅ ∣ s , π n ( s ) ) [ V π n ( s ′ ) − V ⋆ ( s ′ ) ] + Q ⋆ ( s , π n ( s ) ) − Q π n ( s , π n ( s ) ) + Q π n ( s , π n ( s ) ) ⏟ − Q ⋆ ( s , π ⋆ ( s ) ) ≥ γ E s ′ ∼ p ( ⋅ ∣ s , π n ( s ) ) [ V π n ( s ′ ) − V ⋆ ( s ′ ) ] + Q ⋆ ( s , π n ( s ) ) − Q π n ( s , π n ( s ) ) + Q π n ( s , π ⋆ ( s ) ) ⏟ − Q ⋆ ( s , π ⋆ ( s ) ) = γ E s ′ ∼ p ( ⋅ ∣ s , π n ( s ) ) [ V π n ( s ′ ) − V ⋆ ( s ′ ) ] − ( Q π n ( s , π n ( s ) ) − Q ⋆ ( s , π n ( s ) ) ) + Q π n ( s , π ⋆ ( s ) ) ⏟ − Q ⋆ ( s , π ⋆ ( s ) ) ≥ γ E s ′ ∼ p ( ⋅ ∣ s , π n ( s ) ) [ V π n ( s ′ ) − V ⋆ ( s ′ ) ] − 2 γ n ∣ ∣ Q 0 − Q ⋆ ∣ ∣ ∞ (利用VI收敛性质 ∣ ∣ Q n − Q ⋆ ∣ ∣ ∞ ≤ γ n ∣ ∣ Q 0 − Q ⋆ ∣ ∣ ∞ ) ≥ γ E s ′ ∼ p ( ⋅ ∣ s , π n ( s ) ) [ γ E s ′ ′ ∼ p ( ⋅ ∣ s ′ , π n ( s ′ ) ) [ V π n ( s ′ ′ ) − V ⋆ ( s ′ ′ ) ] − 2 γ n ∣ ∣ Q 0 − Q ⋆ ∣ ∣ ∞ ] − 2 γ n ∣ ∣ Q 0 − Q ⋆ ∣ ∣ ∞ (状态空间上套娃) = γ 2 E s ′ ∼ p ( ⋅ ∣ s , π n ( s ) ) [ E s ′ ′ ∼ p ( ⋅ ∣ s ′ , π n ( s ′ ) ) [ V π n ( s ′ ′ ) − V ⋆ ( s ′ ′ ) ] ] − 2 γ n ∣ ∣ Q 0 − Q ⋆ ∣ ∣ ∞ − 2 γ n + 1 ∣ ∣ Q 0 − Q ⋆ ∣ ∣ ∞ ≥ ⋯ ≥ (不断套娃) = − 2 γ n ∣ ∣ Q 0 − Q ⋆ ∣ ∣ ∞ − 2 γ n + 1 ∣ ∣ Q 0 − Q ⋆ ∣ ∣ ∞ − ⋯ − 2 γ ∞ ∣ ∣ Q 0 − Q ⋆ ∣ ∣ ∞ = − 2 γ n 1 − γ ∣ ∣ Q 0 − Q ⋆ ∣ ∣ ∞ \begin{aligned} V^{\pi_n}(s)-V^\star(s)&=Q^{\pi_n}(s,\pi_n(s))-Q^\star(s,\pi^\star(s)) \text{\quad (deterministic policy)}\\ &= Q^{\pi_n}(s,\pi_n(s))\underbrace{-Q^\star(s,\pi_n(s))+Q^\star(s,\pi_n(s))} - Q^\star(s,\pi^\star(s)) \\ &=\gamma \mathbb E_{s'\sim p(\cdot\mid s,\pi_n(s))}\left[V^{\pi_n}(s')-V^\star(s')\right] + Q^\star(s,\pi_n(s)) - Q^\star(s,\pi^\star(s))\\ &=\gamma \mathbb E_{s'\sim p(\cdot\mid s,\pi_n(s))}\left[V^{\pi_n}(s')-V^\star(s')\right] + Q^\star(s,\pi_n(s)) \underbrace{- Q^{\pi_n}(s,\pi_n(s))+Q^{\pi_n}(s,\pi_n(s))}-Q^\star(s,\pi^\star(s))\\ &\geq\gamma \mathbb E_{s'\sim p(\cdot\mid s,\pi_n(s))}\left[V^{\pi_n}(s')-V^\star(s')\right] + Q^\star(s,\pi_n(s))- Q^{\pi_n}(s,\pi_n(s))+\underbrace{Q^{\pi_n}(s,\pi^\star(s))}-Q^\star(s,\pi^\star(s))\\ &= \gamma \mathbb E_{s'\sim p(\cdot\mid s,\pi_n(s))}\left[V^{\pi_n}(s')-V^\star(s')\right] - \big(Q^{\pi_n}(s,\pi_n(s))-Q^\star(s,\pi_n(s))\big)+\underbrace{Q^{\pi_n}(s,\pi^\star(s))}-Q^\star(s,\pi^\star(s))\\ &\geq \gamma \mathbb E_{s'\sim p(\cdot\mid s,\pi_n(s))}\left[V^{\pi_n}(s')-V^\star(s')\right] -2\gamma^n ||Q_0-Q^\star||_{\infty}\text{(利用VI收敛性质$||Q_n-Q^\star||_{\infty}\leq \gamma^n||Q_0-Q^\star||_{\infty})$}\\ &\geq \gamma \mathbb E_{s'\sim p(\cdot\mid s,\pi_n(s))}\Big[\gamma \mathbb E_{s''\sim p(\cdot\mid s',\pi_n(s'))}\left[V^{\pi_n}(s'')-V^\star(s'')\right] -2\gamma^n ||Q_0-Q^\star||_{\infty}\Big] -2\gamma^n ||Q_0-Q^\star||_{\infty}\text{(状态空间上套娃)}\\ &=\gamma^2 \mathbb E_{s'\sim p(\cdot\mid s,\pi_n(s))}\Big[ \mathbb E_{s''\sim p(\cdot\mid s',\pi_n(s'))}\left[V^{\pi_n}(s'')-V^\star(s'')\right] \Big] -2\gamma^n ||Q_0-Q^\star||_{\infty}-2\gamma^{n+1} ||Q_0-Q^\star||_{\infty}\\ &\geq \cdots\geq \text{(不断套娃)} \\ &=-2\gamma^n ||Q_0-Q^\star||_{\infty}-2\gamma^{n+1} ||Q_0-Q^\star||_{\infty}-\cdots-2\gamma^\infty ||Q_0-Q^\star||_{\infty}\\ &= -\frac{2\gamma^n}{1-\gamma}||Q_0-Q^\star||_{\infty} \end{aligned} Vπn(s)V(s)=Qπn(s,πn(s))Q(s,π(s))(deterministic policy)=Qπn(s,πn(s)) Q(s,πn(s))+Q(s,πn(s))Q(s,π(s))=γEsp(s,πn(s))[Vπn(s)V(s)]+Q(s,πn(s))Q(s,π(s))=γEsp(s,πn(s))[Vπn(s)V(s)]+Q(s,πn(s)) Qπn(s,πn(s))+Qπn(s,πn(s))Q(s,π(s))γEsp(s,πn(s))[Vπn(s)V(s)]+Q(s,πn(s))Qπn(s,πn(s))+ Qπn(s,π(s))Q(s,π(s))=γEsp(s,πn(s))[Vπn(s)V(s)](Qπn(s,πn(s))Q(s,πn(s)))+ Qπn(s,π(s))Q(s,π(s))γEsp(s,πn(s))[Vπn(s)V(s)]2γnQ0Q(利用VI收敛性质∣∣QnQγn∣∣Q0Q)γEsp(s,πn(s))[γEsp(s,πn(s))[Vπn(s)V(s)]2γnQ0Q]2γnQ0Q(状态空间上套娃)=γ2Esp(s,πn(s))[Esp(s,πn(s))[Vπn(s)V(s)]]2γnQ0Q2γn+1Q0Q(不断套娃)=2γnQ0Q2γn+1Q0Q2γQ0Q=1γ2γnQ0Q

二、Policy Iteration

2.1 PI

PI算法

  1. 初始化显式策略 π 0 \pi_0 π0
  2. (Policy Evaluation)评估当前策略: Q π n ( s , a ) = ( I − γ P π n ) − 1 r ∀ s , a Q^{\pi_n}(s,a)=(I-\gamma P^{\pi_n})^{-1}r\quad\forall s,a Qπn(s,a)=(IγPπn)1rs,a
  3. (Policy Improvement) 改进策略: π n + 1 = arg max ⁡ a Q π n ( s , a ) ∀ s \pi_{n+1}=\argmax_a Q^{\pi_n}(s,a) \quad\forall s πn+1=aargmaxQπn(s,a)s

2.2 Policy Evaluation

  1. 由Bellman consistency equation(Q-Q)有 Q π n ( s , a ) = r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) , a ′ ∼ π n ( ⋅ ∣ s ′ ) [ Q π n ( s , a ) ] Q^{\pi_n}(s,a)=r(s,a)+\gamma \mathbb E_{s'\sim p(\cdot\mid s,a),a'\sim \pi_n(\cdot\mid s')}[Q^{\pi_n}(s,a)] Qπn(s,a)=r(s,a)+γEsp(s,a),aπn(s)[Qπn(s,a)]
  2. Q π n ( s , a ) Q^{\pi_n}(s,a) Qπn(s,a)看成 R ∣ S ∣ ∣ A ∣ × 1 R^{|S||A|\times 1} RSA×1的向量, r ( s , a ) r(s,a) r(s,a)看成 R ∣ S ∣ ∣ A ∣ × 1 R^{|S||A|\times 1} RSA×1的向量,将 E s ′ ∼ p ( ⋅ ∣ s , a ) , a ′ ∼ π n ( ⋅ ∣ s ′ ) E_{s'\sim p(\cdot\mid s,a),a'\sim \pi_n(\cdot\mid s')} Esp(s,a),aπn(s)看成矩阵变换的操作 P π n ∈ R ∣ S ∣ ∣ A ∣ × ∣ S ∣ P^{\pi_n}\in R^{|S||A|\times |S|} PπnRSA×S,如 P ( s , a ) , ( s ′ , a ′ ) π = p ( s ′ ∣ s , a ) π ( a ′ ∣ s ′ ) P^\pi_{(s,a),(s',a')}=p(s'|s,a)\pi(a'|s') P(s,a),(s,a)π=p(ss,a)π(as)表示将具体值 ( s , a ) (s,a) (s,a)变换到 ( s ′ , a ′ ) (s',a') (s,a)的概率,所以Bellman consistency equation变为: Q π n = r + γ P π n Q π n Q^{\pi_n}=r+\gamma P^{\pi_n} Q^{\pi_n} Qπn=r+γPπnQπn
  3. 于是PI算法的Policy Evaluation的closed-form形式为
    Q π n = ( I − γ P π n ) − 1 r Q^{\pi_n}=(I-\gamma P^{\pi_n})^{-1}r Qπn=(IγPπn)1r

(可逆证明在第三章补充)

2.3 Policy Improvement

  1. 对于PI即 π n + 1 = arg max ⁡ a Q π n ( s , a ) ∀ s \pi_{n+1}=\argmax_a Q^{\pi_n}(s,a) \quad\forall s πn+1=aargmaxQπn(s,a)s,关键是说明改进后的策略 π n + 1 \pi_{n+1} πn+1真的比 π n \pi_n πn
  2. (Monotonic Improvement) 只需证 Q π n + 1 ( s , a ) ≥ Q π n ( s , a ) ∀ s , a Q^{\pi_{n+1}}(s,a)\geq Q^{\pi_n}(s,a) \quad \forall s,a Qπn+1(s,a)Qπn(s,a)s,a

以下证明: Q π n + 1 ( s , a ) ≥ Q π n ( s , a ) ∀ s , a Q^{\pi_{n+1}}(s,a)\geq Q^{\pi_n}(s,a) \quad \forall s,a Qπn+1(s,a)Qπn(s,a)s,a
Q π n + 1 ( s , a ) − Q π n ( s , a ) = r ( s , a ) + γ E s ′ ∼ p , a ′ ∼ π n + 1 [ Q π n + 1 ( s ′ , a ′ ) ] − r ( s , a ) − γ E s ′ ∼ p , a ′ ∼ π n [ Q π n ( s ′ , a ′ ) ] = γ E s ′ ∼ p [ Q π n + 1 ( s ′ , π n + 1 ( s ′ ) ) − Q π n ( s ′ , π n ( s ′ ) ) ] (deterministic policy) = γ E s ′ ∼ p [ Q π n + 1 ( s ′ , π n + 1 ( s ′ ) ) − Q π n ( s ′ , π n + 1 ( s ′ ) ) + Q π n ( s ′ , π n + 1 ( s ′ ) ) − Q π n ( s ′ , π n ( s ′ ) ) ⏟ ≥ 0 ] ≥ γ E s ′ ∼ p [ Q π n + 1 ( s ′ , π n + 1 ( s ′ ) ) − Q π n ( s ′ , π n + 1 ( s ′ ) ) ] (开始在状态空间套娃) ≥ γ E s ′ ∼ p [ γ E s ′ ′ ∼ p [ Q π n + 1 ( s ′ ′ , π n + 1 ( s ′ ′ ) ) − Q π n ( s ′ ′ , π n + 1 ( s ′ ′ ) ) ] ] ≥ ⋯ ≥ ≥ γ ∞ = 0 \begin{aligned} Q^{\pi_{n+1}}(s,a)-Q^{\pi_n}(s,a)&=r(s,a)+\gamma \mathbb E_{s'\sim p,a'\sim \pi_{n+1}}[Q^{\pi_{n+1}}(s',a')]-r(s,a)-\gamma \mathbb E_{s'\sim p,a'\sim \pi_{n}}[Q^{\pi_{n}}(s',a')]\\ &=\gamma \mathbb E_{s'\sim p} [Q^{\pi_{n+1}}(s',\pi_{n+1}(s'))-Q^{\pi_{n}}(s',\pi_n(s'))] \text{(deterministic policy)}\\ &= \gamma \mathbb E_{s'\sim p} [Q^{\pi_{n+1}}(s',\pi_{n+1}(s'))- Q^{\pi_n}(s',\pi_{n+1}(s'))+\underbrace{Q^{\pi_n}(s',\pi_{n+1}(s'))-Q^{\pi_{n}}(s',\pi_n(s'))}_{\geq 0}] \\ &\geq \gamma \mathbb E_{s'\sim p} [Q^{\pi_{n+1}}(s',\pi_{n+1}(s'))- Q^{\pi_n}(s',\pi_{n+1}(s'))] \text{(开始在状态空间套娃)}\\ &\geq \gamma \mathbb E_{s'\sim p} \Big[\gamma \mathbb E_{s''\sim p} [Q^{\pi_{n+1}}(s'',\pi_{n+1}(s''))- Q^{\pi_n}(s'',\pi_{n+1}(s''))] \Big] \\ &\geq \cdots \geq\\ &\geq \gamma^\infty =0 \end{aligned} Qπn+1(s,a)Qπn(s,a)=r(s,a)+γEsp,aπn+1[Qπn+1(s,a)]r(s,a)γEsp,aπn[Qπn(s,a)]=γEsp[Qπn+1(s,πn+1(s))Qπn(s,πn(s))](deterministic policy)=γEsp[Qπn+1(s,πn+1(s))Qπn(s,πn+1(s))+0 Qπn(s,πn+1(s))Qπn(s,πn(s))]γEsp[Qπn+1(s,πn+1(s))Qπn(s,πn+1(s))](开始在状态空间套娃)γEsp[γEsp[Qπn+1(s,πn+1(s))Qπn(s,πn+1(s))]]γ=0

(值得注意的是,这些推导证明也正说明了infinite horizon discounted的重要性即 γ ∞ = 0 , γ < 1 \gamma^\infty=0,\gamma <1 γ=0,γ<1)

同理有: V π n + 1 ( s ) ≥ V π n ( s ) ∀ s V^{\pi_{n+1}}(s)\geq V^{\pi_n}(s) \quad \forall s Vπn+1(s)Vπn(s)s
V π n + 1 ( s ) = Q π n + 1 ( s , π n + 1 ( s ) ) ≥ Q π n ( s , π n + 1 ( s ) ) = max ⁡ a Q π n ( s , a ) ≥ Q π n ( s , π n ( s ) ) = V π n ( s ) V^{\pi_{n+1}}(s)=Q^{\pi_{n+1}}(s,{\pi_{n+1}(s)})\geq Q^{\pi_n}(s,\pi_{n+1}(s))=\max_a Q^{\pi_n}(s,a)\geq Q^{\pi_n}(s,\pi_n(s))=V^{\pi_n}(s) Vπn+1(s)=Qπn+1(s,πn+1(s))Qπn(s,πn+1(s))=amaxQπn(s,a)Qπn(s,πn(s))=Vπn(s)

2.4 PI的收敛性分析

PE是评估策略,Policy Improvement确保了每次更新后的策略 Q π n + 1 ( s , a ) ≥ Q π n ( s , a ) ∀ s , a Q^{\pi_{n+1}}(s,a)\geq Q^{\pi_n}(s,a) \quad \forall s,a Qπn+1(s,a)Qπn(s,a)s,a确实是有改进的,接下来是说明PI确实是收敛的 π n \pi_n πn在有限次迭代后能得到最优策略 π ⋆ \pi^\star π

  1. 已知 π n + 1 ( s ) = arg max ⁡ a Q π n ( s , a ) ∀ s \pi_{n+1}(s)=\argmax_a Q^{\pi_n}(s,a) \quad \forall s πn+1(s)=aargmaxQπn(s,a)s
  2. 只需证 ∣ ∣ V π n + 1 ( s ) − V ⋆ ( s ) ∣ ∣ ∞ ≤ γ ∣ ∣ V π n ( s ) − V ⋆ ( s ) ∣ ∣ ∞ ||V^{\pi_{n+1}}(s)-V^{\star}(s)||_\infty \leq \gamma ||V^{\pi_n}(s)-V^\star(s)||_\infty Vπn+1(s)V(s)γVπn(s)V(s)
  3. 则有 ∣ ∣ V π n + 1 ( s ) − V ⋆ ( s ) ∣ ∣ ∞ ≤ γ ∣ ∣ V π n ( s ) − V ⋆ ( s ) ∣ ∣ ∞ ≤ ⋯ ≤ γ n ∣ ∣ V π 0 ( s ) − V ⋆ ( s ) ∣ ∣ ∞ ||V^{\pi_{n+1}}(s)-V^{\star}(s)||_\infty\leq \gamma ||V^{\pi_n}(s)-V^\star(s)||_\infty\leq \cdots\leq \gamma^n ||V^{\pi_0}(s)-V^\star(s)||_\infty Vπn+1(s)V(s)γVπn(s)V(s)γnVπ0(s)V(s),保证收敛哦!

V ⋆ ( s ) − V π n + 1 ( s ) = max ⁡ a [ r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ V ⋆ ( s ′ ) ] ] − r ( s , π n + 1 ( s ) ) − γ E s ′ ∼ p ( ⋅ ∣ s , π n + 1 ( s ) ) [ V π n + 1 ( s ′ ) ] ≤ max ⁡ a [ r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ V ⋆ ( s ′ ) ] ] − r ( s , π n + 1 ( s ) ) − γ E s ′ ∼ p ( ⋅ ∣ s , π n + 1 ( s ) ) [ V π n ( s ′ ) ] (利用 V π n + 1 ( s ) ≥ V π n ( s ) ) = max ⁡ a [ r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ V ⋆ ( s ′ ) ] ] − max ⁡ a [ r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ V π n ( s ′ ) ] ] (利用定义 π n + 1 = arg max ⁡ a Q π n ( s , a ) ) ≤ max ⁡ a γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ V ⋆ ( s ′ ) − V π n ( s ′ ) ] ≤ γ max ⁡ s ′ V ⋆ ( s ′ ) − V π n ( s ′ ) (期望 ≤ 最大值点) = γ ∣ ∣ V ⋆ ( s ) − V π n ( s ) ∣ ∣ ∞ ( ∞ − n o r m 的定义) \begin{aligned} V^{\star}(s)-V^{\pi_{n+1}}(s)&=\max_a \Big[r(s,a)+\gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}[V^\star(s')]\Big]-r(s,\pi_{n+1}(s))-\gamma \mathbb E_{s'\sim p(\cdot\mid s,\pi_{n+1}(s))}[V^{\pi_{n+1}}(s')]\\ &\leq \max_a \Big[r(s,a)+\gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}[V^\star(s')]\Big] - r(s,\pi_{n+1}(s))-\gamma \mathbb E_{s'\sim p(\cdot\mid s,\pi_{n+1}(s))}[V^{\pi_{n}}(s')] \text{(利用$V^{\pi_{n+1}}(s)\geq V^{\pi_n}(s)$)}\\ &=\max_a \Big[r(s,a)+\gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}[V^\star(s')]\Big]-\max_a\Big[r(s,a)+\gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}[V^{\pi_{n}}(s')]\Big]\text{(利用定义$\pi_{n+1}=\argmax_a Q^{\pi_n}(s,a)$)}\\ &\leq \max_a \gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}[V^\star(s')-V^{\pi_{n}}(s')]\\ &\leq \gamma \max_{s'} V^\star(s')-V^{\pi_{n}}(s') \quad \text{(期望$\leq$最大值点)}\\ &=\gamma ||V^\star(s)-V^{\pi_n}(s)||_{\infty}\quad \text{($\infty-norm$的定义)} \end{aligned} V(s)Vπn+1(s)=amax[r(s,a)+γEsp(s,a)[V(s)]]r(s,πn+1(s))γEsp(s,πn+1(s))[Vπn+1(s)]amax[r(s,a)+γEsp(s,a)[V(s)]]r(s,πn+1(s))γEsp(s,πn+1(s))[Vπn(s)](利用Vπn+1(s)Vπn(s))=amax[r(s,a)+γEsp(s,a)[V(s)]]amax[r(s,a)+γEsp(s,a)[Vπn(s)]](利用定义πn+1=aargmaxQπn(s,a))amaxγEsp(s,a)[V(s)Vπn(s)]γsmaxV(s)Vπn(s)(期望最大值点)=γV(s)Vπn(s)(∞norm的定义)

推导中的第二步到第三步的解释 π n + 1 = arg max ⁡ a Q π n ( s , a ) \pi_{n+1}=\argmax_a Q^{\pi_n}(s,a) πn+1=aargmaxQπn(s,a) 意味着有 Q π n ( s , π n + 1 ( s ) ) = max ⁡ a Q π n ( s , a ) = max ⁡ a [ r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ V π n ( s ′ ) ] ] = r ( s , π n + 1 ( s ) ) + γ E s ′ ∼ p ( ⋅ ∣ s , π n + 1 ( s ) ) [ V π n ( s ′ ) ] Q^{\pi_n}(s,\pi_{n+1}(s))=\max_a Q^{\pi_n}(s,a)=\max_a\Big[r(s,a)+\gamma \mathbb E_{s'\sim p(\cdot\mid s,a )}[V^{\pi_n}(s')]\Big]=r(s,\pi_{n+1}(s))+\gamma \mathbb E_{s'\sim p(\cdot\mid s, \pi_{n+1}(s))}[V^{\pi_n}(s')] Qπn(s,πn+1(s))=maxaQπn(s,a)=maxa[r(s,a)+γEsp(s,a)[Vπn(s)]]=r(s,πn+1(s))+γEsp(s,πn+1(s))[Vπn(s)]
再细节一点,对所有s有 V ⋆ ( s ) − V π n + 1 ( s ) ≤ γ ∣ ∣ V ⋆ ( s ) − V π n ( s ) ∣ ∣ ∞ \quad V^{\star}(s)-V^{\pi_{n+1}}(s)\leq \gamma ||V^\star(s)-V^{\pi_n}(s)||_{\infty} V(s)Vπn+1(s)γV(s)Vπn(s),所以 ∣ ∣ V ⋆ ( s ) − V π n + 1 ( s ) ∣ ∣ ∞ ≤ γ ∣ ∣ V ⋆ ( s ) − V π n ( s ) ∣ ∣ ∞ ||V^{\star}(s)-V^{\pi_{n+1}}(s)||_\infty\leq \gamma ||V^\star(s)-V^{\pi_n}(s)||_{\infty} V(s)Vπn+1(s)γV(s)Vπn(s)

三、补充

3.1 PE的可逆证明

对于 ∀ x ∈ R ∣ S ∣ ∣ A ∣ × 1 ≠ 0 \forall x\in \mathbb R^{|S||A|\times 1}\neq 0 xRSA×1=0,其中 P π ∈ R ∣ S ∣ ∣ A ∣ × ∣ S ∣ ∣ A ∣ P^\pi\in \mathbb R^{|S||A|\times |S||A|} PπRSA×SA I I I为单位矩阵,可逆证明如下:
∣ ∣ ( I − γ P π ) x ∣ ∣ ∞ = ∣ ∣ x − γ P π x ∣ ∣ ∞ ≥ ∣ ∣ x ∣ ∣ ∞ − ∣ ∣ γ P π x ∣ ∣ ∞ (两者差的最大值>= 两者最大值之差) ≥ ∣ ∣ x ∣ ∣ ∞ − γ ∣ ∣ x ∣ ∣ ∞ ( P π 转移矩阵的定义) = ( 1 − γ ) ∣ ∣ x ∣ ∣ ∞ > 0 \begin{aligned} ||(I-\gamma P^{\pi})x||_{\infty}&=||x-\gamma P^{\pi}x||_{\infty}\\ &\geq ||x||_{\infty}-||\gamma P^\pi x||_{\infty}\text{(两者差的最大值>= 两者最大值之差)}\\ &\geq ||x||_{\infty}-\gamma ||x||_\infty \text{($P^\pi$转移矩阵的定义)}\\ &=(1-\gamma)||x||_\infty >0 \end{aligned} (IγPπ)x=xγPπxxγPπx(两者差的最大值>= 两者最大值之差)xγxPπ转移矩阵的定义)=(1γ)x>0

因为对于所有的非0向量x,经过 ( I − γ P π ) (I-\gamma P^{\pi}) (IγPπ)矩阵变换后均>0,故其满秩。

3.2 PE可逆证明的背后

3.2.1 P π P^\pi Pπ的理解

首先直观地解释下,为什么 Q π ( s , a ) = r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) , a ′ ∼ π ( ⋅ ∣ s ′ ) [ Q π ( s , a ) ] Q^{\pi}(s,a)=r(s,a)+\gamma \mathbb E_{s'\sim p(\cdot\mid s,a),a'\sim \pi(\cdot\mid s')}[Q^{\pi}(s,a)] Qπ(s,a)=r(s,a)+γEsp(s,a),aπ(s)[Qπ(s,a)]能写成 Q π = r + γ P π Q π Q^{\pi}=r+\gamma P^{\pi} Q^{\pi} Qπ=r+γPπQπ的形式, P π P^\pi Pπ究竟是什么?

假设状态空间有m个离散值 { s 1 , s 2 , . . . , s m } \{s_1,s_2,...,s_m\} {s1,s2,...,sm},动作空间有n个离散值 { a 1 , a 2 , . . . , a n } \{a_1,a_2,...,a_n\} {a1,a2,...,an},所以 P π P^\pi Pπ维度为 R m n × m n \mathbb R^{mn\times mn} Rmn×mn Q π ( s , a ) Q^\pi(s,a) Qπ(s,a)维度为 R m n \mathbb R^{mn} Rmn
每天一个RL基础理论(2)——VI&PI_第1张图片
Q π ( s , a ) Q^\pi(s,a) Qπ(s,a)代表的是所有下一状态动作对的Q值,而 P π P^\pi Pπ比如第 m n mn mn行的 i j ij ij列,代表在当前策略 π \pi π下,当前状态动作对 ( s m , a n ) (s_m,a_n) (sm,an)转移到下一状态动作对为 ( s i , a j ) (s_i,a_j) (si,aj)的概率为 p ( s i ∣ s m , a n ) π ( a j ∣ s i ) p(s_i|s_m,a_n)\pi(a_j|s_i) p(sism,an)π(ajsi)

3.2.2 P π P^\pi Pπ推出的常用引理

[ ( 1 − γ ) ( I − γ P π ) − 1 ] ( s , a ) , ( s ′ , a ′ ) = ( 1 − γ ) ∑ h = 0 ∞ γ h P π ( s h = s ′ , a h = a ′ ∣ s 0 = s , a 0 = a ) [(1-\gamma)(I-\gamma P^\pi)^{-1}]_{(s,a),(s',a')}=(1-\gamma)\sum_{h=0}^\infty \gamma^h {\mathbb P}^\pi(s_h=s',a_h=a'|s_0=s,a_0=a) [(1γ)(IγPπ)1](s,a),(s,a)=(1γ)h=0γhPπ(sh=s,ah=as0=s,a0=a)

右边表示的是初始状态为 s 0 , a 0 s_0,a_0 s0,a0在策略 π \pi π下访问到结果为 ( s ′ , a ′ ) (s',a') (s,a)的概率
左边可根据 P π P^\pi Pπ的定义展开求得
具体证明后续再补充

3.3 评估策略性能的通用公式

定义从初始状态为 s 0 s_0 s0,策略为 π \pi π的轨迹访问到 ( s , a ) (s,a) (s,a)的概率为 d s 0 π ( s , a ) d^\pi_{s_0}(s,a) ds0π(s,a),其为
d s 0 π ( s , a ) = ( 1 − γ ) ∑ h = 0 ∞ γ h P π ( s h = s , a h = a ∣ s 0 ) d^\pi_{s_0}(s,a)=(1-\gamma)\sum_{h=0}^\infty \gamma^h\mathbb P^\pi(s_h=s,a_h=a|s_0) ds0π(s,a)=(1γ)h=0γhPπ(sh=s,ah=as0)其中 P π ( s h = s , a h = a ∣ s 0 ) \mathbb P^\pi(s_h=s,a_h=a|s_0) Pπ(sh=s,ah=as0)为从初始状态为 s 0 s_0 s0,策略为 π \pi π的轨迹,在第h时刻访问到 ( s , a ) (s,a) (s,a)的概率。

对于任意两个策略 π , π ′ \pi,\pi' π,π,,其性能差异可表示为:
V π ′ ( s 0 ) − V π ( s 0 ) = 1 1 − γ E s , a ∼ d s 0 π ′ [ Q π ( s , a ) − V π ( s ) ] = 1 1 − γ E ( s , a ) ∼ d s 0 π ′ ( s , a ) [ A π ( s , a ) ] V^{\pi'}(s_0)-V^{\pi}(s_0)=\frac{1}{1-\gamma} \mathbb E_{s,a\sim d^{\pi'}_{s_0}}[Q^{\pi}(s,a)-V^{\pi}(s)]=\frac{1}{1-\gamma} \mathbb E_{(s,a)\sim d^{\pi'}_{s_0}(s,a)}[A^{\pi}(s,a)] Vπ(s0)Vπ(s0)=1γ1Es,ads0π[Qπ(s,a)Vπ(s)]=1γ1E(s,a)ds0π(s,a)[Aπ(s,a)]

证明如下:(下面把 s s s当作 s 0 s_0 s0 s ′ s' s当作s即可)
V π ′ ( s ) − V π ( s ) = ∑ a π ′ ( a ∣ s ) ⋅ Q π ′ ( s , a ) − ∑ a π ( a ∣ s ) ⋅ Q π ( s , a ) = ∑ a π ′ ( a ∣ s ) ⋅ ( Q π ′ ( s , a ) − Q π ( s , a ) ) + ∑ a ( π ′ ( a ∣ s ) − π ( a ∣ s ) ) ⋅ Q π ( s , a ) = ∑ a ( π ′ ( a ∣ s ) − π ( a ∣ s ) ) ⋅ Q π ( s , a ) + γ ∑ a π ′ ( a ∣ s ) ∑ s ′ P ( s ′ ∣ s , a ) ⋅ [ V π ′ ( s ′ ) − V π ( s ′ ) ] = 1 1 − γ ∑ s ′ d s π ′ ( s ′ ) ∑ a ′ ( π ′ ( a ′ ∣ s ′ ) − π ( a ′ ∣ s ′ ) ) ⋅ Q π ( s ′ , a ′ ) (开始在状态空间套娃) = 1 1 − γ ∑ s ′ d s π ′ ( s ′ ) ∑ a ′ π ′ ( a ′ ∣ s ′ ) ⋅ ( Q π ( s ′ , a ′ ) − V π ( s ′ ) ) = 1 1 − γ ∑ s ′ d s π ′ ( s ′ ) ∑ a ′ π ′ ( a ′ ∣ s ′ ) ⋅ A π ( s ′ , a ′ ) = 1 1 − γ E ( s ′ , a ′ ) ∼ d s π ′ ( s ′ , a ′ ) A π ( s ′ , a ′ ) \begin{aligned} V^{\pi^{\prime}}(s)-V^{\pi}(s) &=\sum_{a} \pi^{\prime}(a \mid s) \cdot Q^{\pi^{\prime}}(s, a)-\sum_{a} \pi(a \mid s) \cdot Q^{\pi}(s, a) \\ &=\sum_{a} \pi^{\prime}(a \mid s) \cdot\left(Q^{\pi^{\prime}}(s, a)-Q^{\pi}(s, a)\right)+\sum_{a}\left(\pi^{\prime}(a \mid s)-\pi(a \mid s)\right) \cdot Q^{\pi}(s, a) \\ &=\sum_{a}\left(\pi^{\prime}(a \mid s)-\pi(a \mid s)\right) \cdot Q^{\pi}(s, a)+\gamma \sum_{a} \pi^{\prime}(a \mid s) \sum_{s^{\prime}} \mathcal{P}\left(s^{\prime} \mid s, a\right) \cdot\left[V^{\pi^{\prime}}\left(s^{\prime}\right)-V^{\pi}\left(s^{\prime}\right)\right] \\ &=\frac{1}{1-\gamma} \sum_{s^{\prime}} d_{s}^{\pi^{\prime}}\left(s^{\prime}\right) \sum_{a^{\prime}}\left(\pi^{\prime}\left(a^{\prime} \mid s^{\prime}\right)-\pi\left(a^{\prime} \mid s^{\prime}\right)\right) \cdot Q^{\pi}\left(s^{\prime}, a^{\prime}\right)\text{(开始在状态空间套娃)} \\ &=\frac{1}{1-\gamma} \sum_{s^{\prime}} d_{s}^{\pi^{\prime}}\left(s^{\prime}\right) \sum_{a^{\prime}} \pi^{\prime}\left(a^{\prime} \mid s^{\prime}\right) \cdot\left(Q^{\pi}\left(s^{\prime}, a^{\prime}\right)-V^{\pi}\left(s^{\prime}\right)\right) \\ &=\frac{1}{1-\gamma} \sum_{s^{\prime}} d_{s}^{\pi^{\prime}}\left(s^{\prime}\right) \sum_{a^{\prime}} \pi^{\prime}\left(a^{\prime} \mid s^{\prime}\right) \cdot A^{\pi}\left(s^{\prime}, a^{\prime}\right)\\ &=\frac{1}{1-\gamma} \mathbb E_{(s',a')\sim d^{\pi^\prime}_s(s',a')}A^{\pi}\left(s^{\prime}, a^{\prime}\right) \end{aligned} Vπ(s)Vπ(s)=aπ(as)Qπ(s,a)aπ(as)Qπ(s,a)=aπ(as)(Qπ(s,a)Qπ(s,a))+a(π(as)π(as))Qπ(s,a)=a(π(as)π(as))Qπ(s,a)+γaπ(as)sP(ss,a)[Vπ(s)Vπ(s)]=1γ1sdsπ(s)a(π(as)π(as))Qπ(s,a)(开始在状态空间套娃)=1γ1sdsπ(s)aπ(as)(Qπ(s,a)Vπ(s))=1γ1sdsπ(s)aπ(as)Aπ(s,a)=1γ1E(s,a)dsπ(s,a)Aπ(s,a)

四、总结

  1. 需要铭记于心的是,经典理论目前为止的推导与证明都是基于infinite horizon discounted MDP 下 存在一个deterministic & stationary的策略,而VI和PI则是两种帮我们找到最优策略 π ⋆ \pi^\star π的两类方法
  2. 在一个MDP都已知,即内部dynamics都清楚的情况下,VI可以帮助我们定性且定量地迭代一个Q函数,使用 π ( s ) = arg max ⁡ a Q ( s , a ) \pi(s)=\argmax_a Q(s,a) π(s)=aargmaxQ(s,a)导出一个determinstic & stationary的策略
  3. PI:显式建模策略——评估策略的Q函数——改进策略进行迭代,最后迭代收敛得到最优策略
  4. VI:Q函数隐式建模策略——在Q空间上迭代直接找最优策略的 Q ⋆ Q^\star Q,最后通过 π ⋆ ( s ) = arg max ⁡ a Q ⋆ ( s , a ) \pi^\star(s)=\argmax_aQ^\star(s,a) π(s)=aargmaxQ(s,a)还原得到最优策略

你可能感兴趣的:(Deep,RL,强化学习)