简要回顾Bellman Optimality:
VI算法:
上述理论知道这么迭代Q是收敛的,而且收敛的速度定性分析是指数收敛 γ n \gamma^n γn的,那如何知道定量的收敛速度?定的什么量?
定量分析回答如下问题:
小知识:衡量一个策略 π ′ \pi' π′的“性能(performance)”,使用的是其所对应的价值函数即 V π ′ ( s ) V^{\pi'}(s) Vπ′(s)
因为目标 E s 0 , a ∼ π ′ , s ∼ p [ ∑ t = 0 ∞ γ t r ( s , a ) ] \mathbb E_{s_0,a\sim \pi',s\sim p}[\sum_{t=0}^\infty{\gamma^t}r(s,a)] Es0,a∼π′,s∼p[∑t=0∞γtr(s,a)]可表述为 E s 0 [ V π ′ ( s 0 ) ] \mathbb E_{s_0}[V^{\pi'}(s_0)] Es0[Vπ′(s0)],初始的状态分布 s 0 s_0 s0是可以指定的,因此策略性能就由优化目标所确定
VI定量的定理:
根据VI迭代了n次后得到 Q n Q_n Qn,其策略 π n = arg max a Q n ( s , a ) \pi_n=\argmax_a Q_n(s,a) πn=aargmaxQn(s,a)有如下定理:
V π n ( s ) ≥ V ⋆ ( s ) − 2 γ n 1 − γ ∣ ∣ Q 0 − Q ⋆ ∣ ∣ ∞ ∀ s ∈ S V^{\pi_n}(s)\geq V^\star(s)-\frac{2\gamma^n}{1-\gamma}||Q_0-Q^\star||_{\infty} \quad \forall s\in S Vπn(s)≥V⋆(s)−1−γ2γn∣∣Q0−Q⋆∣∣∞∀s∈S
证明:
V π n ( s ) − V ⋆ ( s ) = Q π n ( s , π n ( s ) ) − Q ⋆ ( s , π ⋆ ( s ) ) (deterministic policy) = Q π n ( s , π n ( s ) ) − Q ⋆ ( s , π n ( s ) ) + Q ⋆ ( s , π n ( s ) ) ⏟ − Q ⋆ ( s , π ⋆ ( s ) ) = γ E s ′ ∼ p ( ⋅ ∣ s , π n ( s ) ) [ V π n ( s ′ ) − V ⋆ ( s ′ ) ] + Q ⋆ ( s , π n ( s ) ) − Q ⋆ ( s , π ⋆ ( s ) ) = γ E s ′ ∼ p ( ⋅ ∣ s , π n ( s ) ) [ V π n ( s ′ ) − V ⋆ ( s ′ ) ] + Q ⋆ ( s , π n ( s ) ) − Q π n ( s , π n ( s ) ) + Q π n ( s , π n ( s ) ) ⏟ − Q ⋆ ( s , π ⋆ ( s ) ) ≥ γ E s ′ ∼ p ( ⋅ ∣ s , π n ( s ) ) [ V π n ( s ′ ) − V ⋆ ( s ′ ) ] + Q ⋆ ( s , π n ( s ) ) − Q π n ( s , π n ( s ) ) + Q π n ( s , π ⋆ ( s ) ) ⏟ − Q ⋆ ( s , π ⋆ ( s ) ) = γ E s ′ ∼ p ( ⋅ ∣ s , π n ( s ) ) [ V π n ( s ′ ) − V ⋆ ( s ′ ) ] − ( Q π n ( s , π n ( s ) ) − Q ⋆ ( s , π n ( s ) ) ) + Q π n ( s , π ⋆ ( s ) ) ⏟ − Q ⋆ ( s , π ⋆ ( s ) ) ≥ γ E s ′ ∼ p ( ⋅ ∣ s , π n ( s ) ) [ V π n ( s ′ ) − V ⋆ ( s ′ ) ] − 2 γ n ∣ ∣ Q 0 − Q ⋆ ∣ ∣ ∞ (利用VI收敛性质 ∣ ∣ Q n − Q ⋆ ∣ ∣ ∞ ≤ γ n ∣ ∣ Q 0 − Q ⋆ ∣ ∣ ∞ ) ≥ γ E s ′ ∼ p ( ⋅ ∣ s , π n ( s ) ) [ γ E s ′ ′ ∼ p ( ⋅ ∣ s ′ , π n ( s ′ ) ) [ V π n ( s ′ ′ ) − V ⋆ ( s ′ ′ ) ] − 2 γ n ∣ ∣ Q 0 − Q ⋆ ∣ ∣ ∞ ] − 2 γ n ∣ ∣ Q 0 − Q ⋆ ∣ ∣ ∞ (状态空间上套娃) = γ 2 E s ′ ∼ p ( ⋅ ∣ s , π n ( s ) ) [ E s ′ ′ ∼ p ( ⋅ ∣ s ′ , π n ( s ′ ) ) [ V π n ( s ′ ′ ) − V ⋆ ( s ′ ′ ) ] ] − 2 γ n ∣ ∣ Q 0 − Q ⋆ ∣ ∣ ∞ − 2 γ n + 1 ∣ ∣ Q 0 − Q ⋆ ∣ ∣ ∞ ≥ ⋯ ≥ (不断套娃) = − 2 γ n ∣ ∣ Q 0 − Q ⋆ ∣ ∣ ∞ − 2 γ n + 1 ∣ ∣ Q 0 − Q ⋆ ∣ ∣ ∞ − ⋯ − 2 γ ∞ ∣ ∣ Q 0 − Q ⋆ ∣ ∣ ∞ = − 2 γ n 1 − γ ∣ ∣ Q 0 − Q ⋆ ∣ ∣ ∞ \begin{aligned} V^{\pi_n}(s)-V^\star(s)&=Q^{\pi_n}(s,\pi_n(s))-Q^\star(s,\pi^\star(s)) \text{\quad (deterministic policy)}\\ &= Q^{\pi_n}(s,\pi_n(s))\underbrace{-Q^\star(s,\pi_n(s))+Q^\star(s,\pi_n(s))} - Q^\star(s,\pi^\star(s)) \\ &=\gamma \mathbb E_{s'\sim p(\cdot\mid s,\pi_n(s))}\left[V^{\pi_n}(s')-V^\star(s')\right] + Q^\star(s,\pi_n(s)) - Q^\star(s,\pi^\star(s))\\ &=\gamma \mathbb E_{s'\sim p(\cdot\mid s,\pi_n(s))}\left[V^{\pi_n}(s')-V^\star(s')\right] + Q^\star(s,\pi_n(s)) \underbrace{- Q^{\pi_n}(s,\pi_n(s))+Q^{\pi_n}(s,\pi_n(s))}-Q^\star(s,\pi^\star(s))\\ &\geq\gamma \mathbb E_{s'\sim p(\cdot\mid s,\pi_n(s))}\left[V^{\pi_n}(s')-V^\star(s')\right] + Q^\star(s,\pi_n(s))- Q^{\pi_n}(s,\pi_n(s))+\underbrace{Q^{\pi_n}(s,\pi^\star(s))}-Q^\star(s,\pi^\star(s))\\ &= \gamma \mathbb E_{s'\sim p(\cdot\mid s,\pi_n(s))}\left[V^{\pi_n}(s')-V^\star(s')\right] - \big(Q^{\pi_n}(s,\pi_n(s))-Q^\star(s,\pi_n(s))\big)+\underbrace{Q^{\pi_n}(s,\pi^\star(s))}-Q^\star(s,\pi^\star(s))\\ &\geq \gamma \mathbb E_{s'\sim p(\cdot\mid s,\pi_n(s))}\left[V^{\pi_n}(s')-V^\star(s')\right] -2\gamma^n ||Q_0-Q^\star||_{\infty}\text{(利用VI收敛性质$||Q_n-Q^\star||_{\infty}\leq \gamma^n||Q_0-Q^\star||_{\infty})$}\\ &\geq \gamma \mathbb E_{s'\sim p(\cdot\mid s,\pi_n(s))}\Big[\gamma \mathbb E_{s''\sim p(\cdot\mid s',\pi_n(s'))}\left[V^{\pi_n}(s'')-V^\star(s'')\right] -2\gamma^n ||Q_0-Q^\star||_{\infty}\Big] -2\gamma^n ||Q_0-Q^\star||_{\infty}\text{(状态空间上套娃)}\\ &=\gamma^2 \mathbb E_{s'\sim p(\cdot\mid s,\pi_n(s))}\Big[ \mathbb E_{s''\sim p(\cdot\mid s',\pi_n(s'))}\left[V^{\pi_n}(s'')-V^\star(s'')\right] \Big] -2\gamma^n ||Q_0-Q^\star||_{\infty}-2\gamma^{n+1} ||Q_0-Q^\star||_{\infty}\\ &\geq \cdots\geq \text{(不断套娃)} \\ &=-2\gamma^n ||Q_0-Q^\star||_{\infty}-2\gamma^{n+1} ||Q_0-Q^\star||_{\infty}-\cdots-2\gamma^\infty ||Q_0-Q^\star||_{\infty}\\ &= -\frac{2\gamma^n}{1-\gamma}||Q_0-Q^\star||_{\infty} \end{aligned} Vπn(s)−V⋆(s)=Qπn(s,πn(s))−Q⋆(s,π⋆(s))(deterministic policy)=Qπn(s,πn(s)) −Q⋆(s,πn(s))+Q⋆(s,πn(s))−Q⋆(s,π⋆(s))=γEs′∼p(⋅∣s,πn(s))[Vπn(s′)−V⋆(s′)]+Q⋆(s,πn(s))−Q⋆(s,π⋆(s))=γEs′∼p(⋅∣s,πn(s))[Vπn(s′)−V⋆(s′)]+Q⋆(s,πn(s)) −Qπn(s,πn(s))+Qπn(s,πn(s))−Q⋆(s,π⋆(s))≥γEs′∼p(⋅∣s,πn(s))[Vπn(s′)−V⋆(s′)]+Q⋆(s,πn(s))−Qπn(s,πn(s))+ Qπn(s,π⋆(s))−Q⋆(s,π⋆(s))=γEs′∼p(⋅∣s,πn(s))[Vπn(s′)−V⋆(s′)]−(Qπn(s,πn(s))−Q⋆(s,πn(s)))+ Qπn(s,π⋆(s))−Q⋆(s,π⋆(s))≥γEs′∼p(⋅∣s,πn(s))[Vπn(s′)−V⋆(s′)]−2γn∣∣Q0−Q⋆∣∣∞(利用VI收敛性质∣∣Qn−Q⋆∣∣∞≤γn∣∣Q0−Q⋆∣∣∞)≥γEs′∼p(⋅∣s,πn(s))[γEs′′∼p(⋅∣s′,πn(s′))[Vπn(s′′)−V⋆(s′′)]−2γn∣∣Q0−Q⋆∣∣∞]−2γn∣∣Q0−Q⋆∣∣∞(状态空间上套娃)=γ2Es′∼p(⋅∣s,πn(s))[Es′′∼p(⋅∣s′,πn(s′))[Vπn(s′′)−V⋆(s′′)]]−2γn∣∣Q0−Q⋆∣∣∞−2γn+1∣∣Q0−Q⋆∣∣∞≥⋯≥(不断套娃)=−2γn∣∣Q0−Q⋆∣∣∞−2γn+1∣∣Q0−Q⋆∣∣∞−⋯−2γ∞∣∣Q0−Q⋆∣∣∞=−1−γ2γn∣∣Q0−Q⋆∣∣∞
PI算法:
(可逆证明在第三章补充)
以下证明: Q π n + 1 ( s , a ) ≥ Q π n ( s , a ) ∀ s , a Q^{\pi_{n+1}}(s,a)\geq Q^{\pi_n}(s,a) \quad \forall s,a Qπn+1(s,a)≥Qπn(s,a)∀s,a
Q π n + 1 ( s , a ) − Q π n ( s , a ) = r ( s , a ) + γ E s ′ ∼ p , a ′ ∼ π n + 1 [ Q π n + 1 ( s ′ , a ′ ) ] − r ( s , a ) − γ E s ′ ∼ p , a ′ ∼ π n [ Q π n ( s ′ , a ′ ) ] = γ E s ′ ∼ p [ Q π n + 1 ( s ′ , π n + 1 ( s ′ ) ) − Q π n ( s ′ , π n ( s ′ ) ) ] (deterministic policy) = γ E s ′ ∼ p [ Q π n + 1 ( s ′ , π n + 1 ( s ′ ) ) − Q π n ( s ′ , π n + 1 ( s ′ ) ) + Q π n ( s ′ , π n + 1 ( s ′ ) ) − Q π n ( s ′ , π n ( s ′ ) ) ⏟ ≥ 0 ] ≥ γ E s ′ ∼ p [ Q π n + 1 ( s ′ , π n + 1 ( s ′ ) ) − Q π n ( s ′ , π n + 1 ( s ′ ) ) ] (开始在状态空间套娃) ≥ γ E s ′ ∼ p [ γ E s ′ ′ ∼ p [ Q π n + 1 ( s ′ ′ , π n + 1 ( s ′ ′ ) ) − Q π n ( s ′ ′ , π n + 1 ( s ′ ′ ) ) ] ] ≥ ⋯ ≥ ≥ γ ∞ = 0 \begin{aligned} Q^{\pi_{n+1}}(s,a)-Q^{\pi_n}(s,a)&=r(s,a)+\gamma \mathbb E_{s'\sim p,a'\sim \pi_{n+1}}[Q^{\pi_{n+1}}(s',a')]-r(s,a)-\gamma \mathbb E_{s'\sim p,a'\sim \pi_{n}}[Q^{\pi_{n}}(s',a')]\\ &=\gamma \mathbb E_{s'\sim p} [Q^{\pi_{n+1}}(s',\pi_{n+1}(s'))-Q^{\pi_{n}}(s',\pi_n(s'))] \text{(deterministic policy)}\\ &= \gamma \mathbb E_{s'\sim p} [Q^{\pi_{n+1}}(s',\pi_{n+1}(s'))- Q^{\pi_n}(s',\pi_{n+1}(s'))+\underbrace{Q^{\pi_n}(s',\pi_{n+1}(s'))-Q^{\pi_{n}}(s',\pi_n(s'))}_{\geq 0}] \\ &\geq \gamma \mathbb E_{s'\sim p} [Q^{\pi_{n+1}}(s',\pi_{n+1}(s'))- Q^{\pi_n}(s',\pi_{n+1}(s'))] \text{(开始在状态空间套娃)}\\ &\geq \gamma \mathbb E_{s'\sim p} \Big[\gamma \mathbb E_{s''\sim p} [Q^{\pi_{n+1}}(s'',\pi_{n+1}(s''))- Q^{\pi_n}(s'',\pi_{n+1}(s''))] \Big] \\ &\geq \cdots \geq\\ &\geq \gamma^\infty =0 \end{aligned} Qπn+1(s,a)−Qπn(s,a)=r(s,a)+γEs′∼p,a′∼πn+1[Qπn+1(s′,a′)]−r(s,a)−γEs′∼p,a′∼πn[Qπn(s′,a′)]=γEs′∼p[Qπn+1(s′,πn+1(s′))−Qπn(s′,πn(s′))](deterministic policy)=γEs′∼p[Qπn+1(s′,πn+1(s′))−Qπn(s′,πn+1(s′))+≥0 Qπn(s′,πn+1(s′))−Qπn(s′,πn(s′))]≥γEs′∼p[Qπn+1(s′,πn+1(s′))−Qπn(s′,πn+1(s′))](开始在状态空间套娃)≥γEs′∼p[γEs′′∼p[Qπn+1(s′′,πn+1(s′′))−Qπn(s′′,πn+1(s′′))]]≥⋯≥≥γ∞=0
(值得注意的是,这些推导证明也正说明了infinite horizon discounted的重要性即 γ ∞ = 0 , γ < 1 \gamma^\infty=0,\gamma <1 γ∞=0,γ<1)
同理有: V π n + 1 ( s ) ≥ V π n ( s ) ∀ s V^{\pi_{n+1}}(s)\geq V^{\pi_n}(s) \quad \forall s Vπn+1(s)≥Vπn(s)∀s
V π n + 1 ( s ) = Q π n + 1 ( s , π n + 1 ( s ) ) ≥ Q π n ( s , π n + 1 ( s ) ) = max a Q π n ( s , a ) ≥ Q π n ( s , π n ( s ) ) = V π n ( s ) V^{\pi_{n+1}}(s)=Q^{\pi_{n+1}}(s,{\pi_{n+1}(s)})\geq Q^{\pi_n}(s,\pi_{n+1}(s))=\max_a Q^{\pi_n}(s,a)\geq Q^{\pi_n}(s,\pi_n(s))=V^{\pi_n}(s) Vπn+1(s)=Qπn+1(s,πn+1(s))≥Qπn(s,πn+1(s))=amaxQπn(s,a)≥Qπn(s,πn(s))=Vπn(s)
PE是评估策略,Policy Improvement确保了每次更新后的策略 Q π n + 1 ( s , a ) ≥ Q π n ( s , a ) ∀ s , a Q^{\pi_{n+1}}(s,a)\geq Q^{\pi_n}(s,a) \quad \forall s,a Qπn+1(s,a)≥Qπn(s,a)∀s,a确实是有改进的,接下来是说明PI确实是收敛的即 π n \pi_n πn在有限次迭代后能得到最优策略 π ⋆ \pi^\star π⋆
V ⋆ ( s ) − V π n + 1 ( s ) = max a [ r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ V ⋆ ( s ′ ) ] ] − r ( s , π n + 1 ( s ) ) − γ E s ′ ∼ p ( ⋅ ∣ s , π n + 1 ( s ) ) [ V π n + 1 ( s ′ ) ] ≤ max a [ r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ V ⋆ ( s ′ ) ] ] − r ( s , π n + 1 ( s ) ) − γ E s ′ ∼ p ( ⋅ ∣ s , π n + 1 ( s ) ) [ V π n ( s ′ ) ] (利用 V π n + 1 ( s ) ≥ V π n ( s ) ) = max a [ r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ V ⋆ ( s ′ ) ] ] − max a [ r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ V π n ( s ′ ) ] ] (利用定义 π n + 1 = arg max a Q π n ( s , a ) ) ≤ max a γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ V ⋆ ( s ′ ) − V π n ( s ′ ) ] ≤ γ max s ′ V ⋆ ( s ′ ) − V π n ( s ′ ) (期望 ≤ 最大值点) = γ ∣ ∣ V ⋆ ( s ) − V π n ( s ) ∣ ∣ ∞ ( ∞ − n o r m 的定义) \begin{aligned} V^{\star}(s)-V^{\pi_{n+1}}(s)&=\max_a \Big[r(s,a)+\gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}[V^\star(s')]\Big]-r(s,\pi_{n+1}(s))-\gamma \mathbb E_{s'\sim p(\cdot\mid s,\pi_{n+1}(s))}[V^{\pi_{n+1}}(s')]\\ &\leq \max_a \Big[r(s,a)+\gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}[V^\star(s')]\Big] - r(s,\pi_{n+1}(s))-\gamma \mathbb E_{s'\sim p(\cdot\mid s,\pi_{n+1}(s))}[V^{\pi_{n}}(s')] \text{(利用$V^{\pi_{n+1}}(s)\geq V^{\pi_n}(s)$)}\\ &=\max_a \Big[r(s,a)+\gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}[V^\star(s')]\Big]-\max_a\Big[r(s,a)+\gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}[V^{\pi_{n}}(s')]\Big]\text{(利用定义$\pi_{n+1}=\argmax_a Q^{\pi_n}(s,a)$)}\\ &\leq \max_a \gamma \mathbb E_{s'\sim p(\cdot\mid s,a)}[V^\star(s')-V^{\pi_{n}}(s')]\\ &\leq \gamma \max_{s'} V^\star(s')-V^{\pi_{n}}(s') \quad \text{(期望$\leq$最大值点)}\\ &=\gamma ||V^\star(s)-V^{\pi_n}(s)||_{\infty}\quad \text{($\infty-norm$的定义)} \end{aligned} V⋆(s)−Vπn+1(s)=amax[r(s,a)+γEs′∼p(⋅∣s,a)[V⋆(s′)]]−r(s,πn+1(s))−γEs′∼p(⋅∣s,πn+1(s))[Vπn+1(s′)]≤amax[r(s,a)+γEs′∼p(⋅∣s,a)[V⋆(s′)]]−r(s,πn+1(s))−γEs′∼p(⋅∣s,πn+1(s))[Vπn(s′)](利用Vπn+1(s)≥Vπn(s))=amax[r(s,a)+γEs′∼p(⋅∣s,a)[V⋆(s′)]]−amax[r(s,a)+γEs′∼p(⋅∣s,a)[Vπn(s′)]](利用定义πn+1=aargmaxQπn(s,a))≤amaxγEs′∼p(⋅∣s,a)[V⋆(s′)−Vπn(s′)]≤γs′maxV⋆(s′)−Vπn(s′)(期望≤最大值点)=γ∣∣V⋆(s)−Vπn(s)∣∣∞(∞−norm的定义)
推导中的第二步到第三步的解释 π n + 1 = arg max a Q π n ( s , a ) \pi_{n+1}=\argmax_a Q^{\pi_n}(s,a) πn+1=aargmaxQπn(s,a) 意味着有 Q π n ( s , π n + 1 ( s ) ) = max a Q π n ( s , a ) = max a [ r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) [ V π n ( s ′ ) ] ] = r ( s , π n + 1 ( s ) ) + γ E s ′ ∼ p ( ⋅ ∣ s , π n + 1 ( s ) ) [ V π n ( s ′ ) ] Q^{\pi_n}(s,\pi_{n+1}(s))=\max_a Q^{\pi_n}(s,a)=\max_a\Big[r(s,a)+\gamma \mathbb E_{s'\sim p(\cdot\mid s,a )}[V^{\pi_n}(s')]\Big]=r(s,\pi_{n+1}(s))+\gamma \mathbb E_{s'\sim p(\cdot\mid s, \pi_{n+1}(s))}[V^{\pi_n}(s')] Qπn(s,πn+1(s))=maxaQπn(s,a)=maxa[r(s,a)+γEs′∼p(⋅∣s,a)[Vπn(s′)]]=r(s,πn+1(s))+γEs′∼p(⋅∣s,πn+1(s))[Vπn(s′)]
再细节一点,对所有s有 V ⋆ ( s ) − V π n + 1 ( s ) ≤ γ ∣ ∣ V ⋆ ( s ) − V π n ( s ) ∣ ∣ ∞ \quad V^{\star}(s)-V^{\pi_{n+1}}(s)\leq \gamma ||V^\star(s)-V^{\pi_n}(s)||_{\infty} V⋆(s)−Vπn+1(s)≤γ∣∣V⋆(s)−Vπn(s)∣∣∞,所以 ∣ ∣ V ⋆ ( s ) − V π n + 1 ( s ) ∣ ∣ ∞ ≤ γ ∣ ∣ V ⋆ ( s ) − V π n ( s ) ∣ ∣ ∞ ||V^{\star}(s)-V^{\pi_{n+1}}(s)||_\infty\leq \gamma ||V^\star(s)-V^{\pi_n}(s)||_{\infty} ∣∣V⋆(s)−Vπn+1(s)∣∣∞≤γ∣∣V⋆(s)−Vπn(s)∣∣∞
对于 ∀ x ∈ R ∣ S ∣ ∣ A ∣ × 1 ≠ 0 \forall x\in \mathbb R^{|S||A|\times 1}\neq 0 ∀x∈R∣S∣∣A∣×1=0,其中 P π ∈ R ∣ S ∣ ∣ A ∣ × ∣ S ∣ ∣ A ∣ P^\pi\in \mathbb R^{|S||A|\times |S||A|} Pπ∈R∣S∣∣A∣×∣S∣∣A∣, I I I为单位矩阵,可逆证明如下:
∣ ∣ ( I − γ P π ) x ∣ ∣ ∞ = ∣ ∣ x − γ P π x ∣ ∣ ∞ ≥ ∣ ∣ x ∣ ∣ ∞ − ∣ ∣ γ P π x ∣ ∣ ∞ (两者差的最大值>= 两者最大值之差) ≥ ∣ ∣ x ∣ ∣ ∞ − γ ∣ ∣ x ∣ ∣ ∞ ( P π 转移矩阵的定义) = ( 1 − γ ) ∣ ∣ x ∣ ∣ ∞ > 0 \begin{aligned} ||(I-\gamma P^{\pi})x||_{\infty}&=||x-\gamma P^{\pi}x||_{\infty}\\ &\geq ||x||_{\infty}-||\gamma P^\pi x||_{\infty}\text{(两者差的最大值>= 两者最大值之差)}\\ &\geq ||x||_{\infty}-\gamma ||x||_\infty \text{($P^\pi$转移矩阵的定义)}\\ &=(1-\gamma)||x||_\infty >0 \end{aligned} ∣∣(I−γPπ)x∣∣∞=∣∣x−γPπx∣∣∞≥∣∣x∣∣∞−∣∣γPπx∣∣∞(两者差的最大值>= 两者最大值之差)≥∣∣x∣∣∞−γ∣∣x∣∣∞(Pπ转移矩阵的定义)=(1−γ)∣∣x∣∣∞>0
因为对于所有的非0向量x,经过 ( I − γ P π ) (I-\gamma P^{\pi}) (I−γPπ)矩阵变换后均>0,故其满秩。
首先直观地解释下,为什么 Q π ( s , a ) = r ( s , a ) + γ E s ′ ∼ p ( ⋅ ∣ s , a ) , a ′ ∼ π ( ⋅ ∣ s ′ ) [ Q π ( s , a ) ] Q^{\pi}(s,a)=r(s,a)+\gamma \mathbb E_{s'\sim p(\cdot\mid s,a),a'\sim \pi(\cdot\mid s')}[Q^{\pi}(s,a)] Qπ(s,a)=r(s,a)+γEs′∼p(⋅∣s,a),a′∼π(⋅∣s′)[Qπ(s,a)]能写成 Q π = r + γ P π Q π Q^{\pi}=r+\gamma P^{\pi} Q^{\pi} Qπ=r+γPπQπ的形式, P π P^\pi Pπ究竟是什么?
假设状态空间有m个离散值 { s 1 , s 2 , . . . , s m } \{s_1,s_2,...,s_m\} {s1,s2,...,sm},动作空间有n个离散值 { a 1 , a 2 , . . . , a n } \{a_1,a_2,...,a_n\} {a1,a2,...,an},所以 P π P^\pi Pπ维度为 R m n × m n \mathbb R^{mn\times mn} Rmn×mn, Q π ( s , a ) Q^\pi(s,a) Qπ(s,a)维度为 R m n \mathbb R^{mn} Rmn
Q π ( s , a ) Q^\pi(s,a) Qπ(s,a)代表的是所有下一状态动作对的Q值,而 P π P^\pi Pπ比如第 m n mn mn行的 i j ij ij列,代表在当前策略 π \pi π下,当前状态动作对 ( s m , a n ) (s_m,a_n) (sm,an)转移到下一状态动作对为 ( s i , a j ) (s_i,a_j) (si,aj)的概率为 p ( s i ∣ s m , a n ) π ( a j ∣ s i ) p(s_i|s_m,a_n)\pi(a_j|s_i) p(si∣sm,an)π(aj∣si)
[ ( 1 − γ ) ( I − γ P π ) − 1 ] ( s , a ) , ( s ′ , a ′ ) = ( 1 − γ ) ∑ h = 0 ∞ γ h P π ( s h = s ′ , a h = a ′ ∣ s 0 = s , a 0 = a ) [(1-\gamma)(I-\gamma P^\pi)^{-1}]_{(s,a),(s',a')}=(1-\gamma)\sum_{h=0}^\infty \gamma^h {\mathbb P}^\pi(s_h=s',a_h=a'|s_0=s,a_0=a) [(1−γ)(I−γPπ)−1](s,a),(s′,a′)=(1−γ)h=0∑∞γhPπ(sh=s′,ah=a′∣s0=s,a0=a)
右边表示的是初始状态为 s 0 , a 0 s_0,a_0 s0,a0在策略 π \pi π下访问到结果为 ( s ′ , a ′ ) (s',a') (s′,a′)的概率
左边可根据 P π P^\pi Pπ的定义展开求得
具体证明后续再补充
定义从初始状态为 s 0 s_0 s0,策略为 π \pi π的轨迹访问到 ( s , a ) (s,a) (s,a)的概率为 d s 0 π ( s , a ) d^\pi_{s_0}(s,a) ds0π(s,a),其为
d s 0 π ( s , a ) = ( 1 − γ ) ∑ h = 0 ∞ γ h P π ( s h = s , a h = a ∣ s 0 ) d^\pi_{s_0}(s,a)=(1-\gamma)\sum_{h=0}^\infty \gamma^h\mathbb P^\pi(s_h=s,a_h=a|s_0) ds0π(s,a)=(1−γ)h=0∑∞γhPπ(sh=s,ah=a∣s0)其中 P π ( s h = s , a h = a ∣ s 0 ) \mathbb P^\pi(s_h=s,a_h=a|s_0) Pπ(sh=s,ah=a∣s0)为从初始状态为 s 0 s_0 s0,策略为 π \pi π的轨迹,在第h时刻访问到 ( s , a ) (s,a) (s,a)的概率。
对于任意两个策略 π , π ′ \pi,\pi' π,π′,,其性能差异可表示为:
V π ′ ( s 0 ) − V π ( s 0 ) = 1 1 − γ E s , a ∼ d s 0 π ′ [ Q π ( s , a ) − V π ( s ) ] = 1 1 − γ E ( s , a ) ∼ d s 0 π ′ ( s , a ) [ A π ( s , a ) ] V^{\pi'}(s_0)-V^{\pi}(s_0)=\frac{1}{1-\gamma} \mathbb E_{s,a\sim d^{\pi'}_{s_0}}[Q^{\pi}(s,a)-V^{\pi}(s)]=\frac{1}{1-\gamma} \mathbb E_{(s,a)\sim d^{\pi'}_{s_0}(s,a)}[A^{\pi}(s,a)] Vπ′(s0)−Vπ(s0)=1−γ1Es,a∼ds0π′[Qπ(s,a)−Vπ(s)]=1−γ1E(s,a)∼ds0π′(s,a)[Aπ(s,a)]
证明如下:(下面把 s s s当作 s 0 s_0 s0, s ′ s' s′当作s即可)
V π ′ ( s ) − V π ( s ) = ∑ a π ′ ( a ∣ s ) ⋅ Q π ′ ( s , a ) − ∑ a π ( a ∣ s ) ⋅ Q π ( s , a ) = ∑ a π ′ ( a ∣ s ) ⋅ ( Q π ′ ( s , a ) − Q π ( s , a ) ) + ∑ a ( π ′ ( a ∣ s ) − π ( a ∣ s ) ) ⋅ Q π ( s , a ) = ∑ a ( π ′ ( a ∣ s ) − π ( a ∣ s ) ) ⋅ Q π ( s , a ) + γ ∑ a π ′ ( a ∣ s ) ∑ s ′ P ( s ′ ∣ s , a ) ⋅ [ V π ′ ( s ′ ) − V π ( s ′ ) ] = 1 1 − γ ∑ s ′ d s π ′ ( s ′ ) ∑ a ′ ( π ′ ( a ′ ∣ s ′ ) − π ( a ′ ∣ s ′ ) ) ⋅ Q π ( s ′ , a ′ ) (开始在状态空间套娃) = 1 1 − γ ∑ s ′ d s π ′ ( s ′ ) ∑ a ′ π ′ ( a ′ ∣ s ′ ) ⋅ ( Q π ( s ′ , a ′ ) − V π ( s ′ ) ) = 1 1 − γ ∑ s ′ d s π ′ ( s ′ ) ∑ a ′ π ′ ( a ′ ∣ s ′ ) ⋅ A π ( s ′ , a ′ ) = 1 1 − γ E ( s ′ , a ′ ) ∼ d s π ′ ( s ′ , a ′ ) A π ( s ′ , a ′ ) \begin{aligned} V^{\pi^{\prime}}(s)-V^{\pi}(s) &=\sum_{a} \pi^{\prime}(a \mid s) \cdot Q^{\pi^{\prime}}(s, a)-\sum_{a} \pi(a \mid s) \cdot Q^{\pi}(s, a) \\ &=\sum_{a} \pi^{\prime}(a \mid s) \cdot\left(Q^{\pi^{\prime}}(s, a)-Q^{\pi}(s, a)\right)+\sum_{a}\left(\pi^{\prime}(a \mid s)-\pi(a \mid s)\right) \cdot Q^{\pi}(s, a) \\ &=\sum_{a}\left(\pi^{\prime}(a \mid s)-\pi(a \mid s)\right) \cdot Q^{\pi}(s, a)+\gamma \sum_{a} \pi^{\prime}(a \mid s) \sum_{s^{\prime}} \mathcal{P}\left(s^{\prime} \mid s, a\right) \cdot\left[V^{\pi^{\prime}}\left(s^{\prime}\right)-V^{\pi}\left(s^{\prime}\right)\right] \\ &=\frac{1}{1-\gamma} \sum_{s^{\prime}} d_{s}^{\pi^{\prime}}\left(s^{\prime}\right) \sum_{a^{\prime}}\left(\pi^{\prime}\left(a^{\prime} \mid s^{\prime}\right)-\pi\left(a^{\prime} \mid s^{\prime}\right)\right) \cdot Q^{\pi}\left(s^{\prime}, a^{\prime}\right)\text{(开始在状态空间套娃)} \\ &=\frac{1}{1-\gamma} \sum_{s^{\prime}} d_{s}^{\pi^{\prime}}\left(s^{\prime}\right) \sum_{a^{\prime}} \pi^{\prime}\left(a^{\prime} \mid s^{\prime}\right) \cdot\left(Q^{\pi}\left(s^{\prime}, a^{\prime}\right)-V^{\pi}\left(s^{\prime}\right)\right) \\ &=\frac{1}{1-\gamma} \sum_{s^{\prime}} d_{s}^{\pi^{\prime}}\left(s^{\prime}\right) \sum_{a^{\prime}} \pi^{\prime}\left(a^{\prime} \mid s^{\prime}\right) \cdot A^{\pi}\left(s^{\prime}, a^{\prime}\right)\\ &=\frac{1}{1-\gamma} \mathbb E_{(s',a')\sim d^{\pi^\prime}_s(s',a')}A^{\pi}\left(s^{\prime}, a^{\prime}\right) \end{aligned} Vπ′(s)−Vπ(s)=a∑π′(a∣s)⋅Qπ′(s,a)−a∑π(a∣s)⋅Qπ(s,a)=a∑π′(a∣s)⋅(Qπ′(s,a)−Qπ(s,a))+a∑(π′(a∣s)−π(a∣s))⋅Qπ(s,a)=a∑(π′(a∣s)−π(a∣s))⋅Qπ(s,a)+γa∑π′(a∣s)s′∑P(s′∣s,a)⋅[Vπ′(s′)−Vπ(s′)]=1−γ1s′∑dsπ′(s′)a′∑(π′(a′∣s′)−π(a′∣s′))⋅Qπ(s′,a′)(开始在状态空间套娃)=1−γ1s′∑dsπ′(s′)a′∑π′(a′∣s′)⋅(Qπ(s′,a′)−Vπ(s′))=1−γ1s′∑dsπ′(s′)a′∑π′(a′∣s′)⋅Aπ(s′,a′)=1−γ1E(s′,a′)∼dsπ′(s′,a′)Aπ(s′,a′)