强化学习中的重要收敛性结论(2):常见RL算法的收敛性

强化学习的理论基础是MDP(Markov Decesion Process),当MDP中的策略 π \pi π确定之后,MDP便是最一般的Markov Process的形式。这里需要补充一些MDP中的基础概念:

(1)策略 π \pi π下的累计折扣回报 G t = ∑ k = 0 ∞ γ k R k + t G_t=\sum_{k=0}^{\infty}\gamma^kR_{k+t} Gt=k=0γkRk+t,其中 r ∈ ( 0 , 1 ] r\in (0,1] r(0,1]是折扣因子, R t R_{t} Rt表示 t t t时刻的奖励。

(2)策略 π \pi π下的价值函数 q π ( s , a ) q_{\pi}(s,a) qπ(s,a),定义式: q π ( s , a ) = E t [ G t ∣ s 0 = s , a 0 = a , π ] q_{\pi}(s,a)=\mathbf{E}_t[G_t|s_0=s,a_0=a,\pi] qπ(s,a)=Et[Gts0=s,a0=a,π]

推导式: q π ( s , a ) = r ( s , a ) + γ ∑ s ′ P r ( s ′ ∣ s , a ) v π ( s ′ ) q_{\pi}(s,a)=r(s,a)+\gamma\sum_{s^{'}}\mathbf{Pr}(s^{'}|s,a)v_{\pi}(s^{'}) qπ(s,a)=r(s,a)+γsPr(ss,a)vπ(s);其中 r ( s , a ) r(s,a) r(s,a)是在状态 s s s下采取动作 a a a的奖励期望。

(3)策略 π \pi π下的价值函数 v π ( s ) v_{\pi}(s) vπ(s),定义式: v π ( s ) = E t [ G t ∣ s 0 = s , π ] v_{\pi}(s)=\mathbf{E_t}[G_t|s_0=s,\pi] vπ(s)=Et[Gts0=s,π]

推导式: v π ( s ) = E a ∼ π ( . ∣ s ) [ q π ( s , a ) ] = ∑ a π ( a ∣ s ) q π ( s , a ) v_{\pi}(s)=\mathbf{E}_{a\sim \pi(.|s)}[q_{\pi}(s,a)]=\sum_{a}\pi(a|s)q_{\pi}(s,a) vπ(s)=Eaπ(.∣s)[qπ(s,a)]=aπ(as)qπ(s,a);

(4)Bellman方程: v π ( s ) = ∑ a π ( a ∣ s ) ( r ( s , a ) + γ ∑ s ′ P r ( s ′ ∣ s , a ) v π ( s ′ ) ) v_{\pi}(s)=\sum_{a}\pi(a|s)(r(s,a)+\gamma\sum_{s^{'}}\mathbf{Pr}(s^{'}|s,a)v_{\pi}(s^{'})) vπ(s)=aπ(as)(r(s,a)+γsPr(ss,a)vπ(s)),矩阵形式为 v π = r π + γ P π v π v_{\pi}=r_{\pi}+\gamma P_{\pi}v_{\pi} vπ=rπ+γPπvπ

(TD算法的收敛性) 在策略 π \pi π下智能体与环境交互产生了一串随机序列 ( s 0 , a 0 , r 1 , s 1 , a 1 , r 2 , s 2 , . . . ) (s_0,a_0,r_1,s_1,a_1,r_2,s_2,...) (s0,a0,r1,s1,a1,r2,s2,...),若对强化学习中 t t t时刻的价值函数采用如下式子进行值函数迭代:
v t + 1 ( s t ) = v t ( s t ) − α t ( s t ) [ v t ( s t ) − [ r t + 1 + γ v t ( s t + 1 ) ] ] , v t + 1 ( s ) = v t ( s ) , ∀ s ≠ s t v_{t+1}(s_t)=v_t(s_t)-\alpha_t(s_t)[v_t(s_t)-[r_{t+1}+\gamma v_t(s_{t+1})]],\\ v_{t+1}(s)=v_t(s),\forall s\ne s_t vt+1(st)=vt(st)αt(st)[vt(st)[rt+1+γvt(st+1)]],vt+1(s)=vt(s),s=st
当满足以下条件:

(1)状态空间 S S S中的状态 s t s_t st有限;

(2) ∀ s ∈ S , ∑ t α t ( s ) = ∞ , ∑ t α t 2 ( s ) < ∞ \forall s\in S,\sum_t\alpha_t(s)=\infty,\sum_{t}\alpha^2_t(s)<\infty sS,tαt(s)=,tαt2(s)<;

则: ∀ s ∈ S , v t ( s ) → v π ( s ) , w . p . 1 \forall s \in S,v_t(s)\rightarrow v_{\pi}(s),w.p.1 sS,vt(s)vπ(s),w.p.1

Proof. t t t时刻为状态 s t = s s_t=s st=s α t ( s ) > 0 \alpha_t(s)>0 αt(s)>0,否则 α t ( s ) = 0 , ∀ s t ≠ s \alpha_t(s)=0,\forall s_t \ne s αt(s)=0,st=s,则迭代式可以变形为:
v t + 1 ( s ) = v t ( s ) − α t ( s ) [ v t ( s ) − [ r t + 1 + γ v t ( s ′ ) ] ] = ( 1 − α t ( s ) ) v t ( s ) + α t ( s ) [ r t + 1 + γ v t ( s ′ ) ] , ∀ s ∈ S , t = t 0 , t 1 , . . . v_{t+1}(s)=v_t(s)-\alpha_t(s)[v_t(s)-[r_{t+1}+\gamma v_t(s^{'})]]\\=(1-\alpha_t(s))v_t(s)+\alpha_t(s)[r_{t+1}+\gamma v_t(s^{'})],\forall s\in S,t={t_0,t_1,...} vt+1(s)=vt(s)αt(s)[vt(s)[rt+1+γvt(s)]]=(1αt(s))vt(s)+αt(s)[rt+1+γvt(s)],sS,t=t0,t1,...
其中 s ′ s^{'} s是当前 t t t时刻从 s s s转移到的下一个状态。设 Δ k + 1 ( s ) = v k + 1 ( s ) − v π ( s ) \Delta_{k+1}(s)=v_{k+1}(s)-v_{\pi}(s) Δk+1(s)=vk+1(s)vπ(s),带入上式得到:
Δ k + 1 ( s ) = ( 1 − α k ( s ) ) Δ k ( s ) + α k ( s ) [ r k + 1 + γ v k ( s k + 1 ) − v π ( s ) ] \Delta_{k+1}(s)=(1-\alpha_k(s))\Delta_k(s)+\alpha_k(s)[r_{k+1}+\gamma v_k(s_{k+1})-v_{\pi}(s)] Δk+1(s)=(1αk(s))Δk(s)+αk(s)[rk+1+γvk(sk+1)vπ(s)]
其中 s k = s s_{k}=s sk=s表示当前时刻 k k k的状态, s k + 1 s_{k+1} sk+1表示 k + 1 k+1 k+1时刻的状态。设 e k ( s ) = r k + 1 + γ v k ( s k + 1 ) − v π ( s ) , Δ k = [ Δ k ( s 1 ) , Δ k ( s 2 ) , . . . Δ k ( s ∣ S ∣ ) ] T e_k(s)=r_{k+1}+\gamma v_k(s_{k+1})-v_{\pi}(s),\Delta_k=[\Delta_k(s_1),\Delta_k(s_2),...\Delta_k(s_{|S|})]^T ek(s)=rk+1+γvk(sk+1)vπ(s),Δk=[Δk(s1),Δk(s2),...Δk(sS)]T, H k = { Δ k , Δ k − 1 , . . . e k − 1 , . . . , α k − 1 , . . . } H_k=\{\Delta_k,\Delta_{k-1},...e_{k-1},...,\alpha_{k-1},...\} Hk={Δk,Δk1,...ek1,...,αk1,...} e k = [ e k ( s 1 ) , e k ( s 2 ) , . . . e k ( s ∣ S ∣ ) ] T , v π = [ v π ( s 1 ) , . . . v π ( s ∣ S ∣ ) ] T e_k=[e_{k}(s_1),e_k(s_2),...e_k(s_{|S|})]^T,v_{\pi}=[v_{\pi}(s_1),...v_{\pi}(s_{|S|})]^T ek=[ek(s1),ek(s2),...ek(sS)]T,vπ=[vπ(s1),...vπ(sS)]T,且 E [ v k ( s k + 1 ) ∣ H k ] = E s k + 1 [ v k ( s k + 1 ) ∣ s k = s ] = ∑ s ′ P r [ s ′ ∣ s ] v k ( s ′ ) \mathbf{E}[v_k(s_{k+1})|H_k]=\mathbf{E}_{s_{k+1}}[v_k(s_{k+1})|s_k=s]=\sum_{s^{'}}\mathbf{Pr}[s^{'}|s]v_k(s^{'}) E[vk(sk+1)Hk]=Esk+1[vk(sk+1)sk=s]=sPr[ss]vk(s),则可以得到:
∣ ∣ E [ e k ∣ H k ] ∣ ∣ ∞ = ∣ ∣ r π + γ P π v k − v π ∣ ∣ ∞ = ∣ ∣ r π + γ P π v k − ( r π + γ P π v π ) ∣ ∣ ∞ = γ ∣ ∣ P π ( v k − v π ) ∣ ∣ ∞ ≤ γ ∣ ∣ v k − v π ∣ ∣ ∞ = γ ∣ ∣ Δ k ∣ ∣ ∞ ||\mathbf{E}[e_k|H_k]||_{\infty}=||r_{\pi}+\gamma P_{\pi}v_k-v_{\pi}||_{\infty}\\=||r_{\pi}+\gamma P_{\pi}v_k-(r_{\pi}+\gamma P_{\pi}v_{\pi})||_{\infty}\\=\gamma||P_{\pi}(v_k-v_{\pi})||_{\infty}\\\leq\gamma||v_k-v_{\pi}||_{\infty}=\gamma||\Delta_k||_{\infty} ∣∣E[ekHk]=∣∣rπ+γPπvkvπ=∣∣rπ+γPπvk(rπ+γPπvπ)=γ∣∣Pπ(vkvπ)γ∣∣vkvπ=γ∣∣Δk
同理可得 V a r [ e k ∣ H k ] \mathbf{Var}[e_k|H_k] Var[ekHk]有界,由Dvoretzky’s 收敛定理的扩展 Δ k ( s ) → 0 , w . p . 1 \Delta_k(s)\rightarrow 0,w.p.1 Δk(s)0,w.p.1,即 v k ( s ) → v π ( s ) , w . p . 1 s v_k(s)\rightarrow v_{\pi}(s),w.p.1s vk(s)vπ(s),w.p.1s.

(线性值函数逼近的收敛性) 当采用式 v ^ ( s ; w ) = ϕ ( s ) T w \hat{v}(s;w)=\phi(s)^Tw v^(s;w)=ϕ(s)Tw ϕ ( s ) ∈ R m \phi(s)\in R^m ϕ(s)Rm,当采用TD算法更新 w w w使 v ^ ( s ; w ) \hat{v}(s;w) v^(s;w)逼近 v π ( s ) v_{\pi}(s) vπ(s)即:
min ⁡ w E s ∼ d ( . ) [ ( v ^ ( s ; w ) − v π ( s ) ) 2 ] = min ⁡ w E s t ∼ d ( . ) [ ( v ^ ( s t ; w t ) − ( r t + 1 + γ v ^ ( s t + 1 ; w t ) ) ) 2 ] \min_{w}\mathbf{E}_{s\sim d(.)}[(\hat{v}(s;w)-v_{\pi}(s))^2]\\=\min_{w}\mathbf{E}_{s_t\sim d(.)}[(\hat{v}(s_t;w_t)-(r_{t+1}+\gamma \hat{v}(s_{t+1};w_t)))^2] wminEsd(.)[(v^(s;w)vπ(s))2]=wminEstd(.)[(v^(st;wt)(rt+1+γv^(st+1;wt)))2]
采用以下迭代式进行参数更新:
w t + 1 = w t + α t E t [ ( r t + 1 + γ ϕ T ( s t + 1 ) w t − ϕ T ( s t ) w t ) ϕ ( s t ) ] w_{t+1}=w_t+\alpha_t\mathbf{E}_t[(r_{t+1}+\gamma\phi^T(s_{t+1})w_t-\phi^T(s_t)w_t)\phi(s_t)] wt+1=wt+αtEt[(rt+1+γϕT(st+1)wtϕT(st)wt)ϕ(st)]
则有以下结论成立:

(1)迭代式中的期望可以写成:
E t [ ( r t + 1 + γ ϕ T ( s t + 1 ) w t − ϕ T ( s t ) w t ) ϕ ( s t ) ] = b − A w t \mathbf{E}_t[(r_{t+1}+\gamma\phi^T(s_{t+1})w_t-\phi^T(s_t)w_t)\phi(s_t)]=b-Aw_t Et[(rt+1+γϕT(st+1)wtϕT(st)wt)ϕ(st)]=bAwt
其中 A = Φ T D ( I − γ P π ) Φ ∈ R m × m A=\Phi^TD(I-\gamma P_{\pi})\Phi\in R^{m\times m} A=ΦTD(IγPπ)ΦRm×m b = Φ T D r π ∈ R m b=\Phi^T D r_{\pi} \in R^m b=ΦTDrπRm。其中:
Φ = ( . . . ϕ T ( s ) . . . ) ∈ R ∣ S ∣ × m , D = ( . . . d π ( s ) . . . ) ∈ R ∣ S ∣ × ∣ S ∣ \Phi=\begin{pmatrix}...\\\phi^T(s)\\... \end{pmatrix}\in R^{|S|\times m},D=\begin{pmatrix} ...& & \\ & d_{\pi}(s) & \\ & & ...\end{pmatrix}\in R^{|S|\times |S|} Φ= ...ϕT(s)... RS×m,D= ...dπ(s)... RS×S
(2)当采用SGD算法进行梯度下降: w t + 1 = w t + α t ( b − A w t ) w_{t+1}=w_t+\alpha_t(b-Aw_t) wt+1=wt+αt(bAwt),若满足 ∑ t α t = ∞ \sum_t\alpha_t = \infty tαt= ∑ t α t 2 < ∞ \sum_t \alpha_t^2 < \infty tαt2<,或在其他的一些条件下有: w t → w ∗ = A − 1 b = v π w_t\rightarrow w^*=A^{-1}b=v_{\pi} wtw=A1b=vπ

Proof.(1)证明略,想详细了解细节可以参考原书。

(2)易知 δ t = w t − w ∗ \delta_t = w_t-w^* δt=wtw,且 w ∗ = A − 1 b w^*=A^{-1}b w=A1b,带入 w t + 1 = w t + α t ( b − A w t ) w_{t+1}=w_t+\alpha_t(b-Aw_t) wt+1=wt+αt(bAwt)得到: δ t + 1 = ( I − α t A ) δ t \delta_{t+1}=(I-\alpha_t A)\delta_t δt+1=(IαtA)δt,即:
δ t + 1 = ∏ k = 0 t ( I − α k A ) δ 0 \delta_{t+1}=\prod_{k=0}^t(I-\alpha_kA)\delta_0 δt+1=k=0t(IαkA)δ0
α t = α , ∀ t \alpha_t=\alpha,\forall t αt=α,t ∣ ∣ δ t + 1 ∣ ∣ ≤ ∣ ∣ ( I − α A ) ∣ ∣ t + 1 ∣ ∣ δ 0 ∣ ∣ ||\delta_{t+1}||\leq||(I-\alpha A)||^{t+1}||\delta_0|| ∣∣δt+1∣∣∣∣(IαA)t+1∣∣δ0∣∣,若 α > 0 \alpha >0 α>0 ρ ( I − α A ) < 1 \rho(I-\alpha A)<1 ρ(IαA)<1,则: δ t → 0 \delta_t \rightarrow 0 δt0,即: w t → w ∗ w_t \rightarrow w^* wtw.

你可能感兴趣的:(闲散杂记,算法,矩阵,机器学习)