先说明该文章对于数学基础要求比较高,大多数的结论数学证明来自于《Mathematical Foundation of Reinforcement Learning》。了解强化学习中一些重要收敛性结论的证明过程,对设计好的强化学习算法以及了解一些强化学习中一些基本结论的由来是大有裨益的。本节将重点介绍一些随机逼近理论中的重要收敛性定理,这些定理将为后面强化学习中重要算法的收敛性分析提供理论基础。
(Dvoretzky’s 收敛定理) 考虑一个随机过程 ω k + 1 = ( 1 − α k ) ω k + β k η k \omega_{k+1}=(1-\alpha_k)\omega_{k}+\beta_k\eta_k ωk+1=(1−αk)ωk+βkηk,其中 { α k } k = 1 ∞ \{\alpha_k\}_{k=1}^{\infty} {αk}k=1∞, { β k } k = 1 ∞ \{\beta_k\}_{k=1}^{\infty} {βk}k=1∞, { η k } k = 1 ∞ \{\eta_k\}_{k=1}^{\infty} {ηk}k=1∞是随机序列。 ∀ k , α k , β k ≥ 0 \forall k,\alpha_k,\beta_k\geq 0 ∀k,αk,βk≥0,当满足以下条件:
(1) ∑ k = 1 ∞ α k = ∞ , ∑ k = 1 ∞ α k 2 < ∞ , ∑ k = 1 ∞ β k 2 < ∞ , u n i f o r m l y . w . p . 1 \sum_{k=1}^{\infty}\alpha_k =\infty,\sum_{k=1}^{\infty}\alpha_k^2 < \infty,\sum_{k=1}^{\infty}\beta_k^2<\infty,uniformly.w.p.1 ∑k=1∞αk=∞,∑k=1∞αk2<∞,∑k=1∞βk2<∞,uniformly.w.p.1;
(2) E [ η k ∣ H k ] = 0 , E [ η k 2 ∣ H k ] ≤ C \mathbf{E}[\eta_k|H_k]=0,\mathbf{E}[\eta_k^2|H_k]\leq C E[ηk∣Hk]=0,E[ηk2∣Hk]≤C;
其中 H k = { ω k , ω k − 1 , . . . η k − 1 , η k − 2 , . . . α k − 1 , . . . β k − 1 , . . . } H_k=\{\omega_k,\omega_{k-1},...\eta_{k-1},\eta_{k-2},...\alpha_{k-1},...\beta_{k-1},...\} Hk={ωk,ωk−1,...ηk−1,ηk−2,...αk−1,...βk−1,...},则 ω k → 0 , w . p . 1 \omega_k \rightarrow 0,w.p.1 ωk→0,w.p.1。
Proof. 假设 α k , β k \alpha_k,\beta_k αk,βk可以由 H k H_k Hk完全确定,即 α k = α k ( H k ) , β k = β k ( H k ) \alpha_k=\alpha_k(H_k),\beta_k=\beta_k(H_k) αk=αk(Hk),βk=βk(Hk)。则:
E [ α k ∣ H k ] = α k , E [ β k ∣ H k ] = β k \mathbf E[\alpha_k|H_k]=\alpha_k,\mathbf E[\beta_k|H_k]=\beta_k E[αk∣Hk]=αk,E[βk∣Hk]=βk
构造 h k = ω k 2 h_k=\omega_k^2 hk=ωk2,得到:
E [ h k + 1 − h k ∣ H k ] = E [ ω k + 1 2 − ω k 2 ∣ H k ] = E [ − α k ( 2 − α k ) ω k 2 + β k 2 η k 2 + ( 2 − 2 α k ) β k η k ω k ∣ H k ] = − α k ( 2 − α k ) ω k 2 + β k 2 E [ η k 2 ∣ H k ] + ( 2 − 2 α k ) β k ω k E [ η k ∣ H k ] ≤ − α k ( 2 − α k ) ω k 2 + β k 2 C \mathbf E[h_{k+1}-h_k|H_k]=\mathbf E[\omega_{k+1}^2-\omega_{k}^2|H_k]\\=\mathbf E[-\alpha_k(2-\alpha_k)\omega_k^2+\beta_k^2\eta_k^2+(2-2\alpha_k)\beta_k\eta_k\omega_k|H_k]\\=-\alpha_k(2-\alpha_k)\omega_k^2+\beta_k^2\mathbf{E}[\eta_k^2|H_k]+(2-2\alpha_k)\beta_k\omega_k\mathbf{E}[\eta_k|H_k]\\\leq-\alpha_k(2-\alpha_k)\omega_k^2+\beta_k^2C E[hk+1−hk∣Hk]=E[ωk+12−ωk2∣Hk]=E[−αk(2−αk)ωk2+βk2ηk2+(2−2αk)βkηkωk∣Hk]=−αk(2−αk)ωk2+βk2E[ηk2∣Hk]+(2−2αk)βkωkE[ηk∣Hk]≤−αk(2−αk)ωk2+βk2C
因为 ∑ k = 1 ∞ α k 2 < ∞ \sum_{k=1}^{\infty}\alpha_k^2<\infty ∑k=1∞αk2<∞,所以 α k → 0 \alpha_k\rightarrow0 αk→0。
所以存在 N N N,当 k > N k>N k>N时有 α k ≤ ∣ α k ∣ < 1 \alpha_k\leq|\alpha_k|<1 αk≤∣αk∣<1(极限定义), − α k ( 2 − α k ) ω k 2 < 0 -\alpha_k(2-\alpha_k)\omega_k^2<0 −αk(2−αk)ωk2<0,此时 E [ h k + 1 − h k ∣ H k ] ≤ β k 2 C , k > N \mathbf E[h_{k+1}-h_k|H_k]\leq\beta_k^2C,k>N E[hk+1−hk∣Hk]≤βk2C,k>N。
又因为 ∑ k = 1 ∞ β k 2 = C β 2 < ∞ \sum_{k=1}^{\infty}\beta_k^2=C_{\beta^2}<\infty ∑k=1∞βk2=Cβ2<∞,因此有:
∑ k = 1 ∞ E [ h k + 1 − h k ∣ H k ] = ( ∑ k = 1 N + ∑ k = N + 1 ∞ ) E [ h k + 1 − h k ∣ H k ] ≤ ∑ k = 1 N E [ h k + 1 − h k ∣ H k ] + ∑ k = N + 1 ∞ β k 2 C ≤ ∑ k = 1 N E [ h k + 1 − h k ∣ H k ] + C β 2 C < ∞ \sum_{k=1}^{\infty}\mathbf E[h_{k+1}-h_k|H_k]=(\sum_{k=1}^N+\sum_{k=N+1}^{\infty})E[h_{k+1}-h_k|H_k]\\ \leq \sum_{k=1}^N E[h_{k+1}-h_k|H_k]+\sum_{k=N+1}^{\infty}\beta_k^2C\\ \leq\sum_{k=1}^N E[h_{k+1}-h_k|H_k]+ C_{\beta^2}C<\infty k=1∑∞E[hk+1−hk∣Hk]=(k=1∑N+k=N+1∑∞)E[hk+1−hk∣Hk]≤k=1∑NE[hk+1−hk∣Hk]+k=N+1∑∞βk2C≤k=1∑NE[hk+1−hk∣Hk]+Cβ2C<∞
继续推导:
∑ k = 1 ∞ α k ω k 2 = ∑ k = N ∞ α k ω k 2 + ∑ k = 1 N α k ω k 2 < ∑ k = 1 N α k ω k 2 + ∑ k = N ∞ α k ( 2 − α k ) ω k 2 < ∑ k = 1 N α k ω k 2 + ∑ k = 1 ∞ α k ( 2 − α k ) ω k 2 < ∑ k = 1 N α k ω k 2 − ∑ k = 1 ∞ E [ h k + 1 − h k ∣ H k ] + ∑ k = 1 ∞ β k 2 C \sum_{k=1}^{\infty}\alpha_k\omega_k^2=\sum_{k=N}^{\infty}\alpha_k\omega_k^2+\sum_{k=1}^N\alpha_k\omega_k^2\\<\sum_{k=1}^N\alpha_k\omega_k^2+\sum_{k=N}^{\infty}\alpha_k(2-\alpha_k)\omega_k^2\\<\sum_{k=1}^N\alpha_k\omega_k^2+\sum_{k=1}^{\infty}\alpha_k(2-\alpha_k)\omega_k^2\\<\sum_{k=1}^N\alpha_k\omega_k^2-\sum_{k=1}^{\infty}\mathbf E[h_{k+1}-h_k|H_k]+\sum_{k=1}^{\infty}\beta_k^2C k=1∑∞αkωk2=k=N∑∞αkωk2+k=1∑Nαkωk2<k=1∑Nαkωk2+k=N∑∞αk(2−αk)ωk2<k=1∑Nαkωk2+k=1∑∞αk(2−αk)ωk2<k=1∑Nαkωk2−k=1∑∞E[hk+1−hk∣Hk]+k=1∑∞βk2C
由之前的证明 E [ h k + 1 − h k ∣ H k ] < ∞ , ∑ k = 1 ∞ β k 2 C = C β 2 C < ∞ \mathbf E[h_{k+1}-h_k|H_k]<\infty,\sum_{k=1}^{\infty}\beta_k^2C=C_{\beta^2}C<\infty E[hk+1−hk∣Hk]<∞,∑k=1∞βk2C=Cβ2C<∞可知:
∑ k = 1 ∞ α k ω k 2 < ∞ \sum_{k=1}^{\infty}\alpha_k\omega_k^2<\infty k=1∑∞αkωk2<∞
因此 ω k → 0 , w . p . 1 \omega_k \rightarrow 0,w.p.1 ωk→0,w.p.1,证毕。
(Dvoretzky’s 收敛定理的扩展) 对于集合 X X X,元素 x ∈ X x\in X x∈X,对于随机过程:
Δ k + 1 ( x ) = ( 1 − α k ( x ) ) Δ k ( x ) + β k ( x ) e k ( x ) \Delta_{k+1}(x)=(1-\alpha_k(x))\Delta_k(x)+\beta_k(x)e_k(x) Δk+1(x)=(1−αk(x))Δk(x)+βk(x)ek(x)
当满足以下条件:
(1)集合 X X X是有限的;
(2) ∑ k α k ( x ) = ∞ , , ∑ k α k 2 ( x ) < ∞ , ∑ k β k 2 ( x ) < ∞ \sum_{k}\alpha_k(x)=\infty,,\sum_{k}\alpha_k^2(x)<\infty,\sum_{k}\beta_k^2(x)<\infty ∑kαk(x)=∞,,∑kαk2(x)<∞,∑kβk2(x)<∞;
(3) E [ β k ( x ) ∣ H k ] ≤ E [ α k ( x ) ∣ H k ] , u n i f o r m l y . w . p . 1 \mathbf E[\beta_k(x)|H_k]\leq \mathbf E[\alpha_k(x)|H_k],uniformly.w.p.1 E[βk(x)∣Hk]≤E[αk(x)∣Hk],uniformly.w.p.1;
(4) ∣ ∣ E [ e k ∣ H k ] ∣ ∣ ∞ ≤ γ ∣ ∣ Δ k ∣ ∣ ∞ , γ ∈ ( 0 , 1 ) ||\mathbf E[e_k|H_k]||_{\infty}\leq \gamma||\Delta_k||_{\infty},\gamma \in(0,1) ∣∣E[ek∣Hk]∣∣∞≤γ∣∣Δk∣∣∞,γ∈(0,1), e k = [ e k ( x ) ] x ∈ X T , Δ k = [ Δ k ( x ) ] x ∈ X T e_k=[e_k(x)]_{x\in X}^T,\Delta_k=[\Delta_k(x)]_{x\in X}^T ek=[ek(x)]x∈XT,Δk=[Δk(x)]x∈XT;
(5) ∃ C ≥ 0 , V a r [ e k ( x ) ∣ H k ] ≤ C ( 1 + ∣ ∣ Δ k ( x ) ∣ ∣ ∞ ) \exist C\geq0,\mathbf{Var}[e_k(x)|H_k]\leq C(1+||\Delta_k(x)||_{\infty}) ∃C≥0,Var[ek(x)∣Hk]≤C(1+∣∣Δk(x)∣∣∞).
其中 H k = { Δ k , Δ k − 1 , . . . e k − 1 , . . . α k − 1 , . . . β k − 1 , . . . } H_k=\{\Delta_k,\Delta_{k-1},...e_{k-1},...\alpha_{k-1},...\beta_{k-1},...\} Hk={Δk,Δk−1,...ek−1,...αk−1,...βk−1,...},则 ∀ x ∈ X , ω k ( x ) → 0 , w . p . 1 \forall x\in X,\omega_k(x) \rightarrow 0,w.p.1 ∀x∈X,ωk(x)→0,w.p.1.
Proof.证明太过复杂,见文献Jaakkola,T.,M.I.Jordan and S.Singh.On the Convergence of Stochastic Iterative Dynamic Programming Algorithms.Neural Computation,1993.6:P.1185-1201.
(Robbins-Monro定理) 若在迭代式 ω k + 1 = ω k − α k g ˉ ( ω k , η k ) \omega_{k+1}=\omega_{k}-\alpha_k\bar{g}(\omega_k,\eta_k) ωk+1=ωk−αkgˉ(ωk,ηk),其中 η k \eta_k ηk为随机变量 g ˉ ( ω k , η k ) = g ( ω k ) + η k \bar{g}(\omega_k,\eta_k)=g(\omega_{k})+\eta_k gˉ(ωk,ηk)=g(ωk)+ηk,当满足条件:
(1) ∀ ω , 0 < c 1 ≤ ∇ ω g ( ω ) ≤ c 2 \forall \omega,0
(2) ∑ k = 1 ∞ α k = ∞ , ∑ k = 1 ∞ α k 2 < ∞ \sum_{k=1}^{\infty}\alpha_k=\infty,\sum_{k=1}^{\infty}\alpha_k^2<\infty ∑k=1∞αk=∞,∑k=1∞αk2<∞;
(3) E [ η k ∣ H k ] = 0 , E [ η k 2 ∣ H k ] < ∞ \mathbf E[\eta_k|H_k]=0,\mathbf E[\eta_k^2|H_k]<\infty E[ηk∣Hk]=0,E[ηk2∣Hk]<∞;
其中 H k = { ω k , ω k − 1 , . . . } H_k=\{\omega_k,\omega_{k-1},...\} Hk={ωk,ωk−1,...},则 ω k → ω ∗ , w . p . 1 \omega_k\rightarrow \omega^*,w.p.1 ωk→ω∗,w.p.1,其中 g ( ω ∗ ) = 0 g(\omega^*)=0 g(ω∗)=0。
Proof.有下面的式子:
ω k + 1 − ω ∗ = ω k − ω ∗ − α k g ˉ ( ω k , η k ) = ω k − ω ∗ − α k ( g ( ω k ) + η k ) = ω k − ω ∗ − α k ( g ( ω k ) − g ( ω ∗ ) + η k ) = ω k − ω ∗ − α k ( ∇ ω g ( ω k ′ ) ( ω k − ω ∗ ) ) + ( α k ) ( − η k ) = ( 1 − α k ∇ ω g ( ω k ′ ) ) ( ω k − ω ∗ ) + ( α k ) ( − η k ) \omega_{k+1}-\omega^*=\omega_{k}-\omega^*-\alpha_k\bar{g}(\omega_k,\eta_k)\\=\omega_k-\omega^*-\alpha_k(g(\omega_k)+\eta_k)\\=\omega_k-\omega^*-\alpha_k(g(\omega_k)-g(\omega^*)+\eta_k)\\=\omega_k-\omega^*-\alpha_k(\nabla_{\omega}g(\omega_k^{'})(\omega_k-\omega^*))+(\alpha_k)(-\eta_k)\\=(1-\alpha_k\nabla_{\omega}g(\omega_k^{'}))(\omega_k-\omega^*)+(\alpha_k)(-\eta_k) ωk+1−ω∗=ωk−ω∗−αkgˉ(ωk,ηk)=ωk−ω∗−αk(g(ωk)+ηk)=ωk−ω∗−αk(g(ωk)−g(ω∗)+ηk)=ωk−ω∗−αk(∇ωg(ωk′)(ωk−ω∗))+(αk)(−ηk)=(1−αk∇ωg(ωk′))(ωk−ω∗)+(αk)(−ηk)
其中: ω k ′ = θ ω k + ( 1 − θ ) ω ∗ , θ ∈ [ 0 , 1 ] \omega_k^{'}=\theta\omega_k+(1-\theta)\omega^*,\theta\in[0,1] ωk′=θωk+(1−θ)ω∗,θ∈[0,1]。令 Δ k = ω k − ω ∗ \Delta_k=\omega_k-\omega^* Δk=ωk−ω∗,则 Δ k + 1 = ( 1 − α k ∇ ω g ( ω k ′ ) ) Δ k + α k ( − η k ) \Delta_{k+1}=(1-\alpha_k\nabla_{\omega}g(\omega_k^{'}))\Delta_k+\alpha_k(-\eta_k) Δk+1=(1−αk∇ωg(ωk′))Δk+αk(−ηk)。
因为 ∑ k = 1 ∞ α k ∇ ω g ( ω k ′ ) > c 1 ∑ k = 1 k α k , ∑ k = 1 k α k = ∞ \sum_{k=1}^{\infty}\alpha_k\nabla_{\omega}g(\omega_k^{'})>c_1\sum_{k=1}^k\alpha_k,\sum_{k=1}^k\alpha_k=\infty ∑k=1∞αk∇ωg(ωk′)>c1∑k=1kαk,∑k=1kαk=∞,所以 ∑ k = 1 ∞ ( α k ∇ ω g ( ω k ′ ) ) = ∞ \sum_{k=1}^{\infty}(\alpha_k\nabla_{\omega}g(\omega_k^{'}))=\infty ∑k=1∞(αk∇ωg(ωk′))=∞.
而 ∑ k = 1 ∞ ( α k ∇ ω g ( ω k ′ ) ) 2 ≤ c 2 2 ∑ k = 1 ∞ α k 2 < ∞ , E [ − η k ∣ H k ] = 0 \sum_{k=1}^{\infty}(\alpha_k\nabla_{\omega}g(\omega_k^{'}))^2\leq c_2^2\sum_{k=1}^{\infty}\alpha^2_k<\infty ,\mathbf E[-\eta_k|H_k]=0 ∑k=1∞(αk∇ωg(ωk′))2≤c22∑k=1∞αk2<∞,E[−ηk∣Hk]=0。
故由Dvoretzky’s 收敛定理, Δ k → 0 , w . p . 1 \Delta_k\rightarrow 0,w.p.1 Δk→0,w.p.1,即 ω k → ω ∗ , w . p . 1 \omega_k\rightarrow \omega^*,w.p.1 ωk→ω∗,w.p.1。
由Robbins-Monro定理可以很容易的估计随机变量的数学期望:若独立同分布随机变量 { x k } k = 1 ∞ \{x_k\}_{k=1}^{\infty} {xk}k=1∞,数学期望为 E [ X ] \mathbf{E}[X] E[X],采用迭代式 w k + 1 = ( 1 − α k ) w k + α k x k w_{k+1}=(1-\alpha_k)w_k+\alpha_kx_k wk+1=(1−αk)wk+αkxk进行计算,若 ∑ k α k = ∞ , ∑ k α k 2 < ∞ \sum_{k}\alpha_k=\infty,\sum_{k}\alpha_k^2<\infty ∑kαk=∞,∑kαk2<∞,可以得到 w k → E [ x ] w_k\rightarrow \mathbf{E}[x] wk→E[x]。(证明直接构造 g ( w k ) = w k − E [ x ] , η k = E [ x ] − x k g(w_k)=w_k-\mathbf E[x],\eta_k=\mathbf E[x]-x_k g(wk)=wk−E[x],ηk=E[x]−xk即可)
(随机梯度下降(SGD)算法的收敛性) 对于最优化问题 min w J ( w ) = E X [ f ( w , X ) ] \min_{w}J(w)=\mathbf{E}_{X}[f(w,X)] minwJ(w)=EX[f(w,X)],当采用迭代式 w k + 1 = w k − α k ∇ w f ( w k , x k ) w_{k+1}=w_k-\alpha_k\nabla_{w}f(w_k,x_k) wk+1=wk−αk∇wf(wk,xk)进行参数迭代时,若满足以下条件:
(1) 0 < c 1 ≤ ∇ w 2 f ( w , X ) ≤ c 2 0
(2) ∑ k = 1 ∞ α k = ∞ , ∑ k = 1 ∞ α k 2 < ∞ \sum_{k=1}^{\infty}\alpha_k=\infty,\sum_{k=1}^{\infty}\alpha_k^2<\infty ∑k=1∞αk=∞,∑k=1∞αk2<∞;
(3) { x k } k = 1 ∞ \{x_k\}_{k=1}^{\infty} {xk}k=1∞是独立同分布随机变量.
则 w k → w ∗ w_k\rightarrow w^* wk→w∗,其中 ∇ w E X [ f ( w ∗ , X ) ] = 0 , w . p . 1 \nabla_w \mathbf{E}_X[f(w^*,X)]=0,w.p.1 ∇wEX[f(w∗,X)]=0,w.p.1.
Proof. 令 g ( w k ) = ∇ w E X [ f ( w k , X ) ] g(w_k)=\nabla_w \mathbf{E}_X[f(w_k,X)] g(wk)=∇wEX[f(wk,X)], η k = ∇ w f ( w k , X ) − ∇ w E X [ f ( w k , X ) ] \eta_k=\nabla_{w}f(w_k,X)-\nabla_w \mathbf{E}_X[f(w_k,X)] ηk=∇wf(wk,X)−∇wEX[f(wk,X)], g ˉ ( w k , η k ) = g ( w k ) + η k = ∇ w f ( w k , X ) \bar{g}(w_k,\eta_k)=g(w_k)+\eta_k=\nabla_{w}f(w_k,X) gˉ(wk,ηk)=g(wk)+ηk=∇wf(wk,X).
由于 0 < c 1 ≤ ∇ w 2 f ( w , X ) ≤ c 2 0
c 1 ≤ ∇ w g ( w k ) = ∇ w 2 E X [ f ( w k , X ) ] ≤ c 2 c_1\leq\nabla_w g(w_k)=\nabla_w^2 \mathbf{E}_X[f(w_k,X)]\leq c_2 c1≤∇wg(wk)=∇w2EX[f(wk,X)]≤c2
而 E [ η k ∣ H k ] = E [ ∇ w f ( w k , X ) − ∇ w E X [ f ( w k , X ) ] ∣ H k ] = 0 \mathbf E[\eta_k|H_k]=\mathbf{E}[\nabla_{w}f(w_k,X)-\nabla_w \mathbf{E}_X[f(w_k,X)]|H_k]=0 E[ηk∣Hk]=E[∇wf(wk,X)−∇wEX[f(wk,X)]∣Hk]=0;
同理 E [ η k 2 ∣ H k ] = E [ ( ∇ w f ( w k , X ) − ∇ w E X [ f ( w k , X ) ] ) 2 ∣ H k ] < ∞ \mathbf E[\eta_k^2|H_k]=\mathbf{E}[(\nabla_{w}f(w_k,X)-\nabla_w \mathbf{E}_X[f(w_k,X)])^2|H_k]<\infty E[ηk2∣Hk]=E[(∇wf(wk,X)−∇wEX[f(wk,X)])2∣Hk]<∞.
因此由Robbins-Monro定理, w k → w ∗ w_k\rightarrow w^* wk→w∗,其中 g ( w ∗ ) = ∇ w E X [ f ( w ∗ , X ) ] = 0 g(w^*)=\nabla_w \mathbf{E}_X[f(w^*,X)]=0 g(w∗)=∇wEX[f(w∗,X)]=0.
(压缩映射原理) 在非空完备度量空间 ( X , d ) (X,d) (X,d)中,映射 T : X → X T:X\rightarrow X T:X→X为压缩映射,即满足条件:
d ( T x 1 , T x 2 ) < C d ( x 1 , x 2 ) , x 1 , x 2 ∈ X , 0 < C < 1 d(Tx_1,Tx_2)
则 T T T在该空间中有唯一的不动点 x 0 x_0 x0满足 T x 0 = x 0 Tx_0=x_0 Tx0=x0,其可以通过 x n + 1 = T x n x_{n+1}=Tx_{n} xn+1=Txn迭代得到 x n → x 0 x_n\rightarrow x_0 xn→x0。
Proof.略,因为这是著名的Banach不动点定理,一般的泛函分析教材上都会有介绍。
(马尔科夫链的稳态分布定理) 设Markov Process的状态空间为 S S S,状态量的个数为 ∣ S ∣ |S| ∣S∣,定义在策略 π \pi π下的状态转移概率矩阵 P π ∈ R ∣ S ∣ × ∣ S ∣ P_{\pi}\in R^{|S|\times |S|} Pπ∈R∣S∣×∣S∣,定义 k k k步状态转移概率矩阵 P π k = { p i j , π ( k ) } ∣ S ∣ × ∣ S ∣ P_{\pi}^k=\{p_{ij,\pi}^{(k)}\}_{|S|\times |S|} Pπk={pij,π(k)}∣S∣×∣S∣,其中:
p i j , π ( k ) = P r o b ( S t k = j ∣ S t 0 = i , π ) p_{ij,\pi}^{(k)}=\mathbf{Prob}(S_{t_k}=j|S_{t_0}=i,\pi) pij,π(k)=Prob(Stk=j∣St0=i,π)
其满足 P π k = P π P π k − 1 P_{\pi}^k=P_{\pi}P_{\pi}^{k-1} Pπk=PπPπk−1.对任意一个初始状态分布 d 0 ∈ R ∣ S ∣ , d_0\in R^{|S|}, d0∈R∣S∣,,在策略 π \pi π下经过 k k k轮迭代后的状态分布为 d k : d k T = d 0 T P π k d_k:d_k^T=d_0^TP_{\pi}^k dk:dkT=d0TPπk.
当满足以下条件:
对于任意的两个状态 i , j ∈ S i,j\in S i,j∈S,都存在有限步长 k k k,使得 [ P π k ] i j > 0 [P_{\pi}^k]_{ij}>0 [Pπk]ij>0。
有以下结论:
(1) P π k → 1 ∣ S ∣ d π T P_{\pi}^k\rightarrow \mathbf{1}_{|S|}d_{\pi}^T Pπk→1∣S∣dπT;
(2) d k T → d 0 T 1 ∣ S ∣ d π T = d π T d_k^T\rightarrow d_0^T\mathbf{1}_{|S|}d_{\pi}^T=d_{\pi}^T dkT→d0T1∣S∣dπT=dπT;
(3) d π T d_{\pi}^T dπT满足 d π T = d π T P π d_{\pi}^T=d_{\pi}^TP_{\pi} dπT=dπTPπ.
此时称这样的Markov Process是regular的。
Proof. 略,因为这是关于Markov Process中经典的遍历定理,一般的随机过程教材上都会有介绍。
(完备度量空间中的柯西列均收敛) 在完备度量空间 ( X , d ) (X,d) (X,d)中,若柯西列 { x n } ⊂ X \{x_n\}\sub X {xn}⊂X,则柯西列必收敛 x n → x ∈ X x_n\rightarrow x \in X xn→x∈X。其子空间 ( Y , d ∣ Y × Y ) ⊂ ( X , d ) (Y,d|_{Y\times Y})\sub (X,d) (Y,d∣Y×Y)⊂(X,d)为闭集是 ( Y , d ∣ Y × Y ) (Y,d|_{Y\times Y}) (Y,d∣Y×Y)为完备度量空间的充要条件。
Proof. 略,该定理在一般的泛函分析教材上都会有介绍。
(夹逼定理) 如果数列 { X n } , { Y n } , { Z n } \{X_n\},\{Y_n\},\{Z_n\} {Xn},{Yn},{Zn}满足以下条件:
(1) 当 n > N 0 ∈ N ∗ n>N_0\in N^* n>N0∈N∗时,有 Y n ≤ X n ≤ Z n Y_n\leq X_n \leq Z_n Yn≤Xn≤Zn;
(2) lim n → ∞ Y n = lim n → ∞ Z n = a < ∞ \lim_{n \rightarrow \infty}Y_n=\lim_{n \rightarrow \infty}Z_n=a<\infty limn→∞Yn=limn→∞Zn=a<∞;
则数列 { X n } \{X_n\} {Xn}极限存在,且 lim n → ∞ X n = a \lim_{n\rightarrow \infty }X_n=a limn→∞Xn=a。
Proof. 由于 lim n → ∞ Y n = lim n → ∞ Z n = a \lim_{n \rightarrow \infty}Y_n=\lim_{n \rightarrow \infty}Z_n=a limn→∞Yn=limn→∞Zn=a,所以由极限的定义:
∀ ε > 0 , ∃ N 1 , n > N 1 , ∣ Y n − a ∣ < ε \forall \varepsilon >0,\exists N_1,n>N_1,|Y_n-a|<\varepsilon ∀ε>0,∃N1,n>N1,∣Yn−a∣<ε;
∀ ε > 0 , ∃ N 2 , n > N 1 , ∣ Z n − a ∣ < ε \forall \varepsilon >0,\exists N_2,n>N_1,|Z_n-a|<\varepsilon ∀ε>0,∃N2,n>N1,∣Zn−a∣<ε;
∀ ε > 0 \forall \varepsilon >0 ∀ε>0,当取 N = max { N 0 , N 1 , N 2 } N=\max\{N_0,N_1,N_2\} N=max{N0,N1,N2}时,若 n > N n>N n>N,有 X n ≥ Y n > a − ε X_n\geq Y_n>a-\varepsilon Xn≥Yn>a−ε, X n ≤ Z n < a + ε X_n\leq Z_nXn≤Zn<a+ε,得到 ∣ X n − a ∣ < ε |X_n-a|<\varepsilon ∣Xn−a∣<ε,故 lim n → ∞ X n = a \lim_{n\rightarrow \infty}X_n=a limn→∞Xn=a.
(数列的平均值收敛性质) 若 { a n } n = 1 ∞ ⊂ R \{a_n\}_{n=1}^{\infty}\sub R {an}n=1∞⊂R是收敛列, lim n → ∞ a n = a ∗ \lim_{n\rightarrow \infty}a_n=a^* limn→∞an=a∗,则 lim n → ∞ 1 n ∑ k = 1 n a n = a ∗ \lim_{n\rightarrow \infty}\frac{1}{n}\sum_{k=1}^na_n=a^* limn→∞n1∑k=1nan=a∗。
Proof. 令 b n = 1 n ∑ k = 1 n a n b_n=\frac{1}{n}\sum_{k=1}^na_n bn=n1∑k=1nan,则有关系 ( n + 1 ) b n + 1 − n b n = a n + 1 (n+1)b_{n+1}-nb_n=a_{n+1} (n+1)bn+1−nbn=an+1成立,令 Δ n = b n − a ∗ \Delta_n=b_n-a^* Δn=bn−a∗,得到:
( n + 1 ) b n + 1 − n b n = ( n + 1 ) ( Δ n + 1 + a ∗ ) − n ( Δ n + a ∗ ) = ( n + 1 ) Δ n + 1 − n Δ n + a ∗ = a n + 1 (n+1)b_{n+1}-nb_n=(n+1)(\Delta_{n+1}+a^*)-n(\Delta_n+a^*)\\=(n+1)\Delta_{n+1}-n\Delta_n+a^*\\=a_{n+1} (n+1)bn+1−nbn=(n+1)(Δn+1+a∗)−n(Δn+a∗)=(n+1)Δn+1−nΔn+a∗=an+1
化简得到: ( n + 1 ) Δ n + 1 − n Δ n = a n + 1 − a ∗ (n+1)\Delta_{n+1}-n\Delta_n=a_{n+1}-a^* (n+1)Δn+1−nΔn=an+1−a∗。由于 a n + 1 → a ∗ a_{n+1}\rightarrow a^* an+1→a∗,有 ( n + 1 ) Δ n + 1 − n Δ n → 0 (n+1)\Delta_{n+1}-n\Delta_n\rightarrow 0 (n+1)Δn+1−nΔn→0,即 { n Δ n } ⊂ R \{n\Delta_n\}\sub R {nΔn}⊂R是柯西列,由于 R R R的完备性,有 n Δ n → c ∗ ∈ R n\Delta_n\rightarrow c^*\in R nΔn→c∗∈R。
这说明 ∀ ε > 0 , ∃ N 1 > 0 , n > N 1 \forall \varepsilon>0,\exists N_1>0,n>N_1 ∀ε>0,∃N1>0,n>N1有 ∣ n Δ n − c ∗ ∣ < ε |n\Delta_n-c^*|<\varepsilon ∣nΔn−c∗∣<ε,即: ∣ Δ n − c ∗ n ∣ < ε n |\Delta_n-\frac{c^*}{n}|<\frac{\varepsilon}{n} ∣Δn−nc∗∣<nε,此时有:
∣ Δ n ∣ < ∣ Δ n − c ∗ n ∣ + ∣ c ∗ n ∣ < ε n + ∣ c ∗ ∣ n |\Delta_n|<|\Delta_n-\frac{c^*}{n}|+|\frac{c^*}{n}|<\frac{\varepsilon}{n}+\frac{|c^*|}{n} ∣Δn∣<∣Δn−nc∗∣+∣nc∗∣<nε+n∣c∗∣
若取 N 2 = [ ε + ∣ c ∗ ∣ ε ] + 1 N_2=[\frac{\varepsilon+|c^*|}{\varepsilon}]+1 N2=[εε+∣c∗∣]+1,当 n > N 2 n>N_2 n>N2有 ∣ Δ n ∣ < ε |\Delta_n|<\varepsilon ∣Δn∣<ε.
综上, ∀ ε > 0 \forall \varepsilon>0 ∀ε>0,若取 N = max { N 1 , N 2 } N=\max\{N_1,N_2\} N=max{N1,N2},当 n > N n>N n>N时有: ∣ Δ n ∣ < ε |\Delta_n|<\varepsilon ∣Δn∣<ε.
这说明 Δ n → 0 \Delta_n\rightarrow 0 Δn→0,即 b n = 1 n ∑ k = 1 n a n → a ∗ b_n=\frac{1}{n}\sum_{k=1}^na_n\rightarrow a^* bn=n1∑k=1nan→a∗得证。