Markov过程是一种用于描述决策问题的数学框架,是强化学习的基础。MDP中,决策者面对一系列的状态和动作,每个状态下采取不同的动作会获得不同的奖励,决策者的目标是制定一种策略,使得长期累积的奖励最大化。
MDP具有以下特点:
MDP可以用五元组 ( S , A , p , r , γ ) (\mathcal{S}, \mathcal{A}, p, r, \gamma) (S,A,p,r,γ) 来表示,其中:
定义在时间 t \mathrm{t} t, 从状态 s t = s \mathrm{s}_{\mathrm{t}}=\mathrm{s} st=s 和动作 A t = a \mathrm{A}_{\mathrm{t}}=\mathrm{a} At=a 跳转到下一状态 S t + 1 = s ′ S_{t+1}=s^{\prime} St+1=s′ 和奖励 R t + 1 = r R_{t+1}=r Rt+1=r 的概率为:
Pr [ S t + 1 = s ′ , R t + 1 = r ∣ S t = s , A t = a ] \operatorname{Pr}\left[S_{t+1}=s^{\prime}, R_{t+1}=r \mid S_t=s, A_t=a\right] Pr[St+1=s′,Rt+1=r∣St=s,At=a]
在MDP中,决策者需要制定一种策略 π : S → A \pi: \mathcal{S} \rightarrow \mathcal{A} π:S→A,将每个状态映射到相应的动作。根据策略,可以计算出每个状态的状态值函数 V π ( s ) V^\pi(s) Vπ(s) 和动作值函数 Q π ( s , a ) Q^\pi(s,a) Qπ(s,a),用于评估策略的好坏。同时,还可以使用值迭代、策略迭代等算法,来寻找最优策略,使得长期累积奖励最大化。
对于有限 Markov决策过程, 可以定义函数 p : S × R × S × A → [ 0 , 1 ] p: \mathcal{S} \times \mathcal{R} \times \mathcal{S} \times \mathcal{A} \rightarrow[0,1] p:S×R×S×A→[0,1] 为 Markov决策过程的动力 (dynamics):
p ( s ′ , r ∣ s , a ) = Pr [ S t + 1 = s ′ , R t + 1 = r ∣ S t = s , A t = a ] \mathrm{p}\left(\mathrm{s}^{\prime}, \mathrm{r} \mid \mathrm{s}, \mathrm{a}\right)=\operatorname{Pr}\left[\mathrm{S}_{\mathrm{t}+1}=\mathrm{s}^{\prime} \quad, \mathrm{R}_{\mathrm{t}+1}=\mathrm{r} \mid \mathrm{S}_{\mathrm{t}}=\mathrm{s}, \mathrm{A}_{\mathrm{t}}=\mathrm{a}\right] p(s′,r∣s,a)=Pr[St+1=s′,Rt+1=r∣St=s,At=a]
p函数中间的坚线 “ ∣ \mid ∣ ”取材于条件概率中间的坚线。
利用动力的定义, 可以得到以下其他导出量。
状态转移概率(1.1):
p ( s ′ ∣ s , a ) = Pr [ S t + 1 = s ′ ∣ S , = s , A = a ] = ∑ r ∈ R p ( s ′ , r ∣ s , a ) , s ∈ S , a ∈ A , s ′ ∈ S p\left(s^{\prime} \mid s, a\right)=\operatorname{Pr}\left[S_{t+1}=s^{\prime} \mid S,=s, A=a\right]=\sum_{r \in \mathbb{R}} p\left(s^{\prime}, r \mid s, a\right), \quad s \in \mathcal{S}, a \in \mathcal{A}, s^{\prime} \in \mathcal{S} p(s′∣s,a)=Pr[St+1=s′∣S,=s,A=a]=r∈R∑p(s′,r∣s,a),s∈S,a∈A,s′∈S
给定 “状态 - 动作” 的期望奖励(1.2):
r ( s , a ) = E [ R t + 1 ∣ S t = s , A t = a ] = ∑ r ∈ R r ∑ s ′ ∈ S p ( s ′ , r ∣ s , a ) , s ∈ S , a ∈ A r(s, a)=\mathrm{E}\left[R_{t+1} \mid S_{t}=s, A_{t}=a\right]=\sum_{r \in \mathbb{R}} r \sum_{s^{\prime} \in S} p\left(s^{\prime}, r \mid s, a\right), \quad s \in \mathcal{S}, a \in \mathcal{A} r(s,a)=E[Rt+1∣St=s,At=a]=r∈R∑rs′∈S∑p(s′,r∣s,a),s∈S,a∈A
给定 “状态 - 动作 -下一状态” 的期望奖励(1.3):
r ( s , a , s ′ ) = E [ R t + 1 ∣ S t = s , A t = a , S t + 1 = s ′ ] = ∑ r ∈ R r p ( s ′ , r ∣ s , a ) p ( s ′ ∣ s , a ) , s ∈ S , a ∈ A , s ′ ∈ S r\left(s, a, s^{\prime}\right)=\mathrm{E}\left[R_{t+1} \mid S_{t}=s, A_{t}=a, S_{t+1}=s^{\prime}\right]= \sum_{r \in \mathbb{R} } r \frac{p\left(s^{\prime}, r \mid s, a\right)}{p\left(s^{\prime} \mid s, a\right)}, \quad s \in \mathcal{S}, a \in \mathcal{A}, s^{\prime} \in \mathcal{S} r(s,a,s′)=E[Rt+1∣St=s,At=a,St+1=s′]=r∈R∑rp(s′∣s,a)p(s′,r∣s,a),s∈S,a∈A,s′∈S
公式(1.3)推导过程 我们可以使用条件概率的公式来推导 r ( s , a , s ′ ) r(s,a,s') r(s,a,s′) 的公式。根据条件概率的定义,有: E [ R t + 1 ∣ S t = s , A t = a , S t + 1 = s ′ ] = ∑ r r ⋅ Pr ( R t + 1 = r ∣ S t = s , A t = a , S t + 1 = s ′ ) \mathrm{E}\left[R_{t+1} \mid S_{t}=s, A_{t}=a, S_{t+1}=s^{\prime}\right]=\sum_{r} r \cdot \operatorname{Pr}\left(R_{t+1}=r \mid S_{t}=s, A_{t}=a, S_{t+1}=s^{\prime}\right) E[Rt+1∣St=s,At=a,St+1=s′]=r∑r⋅Pr(Rt+1=r∣St=s,At=a,St+1=s′)
利用条件概率公式的两种形式
P ( A , B ) = P ( A ∣ B ) ⋅ P ( B ) P(A , B)=P(A \mid B) \cdot P(B) P(A,B)=P(A∣B)⋅P(B)
P ( A ∣ B ) ⋅ P ( B ) = P ( A , B ) P(A \mid B) \cdot P(B)=P(A , B) P(A∣B)⋅P(B)=P(A,B)对下面的概率公式进行转化
Pr ( R t + 1 = r ∣ S t = S , A t = a , S t + 1 = s ′ ) = Pr ( R t + 1 = r , S t = s , A t = a , S t + 1 = s ′ ) Pr ( S t = s , A t = a , S t + 1 = s ′ ) = Pr ( R t + 1 = r , S t + 1 = s ′ ∣ S t = s , A t = a ) ⋅ Pr ( S t = s , A t = a ) Pr ( S t + 1 = s ′ ∣ S t = s , A t = a ) ⋅ Pr ( S t = s , A t = a ) = Pr ( R t + 1 = r , S t + 1 = s ′ ∣ S t = s , A t = a ) Pr ( S t + 1 = s ′ ∣ S t = s , A t = a ) \begin{aligned} & \operatorname{Pr}\left(R_{t+1}=r \mid S_t=S, A_t=a, S_{t+1}=s^{\prime}\right)\\ & =\frac{\operatorname{Pr}\left(R_{t+1}=r, S_t=s, A_t=a, S_{t+1}=s^{\prime}\right)}{\operatorname{Pr}\left(S_t=s, A_t=a, S_{t+1}=s^{\prime}\right)} \\ & =\frac{\operatorname{Pr}\left(\operatorname{R_{t+1}} =r, S_{t+1}=s^{\prime} \mid S_t=s, A_t=a\right) \cdot \operatorname{Pr}\left(S_t=s, A_t=a\right)}{\operatorname{Pr}\left(S_{t+1}=s^{\prime} \mid S_t=s, A_t=a\right) \cdot \operatorname{Pr}\left(S_t=s, A_t=a\right)} \\ & =\frac{\operatorname{Pr}\left(\operatorname{R}_{t+1}=r, S_{t+1}=s^{\prime} \mid S_t=s, A_t=a\right)}{\operatorname{Pr}\left(S_{t+1}=s^{\prime} \mid S_t=s, A_t=a\right)} \end{aligned} Pr(Rt+1=r∣St=S,At=a,St+1=s′)=Pr(St=s,At=a,St+1=s′)Pr(Rt+1=r,St=s,At=a,St+1=s′)=Pr(St+1=s′∣St=s,At=a)⋅Pr(St=s,At=a)Pr(Rt+1=r,St+1=s′∣St=s,At=a)⋅Pr(St=s,At=a)=Pr(St+1=s′∣St=s,At=a)Pr(Rt+1=r,St+1=s′∣St=s,At=a)
而根据贝叶斯公式,我们可以将上式中的条件概率转换为联合概率和边缘概率的形式,即:
Pr ( R t + 1 = r ∣ S t = s , A t = a , S t + 1 = s ′ ) = Pr ( S t + 1 = s ′ , R t + 1 = r ∣ S t = s , A t = a ) Pr ( S t + 1 = s ′ ∣ S t = s , A t = a ) \operatorname{Pr}\left(R_{t+1}=r \mid S_{t}=s, A_{t}=a, S_{t+1}=s^{\prime}\right)=\frac{\operatorname{Pr}\left(S_{t+1}=s^{\prime}, R_{t+1}=r \mid S_{t}=s, A_{t}=a\right)}{\operatorname{Pr}\left(S_{t+1}=s^{\prime} \mid S_{t}=s, A_{t}=a\right)} Pr(Rt+1=r∣St=s,At=a,St+1=s′)=Pr(St+1=s′∣St=s,At=a)Pr(St+1=s′,Rt+1=r∣St=s,At=a) 将上式代入前面的式子中,得到: E [ R t + 1 ∣ S t = s , A t = a , S t + 1 = s ′ ] = ∑ r r ⋅ Pr ( S t + 1 = s ′ , R t + 1 = r ∣ S t = s , A t = a ) Pr ( S t + 1 = s ′ ∣ S t = s , A t = a ) \mathrm{E}\left[R_{t+1} \mid S_{t}=s, A_{t}=a, S_{t+1}=s^{\prime}\right]=\sum_{r} r \cdot \frac{\operatorname{Pr}\left(S_{t+1}=s^{\prime}, R_{t+1}=r \mid S_{t}=s, A_{t}=a\right)}{\operatorname{Pr}\left(S_{t+1}=s^{\prime} \mid S_{t}=s, A_{t}=a\right)} E[Rt+1∣St=s,At=a,St+1=s′]=r∑r⋅Pr(St+1=s′∣St=s,At=a)Pr(St+1=s′,Rt+1=r∣St=s,At=a) 根据MDP中的状态转移概率 p ( s ′ , r ∣ s , a ) p(s',r|s,a) p(s′,r∣s,a)
和状态转移概率的定义,我们可以将上式中的条件概率表示为 p ( s ′ , r ∣ s , a ) p(s',r|s,a) p(s′,r∣s,a) 的形式,即: Pr ( S t + 1 = s ′ , R t + 1 = r ∣ S t = s , A t = a ) = p ( s ′ , r ∣ s , a ) \operatorname{Pr}\left(S_{t+1}=s^{\prime}, R_{t+1}=r \mid S_{t}=s, A_{t}=a\right)=p\left(s^{\prime}, r \mid s, a\right) Pr(St+1=s′,Rt+1=r∣St=s,At=a)=p(s′,r∣s,a)
同样地,根据MDP中的状态转移概率 p ( s ′ ∣ s , a ) p(s'|s,a) p(s′∣s,a) 和状态转移概率的定义,我们可以将上式中的边缘概率表示为 p ( s ′ ∣ s , a ) p(s'|s,a) p(s′∣s,a)
的形式,即: Pr ( S t + 1 = s ′ ∣ S t = s , A t = a ) = p ( s ′ ∣ s , a ) \operatorname{Pr}\left(S_{t+1}=s^{\prime} \mid S_{t}=s, A_{t}=a\right)=p\left(s^{\prime} \mid s, a\right) Pr(St+1=s′∣St=s,At=a)=p(s′∣s,a)
将上面两个式子代入前面的式子中,得到:
E [ R t + 1 ∣ S t = s , A t = a , S t + 1 = s ′ ] = ∑ r r ⋅ p ( s ′ , r ∣ s , a ) p ( s ′ ∣ s , a ) , s ∈ S , a ∈ A , s ′ ∈ S \mathrm{E}\left[R_{t+1} \mid S_{t}=s, A_{t}=a, S_{t+1}=s^{\prime}\right]=\sum_{r} r \cdot \frac{p\left(s^{\prime}, r \mid s, a\right)}{p\left(s^{\prime} \mid s, a\right)}, \quad s \in \mathcal{S}, a \in \mathcal{A}, s^{\prime} \in \mathcal{S} E[Rt+1∣St=s,At=a,St+1=s′]=r∑r⋅p(s′∣s,a)p(s′,r∣s,a),s∈S,a∈A,s′∈S 这就是
r ( s , a , s ′ ) r(s,a,s') r(s,a,s′) 的公式推导过程。
假设某一回合在第 T T T 步终止,则从 t ( t < T ) t(t
G t = R t + 1 + R t + 2 + ⋯ + R T G_t = R_{t+1} + R_{t+2} + \cdots + R_T Gt=Rt+1+Rt+2+⋯+RT
引入折扣因子 γ ∈ [ 0 , 1 ] \gamma \in [0,1] γ∈[0,1],则回报 G t G_t Gt 可以表示为:
G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + ⋯ = ∑ τ = 0 + ∞ γ τ R t + τ + 1 G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{\tau =0}^{+\infty} \gamma^\tau R_{t+\tau+1} Gt=Rt+1+γRt+2+γ2Rt+3+⋯=τ=0∑+∞γτRt+τ+1
其中, R t R_t Rt 表示第 t t t 步的奖励, γ \gamma γ 表示折扣因子, t t t 表示当前步数。
定义策略(policy)为从状态到动作的转移概率
π ( a ∣ s ) = P r [ A t = a ∣ S t = s ] , s ∈ S , a ∈ A \pi(a\mid s)=Pr[A_t=a \mid S_t=s],s \in S,a \in A π(a∣s)=Pr[At=a∣St=s],s∈S,a∈A
基于回报的定义,可以进一步定义价值函数 (value function)。对于给定的策略 π \pi π, 可以定义以下价值函数。
其中 Pr [ G t + 1 = g ∣ S t = s , A t = a , S t + 1 = s ′ ] = P r [ G t + 1 = g ∣ S t + 1 = s ′ ] \operatorname{Pr}\left[G_{t+1}=g \mid S_t=s, A_t=a, S_{t+1}=s^{\prime} \quad\right]=Pr\left[G_{t+1}=g \mid S_{t+1}=s^{\prime}\right] Pr[Gt+1=g∣St=s,At=a,St+1=s′]=Pr[Gt+1=g∣St+1=s′] 用到了Markov性。
回忆前面我们定义的
G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + ⋯ = ∑ τ = 0 + ∞ γ τ R t + τ + 1 G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{\tau =0}^{+\infty} \gamma^\tau R_{t+\tau+1} Gt=Rt+1+γRt+2+γ2Rt+3+⋯=τ=0∑+∞γτRt+τ+1
观察各项可以发现
G t + 1 = R t + 2 + γ R t + 3 + γ 2 R t + 4 + ⋯ = G t − R t + 1 γ G_{t+1} = R_{t+2} + \gamma R_{t+3} + \gamma^2 R_{t+4} + \cdots =\frac{G_t-R_{t+1}}{\gamma} Gt+1=Rt+2+γRt+3+γ2Rt+4+⋯=γGt−Rt+1
也就是说 G t + 1 和 G t G_{t+1} 和 G_{t} Gt+1和Gt存在递推关系
G t = R t + 1 + γ G t + 1 G_t =R_{t+1}+\gamma G_{t+1} Gt=Rt+1+γGt+1
回顾1.2公式
q π ( s , a ) = E π [ G t ∣ S t = s , A t = a ] = E π [ R t + 1 + γ G t + 1 ∣ S t = s , A t = a ] = E π [ R t + 1 ∣ S t = s , A t = a ] + E π [ γ G t + 1 ∣ S t = s , A t = a ] = E π [ R t + 1 ∣ S t = s , A t = a ] + γ E π [ G t + 1 ∣ S t = s , A t = a ] = ∑ r ∈ R r ∑ s ′ ∈ S p ( s ′ , r ∣ s , a ) + γ ∑ s ′ p ( s ′ ∣ s , a ) v π ( s ′ ) = ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ v π ( s ′ ) ] \begin{aligned} q_\pi(s, a) & =\mathrm{E}_\pi\left[G_t \mid S_t=s, A_t=a\right] \\ & =\mathrm{E}_\pi\left[R_{t+1}+\gamma G_{t+1} \mid S_t=s, A_t=a\right] \\ & =\mathrm{E}_\pi\left[R_{t+1} \mid S_t=s, A_t=a\right]+\mathrm{E}_\pi\left[\gamma G_{t+1} \mid S_t=s, A_t=a\right] \\ & =\mathrm{E}_\pi\left[R_{t+1} \mid S_t=s, A_t=a\right]+\gamma \mathrm{E}_\pi\left[G_{t+1} \mid S_t=s, A_t=a\right] \\ & =\sum_{r \in \mathbb{R}} r \sum_{s^{\prime} \in S} p\left(s^{\prime}, r \mid s, a\right)+ \gamma \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v_\pi\left(s^{\prime}\right) \\ & =\sum_{s^{\prime}, r} p\left(s^{\prime}, r \mid s, a\right)\left[r+\gamma v_\pi \left(s^{\prime}\right)\right] \end{aligned} qπ(s,a)=Eπ[Gt∣St=s,At=a]=Eπ[Rt+1+γGt+1∣St=s,At=a]=Eπ[Rt+1∣St=s,At=a]+Eπ[γGt+1∣St=s,At=a]=Eπ[Rt+1∣St=s,At=a]+γEπ[Gt+1∣St=s,At=a]=r∈R∑rs′∈S∑p(s′,r∣s,a)+γs′∑p(s′∣s,a)vπ(s′)=s′,r∑p(s′,r∣s,a)[r+γvπ(s′)]
这样就得到了结果
q π ( s , a ) = ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ v π ( s ′ ) ] , s ∈ S , a ∈ A q π ( s , a ) = ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ ∑ a ′ π ( a ′ ∣ s ′ ) q π ( s ′ , a ′ ) ] , s ∈ S , a ∈ A \mathcal{q}_{\pi} (s, a)=\sum_{s^{\prime}, r} p\left(s^{\prime}, r \mid s, a\right)\left[r+\gamma v_\pi \left(s^{\prime}\right)\right] , \quad s \in \mathcal{S}, a \in \mathcal{A} \\ \mathcal{q}_{\pi} (s, a)=\sum_{s^\prime,r} p\left(s^{\prime}, r \mid s, a\right)\left[r+\gamma \sum_{a^{\prime}} \pi\left(a^{\prime} \mid s^{\prime}\right) q_\pi\left(s^{\prime}, a^{\prime}\right)\right], \quad s \in \mathcal{S}, a \in \mathcal{A} qπ(s,a)=s′,r∑p(s′,r∣s,a)[r+γvπ(s′)],s∈S,a∈Aqπ(s,a)=s′,r∑p(s′,r∣s,a)[r+γa′∑π(a′∣s′)qπ(s′,a′)],s∈S,a∈A