深度强化学习总结[1]

深度强化学习总结[1]

  • 引言
  • 深度学习基础
  • 马尔可夫决策过程
    • 动作价值函数
    • 动作价值函数推导
    • 状态价值函数
    • 仿真实验环境测试
    • 对倒立摆环境进行分析
  • 附录1.需要安装的包
  • 参考文献

引言

根据前面分析的对于一个函数可以用连续和非连续的函数来对其进行逼近,而上述逼近往往是用无限的函数的和来逼近,这在编程中不可实现,所以我们需要进行截断。对于连续多项式函数,截断就是取前几项,并用最小二乘法确定相关系数。例如,对于单变量的函数来说,多项式逼近函数为:
f ( x ) = a 0 + a 1 ( x − x 0 ) + ⋯ + a n ( x − x 0 ) n + ⋯ f(x)=a_0+a_1(x-x_0)+\cdots+a_n(x-x_0)^n+\cdots f(x)=a0+a1(xx0)++an(xx0)n+

截取前 3 3 3项后的函数为:
f ( x ) = a 0 + a 1 × ( x − x 0 ) + a 2 × ( x − x 0 ) 2 f(x)=a_0+a_1\times(x-x_0)+a_2\times(x-x_0)^2 f(x)=a0+a1×(xx0)+a2×(xx0)2

深度学习基础

这里的逻辑斯蒂回归只能算是最低水平上的逻辑斯蒂分类函数,因为她本质上还是一个线性函数用于分类,而不是一个窗口函数。根据我对阶跃函数逼近任意函数的理论分析,采用这种形式: ϵ ( ω x + b ) \epsilon({\omega{x}+b}) ϵ(ωx+b)能够大大减轻参数估计量,不用以无穷多项相加的形式描述原函数,直观且易理解。所以大部分都是采用这种边界曲线形成的阶跃函数来逼近的。
参考链接:

  • 阶跃函数类总结

马尔可夫决策过程

策略函数是随机策略函数,因为基于当前状态采取何种动作是具有随机性的。同时状态转移函数也是随机的,基于同样的状态和动作,也会有不同的状态转移函数。

动作价值函数

马尔可夫决策过程中的动作价值函数就是在当前状态和采取的动作都已经确定的情况下收到的回报。
我们知道,对于任意状态 S t S_t St和动作 A t A_t At,所产生的奖励为: R t R_t Rt,奖励的期望就是:
E ( R t ) = ∫ ( S t , A t ) ∈ Ω R t × P ( S t , A t ) d ( S t , A t ) E(R_t)=\int_{(S_t,A_t)\in\Omega}R_t\times{P(S_t,A_t)}d(S_t,A_t) E(Rt)=(St,At)ΩRt×P(St,At)d(St,At)

而在当前时刻的状态和动作与之前时刻的状态和动作有关也就是:
( S t , A t ) ∼ ( S t − 1 , A t − 1 ) (S_t,A_t)\sim(S_{t-1},A_{t-1}) (St,At)(St1,At1)

所以:
P ( S t , A t ) = P ( S t , A t ∣ s t − 1 , a t − 1 ) P(S_t,A_t)=P(S_t,A_t|s_{t-1},a_{t-1}) P(St,At)=P(St,Atst1,at1)

所以对于已知 t t t时刻的状态 s t s_t st和采取的动作 a t a_t at的情况下,下一时刻的奖励 R t + 1 R_{t+1} Rt+1是具有随机性的,而他的均值就是:
E ( R t + 1 ) = ∫ ( S t + 1 , A t + 1 ) ∈ Ω R t + 1 P ( S t + 1 , A t + 1 ∣ s t , a t ) d ( S t + 1 , A t + 1 ) = ∑ S t + 1 ∑ A t + 1 R t + 1 ( S t + 1 , A t + 1 ) P ( S t + 1 ∣ S t , A t ) P ( A t + 1 ∣ S t + 1 ) \begin{split} E(R_{t+1})&=\int_{(S_{t+1},A_{t+1})\in\Omega}^{}R_{t+1}P(S_{t+1},A_{t+1}|s_t,a_t)d(S_{t+1},A_{t+1})\\&=\sum_{S_{t+1}}\sum_{A_{t+1}}R_{t+1}(S_{t+1},A_{t+1})P(S_{t+1}|S_{t},A_{t})P(A_{t+1}|S_{t+1}) \end{split} E(Rt+1)=(St+1,At+1)ΩRt+1P(St+1,At+1st,at)d(St+1,At+1)=St+1At+1Rt+1(St+1,At+1)P(St+1St,At)P(At+1St+1)

E ( R t + 2 ) = ∫ ( S t + 2 , A t + 2 ) ∈ Ω ∫ ( S t + 1 , A t + 1 ) ∈ Ω R t + 2 P ( S t + 2 , A t + 2 , S t + 1 , A t + 1 ∣ s t , a t ) d ( S t + 1 , A t + 1 ) d ( S t + 2 , A t + 2 ) = ∫ ( S t + 2 , A t + 2 ) ∈ Ω R t + 2 P ( S t + 2 , A t + 2 ∣ s t , a t ) d ( S t + 2 , A t + 2 ) = ∑ S t + 1 ∑ A t + 1 ∑ S t + 2 ∑ A t + 2 R t + 2 ( S t + 2 , A t + 2 ) P ( S t + 1 ∣ s t , a t ) P ( A t + 1 ∣ S t + 1 ) P ( S t + 2 ∣ S t + 1 , A t + 1 ) P ( A t + 2 ∣ S t + 2 ) \begin{split} E(R_{t+2})&=\int_{(S_{t+2},A_{t+2})\in\Omega}\int_{(S_{t+1},A_{t+1})\in\Omega}R_{t+2}P(S_{t+2},A_{t+2},S_{t+1},A_{t+1}|s_t,a_t)d(S_{t+1},A_{t+1})d(S_{t+2},A_{t+2})\\&=\int_{(S_{t+2},A_{t+2})\in\Omega}R_{t+2}P(S_{t+2},A_{t+2}|s_t,a_t)d(S_{t+2},A_{t+2})\\&=\sum_{S_{t+1}}\sum_{A_{t+1}}\sum_{S_{t+2}}\sum_{A_{t+2}}R_{t+2}(S_{t+2},A_{t+2})P(S_{t+1}|s_t,a_t)P(A_{t+1}|S_{t+1})P(S_{t+2}|S_{t+1},A_{t+1})P(A_{t+2}|S_{t+2}) \end{split} E(Rt+2)=(St+2,At+2)Ω(St+1,At+1)ΩRt+2P(St+2,At+2,St+1,At+1st,at)d(St+1,At+1)d(St+2,At+2)=(St+2,At+2)ΩRt+2P(St+2,At+2st,at)d(St+2,At+2)=St+1At+1St+2At+2Rt+2(St+2,At+2)P(St+1st,at)P(At+1St+1)P(St+2St+1,At+1)P(At+2St+2)

之前一直的困惑就是: P ( S t + 2 , A t + 2 , S t + 1 , A t + 1 ∣ s t , a t ) P(S_{t+2},A_{t+2},S_{t+1},A_{t+1}|s_t,a_t) P(St+2,At+2,St+1,At+1st,at)怎么化为乘积的形式。所以上面的中间一步应该是错误的。即:
! = ∫ ( S t + 2 , A t + 2 ) ∈ Ω R t + 2 P ( S t + 2 , A t + 2 ∣ s t , a t ) d ( S t + 2 , A t + 2 ) !=\int_{(S_{t+2},A_{t+2})\in\Omega}R_{t+2}P(S_{t+2},A_{t+2}|s_t,a_t)d(S_{t+2},A_{t+2}) !=(St+2,At+2)ΩRt+2P(St+2,At+2st,at)d(St+2,At+2)

或者说这个表达式的意义是:
P ( S t + 2 , A t + 2 ∣ s t , a t ) = ∑ S t + 1 ∑ A t + 1 P ( S t + 1 ∣ s t , a t ) P ( A t + 1 ∣ S t + 1 ) P ( S t + 2 ∣ S t + 1 , A t + 1 ) P ( A t + 2 ∣ S t + 2 ) P(S_{t+2},A_{t+2}|s_t,a_t)=\sum_{S_{t+1}}\sum_{A_{t+1}}P(S_{t+1}|s_t,a_t)P(A_{t+1}|S_{t+1})P(S_{t+2}|S_{t+1},A_{t+1})P(A_{t+2}|S_{t+2}) P(St+2,At+2st,at)=St+1At+1P(St+1st,at)P(At+1St+1)P(St+2St+1,At+1)P(At+2St+2)

这个表达式显然不能把 S t + 1 , A t + 1 S_{t+1},A_{t+1} St+1,At+1消去,或者说不等于: P ( S t + 2 ∣ s t , a t ) P ( A t + 2 ∣ S t + 2 ) P(S_{t+2}|s_t,a_t)P(A_{t+2}|S_{t+2}) P(St+2st,at)P(At+2St+2)
更远的也是如此:
E ( R t + n ) = ∫ ( S t + n , A t + n ) ∈ Ω R t + n P ( S t + n , A t + n ∣ s t , a t ) d ( S t + n , A t + n ) E(R_{t+n})=\int_{(S_{t+n},A_{t+n})\in\Omega}^{}R_{t+n}P(S_{t+n},A_{t+n}|s_t,a_t)d(S_{t+n},A_{t+n}) E(Rt+n)=(St+n,At+n)ΩRt+nP(St+n,At+nst,at)d(St+n,At+n)

也就是条件概率中的条件只能是已知的内容,然后求出对应的均值。所以对于 U ( t ) U(t) U(t)整个回报进行求均值就是:
E ( U t ∣ S t = s t , A t = a t ) = E ( R t + γ R t + 1 + ⋯ + γ n R t + n ∣ S t = s t , A t = a t ) = E ( R t ∣ S t = s t , A t = a t ) + E ( γ R t + 1 ∣ S t = s t , A t = a t ) + E ( γ n R t + n ∣ S t = s t , A t = a t ) = r t + γ ∫ ( S t + 1 , A t + 1 ) ∈ Ω R t + 1 P ( S t + 1 , A t + 1 ∣ s t , a t ) d ( S t + 1 , A t + 1 ) + γ 2 ∫ ( S t + 2 , A t + 2 ) ∈ Ω ∫ ( S t + 1 , A t + 1 ) ∈ Ω R t + 2 P ( S t + 2 , A t + 2 , S t + 1 , A t + 1 ∣ s t , a t ) d ( S t + 1 , A t + 1 ) d ( S t + 2 , A t + 2 ) + ⋯ + γ n ∫ ( S t + n , A t + n ) ∈ Ω ⋯ ∫ ( S t + 1 , A t + 1 ) ∈ Ω R t + n P ( S t + n , A t + n , ⋯   , S t + 1 , A t + 1 ∣ s t , a t ) d ( S t + 1 , A t + 1 ) ⋯ d ( S t + n , A t + n ) = r t + ∫ ( S t + n , A t + n ) ∈ Ω ⋯ ∫ ( S t + 1 , A t + 1 ) ∈ Ω ( γ R t + 1 + ⋯ + γ n R t + n ) P ( S t + n , A t + n , ⋯   , S t + 1 , A t + 1 ∣ s t , a t ) d ( S t + 1 , A t + 1 ) ⋯ d ( S t + n , A t + n ) \begin{split} E(U_t|S_t=s_t,A_t=a_t)&=E(R_t+\gamma{R_{t+1}}+\cdots+\gamma^nR_{t+n}|S_t=s_t,A_t=a_t)\\&=E(R_t|S_t=s_t,A_t=a_t)+E(\gamma{R_{t+1}}|S_t=s_t,A_t=a_t)\\&+E(\gamma^nR_{t+n}|S_t=s_t,A_t=a_t)\\&=r_t+\gamma\int_{(S_{t+1},A_{t+1})\in\Omega}^{}R_{t+1}P(S_{t+1},A_{t+1}|s_t,a_t)d(S_{t+1},A_{t+1})\\&+\gamma^2\int_{(S_{t+2},A_{t+2})\in\Omega}\int_{(S_{t+1},A_{t+1})\in\Omega}R_{t+2}P(S_{t+2},A_{t+2},S_{t+1},A_{t+1}|s_t,a_t)d(S_{t+1},A_{t+1})d(S_{t+2},A_{t+2})\\&+\cdots+\gamma^n\int_{(S_{t+n},A_{t+n})\in\Omega}\cdots\int_{(S_{t+1},A_{t+1})\in\Omega}R_{t+n}P(S_{t+n},A_{t+n},\cdots,S_{t+1},A_{t+1}|s_t,a_t)d(S_{t+1},A_{t+1})\cdots{d(S_{t+n},A_{t+n})}\\&=r_t+\int_{(S_{t+n},A_{t+n})\in\Omega}\cdots\int_{(S_{t+1},A_{t+1})\in\Omega}(\gamma{R_{t+1}}+\cdots+\gamma^nR_{t+n})P(S_{t+n},A_{t+n},\cdots,S_{t+1},A_{t+1}|s_t,a_t)d(S_{t+1},A_{t+1})\cdots{d(S_{t+n},A_{t+n})} \end{split} E(UtSt=st,At=at)=E(Rt+γRt+1++γnRt+nSt=st,At=at)=E(RtSt=st,At=at)+E(γRt+1St=st,At=at)+E(γnRt+nSt=st,At=at)=rt+γ(St+1,At+1)ΩRt+1P(St+1,At+1st,at)d(St+1,At+1)+γ2(St+2,At+2)Ω(St+1,At+1)ΩRt+2P(St+2,At+2,St+1,At+1st,at)d(St+1,At+1)d(St+2,At+2)++γn(St+n,At+n)Ω(St+1,At+1)ΩRt+nP(St+n,At+n,,St+1,At+1st,at)d(St+1,At+1)d(St+n,At+n)=rt+(St+n,At+n)Ω(St+1,At+1)Ω(γRt+1++γnRt+n)P(St+n,At+n,,St+1,At+1st,at)d(St+1,At+1)d(St+n,At+n)

也就是说求回报的期望就是在当前状态和动作的联合条件概率分布下求期望。注意上面的公式中忘记写了: R t + 1 = R t + 1 ( S t + 1 , A t + 1 ) R_{t+1}=R_{t+1}(S_{t+1},A_{t+1}) Rt+1=Rt+1(St+1,At+1)也就是对应的是对应状态和动作的函数。上面这个分析垃圾就垃圾在概率表达式不清楚,从状态动作轨迹出发分析概率回报就很清晰。

动作价值函数推导

深度强化学习总结[1]_第1张图片
从图中我们可以看出以下一些关系式:
R t = R ( S t , A t ) R_t=R(S_t,A_t) Rt=R(St,At)

A t ∼ P ( A t ∣ S t ) A_t\sim{P(A_t|S_t)} AtP(AtSt)

S t + 1 ∼ P ( S t + 1 ∣ S t , A t ) S_{t+1}\sim{P(S_{t+1}|S_t,A_t)} St+1P(St+1St,At)

U t = R t + γ R t + 1 + ⋯ + γ n R t + n U_t=R_t+\gamma{R_{t+1}}+\cdots+\gamma^nR_{t+n} Ut=Rt+γRt+1++γnRt+n

所以当我们给定起始时刻 t t t的状态 s t s_t st和动作 a t a_t at之后他随后的状态和动作序列是不确定的,可以表示为如下形式:
s t , a t , S t + 1 ∼ P ( S t + 1 ∣ S t , A t ) , R ( S t , A t ) , A t + 1 ∼ P ( A t + 1 ∣ S t + 1 ) , ⋯   , S t + n ∼ P ( S t + n ∣ S t + n − 1 , A t + n − 1 ) , R ( S t + n − 1 , A t + n − 1 ) , A t + n ∼ P ( A t + n ∣ S t + n ) , R ( S t + n , A t + n ) s_t,a_t,S_{t+1}\sim{P(S_{t+1}|S_t,A_t)},R(S_t,A_t),A_{t+1}\sim{P(A_{t+1}|S_{t+1})},\cdots,S_{t+n}\sim{P(S_{t+n}|S_{t+n-1},A_{t+n-1})},R(S_{t+n-1},A_{t+n-1}),A_{t+n}\sim{P(A_{t+n}|S_{t+n})},R(S_{t+n},A_{t+n}) st,at,St+1P(St+1St,At),R(St,At),At+1P(At+1St+1),,St+nP(St+nSt+n1,At+n1),R(St+n1,At+n1),At+nP(At+nSt+n),R(St+n,At+n)
所以对于一个特定的序列:
s t , a t , r t , s t + 1 , a t + 1 , r t + 1 , ⋯   , s t + n , a t + n , r t + n s_t,a_t,r_t,s_{t+1},a_{t+1},r_{t+1},\cdots,s_{t+n},a_{t+n},r_{t+n} st,at,rt,st+1,at+1,rt+1,,st+n,at+n,rt+n

这个序列产生的概率为:
P ( s t + 1 ∣ s t , a t ) × P ( a t + 1 ∣ s t + 1 ) × P ( s t + 2 ∣ s t + 1 , a t + 1 ) × P ( a t + 2 ∣ s t + 2 ) × ⋯ × P ( s t + n ∣ s t + n − 1 , a t + n − 1 ) × P ( a t + n ∣ s t + n ) P(s_{t+1}|s_t,a_t)\times{P(a_{t+1}|s_{t+1})}\times{P(s_{t+2}|s_{t+1},a_{t+1})}\times{P(a_{t+2}|s_{t+2})}\times\cdots\times{P(s_{t+n}|s_{t+n-1},a_{t+n-1})}\times{P(a_{t+n}|s_{t+n})} P(st+1st,at)×P(at+1st+1)×P(st+2st+1,at+1)×P(at+2st+2)××P(st+nst+n1,at+n1)×P(at+nst+n)

这个序列对应的回报为:
U ( t ) = r ( s t , a t ) + γ 1 r ( s t + 1 , a t + 1 ) + ⋯ + γ n r ( s t + n , a t + n ) U(t)=r(s_t,a_t)+\gamma^1r(s_{t+1},a_{t+1})+\cdots+\gamma^nr(s_{t+n},a_{t+n}) U(t)=r(st,at)+γ1r(st+1,at+1)++γnr(st+n,at+n)

然后对所有可能的排列进行求加权平均也就是期望:
∑ s t + 1 ∑ a t + 1 ∑ s t + 2 ∑ a t + 2 ⋯ ∑ s t + n ∑ a t + n [ r ( s t , a t ) + γ 1 r ( s t + 1 , a t + 1 ) + ⋯ + γ n r ( s t + n , a t + n ) ] [ P ( s t + 1 ∣ s t , a t ) × P ( a t + 1 ∣ s t + 1 ) × P ( s t + 2 ∣ s t + 1 , a t + 1 ) × P ( a t + 2 ∣ s t + 2 ) × ⋯ × P ( s t + n ∣ s t + n − 1 , a t + n − 1 ) × P ( a t + n ∣ s t + n ) ] \sum_{s_{t+1}}\sum_{a_{t+1}}\sum_{s_{t+2}}\sum_{a_{t+2}}\cdots\sum_{s_{t+n}}\sum_{a_{t+n}}{[r(s_t,a_t)+\gamma^1r(s_{t+1},a_{t+1})+\cdots+\gamma^nr(s_{t+n},a_{t+n})]}{[P(s_{t+1}|s_t,a_t)\times{P(a_{t+1}|s_{t+1})}\times{P(s_{t+2}|s_{t+1},a_{t+1})}\times{P(a_{t+2}|s_{t+2})}\times\cdots\times{P(s_{t+n}|s_{t+n-1},a_{t+n-1})}\times{P(a_{t+n}|s_{t+n})}]} st+1at+1st+2at+2st+nat+n[r(st,at)+γ1r(st+1,at+1)++γnr(st+n,at+n)][P(st+1st,at)×P(at+1st+1)×P(st+2st+1,at+1)×P(at+2st+2)××P(st+nst+n1,at+n1)×P(at+nst+n)]

要判定上式是不是回报的期望就要看上面有没有考虑到所有的可能的回报函数以及对应的概率,首先就是单个动作变化的情况下的回报与概率,假定是 a t + 1 a_{t+1} at+1,那么对应的序列变为了:
s t , a t , r ( s t , a t ) , s t + 1 , a t + 1 ∗ , r ( s t + 1 , a t + 1 ∗ ) , s t + 2 , a t + 2 , ⋯   , s t + n , a t + n , r ( s t + n , a t + n ) s_t,a_t,r(s_t,a_t),s_{t+1},a_{t+1}^*,r(s_{t+1},a_{t+1}^*),s_{t+2},a_{t+2},\cdots,s_{t+n},a_{t+n},r(s_{t+n},a_{t+n}) st,at,r(st,at),st+1,at+1,r(st+1,at+1),st+2,at+2,,st+n,at+n,r(st+n,at+n)

这个序列产生的概率为:
P ( s t + 1 ∣ s t , a t ) × P ( a t + 1 ∗ ∣ s t + 1 ) × P ( s t + 2 ∣ s t + 1 , a t + 1 ∗ ) × P ( a t + 2 ∣ s t + 2 ) × ⋯ × P ( s t + n ∣ s t + n − 1 , a t + n − 1 ) × P ( a t + n ∣ s t + n ) P(s_{t+1}|s_t,a_t)\times{P(a_{t+1}^*|s_{t+1})}\times{P(s_{t+2}|s_{t+1},a_{t+1}^*)}\times{P(a_{t+2}|s_{t+2})}\times\cdots\times{P(s_{t+n}|s_{t+n-1},a_{t+n-1})}\times{P(a_{t+n}|s_{t+n})} P(st+1st,at)×P(at+1st+1)×P(st+2st+1,at+1)×P(at+2st+2)××P(st+nst+n1,at+n1)×P(at+nst+n)

相比于原始的来看,改变一个动作会导致两个乘子发生改变,一个是状态动作函数也就是策略函数 P ( a t + 1 ∗ ∣ s t + 1 ) {P(a_{t+1}^*|s_{t+1})} P(at+1st+1),另一个就是状态转移函数也会发生改变 P ( s t + 2 ∣ s t + 1 , a t + 1 ∗ ) {P(s_{t+2}|s_{t+1},a_{t+1}^*)} P(st+2st+1,at+1)
回报函数变为了:
U ( t ) = r ( s t , a t ) + γ r ( s t + 1 , a t + 1 ∗ ) + ⋯ + γ n r ( s t + n , a t + n ) U(t)=r(s_t,a_t)+\gamma{r(s_{t+1},a_{t+1}^*)}+\cdots+\gamma^nr(s_{t+n},a_{t+n}) U(t)=r(st,at)+γr(st+1,at+1)++γnr(st+n,at+n)

回报函数中只有一个项变化了,而对应个概率中有两个因子都变化了。因此这个回报期望中至少有这两项:
[ r ( s t , a t ) + γ r ( s t + 1 , a t + 1 ∗ ) + ⋯ + γ n r ( s t + n , a t + n ) ] × [ P ( s t + 1 ∣ s t , a t ) × P ( a t + 1 ∗ ∣ s t + 1 ) × P ( s t + 2 ∣ s t + 1 , a t + 1 ∗ ) × P ( a t + 2 ∣ s t + 2 ) × ⋯ × P ( s t + n ∣ s t + n − 1 , a t + n − 1 ) × P ( a t + n ∣ s t + n ) ] + [ r ( s t , a t ) + γ 1 r ( s t + 1 , a t + 1 ) + ⋯ + γ n r ( s t + n , a t + n ) ] × [ P ( s t + 1 ∣ s t , a t ) × P ( a t + 1 ∣ s t + 1 ) × P ( s t + 2 ∣ s t + 1 , a t + 1 ) × P ( a t + 2 ∣ s t + 2 ) × ⋯ × P ( s t + n ∣ s t + n − 1 , a t + n − 1 ) × P ( a t + n ∣ s t + n ) ] [r(s_t,a_t)+\gamma{r(s_{t+1},a_{t+1}^*)}+\cdots+\gamma^nr(s_{t+n},a_{t+n})]\times[P(s_{t+1}|s_t,a_t)\times{P(a_{t+1}^*|s_{t+1})}\times{P(s_{t+2}|s_{t+1},a_{t+1}^*)}\times{P(a_{t+2}|s_{t+2})}\times\cdots\times{P(s_{t+n}|s_{t+n-1},a_{t+n-1})}\times{P(a_{t+n}|s_{t+n})}]+[r(s_t,a_t)+\gamma^1r(s_{t+1},a_{t+1})+\cdots+\gamma^nr(s_{t+n},a_{t+n})]\times[P(s_{t+1}|s_t,a_t)\times{P(a_{t+1}|s_{t+1})}\times{P(s_{t+2}|s_{t+1},a_{t+1})}\times{P(a_{t+2}|s_{t+2})}\times\cdots\times{P(s_{t+n}|s_{t+n-1},a_{t+n-1})}\times{P(a_{t+n}|s_{t+n})}] [r(st,at)+γr(st+1,at+1)++γnr(st+n,at+n)]×[P(st+1st,at)×P(at+1st+1)×P(st+2st+1,at+1)×P(at+2st+2)××P(st+nst+n1,at+n1)×P(at+nst+n)]+[r(st,at)+γ1r(st+1,at+1)++γnr(st+n,at+n)]×[P(st+1st,at)×P(at+1st+1)×P(st+2st+1,at+1)×P(at+2st+2)××P(st+nst+n1,at+n1)×P(at+nst+n)]

而我们在均值的表达式中可以找到这一项。同时对于动作空间内的所有的 a t + 1 a_{t+1} at+1都有这个表达式,均值也必定包含所有的 a t + 1 a_{t+1} at+1的情况,也就是
∑ a t + 1 ∗ [ r ( s t , a t ) + γ r ( s t + 1 , a t + 1 ∗ ) + ⋯ + γ n r ( s t + n , a t + n ) ] × [ P ( s t + 1 ∣ s t , a t ) × P ( a t + 1 ∗ ∣ s t + 1 ) × P ( s t + 2 ∣ s t + 1 , a t + 1 ∗ ) × P ( a t + 2 ∣ s t + 2 ) × ⋯ × P ( s t + n ∣ s t + n − 1 , a t + n − 1 ) × P ( a t + n ∣ s t + n ) ] \sum_{a_{t+1}^*}[r(s_t,a_t)+\gamma{r(s_{t+1},a_{t+1}^*)}+\cdots+\gamma^nr(s_{t+n},a_{t+n})]\times[P(s_{t+1}|s_t,a_t)\times{P(a_{t+1}^*|s_{t+1})}\times{P(s_{t+2}|s_{t+1},a_{t+1}^*)}\times{P(a_{t+2}|s_{t+2})}\times\cdots\times{P(s_{t+n}|s_{t+n-1},a_{t+n-1})}\times{P(a_{t+n}|s_{t+n})}] at+1[r(st,at)+γr(st+1,at+1)++γnr(st+n,at+n)]×[P(st+1st,at)×P(at+1st+1)×P(st+2st+1,at+1)×P(at+2st+2)××P(st+nst+n1,at+n1)×P(at+nst+n)]

对于其他位置处的动作发生变化也是类似的表达式:
∑ a t + k ∗ [ r ( s t , a t ) + ⋯ + γ k r ( s t + k , a t + k ) + ⋯ + γ n r ( s t + n , a t + n ) ] × [ P ( s t + 1 ∣ s t , a t ) × P ( a t + 1 ∣ s t + 1 ) × P ( s t + 2 ∣ s t + 1 , a t + 1 ) × P ( a t + 2 ∣ s t + 2 ) × ⋯ P ( a t + k ∗ ∣ s t + k ) × P ( s t + k + 1 ∣ s t + k , a t + k ∗ ) ⋯ × P ( s t + n ∣ s t + n − 1 , a t + n − 1 ) × P ( a t + n ∣ s t + n ) ] \sum_{a_{t+k}^*}[r(s_t,a_t)+\cdots+\gamma^k{r(s_{t+k},a_{t+k})}+\cdots+\gamma^nr(s_{t+n},a_{t+n})]\times[P(s_{t+1}|s_t,a_t)\times{P(a_{t+1}|s_{t+1})}\times{P(s_{t+2}|s_{t+1},a_{t+1})}\times{P(a_{t+2}|s_{t+2})}\times\cdots{P(a_{t+k}^*|s_{t+k})}\times{P(s_{t+k+1}|s_{t+k},a_{t+k}^*)}\cdots\times{P(s_{t+n}|s_{t+n-1},a_{t+n-1})}\times{P(a_{t+n}|s_{t+n})}] at+k[r(st,at)++γkr(st+k,at+k)++γnr(st+n,at+n)]×[P(st+1st,at)×P(at+1st+1)×P(st+2st+1,at+1)×P(at+2st+2)×P(at+kst+k)×P(st+k+1st+k,at+k)×P(st+nst+n1,at+n1)×P(at+nst+n)]

接下来对于状态发生变化进行分析。假设中间某个位置状态发生变化,轨迹序列变为:
s t , a t , r ( s t , a t ) , s t + 1 ∗ , a t + 1 , r ( s t + 1 ∗ , a t + 1 ) , ⋯   , s t + n , a t + n , r ( s t + n , a t + n ) s_t,a_t,r(s_t,a_t),s_{t+1}^*,a_{t+1},r(s_{t+1}^*,a_{t+1}),\cdots,s_{t+n},a_{t+n},r(s_{t+n},a_{t+n}) st,at,r(st,at),st+1,at+1,r(st+1,at+1),,st+n,at+n,r(st+n,at+n)

此时回报函数为:
U ( t ) = r ( s t , a t ) + γ r ( s t + 1 ∗ , a t + 1 ) + ⋯ + γ n r ( s t + n , a t + n ) U(t)=r(s_t,a_t)+\gamma{r(s_{t+1}^*,a_{t+1})}+\cdots+\gamma^nr(s_{t+n},a_{t+n}) U(t)=r(st,at)+γr(st+1,at+1)++γnr(st+n,at+n)

对应的概率为:
P ( s t + 1 ∗ ∣ s t , a t ) × P ( a t + 1 ∣ s t + 1 ∗ ) × P ( s t + 2 ∣ s t + 1 ∗ , a t + 1 ) × ⋯ × P ( a t + n ∣ s t + n ) P(s_{t+1}^*|s_t,a_t)\times{P(a_{t+1}|s_{t+1}^*)}\times{P(s_{t+2}|s_{t+1}^*,a_{t+1})}\times\cdots\times{P(a_{t+n}|s_{t+n})} P(st+1st,at)×P(at+1st+1)×P(st+2st+1,at+1)××P(at+nst+n)

可以看出状态发生改变会影响三个因子,这与之前的分析图也是一致的,也就是状态会影响动作,会影响下一个状态,也会影响收到的奖励。而动作只会影响两个,影响收到的奖励与下一个状态。所以对应的概率上来说,生成动作的概率和生成下一个状态的概率以及由先前的状态生成当前状态的概率都会发生改变。而动作的改变只会带来状态到动作的概率和状态动作到下一个状态的概率的改变。
当状态和动作同时发生改变:
s t , a t , r ( s t , a t ) , s t + 1 ∗ , a t + 1 ∗ , r ( s t + 1 ∗ , a t + 1 ∗ ) , s t + 2 , a t + 2 , r ( s t + 2 , a t + 2 ) , ⋯   , r ( s t + n , a t + n ) s_t,a_t,r(s_t,a_t),s_{t+1}^*,a_{t+1}^*,r(s_{t+1}^*,a_{t+1}^*),s_{t+2},a_{t+2},r(s_{t+2},a_{t+2}),\cdots,r(s_{t+n},a_{t+n}) st,at,r(st,at),st+1,at+1,r(st+1,at+1),st+2,at+2,r(st+2,at+2),,r(st+n,at+n)

回报为:
U ( t ) = r ( s t , a t ) + γ r ( s t + 1 ∗ , a t + 1 ∗ ) + ⋯ + γ n r ( s t + n , a t + n ) U(t)=r(s_t,a_t)+\gamma{r(s_{t+1}^*,a_{t+1}^*)}+\cdots+\gamma^nr(s_{t+n},a_{t+n}) U(t)=r(st,at)+γr(st+1,at+1)++γnr(st+n,at+n)

对应的概率为:
P ( s t + 1 ∗ ∣ s t , a t ) × P ( a t + 1 ∗ ∣ s t + 1 ∗ ) × P ( s t + 2 ∣ s t + 1 ∗ , a t + 1 ∗ ) × P ( a t + 2 ∣ s t + 2 ) × ⋯ × P ( a t + n ∣ s t + n ) P(s_{t+1}^*|s_t,a_t)\times{P(a_{t+1}^*|s_{t+1}^*)}\times{P(s_{t+2}|s_{t+1}^*,a_{t+1}^*)}\times{P(a_{t+2}|s_{t+2})}\times\cdots\times{P(a_{t+n}|s_{t+n})} P(st+1st,at)×P(at+1st+1)×P(st+2st+1,at+1)×P(at+2st+2)××P(at+nst+n)

可以看出在概率的乘积因子的变化上与状态发生变化时类似。当所有的都是变化的时:
s t , a t , r ( s t , a t ) , s t + 1 ∗ , a t + 1 ∗ , r ( s t + 1 ∗ , a t + 1 ∗ ) , ⋯   , s t + n ∗ , a t + n ∗ , r ( s t + n ∗ , a t + n ∗ ) s_t,a_t,r(s_t,a_t),s_{t+1}^*,a_{t+1}^*,r(s_{t+1}^*,a_{t+1}^*),\cdots,s_{t+n}^*,a_{t+n}^*,r(s_{t+n}^*,a_{t+n}^*) st,at,r(st,at),st+1,at+1,r(st+1,at+1),,st+n,at+n,r(st+n,at+n)

回报函数为:
U ( t ) = r ( s t , a t ) + γ r ( s t + 1 ∗ , a t + 1 ∗ ) + ⋯ + γ n r ( s t + n ∗ , a t + n ∗ ) U(t)=r(s_t,a_t)+\gamma{r(s_{t+1}^*,a_{t+1}^*)}+\cdots+\gamma^n{r(s_{t+n}^*,a_{t+n}^*)} U(t)=r(st,at)+γr(st+1,at+1)++γnr(st+n,at+n)

对应的概率为:
P ( s t + 1 ∗ ∣ s t , a t ) × P ( a t + 1 ∗ ∣ s t + 1 ∗ ) × P ( s t + 2 ∗ ∣ s t + 1 ∗ , a t + 1 ∗ ) × P ( a t + 2 ∗ ∣ s t + 2 ∗ ) × ⋯ × P ( a t + n ∗ ∣ s t + n ∗ ) P(s_{t+1}^*|s_t,a_t)\times{P(a_{t+1}^*|s_{t+1}^*)}\times{P(s_{t+2}^*|s_{t+1}^*,a_{t+1}^*)}\times{P(a_{t+2}^*|s_{t+2}^*)}\times\cdots\times{P(a_{t+n}^*|s_{t+n}^*)} P(st+1st,at)×P(at+1st+1)×P(st+2st+1,at+1)×P(at+2st+2)××P(at+nst+n)

所以对应的期望就是所有的 s s s a a a对应的回报与对应的概率乘积的和:
∑ s t + 1 ∗ ⋯ ∑ s t + n ∗ ⋯ ∑ a t + 1 ∗ ⋯ ∑ a t + n ∗ [ r ( s t , a t ) + γ r ( s t + 1 ∗ , a t + 1 ∗ ) + ⋯ + γ n r ( s t + n ∗ , a t + n ∗ ) ] × [ P ( s t + 1 ∗ ∣ s t , a t ) × P ( a t + 1 ∗ ∣ s t + 1 ∗ ) × P ( s t + 2 ∗ ∣ s t + 1 ∗ , a t + 1 ∗ ) × P ( a t + 2 ∗ ∣ s t + 2 ∗ ) × ⋯ × P ( a t + n ∗ ∣ s t + n ∗ ) ] \sum_{s_{t+1}^*}\cdots\sum_{s_{t+n}^*}\cdots\sum_{a_{t+1}^*}\cdots\sum_{a_{t+n}^*}[r(s_t,a_t)+\gamma{r(s_{t+1}^*,a_{t+1}^*)}+\cdots+\gamma^n{r(s_{t+n}^*,a_{t+n}^*)}]\times[P(s_{t+1}^*|s_t,a_t)\times{P(a_{t+1}^*|s_{t+1}^*)}\times{P(s_{t+2}^*|s_{t+1}^*,a_{t+1}^*)}\times{P(a_{t+2}^*|s_{t+2}^*)}\times\cdots\times{P(a_{t+n}^*|s_{t+n}^*)}] st+1st+nat+1at+n[r(st,at)+γr(st+1,at+1)++γnr(st+n,at+n)]×[P(st+1st,at)×P(at+1st+1)×P(st+2st+1,at+1)×P(at+2st+2)××P(at+nst+n)]

其实也可以正向分析,已知轨迹序列包括状态和动作序列,而经过分析容易发现存在仅有一个动作发生变化而其他不变的情况,同样也存在仅有一个状态发生变化其他不变的情况,也就是每个位置处的状态和动作都是可以任意变化的。所以序列的总数量为: ∑ s t + 1 ∗ ⋯ ∑ s t + n ∗ ⋯ ∑ a t + 1 ∗ ⋯ ∑ a t + n ∗ \sum_{s_{t+1}^*}\cdots\sum_{s_{t+n}^*}\cdots\sum_{a_{t+1}^*}\cdots\sum_{a_{t+n}^*} st+1st+nat+1at+n,中间是乘积的形式。所以期望就是所有的回报与对应的概率成绩的和,也就是前面所给的形式。而上面的公式:
∑ s t + 1 ∗ ⋯ ∑ s t + n ∗ ⋯ ∑ a t + 1 ∗ ⋯ ∑ a t + n ∗ [ r ( s t , a t ) + γ r ( s t + 1 ∗ , a t + 1 ∗ ) + ⋯ + γ n r ( s t + n ∗ , a t + n ∗ ) ] × [ P ( s t + 1 ∗ ∣ s t , a t ) × P ( a t + 1 ∗ ∣ s t + 1 ∗ ) × P ( s t + 2 ∗ ∣ s t + 1 ∗ , a t + 1 ∗ ) × P ( a t + 2 ∗ ∣ s t + 2 ∗ ) × ⋯ × P ( a t + n ∗ ∣ s t + n ∗ ) ] \sum_{s_{t+1}^*}\cdots\sum_{s_{t+n}^*}\cdots\sum_{a_{t+1}^*}\cdots\sum_{a_{t+n}^*}[r(s_t,a_t)+\gamma{r(s_{t+1}^*,a_{t+1}^*)}+\cdots+\gamma^n{r(s_{t+n}^*,a_{t+n}^*)}]\times[P(s_{t+1}^*|s_t,a_t)\times{P(a_{t+1}^*|s_{t+1}^*)}\times{P(s_{t+2}^*|s_{t+1}^*,a_{t+1}^*)}\times{P(a_{t+2}^*|s_{t+2}^*)}\times\cdots\times{P(a_{t+n}^*|s_{t+n}^*)}] st+1st+nat+1at+n[r(st,at)+γr(st+1,at+1)++γnr(st+n,at+n)]×[P(st+1st,at)×P(at+1st+1)×P(st+2st+1,at+1)×P(at+2st+2)××P(at+nst+n)]

还可以进行变化,将奖励函数带入并分别求积分:
E [ U ( t ) ] = r ( s t , a t ) + ∑ s t + 1 ∗ ∑ a t + 1 ∗ γ r ( s t + 1 ∗ , a t + 1 ∗ ) × P ( s t + 1 ∗ ∣ s t , a t ) × P ( a t + 1 ∗ ∣ s t + 1 ∗ ) + ∑ s t + 1 ∗ ∑ s t + 2 ∗ ∑ a t + 1 ∗ ∑ a t + 2 ∗ γ 2 r ( s t + 2 ∗ , a t + 2 ∗ ) P ( s t + 1 ∗ ∣ s t , a t ) × P ( a t + 1 ∗ ∣ s t + 1 ∗ ) × P ( s t + 2 ∗ ∣ s t + 1 ∗ , a t + 1 ∗ ) × P ( a t + 2 ∗ ∣ s t + 2 ∗ ) + ⋯ + ∑ s t + 1 ∗ ⋯ ∑ s t + n ∗ ⋯ ∑ a t + 1 ∗ ⋯ ∑ a t + n ∗ γ n r ( s t + n ∗ , a t + n ∗ ) × [ P ( s t + 1 ∗ ∣ s t , a t ) × P ( a t + 1 ∗ ∣ s t + 1 ∗ ) × P ( s t + 2 ∗ ∣ s t + 1 ∗ , a t + 1 ∗ ) × P ( a t + 2 ∗ ∣ s t + 2 ∗ ) × ⋯ × P ( a t + n ∗ ∣ s t + n ∗ ) ] E[U(t)]=r(s_t,a_t)+\sum_{s_{t+1}^*}\sum_{a_{t+1}^*}\gamma{r(s_{t+1}^*,a_{t+1}^*)}\times{P(s_{t+1}^*|s_t,a_t)}\times{P(a_{t+1}^*|s_{t+1}^*)}+\sum_{s_{t+1}^*}\sum_{s_{t+2}^*}\sum_{a_{t+1}^*}\sum_{a_{t+2}^*}\gamma^2r(s_{t+2}^*,a_{t+2}^*)P(s_{t+1}^*|s_t,a_t)\times{P(a_{t+1}^*|s_{t+1}^*)}\times{P(s_{t+2}^*|s_{t+1}^*,a_{t+1}^*)}\times{P(a_{t+2}^*|s_{t+2}^*)}+\cdots+\sum_{s_{t+1}^*}\cdots\sum_{s_{t+n}^*}\cdots\sum_{a_{t+1}^*}\cdots\sum_{a_{t+n}^*}\gamma^n{r(s_{t+n}^*,a_{t+n}^*)}\times[P(s_{t+1}^*|s_t,a_t)\times{P(a_{t+1}^*|s_{t+1}^*)}\times{P(s_{t+2}^*|s_{t+1}^*,a_{t+1}^*)}\times{P(a_{t+2}^*|s_{t+2}^*)}\times\cdots\times{P(a_{t+n}^*|s_{t+n}^*)}] E[U(t)]=r(st,at)+st+1at+1γr(st+1,at+1)×P(st+1st,at)×P(at+1st+1)+st+1st+2at+1at+2γ2r(st+2,at+2)P(st+1st,at)×P(at+1st+1)×P(st+2st+1,at+1)×P(at+2st+2)++st+1st+nat+1at+nγnr(st+n,at+n)×[P(st+1st,at)×P(at+1st+1)×P(st+2st+1,at+1)×P(at+2st+2)××P(at+nst+n)]

我们可以把它写为递推形式:
E [ U ( t + 1 ) ] = r ( s t + 1 , a t + 1 ) + ∑ s t + 2 ∗ ⋯ ∑ s t + n ∗ ∑ a t + 1 ∗ ⋯ ∑ a t + n ∗ [ γ r ( s t + 2 ∗ , a t + 2 ∗ ) + ⋯ + γ n r ( s t + n ∗ , a t + n ∗ ) ] × [ P ( s t + 2 ∗ ∣ s t + 1 , a t + 1 ) × P ( a t + 2 ∗ ∣ s t + 2 ∗ ) × ⋯ × P ( a t + n ∗ ∣ s t + n ∗ ) ] E[U(t+1)]=r(s_{t+1},a_{t+1})+\sum_{s_{t+2}^*}\cdots\sum_{s_{t+n}^*}\sum_{a_{t+1}^*}\cdots\sum_{a_{t+n}^*}[\gamma{r(s_{t+2}^*,a_{t+2}^*)+\cdots+\gamma^nr(s_{t+n}^*,a_{t+n}^*)}]\times[P(s_{t+2}^*|s_{t+1},a_{t+1})\times{P(a_{t+2}^*|s_{t+2}^*)}\times\cdots\times{P(a_{t+n}^*|s_{t+n}^*)}] E[U(t+1)]=r(st+1,at+1)+st+2st+nat+1at+n[γr(st+2,at+2)++γnr(st+n,at+n)]×[P(st+2st+1,at+1)×P(at+2st+2)××P(at+nst+n)]

所以有递推形式:
E [ U ( t ) ] = r ( s t , a t ) + ∑ s t + 1 ∑ a t + 1 γ E [ U ( t + 1 ) ] × P ( s t + 1 ∣ s t , a t ) × P ( a t + 1 ∣ s t + 1 ) E[U(t)]=r(s_t,a_t)+\sum_{s_{t+1}}\sum_{a_{t+1}}\gamma{E[U(t+1)]}\times{P(s_{t+1}|s_t,a_t)}\times{P(a_{t+1}|s_{t+1})} E[U(t)]=r(st,at)+st+1at+1γE[U(t+1)]×P(st+1st,at)×P(at+1st+1)

当我们的策略函数也就是状态到动作的函数能够使得 E [ U ( t ) ] E[U(t)] E[U(t)]最大化,那么势必有 ∑ s t + 1 ∗ ∑ a t + 1 ∗ γ E [ U ( t + 1 ) ] × P ( s t + 1 ∣ s t , a t ) × P ( a t + 1 ∣ s t + 1 ) \sum_{s_{t+1}^*}\sum_{a_{t+1}^*}\gamma{E[U(t+1)]}\times{P(s_{t+1}|s_t,a_t)}\times{P(a_{t+1}|s_{t+1})} st+1at+1γE[U(t+1)]×P(st+1st,at)×P(at+1st+1)最大化,也就是有 ∑ s t + 1 ∗ ∑ a t + 1 ∗ E [ U ( t + 1 ) ] × P ( s t + 1 ∣ s t , a t ) × P ( a t + 1 ∣ s t + 1 ) \sum_{s_{t+1}^*}\sum_{a_{t+1}^*}{E[U(t+1)]}\times{P(s_{t+1}|s_t,a_t)}\times{P(a_{t+1}|s_{t+1})} st+1at+1E[U(t+1)]×P(st+1st,at)×P(at+1st+1)最大化。以上就是贝尔曼方程的分解。如果再把状态价值函数引进来,可以得到贝尔曼方程其他形式。

状态价值函数

接下来是状态确定但是动作不确定情况下的回报函数。
假设状态动作轨迹为:
s t , a t ∗ , s t + 1 ∗ , ⋯   , s t + n ∗ , a t + n ∗ s_t,a_t^*,s_{t+1}^*,\cdots,s_{t+n}^*,a_{t+n}^* st,at,st+1,,st+n,at+n

同样单独的动作变化是允许的,就是某个动作变化但其它不变,也就是每个位置都是独立的变化量。所以均值为:
E = ∑ a t ∗ ∑ s t + 1 ∗ ∑ a t + 1 ∗ ⋯ ∑ a t + n ∗ [ r ( s t , a t ∗ ) + γ r ( s t + 1 ∗ , a t + 1 ∗ ) + ⋯ + γ n r ( s t + n ∗ , s t + n ∗ ) ] P ( a t ∗ ∣ s t ) P ( s t + 1 ∗ ∣ s t , a t ∗ ) ⋯ P ( a t + n ∗ ∣ s t + n ∗ ) E=\sum_{a_{t}^*}\sum_{s_{t+1}^*}\sum_{a_{t+1}^*}\cdots\sum_{a_{t+n}^*}[r(s_t,a_t^*)+\gamma{r(s_{t+1}^*,a_{t+1}^*)}+\cdots+\gamma^nr(s_{t+n}^*,s_{t+n}^*)]P(a_t^*|s_t)P(s_{t+1}^*|s_t,a_t^*)\cdots{P(a_{t+n}^*|s_{t+n}^*)} E=atst+1at+1at+n[r(st,at)+γr(st+1,at+1)++γnr(st+n,st+n)]P(atst)P(st+1st,at)P(at+nst+n)

E = ∑ a t ∗ r ( s t , a t ∗ ) P ( a t ∗ ∣ s t ) + ∑ a t ∗ ∑ s t + 1 ∗ ∑ a t + 1 ∗ γ r ( s t + 1 ∗ , a t + 1 ∗ ) P ( a t ∗ ∣ s t ) P ( s t + 1 ∗ ∣ s t , a t ∗ ) P ( a t + 1 ∗ ∣ s t + 1 ∗ ) + ⋯ + ∑ a t ∗ ⋯ ∑ a t + n ∗ γ n r ( s t + n ∗ , a t + n ∗ ) P ( a t ∗ ∣ s t ) ⋯ P ( a t + n ∗ ∣ s t + n ∗ ) E=\sum_{a_{t}^*}r(s_t,a_t^*)P(a_t^*|s_t)+\sum_{a_{t}^*}\sum_{s_{t+1}^*}\sum_{a_{t+1}^*}\gamma{r(s_{t+1}^*,a_{t+1}^*)}P(a_t^*|s_t)P(s_{t+1}^*|s_t,a_t^*)P(a_{t+1}^*|s_{t+1}^*)+\cdots+\sum_{a_t^*}\cdots\sum_{a_{t+n}^*}\gamma^nr(s_{t+n}^*,a_{t+n}^*)P(a_t^*|s_t)\cdots{P(a_{t+n}^*|s_{t+n}^*)} E=atr(st,at)P(atst)+atst+1at+1γr(st+1,at+1)P(atst)P(st+1st,at)P(at+1st+1)++atat+nγnr(st+n,at+n)P(atst)P(at+nst+n)

排除策略的影响就是把以期望最大的那个策略作为最有概率。
我们可以分析一下有多少参数需估计,假设状态数量为 n n n,动作数量为 m m m,所以状态到动作一共有 n m nm nm个参数,而状态动作到状态一共有: n 2 m n^2m n

你可能感兴趣的:(机器学习,人工智能,算法)