根据前面分析的对于一个函数可以用连续和非连续的函数来对其进行逼近,而上述逼近往往是用无限的函数的和来逼近,这在编程中不可实现,所以我们需要进行截断。对于连续多项式函数,截断就是取前几项,并用最小二乘法确定相关系数。例如,对于单变量的函数来说,多项式逼近函数为:
f ( x ) = a 0 + a 1 ( x − x 0 ) + ⋯ + a n ( x − x 0 ) n + ⋯ f(x)=a_0+a_1(x-x_0)+\cdots+a_n(x-x_0)^n+\cdots f(x)=a0+a1(x−x0)+⋯+an(x−x0)n+⋯
截取前 3 3 3项后的函数为:
f ( x ) = a 0 + a 1 × ( x − x 0 ) + a 2 × ( x − x 0 ) 2 f(x)=a_0+a_1\times(x-x_0)+a_2\times(x-x_0)^2 f(x)=a0+a1×(x−x0)+a2×(x−x0)2
这里的逻辑斯蒂回归只能算是最低水平上的逻辑斯蒂分类函数,因为她本质上还是一个线性函数用于分类,而不是一个窗口函数。根据我对阶跃函数逼近任意函数的理论分析,采用这种形式: ϵ ( ω x + b ) \epsilon({\omega{x}+b}) ϵ(ωx+b)能够大大减轻参数估计量,不用以无穷多项相加的形式描述原函数,直观且易理解。所以大部分都是采用这种边界曲线形成的阶跃函数来逼近的。
参考链接:
策略函数是随机策略函数,因为基于当前状态采取何种动作是具有随机性的。同时状态转移函数也是随机的,基于同样的状态和动作,也会有不同的状态转移函数。
马尔可夫决策过程中的动作价值函数就是在当前状态和采取的动作都已经确定的情况下收到的回报。
我们知道,对于任意状态 S t S_t St和动作 A t A_t At,所产生的奖励为: R t R_t Rt,奖励的期望就是:
E ( R t ) = ∫ ( S t , A t ) ∈ Ω R t × P ( S t , A t ) d ( S t , A t ) E(R_t)=\int_{(S_t,A_t)\in\Omega}R_t\times{P(S_t,A_t)}d(S_t,A_t) E(Rt)=∫(St,At)∈ΩRt×P(St,At)d(St,At)
而在当前时刻的状态和动作与之前时刻的状态和动作有关也就是:
( S t , A t ) ∼ ( S t − 1 , A t − 1 ) (S_t,A_t)\sim(S_{t-1},A_{t-1}) (St,At)∼(St−1,At−1)
所以:
P ( S t , A t ) = P ( S t , A t ∣ s t − 1 , a t − 1 ) P(S_t,A_t)=P(S_t,A_t|s_{t-1},a_{t-1}) P(St,At)=P(St,At∣st−1,at−1)
所以对于已知 t t t时刻的状态 s t s_t st和采取的动作 a t a_t at的情况下,下一时刻的奖励 R t + 1 R_{t+1} Rt+1是具有随机性的,而他的均值就是:
E ( R t + 1 ) = ∫ ( S t + 1 , A t + 1 ) ∈ Ω R t + 1 P ( S t + 1 , A t + 1 ∣ s t , a t ) d ( S t + 1 , A t + 1 ) = ∑ S t + 1 ∑ A t + 1 R t + 1 ( S t + 1 , A t + 1 ) P ( S t + 1 ∣ S t , A t ) P ( A t + 1 ∣ S t + 1 ) \begin{split} E(R_{t+1})&=\int_{(S_{t+1},A_{t+1})\in\Omega}^{}R_{t+1}P(S_{t+1},A_{t+1}|s_t,a_t)d(S_{t+1},A_{t+1})\\&=\sum_{S_{t+1}}\sum_{A_{t+1}}R_{t+1}(S_{t+1},A_{t+1})P(S_{t+1}|S_{t},A_{t})P(A_{t+1}|S_{t+1}) \end{split} E(Rt+1)=∫(St+1,At+1)∈ΩRt+1P(St+1,At+1∣st,at)d(St+1,At+1)=St+1∑At+1∑Rt+1(St+1,At+1)P(St+1∣St,At)P(At+1∣St+1)
E ( R t + 2 ) = ∫ ( S t + 2 , A t + 2 ) ∈ Ω ∫ ( S t + 1 , A t + 1 ) ∈ Ω R t + 2 P ( S t + 2 , A t + 2 , S t + 1 , A t + 1 ∣ s t , a t ) d ( S t + 1 , A t + 1 ) d ( S t + 2 , A t + 2 ) = ∫ ( S t + 2 , A t + 2 ) ∈ Ω R t + 2 P ( S t + 2 , A t + 2 ∣ s t , a t ) d ( S t + 2 , A t + 2 ) = ∑ S t + 1 ∑ A t + 1 ∑ S t + 2 ∑ A t + 2 R t + 2 ( S t + 2 , A t + 2 ) P ( S t + 1 ∣ s t , a t ) P ( A t + 1 ∣ S t + 1 ) P ( S t + 2 ∣ S t + 1 , A t + 1 ) P ( A t + 2 ∣ S t + 2 ) \begin{split} E(R_{t+2})&=\int_{(S_{t+2},A_{t+2})\in\Omega}\int_{(S_{t+1},A_{t+1})\in\Omega}R_{t+2}P(S_{t+2},A_{t+2},S_{t+1},A_{t+1}|s_t,a_t)d(S_{t+1},A_{t+1})d(S_{t+2},A_{t+2})\\&=\int_{(S_{t+2},A_{t+2})\in\Omega}R_{t+2}P(S_{t+2},A_{t+2}|s_t,a_t)d(S_{t+2},A_{t+2})\\&=\sum_{S_{t+1}}\sum_{A_{t+1}}\sum_{S_{t+2}}\sum_{A_{t+2}}R_{t+2}(S_{t+2},A_{t+2})P(S_{t+1}|s_t,a_t)P(A_{t+1}|S_{t+1})P(S_{t+2}|S_{t+1},A_{t+1})P(A_{t+2}|S_{t+2}) \end{split} E(Rt+2)=∫(St+2,At+2)∈Ω∫(St+1,At+1)∈ΩRt+2P(St+2,At+2,St+1,At+1∣st,at)d(St+1,At+1)d(St+2,At+2)=∫(St+2,At+2)∈ΩRt+2P(St+2,At+2∣st,at)d(St+2,At+2)=St+1∑At+1∑St+2∑At+2∑Rt+2(St+2,At+2)P(St+1∣st,at)P(At+1∣St+1)P(St+2∣St+1,At+1)P(At+2∣St+2)
之前一直的困惑就是: P ( S t + 2 , A t + 2 , S t + 1 , A t + 1 ∣ s t , a t ) P(S_{t+2},A_{t+2},S_{t+1},A_{t+1}|s_t,a_t) P(St+2,At+2,St+1,At+1∣st,at)怎么化为乘积的形式。所以上面的中间一步应该是错误的。即:
! = ∫ ( S t + 2 , A t + 2 ) ∈ Ω R t + 2 P ( S t + 2 , A t + 2 ∣ s t , a t ) d ( S t + 2 , A t + 2 ) !=\int_{(S_{t+2},A_{t+2})\in\Omega}R_{t+2}P(S_{t+2},A_{t+2}|s_t,a_t)d(S_{t+2},A_{t+2}) !=∫(St+2,At+2)∈ΩRt+2P(St+2,At+2∣st,at)d(St+2,At+2)
或者说这个表达式的意义是:
P ( S t + 2 , A t + 2 ∣ s t , a t ) = ∑ S t + 1 ∑ A t + 1 P ( S t + 1 ∣ s t , a t ) P ( A t + 1 ∣ S t + 1 ) P ( S t + 2 ∣ S t + 1 , A t + 1 ) P ( A t + 2 ∣ S t + 2 ) P(S_{t+2},A_{t+2}|s_t,a_t)=\sum_{S_{t+1}}\sum_{A_{t+1}}P(S_{t+1}|s_t,a_t)P(A_{t+1}|S_{t+1})P(S_{t+2}|S_{t+1},A_{t+1})P(A_{t+2}|S_{t+2}) P(St+2,At+2∣st,at)=St+1∑At+1∑P(St+1∣st,at)P(At+1∣St+1)P(St+2∣St+1,At+1)P(At+2∣St+2)
这个表达式显然不能把 S t + 1 , A t + 1 S_{t+1},A_{t+1} St+1,At+1消去,或者说不等于: P ( S t + 2 ∣ s t , a t ) P ( A t + 2 ∣ S t + 2 ) P(S_{t+2}|s_t,a_t)P(A_{t+2}|S_{t+2}) P(St+2∣st,at)P(At+2∣St+2)
更远的也是如此:
E ( R t + n ) = ∫ ( S t + n , A t + n ) ∈ Ω R t + n P ( S t + n , A t + n ∣ s t , a t ) d ( S t + n , A t + n ) E(R_{t+n})=\int_{(S_{t+n},A_{t+n})\in\Omega}^{}R_{t+n}P(S_{t+n},A_{t+n}|s_t,a_t)d(S_{t+n},A_{t+n}) E(Rt+n)=∫(St+n,At+n)∈ΩRt+nP(St+n,At+n∣st,at)d(St+n,At+n)
也就是条件概率中的条件只能是已知的内容,然后求出对应的均值。所以对于 U ( t ) U(t) U(t)整个回报进行求均值就是:
E ( U t ∣ S t = s t , A t = a t ) = E ( R t + γ R t + 1 + ⋯ + γ n R t + n ∣ S t = s t , A t = a t ) = E ( R t ∣ S t = s t , A t = a t ) + E ( γ R t + 1 ∣ S t = s t , A t = a t ) + E ( γ n R t + n ∣ S t = s t , A t = a t ) = r t + γ ∫ ( S t + 1 , A t + 1 ) ∈ Ω R t + 1 P ( S t + 1 , A t + 1 ∣ s t , a t ) d ( S t + 1 , A t + 1 ) + γ 2 ∫ ( S t + 2 , A t + 2 ) ∈ Ω ∫ ( S t + 1 , A t + 1 ) ∈ Ω R t + 2 P ( S t + 2 , A t + 2 , S t + 1 , A t + 1 ∣ s t , a t ) d ( S t + 1 , A t + 1 ) d ( S t + 2 , A t + 2 ) + ⋯ + γ n ∫ ( S t + n , A t + n ) ∈ Ω ⋯ ∫ ( S t + 1 , A t + 1 ) ∈ Ω R t + n P ( S t + n , A t + n , ⋯ , S t + 1 , A t + 1 ∣ s t , a t ) d ( S t + 1 , A t + 1 ) ⋯ d ( S t + n , A t + n ) = r t + ∫ ( S t + n , A t + n ) ∈ Ω ⋯ ∫ ( S t + 1 , A t + 1 ) ∈ Ω ( γ R t + 1 + ⋯ + γ n R t + n ) P ( S t + n , A t + n , ⋯ , S t + 1 , A t + 1 ∣ s t , a t ) d ( S t + 1 , A t + 1 ) ⋯ d ( S t + n , A t + n ) \begin{split} E(U_t|S_t=s_t,A_t=a_t)&=E(R_t+\gamma{R_{t+1}}+\cdots+\gamma^nR_{t+n}|S_t=s_t,A_t=a_t)\\&=E(R_t|S_t=s_t,A_t=a_t)+E(\gamma{R_{t+1}}|S_t=s_t,A_t=a_t)\\&+E(\gamma^nR_{t+n}|S_t=s_t,A_t=a_t)\\&=r_t+\gamma\int_{(S_{t+1},A_{t+1})\in\Omega}^{}R_{t+1}P(S_{t+1},A_{t+1}|s_t,a_t)d(S_{t+1},A_{t+1})\\&+\gamma^2\int_{(S_{t+2},A_{t+2})\in\Omega}\int_{(S_{t+1},A_{t+1})\in\Omega}R_{t+2}P(S_{t+2},A_{t+2},S_{t+1},A_{t+1}|s_t,a_t)d(S_{t+1},A_{t+1})d(S_{t+2},A_{t+2})\\&+\cdots+\gamma^n\int_{(S_{t+n},A_{t+n})\in\Omega}\cdots\int_{(S_{t+1},A_{t+1})\in\Omega}R_{t+n}P(S_{t+n},A_{t+n},\cdots,S_{t+1},A_{t+1}|s_t,a_t)d(S_{t+1},A_{t+1})\cdots{d(S_{t+n},A_{t+n})}\\&=r_t+\int_{(S_{t+n},A_{t+n})\in\Omega}\cdots\int_{(S_{t+1},A_{t+1})\in\Omega}(\gamma{R_{t+1}}+\cdots+\gamma^nR_{t+n})P(S_{t+n},A_{t+n},\cdots,S_{t+1},A_{t+1}|s_t,a_t)d(S_{t+1},A_{t+1})\cdots{d(S_{t+n},A_{t+n})} \end{split} E(Ut∣St=st,At=at)=E(Rt+γRt+1+⋯+γnRt+n∣St=st,At=at)=E(Rt∣St=st,At=at)+E(γRt+1∣St=st,At=at)+E(γnRt+n∣St=st,At=at)=rt+γ∫(St+1,At+1)∈ΩRt+1P(St+1,At+1∣st,at)d(St+1,At+1)+γ2∫(St+2,At+2)∈Ω∫(St+1,At+1)∈ΩRt+2P(St+2,At+2,St+1,At+1∣st,at)d(St+1,At+1)d(St+2,At+2)+⋯+γn∫(St+n,At+n)∈Ω⋯∫(St+1,At+1)∈ΩRt+nP(St+n,At+n,⋯,St+1,At+1∣st,at)d(St+1,At+1)⋯d(St+n,At+n)=rt+∫(St+n,At+n)∈Ω⋯∫(St+1,At+1)∈Ω(γRt+1+⋯+γnRt+n)P(St+n,At+n,⋯,St+1,At+1∣st,at)d(St+1,At+1)⋯d(St+n,At+n)
也就是说求回报的期望就是在当前状态和动作的联合条件概率分布下求期望。注意上面的公式中忘记写了: R t + 1 = R t + 1 ( S t + 1 , A t + 1 ) R_{t+1}=R_{t+1}(S_{t+1},A_{t+1}) Rt+1=Rt+1(St+1,At+1)也就是对应的是对应状态和动作的函数。上面这个分析垃圾就垃圾在概率表达式不清楚,从状态动作轨迹出发分析概率回报就很清晰。
从图中我们可以看出以下一些关系式:
R t = R ( S t , A t ) R_t=R(S_t,A_t) Rt=R(St,At)
A t ∼ P ( A t ∣ S t ) A_t\sim{P(A_t|S_t)} At∼P(At∣St)
S t + 1 ∼ P ( S t + 1 ∣ S t , A t ) S_{t+1}\sim{P(S_{t+1}|S_t,A_t)} St+1∼P(St+1∣St,At)
U t = R t + γ R t + 1 + ⋯ + γ n R t + n U_t=R_t+\gamma{R_{t+1}}+\cdots+\gamma^nR_{t+n} Ut=Rt+γRt+1+⋯+γnRt+n
所以当我们给定起始时刻 t t t的状态 s t s_t st和动作 a t a_t at之后他随后的状态和动作序列是不确定的,可以表示为如下形式:
s t , a t , S t + 1 ∼ P ( S t + 1 ∣ S t , A t ) , R ( S t , A t ) , A t + 1 ∼ P ( A t + 1 ∣ S t + 1 ) , ⋯ , S t + n ∼ P ( S t + n ∣ S t + n − 1 , A t + n − 1 ) , R ( S t + n − 1 , A t + n − 1 ) , A t + n ∼ P ( A t + n ∣ S t + n ) , R ( S t + n , A t + n ) s_t,a_t,S_{t+1}\sim{P(S_{t+1}|S_t,A_t)},R(S_t,A_t),A_{t+1}\sim{P(A_{t+1}|S_{t+1})},\cdots,S_{t+n}\sim{P(S_{t+n}|S_{t+n-1},A_{t+n-1})},R(S_{t+n-1},A_{t+n-1}),A_{t+n}\sim{P(A_{t+n}|S_{t+n})},R(S_{t+n},A_{t+n}) st,at,St+1∼P(St+1∣St,At),R(St,At),At+1∼P(At+1∣St+1),⋯,St+n∼P(St+n∣St+n−1,At+n−1),R(St+n−1,At+n−1),At+n∼P(At+n∣St+n),R(St+n,At+n)
所以对于一个特定的序列:
s t , a t , r t , s t + 1 , a t + 1 , r t + 1 , ⋯ , s t + n , a t + n , r t + n s_t,a_t,r_t,s_{t+1},a_{t+1},r_{t+1},\cdots,s_{t+n},a_{t+n},r_{t+n} st,at,rt,st+1,at+1,rt+1,⋯,st+n,at+n,rt+n
这个序列产生的概率为:
P ( s t + 1 ∣ s t , a t ) × P ( a t + 1 ∣ s t + 1 ) × P ( s t + 2 ∣ s t + 1 , a t + 1 ) × P ( a t + 2 ∣ s t + 2 ) × ⋯ × P ( s t + n ∣ s t + n − 1 , a t + n − 1 ) × P ( a t + n ∣ s t + n ) P(s_{t+1}|s_t,a_t)\times{P(a_{t+1}|s_{t+1})}\times{P(s_{t+2}|s_{t+1},a_{t+1})}\times{P(a_{t+2}|s_{t+2})}\times\cdots\times{P(s_{t+n}|s_{t+n-1},a_{t+n-1})}\times{P(a_{t+n}|s_{t+n})} P(st+1∣st,at)×P(at+1∣st+1)×P(st+2∣st+1,at+1)×P(at+2∣st+2)×⋯×P(st+n∣st+n−1,at+n−1)×P(at+n∣st+n)
这个序列对应的回报为:
U ( t ) = r ( s t , a t ) + γ 1 r ( s t + 1 , a t + 1 ) + ⋯ + γ n r ( s t + n , a t + n ) U(t)=r(s_t,a_t)+\gamma^1r(s_{t+1},a_{t+1})+\cdots+\gamma^nr(s_{t+n},a_{t+n}) U(t)=r(st,at)+γ1r(st+1,at+1)+⋯+γnr(st+n,at+n)
然后对所有可能的排列进行求加权平均也就是期望:
∑ s t + 1 ∑ a t + 1 ∑ s t + 2 ∑ a t + 2 ⋯ ∑ s t + n ∑ a t + n [ r ( s t , a t ) + γ 1 r ( s t + 1 , a t + 1 ) + ⋯ + γ n r ( s t + n , a t + n ) ] [ P ( s t + 1 ∣ s t , a t ) × P ( a t + 1 ∣ s t + 1 ) × P ( s t + 2 ∣ s t + 1 , a t + 1 ) × P ( a t + 2 ∣ s t + 2 ) × ⋯ × P ( s t + n ∣ s t + n − 1 , a t + n − 1 ) × P ( a t + n ∣ s t + n ) ] \sum_{s_{t+1}}\sum_{a_{t+1}}\sum_{s_{t+2}}\sum_{a_{t+2}}\cdots\sum_{s_{t+n}}\sum_{a_{t+n}}{[r(s_t,a_t)+\gamma^1r(s_{t+1},a_{t+1})+\cdots+\gamma^nr(s_{t+n},a_{t+n})]}{[P(s_{t+1}|s_t,a_t)\times{P(a_{t+1}|s_{t+1})}\times{P(s_{t+2}|s_{t+1},a_{t+1})}\times{P(a_{t+2}|s_{t+2})}\times\cdots\times{P(s_{t+n}|s_{t+n-1},a_{t+n-1})}\times{P(a_{t+n}|s_{t+n})}]} st+1∑at+1∑st+2∑at+2∑⋯st+n∑at+n∑[r(st,at)+γ1r(st+1,at+1)+⋯+γnr(st+n,at+n)][P(st+1∣st,at)×P(at+1∣st+1)×P(st+2∣st+1,at+1)×P(at+2∣st+2)×⋯×P(st+n∣st+n−1,at+n−1)×P(at+n∣st+n)]
要判定上式是不是回报的期望就要看上面有没有考虑到所有的可能的回报函数以及对应的概率,首先就是单个动作变化的情况下的回报与概率,假定是 a t + 1 a_{t+1} at+1,那么对应的序列变为了:
s t , a t , r ( s t , a t ) , s t + 1 , a t + 1 ∗ , r ( s t + 1 , a t + 1 ∗ ) , s t + 2 , a t + 2 , ⋯ , s t + n , a t + n , r ( s t + n , a t + n ) s_t,a_t,r(s_t,a_t),s_{t+1},a_{t+1}^*,r(s_{t+1},a_{t+1}^*),s_{t+2},a_{t+2},\cdots,s_{t+n},a_{t+n},r(s_{t+n},a_{t+n}) st,at,r(st,at),st+1,at+1∗,r(st+1,at+1∗),st+2,at+2,⋯,st+n,at+n,r(st+n,at+n)
这个序列产生的概率为:
P ( s t + 1 ∣ s t , a t ) × P ( a t + 1 ∗ ∣ s t + 1 ) × P ( s t + 2 ∣ s t + 1 , a t + 1 ∗ ) × P ( a t + 2 ∣ s t + 2 ) × ⋯ × P ( s t + n ∣ s t + n − 1 , a t + n − 1 ) × P ( a t + n ∣ s t + n ) P(s_{t+1}|s_t,a_t)\times{P(a_{t+1}^*|s_{t+1})}\times{P(s_{t+2}|s_{t+1},a_{t+1}^*)}\times{P(a_{t+2}|s_{t+2})}\times\cdots\times{P(s_{t+n}|s_{t+n-1},a_{t+n-1})}\times{P(a_{t+n}|s_{t+n})} P(st+1∣st,at)×P(at+1∗∣st+1)×P(st+2∣st+1,at+1∗)×P(at+2∣st+2)×⋯×P(st+n∣st+n−1,at+n−1)×P(at+n∣st+n)
相比于原始的来看,改变一个动作会导致两个乘子发生改变,一个是状态动作函数也就是策略函数 P ( a t + 1 ∗ ∣ s t + 1 ) {P(a_{t+1}^*|s_{t+1})} P(at+1∗∣st+1),另一个就是状态转移函数也会发生改变 P ( s t + 2 ∣ s t + 1 , a t + 1 ∗ ) {P(s_{t+2}|s_{t+1},a_{t+1}^*)} P(st+2∣st+1,at+1∗)。
回报函数变为了:
U ( t ) = r ( s t , a t ) + γ r ( s t + 1 , a t + 1 ∗ ) + ⋯ + γ n r ( s t + n , a t + n ) U(t)=r(s_t,a_t)+\gamma{r(s_{t+1},a_{t+1}^*)}+\cdots+\gamma^nr(s_{t+n},a_{t+n}) U(t)=r(st,at)+γr(st+1,at+1∗)+⋯+γnr(st+n,at+n)
回报函数中只有一个项变化了,而对应个概率中有两个因子都变化了。因此这个回报期望中至少有这两项:
[ r ( s t , a t ) + γ r ( s t + 1 , a t + 1 ∗ ) + ⋯ + γ n r ( s t + n , a t + n ) ] × [ P ( s t + 1 ∣ s t , a t ) × P ( a t + 1 ∗ ∣ s t + 1 ) × P ( s t + 2 ∣ s t + 1 , a t + 1 ∗ ) × P ( a t + 2 ∣ s t + 2 ) × ⋯ × P ( s t + n ∣ s t + n − 1 , a t + n − 1 ) × P ( a t + n ∣ s t + n ) ] + [ r ( s t , a t ) + γ 1 r ( s t + 1 , a t + 1 ) + ⋯ + γ n r ( s t + n , a t + n ) ] × [ P ( s t + 1 ∣ s t , a t ) × P ( a t + 1 ∣ s t + 1 ) × P ( s t + 2 ∣ s t + 1 , a t + 1 ) × P ( a t + 2 ∣ s t + 2 ) × ⋯ × P ( s t + n ∣ s t + n − 1 , a t + n − 1 ) × P ( a t + n ∣ s t + n ) ] [r(s_t,a_t)+\gamma{r(s_{t+1},a_{t+1}^*)}+\cdots+\gamma^nr(s_{t+n},a_{t+n})]\times[P(s_{t+1}|s_t,a_t)\times{P(a_{t+1}^*|s_{t+1})}\times{P(s_{t+2}|s_{t+1},a_{t+1}^*)}\times{P(a_{t+2}|s_{t+2})}\times\cdots\times{P(s_{t+n}|s_{t+n-1},a_{t+n-1})}\times{P(a_{t+n}|s_{t+n})}]+[r(s_t,a_t)+\gamma^1r(s_{t+1},a_{t+1})+\cdots+\gamma^nr(s_{t+n},a_{t+n})]\times[P(s_{t+1}|s_t,a_t)\times{P(a_{t+1}|s_{t+1})}\times{P(s_{t+2}|s_{t+1},a_{t+1})}\times{P(a_{t+2}|s_{t+2})}\times\cdots\times{P(s_{t+n}|s_{t+n-1},a_{t+n-1})}\times{P(a_{t+n}|s_{t+n})}] [r(st,at)+γr(st+1,at+1∗)+⋯+γnr(st+n,at+n)]×[P(st+1∣st,at)×P(at+1∗∣st+1)×P(st+2∣st+1,at+1∗)×P(at+2∣st+2)×⋯×P(st+n∣st+n−1,at+n−1)×P(at+n∣st+n)]+[r(st,at)+γ1r(st+1,at+1)+⋯+γnr(st+n,at+n)]×[P(st+1∣st,at)×P(at+1∣st+1)×P(st+2∣st+1,at+1)×P(at+2∣st+2)×⋯×P(st+n∣st+n−1,at+n−1)×P(at+n∣st+n)]
而我们在均值的表达式中可以找到这一项。同时对于动作空间内的所有的 a t + 1 a_{t+1} at+1都有这个表达式,均值也必定包含所有的 a t + 1 a_{t+1} at+1的情况,也就是
∑ a t + 1 ∗ [ r ( s t , a t ) + γ r ( s t + 1 , a t + 1 ∗ ) + ⋯ + γ n r ( s t + n , a t + n ) ] × [ P ( s t + 1 ∣ s t , a t ) × P ( a t + 1 ∗ ∣ s t + 1 ) × P ( s t + 2 ∣ s t + 1 , a t + 1 ∗ ) × P ( a t + 2 ∣ s t + 2 ) × ⋯ × P ( s t + n ∣ s t + n − 1 , a t + n − 1 ) × P ( a t + n ∣ s t + n ) ] \sum_{a_{t+1}^*}[r(s_t,a_t)+\gamma{r(s_{t+1},a_{t+1}^*)}+\cdots+\gamma^nr(s_{t+n},a_{t+n})]\times[P(s_{t+1}|s_t,a_t)\times{P(a_{t+1}^*|s_{t+1})}\times{P(s_{t+2}|s_{t+1},a_{t+1}^*)}\times{P(a_{t+2}|s_{t+2})}\times\cdots\times{P(s_{t+n}|s_{t+n-1},a_{t+n-1})}\times{P(a_{t+n}|s_{t+n})}] at+1∗∑[r(st,at)+γr(st+1,at+1∗)+⋯+γnr(st+n,at+n)]×[P(st+1∣st,at)×P(at+1∗∣st+1)×P(st+2∣st+1,at+1∗)×P(at+2∣st+2)×⋯×P(st+n∣st+n−1,at+n−1)×P(at+n∣st+n)]
对于其他位置处的动作发生变化也是类似的表达式:
∑ a t + k ∗ [ r ( s t , a t ) + ⋯ + γ k r ( s t + k , a t + k ) + ⋯ + γ n r ( s t + n , a t + n ) ] × [ P ( s t + 1 ∣ s t , a t ) × P ( a t + 1 ∣ s t + 1 ) × P ( s t + 2 ∣ s t + 1 , a t + 1 ) × P ( a t + 2 ∣ s t + 2 ) × ⋯ P ( a t + k ∗ ∣ s t + k ) × P ( s t + k + 1 ∣ s t + k , a t + k ∗ ) ⋯ × P ( s t + n ∣ s t + n − 1 , a t + n − 1 ) × P ( a t + n ∣ s t + n ) ] \sum_{a_{t+k}^*}[r(s_t,a_t)+\cdots+\gamma^k{r(s_{t+k},a_{t+k})}+\cdots+\gamma^nr(s_{t+n},a_{t+n})]\times[P(s_{t+1}|s_t,a_t)\times{P(a_{t+1}|s_{t+1})}\times{P(s_{t+2}|s_{t+1},a_{t+1})}\times{P(a_{t+2}|s_{t+2})}\times\cdots{P(a_{t+k}^*|s_{t+k})}\times{P(s_{t+k+1}|s_{t+k},a_{t+k}^*)}\cdots\times{P(s_{t+n}|s_{t+n-1},a_{t+n-1})}\times{P(a_{t+n}|s_{t+n})}] at+k∗∑[r(st,at)+⋯+γkr(st+k,at+k)+⋯+γnr(st+n,at+n)]×[P(st+1∣st,at)×P(at+1∣st+1)×P(st+2∣st+1,at+1)×P(at+2∣st+2)×⋯P(at+k∗∣st+k)×P(st+k+1∣st+k,at+k∗)⋯×P(st+n∣st+n−1,at+n−1)×P(at+n∣st+n)]
接下来对于状态发生变化进行分析。假设中间某个位置状态发生变化,轨迹序列变为:
s t , a t , r ( s t , a t ) , s t + 1 ∗ , a t + 1 , r ( s t + 1 ∗ , a t + 1 ) , ⋯ , s t + n , a t + n , r ( s t + n , a t + n ) s_t,a_t,r(s_t,a_t),s_{t+1}^*,a_{t+1},r(s_{t+1}^*,a_{t+1}),\cdots,s_{t+n},a_{t+n},r(s_{t+n},a_{t+n}) st,at,r(st,at),st+1∗,at+1,r(st+1∗,at+1),⋯,st+n,at+n,r(st+n,at+n)
此时回报函数为:
U ( t ) = r ( s t , a t ) + γ r ( s t + 1 ∗ , a t + 1 ) + ⋯ + γ n r ( s t + n , a t + n ) U(t)=r(s_t,a_t)+\gamma{r(s_{t+1}^*,a_{t+1})}+\cdots+\gamma^nr(s_{t+n},a_{t+n}) U(t)=r(st,at)+γr(st+1∗,at+1)+⋯+γnr(st+n,at+n)
对应的概率为:
P ( s t + 1 ∗ ∣ s t , a t ) × P ( a t + 1 ∣ s t + 1 ∗ ) × P ( s t + 2 ∣ s t + 1 ∗ , a t + 1 ) × ⋯ × P ( a t + n ∣ s t + n ) P(s_{t+1}^*|s_t,a_t)\times{P(a_{t+1}|s_{t+1}^*)}\times{P(s_{t+2}|s_{t+1}^*,a_{t+1})}\times\cdots\times{P(a_{t+n}|s_{t+n})} P(st+1∗∣st,at)×P(at+1∣st+1∗)×P(st+2∣st+1∗,at+1)×⋯×P(at+n∣st+n)
可以看出状态发生改变会影响三个因子,这与之前的分析图也是一致的,也就是状态会影响动作,会影响下一个状态,也会影响收到的奖励。而动作只会影响两个,影响收到的奖励与下一个状态。所以对应的概率上来说,生成动作的概率和生成下一个状态的概率以及由先前的状态生成当前状态的概率都会发生改变。而动作的改变只会带来状态到动作的概率和状态动作到下一个状态的概率的改变。
当状态和动作同时发生改变:
s t , a t , r ( s t , a t ) , s t + 1 ∗ , a t + 1 ∗ , r ( s t + 1 ∗ , a t + 1 ∗ ) , s t + 2 , a t + 2 , r ( s t + 2 , a t + 2 ) , ⋯ , r ( s t + n , a t + n ) s_t,a_t,r(s_t,a_t),s_{t+1}^*,a_{t+1}^*,r(s_{t+1}^*,a_{t+1}^*),s_{t+2},a_{t+2},r(s_{t+2},a_{t+2}),\cdots,r(s_{t+n},a_{t+n}) st,at,r(st,at),st+1∗,at+1∗,r(st+1∗,at+1∗),st+2,at+2,r(st+2,at+2),⋯,r(st+n,at+n)
回报为:
U ( t ) = r ( s t , a t ) + γ r ( s t + 1 ∗ , a t + 1 ∗ ) + ⋯ + γ n r ( s t + n , a t + n ) U(t)=r(s_t,a_t)+\gamma{r(s_{t+1}^*,a_{t+1}^*)}+\cdots+\gamma^nr(s_{t+n},a_{t+n}) U(t)=r(st,at)+γr(st+1∗,at+1∗)+⋯+γnr(st+n,at+n)
对应的概率为:
P ( s t + 1 ∗ ∣ s t , a t ) × P ( a t + 1 ∗ ∣ s t + 1 ∗ ) × P ( s t + 2 ∣ s t + 1 ∗ , a t + 1 ∗ ) × P ( a t + 2 ∣ s t + 2 ) × ⋯ × P ( a t + n ∣ s t + n ) P(s_{t+1}^*|s_t,a_t)\times{P(a_{t+1}^*|s_{t+1}^*)}\times{P(s_{t+2}|s_{t+1}^*,a_{t+1}^*)}\times{P(a_{t+2}|s_{t+2})}\times\cdots\times{P(a_{t+n}|s_{t+n})} P(st+1∗∣st,at)×P(at+1∗∣st+1∗)×P(st+2∣st+1∗,at+1∗)×P(at+2∣st+2)×⋯×P(at+n∣st+n)
可以看出在概率的乘积因子的变化上与状态发生变化时类似。当所有的都是变化的时:
s t , a t , r ( s t , a t ) , s t + 1 ∗ , a t + 1 ∗ , r ( s t + 1 ∗ , a t + 1 ∗ ) , ⋯ , s t + n ∗ , a t + n ∗ , r ( s t + n ∗ , a t + n ∗ ) s_t,a_t,r(s_t,a_t),s_{t+1}^*,a_{t+1}^*,r(s_{t+1}^*,a_{t+1}^*),\cdots,s_{t+n}^*,a_{t+n}^*,r(s_{t+n}^*,a_{t+n}^*) st,at,r(st,at),st+1∗,at+1∗,r(st+1∗,at+1∗),⋯,st+n∗,at+n∗,r(st+n∗,at+n∗)
回报函数为:
U ( t ) = r ( s t , a t ) + γ r ( s t + 1 ∗ , a t + 1 ∗ ) + ⋯ + γ n r ( s t + n ∗ , a t + n ∗ ) U(t)=r(s_t,a_t)+\gamma{r(s_{t+1}^*,a_{t+1}^*)}+\cdots+\gamma^n{r(s_{t+n}^*,a_{t+n}^*)} U(t)=r(st,at)+γr(st+1∗,at+1∗)+⋯+γnr(st+n∗,at+n∗)
对应的概率为:
P ( s t + 1 ∗ ∣ s t , a t ) × P ( a t + 1 ∗ ∣ s t + 1 ∗ ) × P ( s t + 2 ∗ ∣ s t + 1 ∗ , a t + 1 ∗ ) × P ( a t + 2 ∗ ∣ s t + 2 ∗ ) × ⋯ × P ( a t + n ∗ ∣ s t + n ∗ ) P(s_{t+1}^*|s_t,a_t)\times{P(a_{t+1}^*|s_{t+1}^*)}\times{P(s_{t+2}^*|s_{t+1}^*,a_{t+1}^*)}\times{P(a_{t+2}^*|s_{t+2}^*)}\times\cdots\times{P(a_{t+n}^*|s_{t+n}^*)} P(st+1∗∣st,at)×P(at+1∗∣st+1∗)×P(st+2∗∣st+1∗,at+1∗)×P(at+2∗∣st+2∗)×⋯×P(at+n∗∣st+n∗)
所以对应的期望就是所有的 s s s与 a a a对应的回报与对应的概率乘积的和:
∑ s t + 1 ∗ ⋯ ∑ s t + n ∗ ⋯ ∑ a t + 1 ∗ ⋯ ∑ a t + n ∗ [ r ( s t , a t ) + γ r ( s t + 1 ∗ , a t + 1 ∗ ) + ⋯ + γ n r ( s t + n ∗ , a t + n ∗ ) ] × [ P ( s t + 1 ∗ ∣ s t , a t ) × P ( a t + 1 ∗ ∣ s t + 1 ∗ ) × P ( s t + 2 ∗ ∣ s t + 1 ∗ , a t + 1 ∗ ) × P ( a t + 2 ∗ ∣ s t + 2 ∗ ) × ⋯ × P ( a t + n ∗ ∣ s t + n ∗ ) ] \sum_{s_{t+1}^*}\cdots\sum_{s_{t+n}^*}\cdots\sum_{a_{t+1}^*}\cdots\sum_{a_{t+n}^*}[r(s_t,a_t)+\gamma{r(s_{t+1}^*,a_{t+1}^*)}+\cdots+\gamma^n{r(s_{t+n}^*,a_{t+n}^*)}]\times[P(s_{t+1}^*|s_t,a_t)\times{P(a_{t+1}^*|s_{t+1}^*)}\times{P(s_{t+2}^*|s_{t+1}^*,a_{t+1}^*)}\times{P(a_{t+2}^*|s_{t+2}^*)}\times\cdots\times{P(a_{t+n}^*|s_{t+n}^*)}] st+1∗∑⋯st+n∗∑⋯at+1∗∑⋯at+n∗∑[r(st,at)+γr(st+1∗,at+1∗)+⋯+γnr(st+n∗,at+n∗)]×[P(st+1∗∣st,at)×P(at+1∗∣st+1∗)×P(st+2∗∣st+1∗,at+1∗)×P(at+2∗∣st+2∗)×⋯×P(at+n∗∣st+n∗)]
其实也可以正向分析,已知轨迹序列包括状态和动作序列,而经过分析容易发现存在仅有一个动作发生变化而其他不变的情况,同样也存在仅有一个状态发生变化其他不变的情况,也就是每个位置处的状态和动作都是可以任意变化的。所以序列的总数量为: ∑ s t + 1 ∗ ⋯ ∑ s t + n ∗ ⋯ ∑ a t + 1 ∗ ⋯ ∑ a t + n ∗ \sum_{s_{t+1}^*}\cdots\sum_{s_{t+n}^*}\cdots\sum_{a_{t+1}^*}\cdots\sum_{a_{t+n}^*} ∑st+1∗⋯∑st+n∗⋯∑at+1∗⋯∑at+n∗,中间是乘积的形式。所以期望就是所有的回报与对应的概率成绩的和,也就是前面所给的形式。而上面的公式:
∑ s t + 1 ∗ ⋯ ∑ s t + n ∗ ⋯ ∑ a t + 1 ∗ ⋯ ∑ a t + n ∗ [ r ( s t , a t ) + γ r ( s t + 1 ∗ , a t + 1 ∗ ) + ⋯ + γ n r ( s t + n ∗ , a t + n ∗ ) ] × [ P ( s t + 1 ∗ ∣ s t , a t ) × P ( a t + 1 ∗ ∣ s t + 1 ∗ ) × P ( s t + 2 ∗ ∣ s t + 1 ∗ , a t + 1 ∗ ) × P ( a t + 2 ∗ ∣ s t + 2 ∗ ) × ⋯ × P ( a t + n ∗ ∣ s t + n ∗ ) ] \sum_{s_{t+1}^*}\cdots\sum_{s_{t+n}^*}\cdots\sum_{a_{t+1}^*}\cdots\sum_{a_{t+n}^*}[r(s_t,a_t)+\gamma{r(s_{t+1}^*,a_{t+1}^*)}+\cdots+\gamma^n{r(s_{t+n}^*,a_{t+n}^*)}]\times[P(s_{t+1}^*|s_t,a_t)\times{P(a_{t+1}^*|s_{t+1}^*)}\times{P(s_{t+2}^*|s_{t+1}^*,a_{t+1}^*)}\times{P(a_{t+2}^*|s_{t+2}^*)}\times\cdots\times{P(a_{t+n}^*|s_{t+n}^*)}] st+1∗∑⋯st+n∗∑⋯at+1∗∑⋯at+n∗∑[r(st,at)+γr(st+1∗,at+1∗)+⋯+γnr(st+n∗,at+n∗)]×[P(st+1∗∣st,at)×P(at+1∗∣st+1∗)×P(st+2∗∣st+1∗,at+1∗)×P(at+2∗∣st+2∗)×⋯×P(at+n∗∣st+n∗)]
还可以进行变化,将奖励函数带入并分别求积分:
E [ U ( t ) ] = r ( s t , a t ) + ∑ s t + 1 ∗ ∑ a t + 1 ∗ γ r ( s t + 1 ∗ , a t + 1 ∗ ) × P ( s t + 1 ∗ ∣ s t , a t ) × P ( a t + 1 ∗ ∣ s t + 1 ∗ ) + ∑ s t + 1 ∗ ∑ s t + 2 ∗ ∑ a t + 1 ∗ ∑ a t + 2 ∗ γ 2 r ( s t + 2 ∗ , a t + 2 ∗ ) P ( s t + 1 ∗ ∣ s t , a t ) × P ( a t + 1 ∗ ∣ s t + 1 ∗ ) × P ( s t + 2 ∗ ∣ s t + 1 ∗ , a t + 1 ∗ ) × P ( a t + 2 ∗ ∣ s t + 2 ∗ ) + ⋯ + ∑ s t + 1 ∗ ⋯ ∑ s t + n ∗ ⋯ ∑ a t + 1 ∗ ⋯ ∑ a t + n ∗ γ n r ( s t + n ∗ , a t + n ∗ ) × [ P ( s t + 1 ∗ ∣ s t , a t ) × P ( a t + 1 ∗ ∣ s t + 1 ∗ ) × P ( s t + 2 ∗ ∣ s t + 1 ∗ , a t + 1 ∗ ) × P ( a t + 2 ∗ ∣ s t + 2 ∗ ) × ⋯ × P ( a t + n ∗ ∣ s t + n ∗ ) ] E[U(t)]=r(s_t,a_t)+\sum_{s_{t+1}^*}\sum_{a_{t+1}^*}\gamma{r(s_{t+1}^*,a_{t+1}^*)}\times{P(s_{t+1}^*|s_t,a_t)}\times{P(a_{t+1}^*|s_{t+1}^*)}+\sum_{s_{t+1}^*}\sum_{s_{t+2}^*}\sum_{a_{t+1}^*}\sum_{a_{t+2}^*}\gamma^2r(s_{t+2}^*,a_{t+2}^*)P(s_{t+1}^*|s_t,a_t)\times{P(a_{t+1}^*|s_{t+1}^*)}\times{P(s_{t+2}^*|s_{t+1}^*,a_{t+1}^*)}\times{P(a_{t+2}^*|s_{t+2}^*)}+\cdots+\sum_{s_{t+1}^*}\cdots\sum_{s_{t+n}^*}\cdots\sum_{a_{t+1}^*}\cdots\sum_{a_{t+n}^*}\gamma^n{r(s_{t+n}^*,a_{t+n}^*)}\times[P(s_{t+1}^*|s_t,a_t)\times{P(a_{t+1}^*|s_{t+1}^*)}\times{P(s_{t+2}^*|s_{t+1}^*,a_{t+1}^*)}\times{P(a_{t+2}^*|s_{t+2}^*)}\times\cdots\times{P(a_{t+n}^*|s_{t+n}^*)}] E[U(t)]=r(st,at)+st+1∗∑at+1∗∑γr(st+1∗,at+1∗)×P(st+1∗∣st,at)×P(at+1∗∣st+1∗)+st+1∗∑st+2∗∑at+1∗∑at+2∗∑γ2r(st+2∗,at+2∗)P(st+1∗∣st,at)×P(at+1∗∣st+1∗)×P(st+2∗∣st+1∗,at+1∗)×P(at+2∗∣st+2∗)+⋯+st+1∗∑⋯st+n∗∑⋯at+1∗∑⋯at+n∗∑γnr(st+n∗,at+n∗)×[P(st+1∗∣st,at)×P(at+1∗∣st+1∗)×P(st+2∗∣st+1∗,at+1∗)×P(at+2∗∣st+2∗)×⋯×P(at+n∗∣st+n∗)]
我们可以把它写为递推形式:
E [ U ( t + 1 ) ] = r ( s t + 1 , a t + 1 ) + ∑ s t + 2 ∗ ⋯ ∑ s t + n ∗ ∑ a t + 1 ∗ ⋯ ∑ a t + n ∗ [ γ r ( s t + 2 ∗ , a t + 2 ∗ ) + ⋯ + γ n r ( s t + n ∗ , a t + n ∗ ) ] × [ P ( s t + 2 ∗ ∣ s t + 1 , a t + 1 ) × P ( a t + 2 ∗ ∣ s t + 2 ∗ ) × ⋯ × P ( a t + n ∗ ∣ s t + n ∗ ) ] E[U(t+1)]=r(s_{t+1},a_{t+1})+\sum_{s_{t+2}^*}\cdots\sum_{s_{t+n}^*}\sum_{a_{t+1}^*}\cdots\sum_{a_{t+n}^*}[\gamma{r(s_{t+2}^*,a_{t+2}^*)+\cdots+\gamma^nr(s_{t+n}^*,a_{t+n}^*)}]\times[P(s_{t+2}^*|s_{t+1},a_{t+1})\times{P(a_{t+2}^*|s_{t+2}^*)}\times\cdots\times{P(a_{t+n}^*|s_{t+n}^*)}] E[U(t+1)]=r(st+1,at+1)+st+2∗∑⋯st+n∗∑at+1∗∑⋯at+n∗∑[γr(st+2∗,at+2∗)+⋯+γnr(st+n∗,at+n∗)]×[P(st+2∗∣st+1,at+1)×P(at+2∗∣st+2∗)×⋯×P(at+n∗∣st+n∗)]
所以有递推形式:
E [ U ( t ) ] = r ( s t , a t ) + ∑ s t + 1 ∑ a t + 1 γ E [ U ( t + 1 ) ] × P ( s t + 1 ∣ s t , a t ) × P ( a t + 1 ∣ s t + 1 ) E[U(t)]=r(s_t,a_t)+\sum_{s_{t+1}}\sum_{a_{t+1}}\gamma{E[U(t+1)]}\times{P(s_{t+1}|s_t,a_t)}\times{P(a_{t+1}|s_{t+1})} E[U(t)]=r(st,at)+st+1∑at+1∑γE[U(t+1)]×P(st+1∣st,at)×P(at+1∣st+1)
当我们的策略函数也就是状态到动作的函数能够使得 E [ U ( t ) ] E[U(t)] E[U(t)]最大化,那么势必有 ∑ s t + 1 ∗ ∑ a t + 1 ∗ γ E [ U ( t + 1 ) ] × P ( s t + 1 ∣ s t , a t ) × P ( a t + 1 ∣ s t + 1 ) \sum_{s_{t+1}^*}\sum_{a_{t+1}^*}\gamma{E[U(t+1)]}\times{P(s_{t+1}|s_t,a_t)}\times{P(a_{t+1}|s_{t+1})} ∑st+1∗∑at+1∗γE[U(t+1)]×P(st+1∣st,at)×P(at+1∣st+1)最大化,也就是有 ∑ s t + 1 ∗ ∑ a t + 1 ∗ E [ U ( t + 1 ) ] × P ( s t + 1 ∣ s t , a t ) × P ( a t + 1 ∣ s t + 1 ) \sum_{s_{t+1}^*}\sum_{a_{t+1}^*}{E[U(t+1)]}\times{P(s_{t+1}|s_t,a_t)}\times{P(a_{t+1}|s_{t+1})} ∑st+1∗∑at+1∗E[U(t+1)]×P(st+1∣st,at)×P(at+1∣st+1)最大化。以上就是贝尔曼方程的分解。如果再把状态价值函数引进来,可以得到贝尔曼方程其他形式。
接下来是状态确定但是动作不确定情况下的回报函数。
假设状态动作轨迹为:
s t , a t ∗ , s t + 1 ∗ , ⋯ , s t + n ∗ , a t + n ∗ s_t,a_t^*,s_{t+1}^*,\cdots,s_{t+n}^*,a_{t+n}^* st,at∗,st+1∗,⋯,st+n∗,at+n∗
同样单独的动作变化是允许的,就是某个动作变化但其它不变,也就是每个位置都是独立的变化量。所以均值为:
E = ∑ a t ∗ ∑ s t + 1 ∗ ∑ a t + 1 ∗ ⋯ ∑ a t + n ∗ [ r ( s t , a t ∗ ) + γ r ( s t + 1 ∗ , a t + 1 ∗ ) + ⋯ + γ n r ( s t + n ∗ , s t + n ∗ ) ] P ( a t ∗ ∣ s t ) P ( s t + 1 ∗ ∣ s t , a t ∗ ) ⋯ P ( a t + n ∗ ∣ s t + n ∗ ) E=\sum_{a_{t}^*}\sum_{s_{t+1}^*}\sum_{a_{t+1}^*}\cdots\sum_{a_{t+n}^*}[r(s_t,a_t^*)+\gamma{r(s_{t+1}^*,a_{t+1}^*)}+\cdots+\gamma^nr(s_{t+n}^*,s_{t+n}^*)]P(a_t^*|s_t)P(s_{t+1}^*|s_t,a_t^*)\cdots{P(a_{t+n}^*|s_{t+n}^*)} E=at∗∑st+1∗∑at+1∗∑⋯at+n∗∑[r(st,at∗)+γr(st+1∗,at+1∗)+⋯+γnr(st+n∗,st+n∗)]P(at∗∣st)P(st+1∗∣st,at∗)⋯P(at+n∗∣st+n∗)
E = ∑ a t ∗ r ( s t , a t ∗ ) P ( a t ∗ ∣ s t ) + ∑ a t ∗ ∑ s t + 1 ∗ ∑ a t + 1 ∗ γ r ( s t + 1 ∗ , a t + 1 ∗ ) P ( a t ∗ ∣ s t ) P ( s t + 1 ∗ ∣ s t , a t ∗ ) P ( a t + 1 ∗ ∣ s t + 1 ∗ ) + ⋯ + ∑ a t ∗ ⋯ ∑ a t + n ∗ γ n r ( s t + n ∗ , a t + n ∗ ) P ( a t ∗ ∣ s t ) ⋯ P ( a t + n ∗ ∣ s t + n ∗ ) E=\sum_{a_{t}^*}r(s_t,a_t^*)P(a_t^*|s_t)+\sum_{a_{t}^*}\sum_{s_{t+1}^*}\sum_{a_{t+1}^*}\gamma{r(s_{t+1}^*,a_{t+1}^*)}P(a_t^*|s_t)P(s_{t+1}^*|s_t,a_t^*)P(a_{t+1}^*|s_{t+1}^*)+\cdots+\sum_{a_t^*}\cdots\sum_{a_{t+n}^*}\gamma^nr(s_{t+n}^*,a_{t+n}^*)P(a_t^*|s_t)\cdots{P(a_{t+n}^*|s_{t+n}^*)} E=at∗∑r(st,at∗)P(at∗∣st)+at∗∑st+1∗∑at+1∗∑γr(st+1∗,at+1∗)P(at∗∣st)P(st+1∗∣st,at∗)P(at+1∗∣st+1∗)+⋯+at∗∑⋯at+n∗∑γnr(st+n∗,at+n∗)P(at∗∣st)⋯P(at+n∗∣st+n∗)
排除策略的影响就是把以期望最大的那个策略作为最有概率。
我们可以分析一下有多少参数需估计,假设状态数量为 n n n,动作数量为 m m m,所以状态到动作一共有 n m nm nm个参数,而状态动作到状态一共有: n 2 m n^2m n