强化学习实战之Bellman期望方程

MDP:Bellman Expectation Equation

MDP理论介绍
有了之前的理论经验我们现在可以通过一个编程实例来体会Bellman期望方程了。

首先我们导入需要用的包,这里使用了sympy,它的优点是可以用符号来表示公式。

import pandas as pd
import sympy
from sympy import symbols

假设这一样一个场景:小明参加了一场考试,那么就有”及格“和”不及格“两种状态,每种状态下小明都有可能选择”学习“和”玩耍“两个动作,由此可以建立一个MDP模型。我们首先给出环境的动态特性:

# dynamic system: taking exams
dynamic = {
    's_': ['fail', 'fail', 'passed', 'passed', 'fail', 'passed'],
    'r': [-3, -1, 1, 3, -2, 1],
    's': ['fail', 'fail', 'fail', 'passed', 'passed', 'passed'],
    'a': ['play', 'learn', 'learn', 'play', 'play', 'learn'],
    'P(s_, r|s, a)': ['1', '1-m', 'm', '1-n', 'n', '1'],
}
df = pd.DataFrame(data=dynamic)

# center the text
d = dict(selector="th",
       props=[('text-align', 'center')])
df.style.set_properties(**{'width':'10em', 'text-align':'center'}).set_table_styles([d])

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-mbQcA0L6-1585808773255)(E:\MARL\notes\images\fail&pass1.png)]

以及相应的策略:

# policy
policy = {
    's': ['fail', 'fail', 'passed', 'passed'],
    'a': ['play','learn', 'play', 'learn'],
    'pi(a|s)': ['1-x', 'x', 'y', '1-y'],
}
df2 = pd.DataFrame(data=policy)
df2.style.set_properties(**{'width':'10em', 'text-align':'center'}).set_table_styles([d])

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-if4Hosqv-1585808773256)(E:\MARL\notes\images\fail&pass2.png)]

回忆一下Bellman期望方程:

强化学习实战之Bellman期望方程_第1张图片
强化学习实战之Bellman期望方程_第2张图片
强化学习实战之Bellman期望方程_第3张图片
v π ( s ) = ∑ a ∈ A π ( a ∣ s ) ⋅ q π ( s , a ) \begin{aligned}v_{\pi}(s)=\sum_{a \in \mathcal{A}} \pi(a | s) \cdot q_{\pi}(s, a)\\\end{aligned}\\ vπ(s)=aAπ(as)qπ(s,a)

q π ( s , a ) = r ( s , a ) + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) ⋅ v π ( s ′ ) = ∑ s ′ , r p ( s ′ , r ∣ s , a ) ⋅ [ r + γ ⋅ v π ( s ′ ) ] , s ∈ S \begin{aligned}q_{\pi}(s, a)&=r(s,a)+\gamma \sum_{s^{\prime} \in \mathcal{S}} p\left(s^{\prime} | s, a\right)\cdot v_{\pi}\left(s^{\prime}\right)\\&= \sum_{s^{\prime}, r} p\left(s^{\prime}, r | s, a\right) \cdot \left[r+\gamma \cdot v_{\pi}(s^{\prime})\right]\quad ,s\in \mathcal{S}\end{aligned} qπ(s,a)=r(s,a)+γsSp(ss,a)vπ(s)=s,rp(s,rs,a)[r+γvπ(s)],sS

根据这两个公式一共可以列出6个方程

'''
Bellman Equation

v(fail) = pi(play|fail)*q(fail,play) + pi(learn|fail)*q(fail,learn)
        = (1-x)*q(fail,play) + x*q(fail,learn)
        
v(passed) = pi(play|passed)*q(passed,play) + pi(learn|passed)*q(passed,learn)
        = y*q(passed,play) + (1-y)*q(passed,learn)

q(fail,play) = P(fail,r|fail,play)[r + gamma*v(fail)] + P(passed,r|fail,play)[r + gamma*v(passed)]
             = 1*[-3 + gamma*v(fail)] + 0 
             = -3 + gamma*v(fail)
             
q(fail,learn) = P(fail,r|fail,learn)[r + gamma*v(fail)] + P(passed,r|fail,learn)[r + gamma*v(passed)]
              = (1-m)*[-1 + gamma*v(fail)] + m*[1 + gamma*v(passed)]
              
q(passed,play) = P(fail,r|passed,play)[r + gamma*v(fail)] + P(passed,r|passed,play)[r + gamma*v(passed)]
             = n*[-2 + gamma*v(fail)] + (1-n)*[3 + gamma*v(passed)]

q(passed,learn) = P(fail,r|passed,learn)[r + gamma*v(fail)] + P(pass,r|passed,learn)[r + gamma*v(passed)]
              = (1-m)*0 + m*[1 + gamma*v(pass)]
              = m[1 + gamma*v(pass)]
              
6 virables, 6 equations; m,n,gamma are parameters in (0,1)
'''

# automatically enable the best printer available in your environment.

sympy.init_printing()

# define variables and parameters

v_fail, v_passed = symbols('v_(fail) v_(passed)')
q_fail_play, q_fail_learn ,q_passed_play, q_passed_learn = symbols('q_(fail\,play) q_(fail\,learn) q_(passed\,play) q_(passed\,learn)')
m, n, gamma, x, y = symbols('m n gamma x y')

# define the augmented matrix

system = sympy.Matrix((
    (1, 0, x-1, -x, 0, 0, 0),
    (0, 1, 0, 0, -y, y-1, 0),
    (-gamma, 0, 1, 0, 0, 0, -3),
    ((m-1)*gamma, -m*gamma, 0, 1, 0, 0, 2*m-1),
    (-n*gamma, (n-1)*gamma, 0, 0, 1, 0, 3-5*n),
    (0, -m*gamma, 0, 0, 0, 1, m),
))
system

这是一个增广矩阵,解之可得价值函数

[ 1 0 x − 1 − x 0 0 0 0 1 0 0 − y y − 1 0 − γ 0 1 0 0 0 − 3 γ ( m − 1 ) − γ m 0 1 0 0 2 m − 1 − γ n γ ( n − 1 ) 0 0 1 0 3 − 5 n 0 − γ m 0 0 0 1 m ] \displaystyle \left[\begin{matrix}1 & 0 & x - 1 & - x & 0 & 0 & 0\\0 & 1 & 0 & 0 & - y & y - 1 & 0\\- \gamma & 0 & 1 & 0 & 0 & 0 & -3\\\gamma \left(m - 1\right) & - \gamma m & 0 & 1 & 0 & 0 & 2 m - 1\\- \gamma n & \gamma \left(n - 1\right) & 0 & 0 & 1 & 0 & 3 - 5 n\\0 & - \gamma m & 0 & 0 & 0 & 1 & m\end{matrix}\right] 10γγ(m1)γn0010γmγ(n1)γmx101000x001000y00100y100010032m135nm

sympy.solve_linear_system(
    system, 
    v_fail, v_passed, 
    q_fail_play, q_fail_learn, q_passed_play, q_passed_learn
)

q ( f a i l , l e a r n ) : 3 γ 2 m n y ( x − 1 ) − 3 γ 2 m ( m − 1 ) ( x − 1 ) ( y − 1 ) − 3 γ 2 y ( m − 1 ) ( n − 1 ) ( x − 1 ) − γ m 2 ( y − 1 ) ( γ ( x − 1 ) + 1 ) − γ m y ( 5 n − 3 ) ( γ ( x − 1 ) + 1 ) − 3 γ ( m − 1 ) ( x − 1 ) + ( 2 m − 1 ) ( γ 2 m ( x − 1 ) ( y − 1 ) + γ 2 y ( n − 1 ) ( x − 1 ) + γ m ( y − 1 ) + γ y ( n − 1 ) + γ ( x − 1 ) + 1 ) − γ 2 m n x y + γ 2 m x ( m − 1 ) ( y − 1 ) + γ 2 m ( x − 1 ) ( y − 1 ) + γ 2 x y ( m − 1 ) ( n − 1 ) + γ 2 y ( n − 1 ) ( x − 1 ) + γ m ( y − 1 ) + γ x ( m − 1 ) + γ y ( n − 1 ) + γ ( x − 1 ) + 1 , q ( f a i l , p l a y ) : − 2 γ 2 m 2 x y + 2 γ 2 m 2 x − 3 γ 2 m n x y + 6 γ 2 m x y − 2 γ 2 m x + 2 γ 2 n x y − 2 γ 2 x y − γ m x − 3 γ m y + 3 γ m − 3 γ n y + 2 γ x + 3 γ y − 3 γ 2 m 2 x y − γ 2 m 2 x − γ 2 m x y − γ 2 m y + γ 2 m − γ 2 n y + γ 2 y + γ m x + γ m y − γ m + γ n y − γ y − γ + 1 , q ( p a s s e d , l e a r n ) : m ( − 3 γ 2 m n x y + 2 γ 2 m x y + 2 γ 2 n x y + γ 2 n y − 2 γ 2 y + γ m x − 4 γ n y + 2 γ y − γ + 1 ) γ 2 m 2 x y − γ 2 m 2 x − γ 2 m x y − γ 2 m y + γ 2 m − γ 2 n y + γ 2 y + γ m x + γ m y − γ m + γ n y − γ y − γ + 1 , q ( p a s s e d , p l a y ) : 3 γ 2 m n ( x − 1 ) ( y − 1 ) + γ m ( y − 1 ) ( − γ m n x + γ x ( m − 1 ) ( n − 1 ) + γ ( n − 1 ) ( x − 1 ) + n − 1 ) + γ n x ( 2 m − 1 ) ( γ m ( y − 1 ) + 1 ) + 3 γ n ( x − 1 ) − ( 5 n − 3 ) ( γ 2 m x ( m − 1 ) ( y − 1 ) + γ 2 m ( x − 1 ) ( y − 1 ) + γ m ( y − 1 ) + γ x ( m − 1 ) + γ ( x − 1 ) + 1 ) − γ 2 m n x y + γ 2 m x ( m − 1 ) ( y − 1 ) + γ 2 m ( x − 1 ) ( y − 1 ) + γ 2 x y ( m − 1 ) ( n − 1 ) + γ 2 y ( n − 1 ) ( x − 1 ) + γ m ( y − 1 ) + γ x ( m − 1 ) + γ y ( n − 1 ) + γ ( x − 1 ) + 1 , v ( f a i l ) : − γ m 2 x ( y − 1 ) − γ m x y ( 5 n − 3 ) + 3 γ m ( x − 1 ) ( y − 1 ) + 3 γ y ( n − 1 ) ( x − 1 ) + x ( 2 m − 1 ) ( γ m ( y − 1 ) + γ y ( n − 1 ) + 1 ) + 3 x − 3 − γ 2 m n x y + γ 2 m x ( m − 1 ) ( y − 1 ) + γ 2 m ( x − 1 ) ( y − 1 ) + γ 2 x y ( m − 1 ) ( n − 1 ) + γ 2 y ( n − 1 ) ( x − 1 ) + γ m ( y − 1 ) + γ x ( m − 1 ) + γ y ( n − 1 ) + γ ( x − 1 ) + 1 , v ( p a s s e d ) : γ n x y ( 2 m − 1 ) + 3 γ n y ( x − 1 ) − m ( γ x ( m − 1 ) ( y − 1 ) + γ ( x − 1 ) ( y − 1 ) + y − 1 ) − y ( 5 n − 3 ) ( γ x ( m − 1 ) + γ ( x − 1 ) + 1 ) − γ 2 m n x y + γ 2 m x ( m − 1 ) ( y − 1 ) + γ 2 m ( x − 1 ) ( y − 1 ) + γ 2 x y ( m − 1 ) ( n − 1 ) + γ 2 y ( n − 1 ) ( x − 1 ) + γ m ( y − 1 ) + γ x ( m − 1 ) + γ y ( n − 1 ) + γ ( x − 1 ) + 1 q_{(fail,learn)} : \frac{3 \gamma^{2} m n y \left(x - 1\right) - 3 \gamma^{2} m \left(m - 1\right) \left(x - 1\right) \left(y - 1\right) - 3 \gamma^{2} y \left(m - 1\right) \left(n - 1\right) \left(x - 1\right) - \gamma m^{2} \left(y - 1\right) \left(\gamma \left(x - 1\right) + 1\right) - \gamma m y \left(5 n - 3\right) \left(\gamma \left(x - 1\right) + 1\right) - 3 \gamma \left(m - 1\right) \left(x - 1\right) + \left(2 m - 1\right) \left(\gamma^{2} m \left(x - 1\right) \left(y - 1\right) + \gamma^{2} y \left(n - 1\right) \left(x - 1\right) + \gamma m \left(y - 1\right) + \gamma y \left(n - 1\right) + \gamma \left(x - 1\right) + 1\right)}{- \gamma^{2} m n x y + \gamma^{2} m x \left(m - 1\right) \left(y - 1\right) + \gamma^{2} m \left(x - 1\right) \left(y - 1\right) + \gamma^{2} x y \left(m - 1\right) \left(n - 1\right) + \gamma^{2} y \left(n - 1\right) \left(x - 1\right) + \gamma m \left(y - 1\right) + \gamma x \left(m - 1\right) + \gamma y \left(n - 1\right) + \gamma \left(x - 1\right) + 1}, \\ q_{(fail,play)} : \frac{- 2 \gamma^{2} m^{2} x y + 2 \gamma^{2} m^{2} x - 3 \gamma^{2} m n x y + 6 \gamma^{2} m x y - 2 \gamma^{2} m x + 2 \gamma^{2} n x y - 2 \gamma^{2} x y - \gamma m x - 3 \gamma m y + 3 \gamma m - 3 \gamma n y + 2 \gamma x + 3 \gamma y - 3}{\gamma^{2} m^{2} x y - \gamma^{2} m^{2} x - \gamma^{2} m x y - \gamma^{2} m y + \gamma^{2} m - \gamma^{2} n y + \gamma^{2} y + \gamma m x + \gamma m y - \gamma m + \gamma n y - \gamma y - \gamma + 1}, \\ q_{(passed,learn)} : \frac{m \left(- 3 \gamma^{2} m n x y + 2 \gamma^{2} m x y + 2 \gamma^{2} n x y + \gamma^{2} n y - 2 \gamma^{2} y + \gamma m x - 4 \gamma n y + 2 \gamma y - \gamma + 1\right)}{\gamma^{2} m^{2} x y - \gamma^{2} m^{2} x - \gamma^{2} m x y - \gamma^{2} m y + \gamma^{2} m - \gamma^{2} n y + \gamma^{2} y + \gamma m x + \gamma m y - \gamma m + \gamma n y - \gamma y - \gamma + 1}, \\ q_{(passed,play)} : \frac{3 \gamma^{2} m n \left(x - 1\right) \left(y - 1\right) + \gamma m \left(y - 1\right) \left(- \gamma m n x + \gamma x \left(m - 1\right) \left(n - 1\right) + \gamma \left(n - 1\right) \left(x - 1\right) + n - 1\right) + \gamma n x \left(2 m - 1\right) \left(\gamma m \left(y - 1\right) + 1\right) + 3 \gamma n \left(x - 1\right) - \left(5 n - 3\right) \left(\gamma^{2} m x \left(m - 1\right) \left(y - 1\right) + \gamma^{2} m \left(x - 1\right) \left(y - 1\right) + \gamma m \left(y - 1\right) + \gamma x \left(m - 1\right) + \gamma \left(x - 1\right) + 1\right)}{- \gamma^{2} m n x y + \gamma^{2} m x \left(m - 1\right) \left(y - 1\right) + \gamma^{2} m \left(x - 1\right) \left(y - 1\right) + \gamma^{2} x y \left(m - 1\right) \left(n - 1\right) + \gamma^{2} y \left(n - 1\right) \left(x - 1\right) + \gamma m \left(y - 1\right) + \gamma x \left(m - 1\right) + \gamma y \left(n - 1\right) + \gamma \left(x - 1\right) + 1}, \\ v_{(fail)} : \frac{- \gamma m^{2} x \left(y - 1\right) - \gamma m x y \left(5 n - 3\right) + 3 \gamma m \left(x - 1\right) \left(y - 1\right) + 3 \gamma y \left(n - 1\right) \left(x - 1\right) + x \left(2 m - 1\right) \left(\gamma m \left(y - 1\right) + \gamma y \left(n - 1\right) + 1\right) + 3 x - 3}{- \gamma^{2} m n x y + \gamma^{2} m x \left(m - 1\right) \left(y - 1\right) + \gamma^{2} m \left(x - 1\right) \left(y - 1\right) + \gamma^{2} x y \left(m - 1\right) \left(n - 1\right) + \gamma^{2} y \left(n - 1\right) \left(x - 1\right) + \gamma m \left(y - 1\right) + \gamma x \left(m - 1\right) + \gamma y \left(n - 1\right) + \gamma \left(x - 1\right) + 1}, \\ v_{(passed)} : \frac{\gamma n x y \left(2 m - 1\right) + 3 \gamma n y \left(x - 1\right) - m \left(\gamma x \left(m - 1\right) \left(y - 1\right) + \gamma \left(x - 1\right) \left(y - 1\right) + y - 1\right) - y \left(5 n - 3\right) \left(\gamma x \left(m - 1\right) + \gamma \left(x - 1\right) + 1\right)}{- \gamma^{2} m n x y + \gamma^{2} m x \left(m - 1\right) \left(y - 1\right) + \gamma^{2} m \left(x - 1\right) \left(y - 1\right) + \gamma^{2} x y \left(m - 1\right) \left(n - 1\right) + \gamma^{2} y \left(n - 1\right) \left(x - 1\right) + \gamma m \left(y - 1\right) + \gamma x \left(m - 1\right) + \gamma y \left(n - 1\right) + \gamma \left(x - 1\right) + 1} q(fail,learn):γ2mnxy+γ2mx(m1)(y1)+γ2m(x1)(y1)+γ2xy(m1)(n1)+γ2y(n1)(x1)+γm(y1)+γx(m1)+γy(n1)+γ(x1)+13γ2mny(x1)3γ2m(m1)(x1)(y1)3γ2y(m1)(n1)(x1)γm2(y1)(γ(x1)+1)γmy(5n3)(γ(x1)+1)3γ(m1)(x1)+(2m1)(γ2m(x1)(y1)+γ2y(n1)(x1)+γm(y1)+γy(n1)+γ(x1)+1),q(fail,play):γ2m2xyγ2m2xγ2mxyγ2my+γ2mγ2ny+γ2y+γmx+γmyγm+γnyγyγ+12γ2m2xy+2γ2m2x3γ2mnxy+6γ2mxy2γ2mx+2γ2nxy2γ2xyγmx3γmy+3γm3γny+2γx+3γy3,q(passed,learn):γ2m2xyγ2m2xγ2mxyγ2my+γ2mγ2ny+γ2y+γmx+γmyγm+γnyγyγ+1m(3γ2mnxy+2γ2mxy+2γ2nxy+γ2ny2γ2y+γmx4γny+2γyγ+1),q(passed,play):γ2mnxy+γ2mx(m1)(y1)+γ2m(x1)(y1)+γ2xy(m1)(n1)+γ2y(n1)(x1)+γm(y1)+γx(m1)+γy(n1)+γ(x1)+13γ2mn(x1)(y1)+γm(y1)(γmnx+γx(m1)(n1)+γ(n1)(x1)+n1)+γnx(2m1)(γm(y1)+1)+3γn(x1)(5n3)(γ2mx(m1)(y1)+γ2m(x1)(y1)+γm(y1)+γx(m1)+γ(x1)+1),v(fail):γ2mnxy+γ2mx(m1)(y1)+γ2m(x1)(y1)+γ2xy(m1)(n1)+γ2y(n1)(x1)+γm(y1)+γx(m1)+γy(n1)+γ(x1)+1γm2x(y1)γmxy(5n3)+3γm(x1)(y1)+3γy(n1)(x1)+x(2m1)(γm(y1)+γy(n1)+1)+3x3,v(passed):γ2mnxy+γ2mx(m1)(y1)+γ2m(x1)(y1)+γ2xy(m1)(n1)+γ2y(n1)(x1)+γm(y1)+γx(m1)+γy(n1)+γ(x1)+1γnxy(2m1)+3γny(x1)m(γx(m1)(y1)+γ(x1)(y1)+y1)y(5n3)(γx(m1)+γ(x1)+1)

总结

这就是一个完整的通过强行求解Bellman期望方程来进行策略评估的过程,可以看到即使是这么小的一个例子也足以体现Bellman方程的难列、难解。用Bellman最优方程求解最优策略时也同样有这个问题,正是这些问题使我们不得不采用其他方法,也就是后面即将出现的DP,MC,TD等等。

参考资料

《强化学习原理与Python实现》肖智清

你可能感兴趣的:(强化学习,算法,人工智能)