问题

Example 4.3: Gambler’s Problem
A gambler has the opportunity to make bets on the outcomes of a sequence of coin ﬂips. If the coin comes up heads, he wins as many dollars as he has staked on that ﬂip; if it is tails, he loses his stake. The game ends when the gambler wins by reaching his goal of $100, or loses by running out of money. On each ﬂip, the gambler must decide what portion of his capital to stake, in integer numbers of dollars. This problem can be formulated as an undiscounted, episodic, ﬁnite MDP. The state is the gambler’s capital, s2{ 1,2,...,99} and the actionsare stakes, a2{ 0,1,...,min(s,100s )}. The reward is zero on all transitions except those on which the gambler reaches his goal, when it is +1. The state-value function then gives the probability of winning from each state. A policy is a mapping from levels of capital to stakes. The optimal policy maximizes the probability of reaching the goal. Let ph denote the probability of the coin coming up heads. If ph is known, then the entire problem is known and it can be solved, for instance, by value iteration. Figure 4.3 shows the change in the value function over successive sweeps of value iteration, and the ﬁnal policy found, for the case of ph =0 .4. This policy is optimal, but not unique. In fact, there is a whole family of optimal policies, all corresponding to ties for the argmax action selection with respect to the optimal value function. Can you guess what the entire family looks like?

Example 4.3 Gambler's Problem书本截图

问题抽象

This problem can be formulated as an undiscounted, episodic, ﬁnite MDP. The state is the gambler’s capital, s∈{ 1,2,...,99} and the actions are stakes, a∈{ 0,1,...,min(s,100s )}. The reward is zero on all transitions except those on which the gambler reaches his goal, when it is +1.

硬币抛掷
持有资金是capital 正面向上的概率是0.4.
reward始终是0直到达到目标(capital为100)时是+1

state 就是持有资金capital
action 下注资金 to stake
reward 0 until state'=100 and reward=+1
probability p(s', r | s, a)

Python实现

数据结构定义

环境模型environment的定义以及记录
环境模型记录了当前环境的所有信息，包括当前状态state下的所有可能的action以及相应的reward和probability值，这里用python中的dict来记录，key是state，value是当前state下的所有可能的action及其对应的环境值，这里用dict记录。

dict({state: dict({    # 当前状态state
     action : list[       # 可能的action
                pvalue，                  # p(state, acrion) value值
                list[reward, state', probability],   # 第一种情况下的state'
                list[reward, state', probability]    # 第二种情况下的state'
               ]
     })})

value iteration算法

算法大致思路：

# 循环操作，循环终止条件：达到精度要求 
#     遍历所有state，在每一个state下：
#         遍历该state下所有action
#               计算qvalue更新environment
#               记录qvalue最大的action
#         value[state] 更新为qvalue最大值
#        
#         记录value更新前后差值，方便判断其是否达到要求
# 达到精度要求循环结束
# 遍历所有state 
#         对每一个state取使得其value最大的action，存储到policy

算法主要实现内容

def valueIteration():
    pre = precision    #精度
    times = 0            # 记录循环次数
# 不断迭代 更新value(state) 数组 直到精度达到要求
    while pre >= precision:
        times += 1
        # 这里要注意精度置为0
        pre = 0
        for state in range(1, 100):
            v_old = value[state]
            # 对当前状态state下所有可能的action求value 找出value的最大值
            for action in actions(state):
                # 这里因为部分会被删 所以不能取全部可能的actions
                if action not in environment[state]:
                    continue
                tmp = environment[state][action]
               #更新 environment[state][action][0] (state, action)的value值
                environment[state][action][0] = tmp[1][2]*(tmp[1][0] + value[tmp[1][1]]) + tmp[2][2]*(tmp[2][0] + value[tmp[2][1]])
            # 取当前state下所有action的value值中最大的值 更新为当前state的value
            max_value = 0
            for action in actions(state):
                if action not in environment[state]:
                    continue
                if environment[state][action][0] > max_value:
                    max_value = environment[state][action][0]
            value[state] = max_value
            # 这里的value就是相当于所有action下value的期望了 这应该是这种算法的优势所在吧
            # 大值的action保留，其余删除
            for action in actions(state):
                if action not in environment[state]:
                    continue
                if environment[state][action][0] != max_value:
                    del environment[state][action]
            # 计算这个state value值的差值
            pre = max(pre, abs(value[state] - v_old))
         
# 对每一个state取使得其value最大的action，存储到policy
    for state in range(1, 100):
        policy[state] = []
        for action in actions(state):
            if action not in environment[state]:
                continue
            if environment[state][action][0] == value[state]:
                policy[state].append(action)
  
    print("iteration times: ", times, "precision: ", precision)

除此之外，作图记录state下action的值以及其趋势、规律。

state下最终的value值 fig3

    plt.figure(3)
    plt.plot(range(100), value[:-1])
    plt.grid()
    plt.title('state-value')

某个/些state的value值在time steps的变化下的变化趋势fig2

    # 在计算value的循环终，每次循环后都更新value_s_t数组
        value_s_t.append(value.copy())

    plt.figure(2)
    for i in [80, 90, 97, 98, 99, 100]:
        plt.plot([t for t in range(times)], [value_s_t[t][i] for t in range(times)], label='state' + str(i) + "'s value")
    plt.legend()
    plt.title("state's value for each time steps")

policy 在state下的policy描点作图

 for state in range(1, 100):
        policy[state] = []
        for action in actions(state):
            if action not in environment[state]:
                continue
            if environment[state][action][0] == value[state]:
                policy[state].append(action)
                plt.figure(1)
                plt.scatter(state, action)
                plt.title("state-policy")

结果记录

书中给出的图

在已经完成算法模型的前提下，运行程序。

1st precision = 0.000000000001

每个状态最终的value值与书中给出的很接近，也就是说，持有资金captital越多，掷骰子之后取胜（capital变成100）的概率越大。而这个结果是经过23次迭代之后得到的，可以预见，迭代次数更多的话，曲线越光滑。
【这里注意图二取接近100的几个state的value的变化，而接近0的state的变化与之完全不一致】取100之前的几个state的val值的变化，首先其相对大小是不变的。可以发现，value值在第一次迭代之后就达到一定的值，之后只是进行微小的提升。
- 这是因为，value第一次迭代之后就取value值最大的（state, action），而之后的迭代，就是根据相应其他state的微调而进行的微调。
我们发现state的策略与书中描述的不是很一致，尤其是50之后的数据。这主要是因为我们迭代的时候，循环内部是按照state的大小从小到大进行遍历的，所以state的计算往往取决于一个小于state值的state'和一个大于state值的state'的加权平均，而在计算的时候，因为大于state值的state'的value值还没有进行计算，所以就影响到计算出来的value值。
- eg. value( 50, 1) = value( 49 ) 与 value( 51 ) 进行某种运算，而value( 49 )是已经计算出来了的，但是 value( 51 ) 仍然是0，这势必会使得其值不是很精确。
- 如果遍历state的时候，顺序随机或者从大到小，那么policy数组就会发生相应变化。可以预见，如果从大到小，那么这个图会沿着state = 50 轴对称。

fig3 value(state)

fig2 state's value

fig1 state's policy

迭代次数

2nd precision = 0.01

可以发现，精度为1e-12和0.01的两次运算，结果相差不是很多，只是迭代次数的区别。

fig3 value(state)

fig2 state's value

fig1 state's policy

迭代次数

3rd precision = 0.1

迭代次数4次就满足了精度为0.1的情况，可是发现结果不尽如人意
部分state的value值在一个小范围内是一致的，比如state∈[1, 5]的时候value=0，其policy也是所有action都可以。
- 这是因为一开始value数组初始化为0，而在迭代的时候，value会受限于其相关的value，所以在相关的value不被更新为非0的时候，这些state的value值是不会改变的。
  
  fig3 value(state)
  
  fig2 state's value
  
  fig1 state's policy
  
  迭代次数

4th precision = 1

这也可以说明上一种情况的最后一点。在第一次迭代之后，小于50的state因为其相关的下一个state值不可能达到100，所以在计算的时候，state'始终为0，所以这些state的value始终为0。
- 而在之后的迭代中，不断的进行调整和对较小的state进行修正，趋于精确。
  
  value(state)

【RLaI】value iteration算法计算最优策略optimal policy(Example 4.4)

问题