强化学习系列8:无模型的时序差分法

1. 修改计算公式

蒙特卡洛法的计算公式如下:
q i = r 0 + γ r 1 i + γ 2 r 2 i . . . q^i = r_0+\gamma r^i_1+\gamma ^2r^i_2... qi=r0+γr1i+γ2r2i...
Q i = Q i − 1 + 1 i ( q i − Q i − 1 ) Q^i=Q^{i-1}+\frac{1}{i}(q^i-Q^{i-1}) Qi=Qi1+i1(qiQi1)
由于 q i q^i qi计算的是长期回报,因此方差会比较大。这里使用时序差分法来计算 Q Q Q,减少方差问题。公式为:
q i = r i − 1 + γ π Q i − 1 q^i=r^{i-1}+\gamma \pi Q^{i-1} qi=ri1+γπQi1
Q i = Q i − 1 + 1 i ( q i − Q i − 1 ) Q^i=Q^{i-1}+\frac{1}{i}(q^i-Q^{i-1}) Qi=Qi1+i1(qiQi1)

SARSA使用采样策略下一时间的 Q Q Q来估计采样 q q q值,是一个on-policy算法。在SARSA中,策略 π \pi π采用的是采样时使用的策略。
Q-Learning使用新策略下一时间的Q来估计采样 q q q值,是一个off-policy算法。这个新策略一般使用 ϵ \epsilon ϵ-greedy策略。Q-learning在计算 q q q时策略使用了最优 max ⁡ \max max策略,使用了最优价值行动替代了真实交互的值,所以值函数会发生“过高估计”的问题;此外由于使用了最优策略,有部分 q q q值会永远更新不到,因此需要对策略加入随机要素,其中 ϵ \epsilon ϵ就是用来探索的。
在Q-Learning中涉及到了2个模型:一个模型用于采样,得到一系列的 s s s, a a a q q q;一个模型用于计算,得到用于更新策略的 q q q值。这里的两个模型分别叫做Behavior Network和Target Network。为了减小波动,可以设置Target Network的策略在一段时间内保持固定,直到一定的次数后才从Behavior Network处进行更新。

2. 代码实现

class SARSA(object):
    def __init__(self, epsilon=0.0):
        self.epsilon = epsilon

    def sarsa_eval(self, agent, env):
        #sarsa
        state = env.reset()
        prev_state = -1
        prev_act = -1
        while True:
            act = agent.play(state, self.epsilon)
            next_state, reward, terminate, _ = env.step(act)
            if prev_act != -1:
                #every visit
                return_val = reward + agent.gamma * (0 if terminate else agent.value_q[state][act])
                agent.value_n[prev_state][prev_act] += 1
                agent.value_q[prev_state][prev_act] += (return_val - \
                    agent.value_q[prev_state][prev_act]) / \
                    agent.value_n[prev_state][prev_act]
            prev_act = act  
            prev_state = state
            state = next_state
            if terminate:
                break

    def policy_improve(self, agent):
        new_policy = np.zeros_like(agent.pi)
        for i in range(1, agent.s_len):
            new_policy[i] = np.argmax(agent.value_q[i,:])
        if np.all(np.equal(new_policy, agent.pi)):
            return False
        else:
            agent.pi = new_policy
            return True

    # monte carlo
    def sarsa(self, agent, env):
        for i in range(10):
            for j in range(2000):
                self.sarsa_eval(agent, env)
            self.policy_improve(agent)

class QLearning(object):
    def __init__(self, epsilon=0.0):
        self.epsilon = epsilon

    def q_learn_eval(self, agent, env):
        state = env.reset()
        prev_state = -1
        prev_act = -1
        while True:
            act = agent.play(state, self.epsilon)
            next_state, reward, terminate, _ = env.step(act)
            if prev_act != -1:
                return_val = reward + agent.gamma * (0 if terminate else np.max(agent.value_q[state,:]))
                agent.value_n[prev_state][prev_act] += 1
                agent.value_q[prev_state][prev_act] += (return_val - \
                    agent.value_q[prev_state][prev_act]) / \
                    agent.value_n[prev_state][prev_act]
            prev_act = act  
            prev_state = state
            state = next_state
            if terminate:
                break
                
    def policy_improve(self, agent):
        new_policy = np.zeros_like(agent.pi)
        for i in range(1, agent.s_len):
            new_policy[i] = np.argmax(agent.value_q[i,:])
        if np.all(np.equal(new_policy, agent.pi)):
            return False
        else:
            agent.pi = new_policy
            return True

    # q learning
    def q_learning(self, agent, env):
        for i in range(10):
            for j in range(3000):
                self.q_learn_eval(agent, env)
            self.policy_improve(agent)

来看看算法效果,将策略评估阶段的迭代次数都调整为3000次:

ladders info:
{18: 31, 99: 58, 17: 40, 92: 37, 72: 37, 12: 55, 21: 52, 44: 82, 6: 31, 98: 56, 31: 6, 58: 99, 40: 17, 37: 72, 55: 12, 52: 21, 82: 44, 56: 98}
Timer MonteCarlo COST:14.503098726272583
return_pi=22
SARSA COST:17.623368978500366
return_pi=61
Q Learning COST:33.774227142333984
return_pi=90

ladders info:
{72: 57, 29: 17, 57: 72, 62: 20, 84: 75, 28: 8, 75: 84, 51: 69, 68: 6, 53: 31, 20: 62, 17: 29, 49: 84, 8: 28, 69: 51, 6: 68, 31: 53}
Timer MonteCarlo COST:4.651626825332642
return_pi=64
SARSA COST:5.257284879684448
return_pi=34
Q Learning COST:8.114835023880005
return_pi=67

总体来看,Q Learning的时间最长,效果也最好。

你可能感兴趣的:(python,算法,强化学习,强化学习系列)