蒙特卡洛法的计算公式如下:
令 q i = r 0 + γ r 1 i + γ 2 r 2 i . . . q^i = r_0+\gamma r^i_1+\gamma ^2r^i_2... qi=r0+γr1i+γ2r2i...
Q i = Q i − 1 + 1 i ( q i − Q i − 1 ) Q^i=Q^{i-1}+\frac{1}{i}(q^i-Q^{i-1}) Qi=Qi−1+i1(qi−Qi−1)
由于 q i q^i qi计算的是长期回报,因此方差会比较大。这里使用时序差分法来计算 Q Q Q,减少方差问题。公式为:
q i = r i − 1 + γ π Q i − 1 q^i=r^{i-1}+\gamma \pi Q^{i-1} qi=ri−1+γπQi−1
Q i = Q i − 1 + 1 i ( q i − Q i − 1 ) Q^i=Q^{i-1}+\frac{1}{i}(q^i-Q^{i-1}) Qi=Qi−1+i1(qi−Qi−1)
SARSA使用采样策略下一时间的 Q Q Q来估计采样 q q q值,是一个on-policy算法。在SARSA中,策略 π \pi π采用的是采样时使用的策略。
Q-Learning使用新策略下一时间的Q来估计采样 q q q值,是一个off-policy算法。这个新策略一般使用 ϵ \epsilon ϵ-greedy策略。Q-learning在计算 q q q时策略使用了最优 max \max max策略,使用了最优价值行动替代了真实交互的值,所以值函数会发生“过高估计”的问题;此外由于使用了最优策略,有部分 q q q值会永远更新不到,因此需要对策略加入随机要素,其中 ϵ \epsilon ϵ就是用来探索的。
在Q-Learning中涉及到了2个模型:一个模型用于采样,得到一系列的 s s s, a a a和 q q q;一个模型用于计算,得到用于更新策略的 q q q值。这里的两个模型分别叫做Behavior Network和Target Network。为了减小波动,可以设置Target Network的策略在一段时间内保持固定,直到一定的次数后才从Behavior Network处进行更新。
class SARSA(object):
def __init__(self, epsilon=0.0):
self.epsilon = epsilon
def sarsa_eval(self, agent, env):
#sarsa
state = env.reset()
prev_state = -1
prev_act = -1
while True:
act = agent.play(state, self.epsilon)
next_state, reward, terminate, _ = env.step(act)
if prev_act != -1:
#every visit
return_val = reward + agent.gamma * (0 if terminate else agent.value_q[state][act])
agent.value_n[prev_state][prev_act] += 1
agent.value_q[prev_state][prev_act] += (return_val - \
agent.value_q[prev_state][prev_act]) / \
agent.value_n[prev_state][prev_act]
prev_act = act
prev_state = state
state = next_state
if terminate:
break
def policy_improve(self, agent):
new_policy = np.zeros_like(agent.pi)
for i in range(1, agent.s_len):
new_policy[i] = np.argmax(agent.value_q[i,:])
if np.all(np.equal(new_policy, agent.pi)):
return False
else:
agent.pi = new_policy
return True
# monte carlo
def sarsa(self, agent, env):
for i in range(10):
for j in range(2000):
self.sarsa_eval(agent, env)
self.policy_improve(agent)
class QLearning(object):
def __init__(self, epsilon=0.0):
self.epsilon = epsilon
def q_learn_eval(self, agent, env):
state = env.reset()
prev_state = -1
prev_act = -1
while True:
act = agent.play(state, self.epsilon)
next_state, reward, terminate, _ = env.step(act)
if prev_act != -1:
return_val = reward + agent.gamma * (0 if terminate else np.max(agent.value_q[state,:]))
agent.value_n[prev_state][prev_act] += 1
agent.value_q[prev_state][prev_act] += (return_val - \
agent.value_q[prev_state][prev_act]) / \
agent.value_n[prev_state][prev_act]
prev_act = act
prev_state = state
state = next_state
if terminate:
break
def policy_improve(self, agent):
new_policy = np.zeros_like(agent.pi)
for i in range(1, agent.s_len):
new_policy[i] = np.argmax(agent.value_q[i,:])
if np.all(np.equal(new_policy, agent.pi)):
return False
else:
agent.pi = new_policy
return True
# q learning
def q_learning(self, agent, env):
for i in range(10):
for j in range(3000):
self.q_learn_eval(agent, env)
self.policy_improve(agent)
来看看算法效果,将策略评估阶段的迭代次数都调整为3000次:
ladders info:
{18: 31, 99: 58, 17: 40, 92: 37, 72: 37, 12: 55, 21: 52, 44: 82, 6: 31, 98: 56, 31: 6, 58: 99, 40: 17, 37: 72, 55: 12, 52: 21, 82: 44, 56: 98}
Timer MonteCarlo COST:14.503098726272583
return_pi=22
SARSA COST:17.623368978500366
return_pi=61
Q Learning COST:33.774227142333984
return_pi=90
ladders info:
{72: 57, 29: 17, 57: 72, 62: 20, 84: 75, 28: 8, 75: 84, 51: 69, 68: 6, 53: 31, 20: 62, 17: 29, 49: 84, 8: 28, 69: 51, 6: 68, 31: 53}
Timer MonteCarlo COST:4.651626825332642
return_pi=64
SARSA COST:5.257284879684448
return_pi=34
Q Learning COST:8.114835023880005
return_pi=67
总体来看,Q Learning的时间最长,效果也最好。