周博磊《强化学习纲要》
学习笔记
课程资料参见:https://github.com/zhoubolei/introRL.
教材:Sutton and Barton
《Reinforcement Learning: An Introduction》
实际生活中,很多MDP是不可知的,因此要使用Model-free的方法。
MC simulation:在算取每个轨迹实际的return后,把很多轨迹进行平均,得到每个状态下面对应的价值。
MC policy evaluation是根据empirical mean return来估计,而不是expected return
因此并不需要MDP的转移函数和奖励函数,也不要像动态规划一样有bootstrapping的过程。
只能用于有终止的MDP
算法概括
当要估计某个状态的时候,从这个状态开始通过数数的办法,数这个状态被访问了多少次,从这个状态开始我们总共得到了多少的return,最后再取empirical mean,就会得到每个状态的价值。
通过大数定律,得到足够的轨迹后,就可以趋近策略对应得价值函数
通过采样很多次,就可以平均;然后进行分解,把加和分解成最新的 x t x_t xt的值和最开始到t-1的值;然后把上一个时刻的平均值也带入进去,分解成(t-1)乘以上一个时刻的平均值;这样就可以把上一个时刻的平均值和这一个时刻的平均值建立一个联系。
通过这种方法,当得到一个新的 x t x_t xt的时候,和上一个平均值进行相减,把这个差值作为一个残差乘以一个learning rate,加到当前的平均值就会更新我们的现在的值。这样就可以把上一时刻的值和这一时刻更新的值建立一个联系。
把蒙特卡洛方法也写成Incremental MC更新的方法。
将 1 / c o u n t i n g 1/counting 1/counting变成 α \alpha α,也叫learning rate(学习率),我们希望值更新的速率有多快
MC与DP的差异
DP(动态规划)
MC(蒙特卡洛)
MC表现出来得到的是这条从起始到终止的蓝色的轨迹,我们得到的是一条实际的轨迹,每一个采取的什么行为以及它到达的状态都是决定的。现在是只是利用得到的实际价值来更新在轨迹上的所有状态,而和轨迹没有关系的状态我们都没有更新
MC相对于DP的好处
算法框架
TD和MC的差别
TD是在决策树上只走了一步,MC是把整个决策树全部走完了,走到终止状态后,回算每个return,更新它的值。
n-step TD
Bootstrapping and Sampling for DP, MC, and TD
可视化展示
Policy iteration
两部分:
Generalized Policy Iteration with Action-Value Function
可以直接用MC的方法代替DP的方法去估计q函数,当得到q函数后,可以通过greedy的方法去改进。
怎么确保MC有足够的贪婪函数?
Monte Carlo with ϵ \epsilon ϵ-Greedy Exploration
当我们follow ϵ \epsilon ϵ -greedy policy的时候,整个q函数以及价值函数是单调递增的。
ϵ \epsilon ϵ-Greedy算法表示
注释:
1:刚开始时q table是随机初始化的;
4: MC的核心是利用当前的策略利用环境进行探索,得到一些轨迹;
7:得到轨迹后,开始更新return,通过incremental mean的方法更新q table, q table有两个量:状态,action;
10:得到q table后,进步更新策略,policy improvement,这样就可以得到下一阶段的策略;得到更好的策略后,又用更好的策略来进行数据的采集。
这样通过迭代的过程,就得到广义的policy iteration。
MC vs. TD for Prediction and Control
回顾TD prediction的步骤
On-Policy的意思是我们现在只有同一个policy,既利用这个policy来采集数据,policy同时也是我们优化的policy。
Sarsa具体的算法
刚开始我们初始化Q table;先通过Q table采样一个A;采取A,会得到奖励R和S‘(进入到下一个状态);再一次通过Q table采样得到A’;收集到所有data后,就可以更新Q table;更新后我们会向前走一步,S变成S’,A变成A’;一步一步进行迭代更新。
n-step Sarsa
前面我们说可以把TD算法扩展它的步数,我们可以得到n-step 的Sarsa。
一步Sarsa是往前走一步过后就更新它的TD target;两步就得到两个实际得到的奖励,再bootstrapping Q的价值,更新TD target;进一步推广到整个结束过后,Sarsa就变成MC的这种更新的方法。
Sarsa属于On-policy Learning
Off-policy Learning
Q-learning 算法
Example on Cliff Walk
https://github.com/cuhkrlcourse/RLexample/blob/master/modelfree/cliffwalk.py
DP和TD的总结
https://github.com/cuhkrlcourse/RLexample/tree/master/modelfree
cliffwalk.py
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.colors import hsv_to_rgb
def change_range(values, vmin=0, vmax=1):
start_zero = values - np.min(values)
return (start_zero / (np.max(start_zero) + 1e-7)) * (vmax - vmin) + vmin
class GridWorld:
terrain_color = dict(normal=[127/360, 0, 96/100],
objective=[26/360, 100/100, 100/100],
cliff=[247/360, 92/100, 70/100],
player=[344/360, 93/100, 100/100])
def __init__(self):
self.player = None
self._create_grid()
self._draw_grid()
self.num_steps = 0
def _create_grid(self, initial_grid=None):
self.grid = self.terrain_color['normal'] * np.ones((4, 12, 3))
self._add_objectives(self.grid)
def _add_objectives(self, grid):
grid[-1, 1:11] = self.terrain_color['cliff']
grid[-1, -1] = self.terrain_color['objective']
def _draw_grid(self):
self.fig, self.ax = plt.subplots(figsize=(12, 4))
self.ax.grid(which='minor')
self.q_texts = [self.ax.text(*self._id_to_position(i)[::-1], '0',
fontsize=11, verticalalignment='center',
horizontalalignment='center') for i in range(12 * 4)]
self.im = self.ax.imshow(hsv_to_rgb(self.grid), cmap='terrain',
interpolation='nearest', vmin=0, vmax=1)
self.ax.set_xticks(np.arange(12))
self.ax.set_xticks(np.arange(12) - 0.5, minor=True)
self.ax.set_yticks(np.arange(4))
self.ax.set_yticks(np.arange(4) - 0.5, minor=True)
def reset(self):
self.player = (3, 0)
self.num_steps = 0
return self._position_to_id(self.player)
def step(self, action):
# Possible actions
if action == 0 and self.player[0] > 0:
self.player = (self.player[0] - 1, self.player[1])
if action == 1 and self.player[0] < 3:
self.player = (self.player[0] + 1, self.player[1])
if action == 2 and self.player[1] < 11:
self.player = (self.player[0], self.player[1] + 1)
if action == 3 and self.player[1] > 0:
self.player = (self.player[0], self.player[1] - 1)
self.num_steps = self.num_steps + 1
# Rules
if all(self.grid[self.player] == self.terrain_color['cliff']):
reward = -100
done = True
elif all(self.grid[self.player] == self.terrain_color['objective']):
reward = 0
done = True
else:
reward = -1
done = False
return self._position_to_id(self.player), reward, done
def _position_to_id(self, pos):
''' Maps a position in x,y coordinates to a unique ID '''
return pos[0] * 12 + pos[1]
def _id_to_position(self, idx):
return (idx // 12), (idx % 12)
def render(self, q_values=None, action=None, max_q=False, colorize_q=False):
assert self.player is not None, 'You first need to call .reset()'
if colorize_q:
assert q_values is not None, 'q_values must not be None for using colorize_q'
grid = self.terrain_color['normal'] * np.ones((4, 12, 3))
values = change_range(np.max(q_values, -1)).reshape(4, 12)
grid[:, :, 1] = values
self._add_objectives(grid)
else:
grid = self.grid.copy()
grid[self.player] = self.terrain_color['player']
self.im.set_data(hsv_to_rgb(grid))
if q_values is not None:
xs = np.repeat(np.arange(12), 4)
ys = np.tile(np.arange(4), 12)
for i, text in enumerate(self.q_texts):
if max_q:
q = max(q_values[i])
txt = '{:.2f}'.format(q)
text.set_text(txt)
else:
actions = ['U', 'D', 'R', 'L']
txt = '\n'.join(['{}: {:.2f}'.format(k, q) for k, q in zip(actions, q_values[i])])
text.set_text(txt)
if action is not None:
self.ax.set_title(action, color='r', weight='bold', fontsize=32)
plt.pause(0.01)
def egreedy_policy(q_values, state, epsilon=0.1):
'''
Choose an action based on a epsilon greedy policy.
A random action is selected with epsilon probability, else select the best action.
'''
if np.random.random() < epsilon:
return np.random.choice(4)
else:
return np.argmax(q_values[state])
def q_learning(env, num_episodes=500, render=True, exploration_rate=0.1,
learning_rate=0.5, gamma=0.9):
q_values = np.zeros((num_states, num_actions))
ep_rewards = []
for _ in range(num_episodes):
state = env.reset()
done = False
reward_sum = 0
while not done:
# Choose action
#第一个action, $\epsilon$-Greedy产生的
action = egreedy_policy(q_values, state, exploration_rate)
# Do the action
#往前走了一步
next_state, reward, done = env.step(action)
reward_sum += reward
# Update q_values
#可以通过bootstrapping去看max的值,构造出当前的TD target
td_target = reward + 0.9 * np.max(q_values[next_state])
td_error = td_target - q_values[state][action]
#得到TD target后可以立刻更新q value的值,并不需要执行第二个action
q_values[state][action] += learning_rate * td_error
# Update state
#进入到下一个state
state = next_state
if render:
env.render(q_values, action=actions[action], colorize_q=True)
ep_rewards.append(reward_sum)
return ep_rewards, q_values
def sarsa(env, num_episodes=500, render=True, exploration_rate=0.1,
learning_rate=0.5, gamma=0.9):
q_values_sarsa = np.zeros((num_states, num_actions))
ep_rewards = []
for _ in range(num_episodes):
state = env.reset()
done = False
reward_sum = 0
# Choose action
#第一个action
action = egreedy_policy(q_values_sarsa, state, exploration_rate)
while not done:
# Do the action
next_state, reward, done = env.step(action)
reward_sum += reward
# Choose next action
#第二个action,通过采样得到
next_action = egreedy_policy(q_values_sarsa, next_state, exploration_rate)
# Next q value is the value of the next action
#构造TD target
td_target = reward + gamma * q_values_sarsa[next_state][next_action]
#计算TD error
td_error = td_target - q_values_sarsa[state][action]
# Update q value
#对Q值进行更新
q_values_sarsa[state][action] += learning_rate * td_error
# Update state and action
state = next_state
action = next_action
if render:
env.render(q_values, action=actions[action], colorize_q=True)
ep_rewards.append(reward_sum)
return ep_rewards, q_values_sarsa
def play(q_values):
# simulate the environent using the learned Q values
env = GridWorld()
state = env.reset()
done = False
while not done:
# Select action
action = egreedy_policy(q_values, state, 0.0)
# Do the action
next_state, reward, done = env.step(action)
# Update state and action
state = next_state
env.render(q_values=q_values, action=actions[action], colorize_q=True)
UP = 0
DOWN = 1
RIGHT = 2
LEFT = 3
actions = ['UP', 'DOWN', 'RIGHT', 'LEFT']
### Define the environment
env = GridWorld()
num_states = 4 * 12 #The number of states in simply the number of "squares" in our grid world, in this case 4 * 12
num_actions = 4 # We have 4 possible actions, up, down, right and left
### Q-learning for cliff walk
q_learning_rewards, q_values = q_learning(env, gamma=0.9, learning_rate=1, render=False)
env.render(q_values, colorize_q=True)
q_learning_rewards, _ = zip(*[q_learning(env, render=False, exploration_rate=0.1,
learning_rate=1) for _ in range(10)])
avg_rewards = np.mean(q_learning_rewards, axis=0)
mean_reward = [np.mean(avg_rewards)] * len(avg_rewards)
fig, ax = plt.subplots()
ax.set_xlabel('Episodes using Q-learning')
ax.set_ylabel('Rewards')
ax.plot(avg_rewards)
ax.plot(mean_reward, 'g--')
print('Mean Reward using Q-Learning: {}'.format(mean_reward[0]))
### Sarsa learning for cliff walk
sarsa_rewards, q_values_sarsa = sarsa(env, render=False, learning_rate=0.5, gamma=0.99)
sarsa_rewards, _ = zip(*[sarsa(env, render=False, exploration_rate=0.2) for _ in range(10)])
avg_rewards = np.mean(sarsa_rewards, axis=0)
mean_reward = [np.mean(avg_rewards)] * len(avg_rewards)
fig, ax = plt.subplots()
ax.set_xlabel('Episodes using Sarsa')
ax.set_ylabel('Rewards')
ax.plot(avg_rewards)
ax.plot(mean_reward, 'g--')
print('Mean Reward using Sarsa: {}'.format(mean_reward[0]))
# visualize the episode in inference for Q-learing and Sarsa-learning
play(q_values)
play(q_values_sarsa)