周博磊《强化学习纲要》
学习笔记
课程资料参见:https://github.com/zhoubolei/introRL.
教材:Sutton and Barton
《Reinforcement Learning: An Introduction》
之前我们遇到的都是状态很少的小规模问题,然后实际生活中很多问题都是大规模问题,如象棋( 1 0 47 10^{47} 1047states),围棋( 1 0 170 10^{170} 10170states),那么怎么用model-free的方法去估计和控制这些大规模问题的价值函数呢?
价值函数的类别
类别1:输入是一种状态,输出是价值函数的值
类别2:对于q函数,将state和action作为输入,然后输出是给定状态和action的价值是多少
类别3:对于q函数,输入是状态,输出是对于所有哦action可能的q值,然后再输出后取argmax,就可以把最可能的action选出来
函数估计
多种表示形式:
Value Function Approximation with an Oracle
价值函数估计的时候也是用的梯度下降法。如果已知每个状态,应该如何优化价值函数?
用特征向量描述状态
Linear Value Function Approximation with Table Lookup Feature
Table lookup feature是one-hot vector。
向量基本上都是0,只有一个元素是1,当前状态等于某一个状态,对应的那个元素会变成1,除了那个状态其他都是0.
one-hot vector用线性模型来表示
参数向量 w 1 w_1 w1… w n w_n wn乘以状态state的feature
由于是one-hot vector,可以得到拟合价值函数就等于当前对应于某个位置的 w k w_k wk;因此现在优化的就是去估计 w k w_k wk
通过Generalized policy iteration达到
Linear Action-Value Function Approximation
实际上,这里没有oracle,因此
这样就可以得到gradient,然后用这个gradient去更新q函数近似函数的参数。
Semi-gradient Sarsa for VFA Control
开始的时候初始化需要优化的 w w w;如果是结束状态的话就用return;如果不是结束状态的话就往前走一步,采样出A’,构造出它的TD target作为oracle,然后算出它的gradient;每往前走一步更新一次 w w w;S和A都更新。
Convergence收敛的问题
The Deadly Triad(死亡三角) for the Danger of Instability and Divergence
潜在不确定因素:
控制算法的收敛性问题
Least Square Prediction
Stochastic Gradient Descent(采样) with Experience Replay
如何用非线性函数来拟合价值函数?
线性价值函数拟合
线性vs非线性价值函数拟合
卷积神经网络
深度学习和强化学习的结合。
回顾:Action-Value Function Approximation
回顾:Incremental Control Algorithm
DQN for Playing Atari Games
Q-learning with Value Function Approximation
Why fixed target
DQNs在Atari上的结果
Abalation Study on DQNs
Demo of Breakout by DQN:
https://www.youtube.com/watch?v=V1eYniJ0Rnk
Demo of Flappy Bird by DQN:
https://www.youtube.com/watch?v=xM62SpKAZHU
Code of DQN in PyTorch:
https://github.com/cuhkrlcourse/DeepRL-Tutorials/blob/master/01.DQN.ipynb
Code of Flappy Bird:
https://github.com/xmfbit/DQN-FlappyBird
Code of DQN in PyTorch:
Imports
import gym
from gym import wrappers
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import numpy as np
from IPython.display import clear_output
from matplotlib import pyplot as plt
%matplotlib inline
import random
from timeit import default_timer as timer
from datetime import timedelta
import math
from utils.wrappers import make_atari, wrap_deepmind, wrap_pytorch
from utils.hyperparameters import Config
from agents.BaseAgent import BaseAgent
Hyperparameters
config = Config()
config.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
#epsilon variables
config.epsilon_start = 1.0
config.epsilon_final = 0.01
config.epsilon_decay = 30000
config.epsilon_by_frame = lambda frame_idx: config.epsilon_final + (config.epsilon_start - config.epsilon_final) * math.exp(-1. * frame_idx / config.epsilon_decay)
#misc agent variables
config.GAMMA=0.99
config.LR=1e-4
#memory
config.TARGET_NET_UPDATE_FREQ = 1000
config.EXP_REPLAY_SIZE = 100000
config.BATCH_SIZE = 32
#Learning control variables
config.LEARN_START = 10000
config.MAX_FRAMES=1000000
Replay Memory
class ExperienceReplayMemory:
def __init__(self, capacity):
self.capacity = capacity
self.memory = []
def push(self, transition):
self.memory.append(transition)
if len(self.memory) > self.capacity:
del self.memory[0]
def sample(self, batch_size):
return random.sample(self.memory, batch_size)
def __len__(self):
return len(self.memory)
Network Declaration
class DQN(nn.Module):
def __init__(self, input_shape, num_actions):
super(DQN, self).__init__()
self.input_shape = input_shape
self.num_actions = num_actions
self.conv1 = nn.Conv2d(self.input_shape[0], 32, kernel_size=8, stride=4)
self.conv2 = nn.Conv2d(32, 64, kernel_size=4, stride=2)
self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=1)
self.fc1 = nn.Linear(self.feature_size(), 512)
#输出action
self.fc2 = nn.Linear(512, self.num_actions)
def forward(self, x):
x = F.relu(self.conv1(x))
x = F.relu(self.conv2(x))
x = F.relu(self.conv3(x))
x = x.view(x.size(0), -1)
x = F.relu(self.fc1(x))
x = self.fc2(x)
return x
def feature_size(self):
return self.conv3(self.conv2(self.conv1(torch.zeros(1, *self.input_shape)))).view(1, -1).size(1)
Agent
class Model(BaseAgent):
def __init__(self, static_policy=False, env=None, config=None):
super(Model, self).__init__()
self.device = config.device
self.gamma = config.GAMMA
self.lr = config.LR
self.target_net_update_freq = config.TARGET_NET_UPDATE_FREQ
self.experience_replay_size = config.EXP_REPLAY_SIZE
self.batch_size = config.BATCH_SIZE
self.learn_start = config.LEARN_START
self.static_policy = static_policy
self.num_feats = env.observation_space.shape
self.num_actions = env.action_space.n
self.env = env
self.declare_networks()
#构造target model
self.target_model.load_state_dict(self.model.state_dict())
self.optimizer = optim.Adam(self.model.parameters(), lr=self.lr)
#move to correct device
#原来的model
self.model = self.model.to(self.device)
#target model
self.target_model.to(self.device)
if self.static_policy:
self.model.eval()
self.target_model.eval()
else:
self.model.train()
self.target_model.train()
self.update_count = 0
self.declare_memory()
def declare_networks(self):
self.model = DQN(self.num_feats, self.num_actions)
self.target_model = DQN(self.num_feats, self.num_actions)
def declare_memory(self):
self.memory = ExperienceReplayMemory(self.experience_replay_size)
def append_to_replay(self, s, a, r, s_):
self.memory.push((s, a, r, s_))
def prep_minibatch(self):
# random transition batch is taken from experience replay memory
transitions = self.memory.sample(self.batch_size)
batch_state, batch_action, batch_reward, batch_next_state = zip(*transitions)
shape = (-1,)+self.num_feats
batch_state = torch.tensor(batch_state, device=self.device, dtype=torch.float).view(shape)
batch_action = torch.tensor(batch_action, device=self.device, dtype=torch.long).squeeze().view(-1, 1)
batch_reward = torch.tensor(batch_reward, device=self.device, dtype=torch.float).squeeze().view(-1, 1)
non_final_mask = torch.tensor(tuple(map(lambda s: s is not None, batch_next_state)), device=self.device, dtype=torch.uint8)
try: #sometimes all next states are false
non_final_next_states = torch.tensor([s for s in batch_next_state if s is not None], device=self.sdevice, dtype=torch.float).view(shape)
empty_next_state_values = False
except:
non_final_next_states = None
empty_next_state_values = True
return batch_state, batch_action, batch_reward, non_final_next_states, non_final_mask, empty_next_state_values
def compute_loss(self, batch_vars):
batch_state, batch_action, batch_reward, non_final_next_states, non_final_mask, empty_next_state_values = batch_vars
#estimate
current_q_values = self.model(batch_state).gather(1, batch_action)
#产生target的过程
with torch.no_grad():
#构造 max_next_q_values
max_next_q_values = torch.zeros(self.batch_size, device=self.device, dtype=torch.float).unsqueeze(dim=1)
if not empty_next_state_values:
max_next_action = self.get_max_next_state_action(non_final_next_states)
#构造max_next_q_values的时候是使用target network,
#把上一帧的状态放到target network里面,然后取max action
max_next_q_values[non_final_mask] = self.target_model(non_final_next_states).gather(1, max_next_action)
expected_q_values = batch_reward + (self.gamma*max_next_q_values)
#loss
diff = (expected_q_values - current_q_values)
loss = self.huber(diff)
loss = loss.mean()
return loss
def update(self, s, a, r, s_, frame=0):
if self.static_policy:
return None
self.append_to_replay(s, a, r, s_)
if frame < self.learn_start:
return None
batch_vars = self.prep_minibatch()
loss = self.compute_loss(batch_vars)
# Optimize the model
self.optimizer.zero_grad()
loss.backward()
for param in self.model.parameters():
param.grad.data.clamp_(-1, 1)
self.optimizer.step()
#每个一段时间更新一次target model,把优化的model copy到target里面来,
#这样开始两个network就一致,然后又让target network慢一些再更新一下
self.update_target_model()
self.save_loss(loss.item())
self.save_sigma_param_magnitudes()
def get_action(self, s, eps=0.1):
with torch.no_grad():
if np.random.random() >= eps or self.static_policy:
X = torch.tensor([s], device=self.device, dtype=torch.float)
a = self.model(X).max(1)[1].view(1, 1)
return a.item()
else:
return np.random.randint(0, self.num_actions)
def update_target_model(self):
self.update_count+=1
self.update_count = self.update_count % self.target_net_update_freq
if self.update_count == 0:
self.target_model.load_state_dict(self.model.state_dict())
def get_max_next_state_action(self, next_states):
return self.target_model(next_states).max(dim=1)[1].view(-1, 1)
def huber(self, x):
cond = (x.abs() < 1.0).to(torch.float)
return 0.5 * x.pow(2) * cond + (x.abs() - 0.5) * (1 - cond)
Plot Results
def plot(frame_idx, rewards, losses, sigma, elapsed_time):
clear_output(True)
plt.figure(figsize=(20,5))
plt.subplot(131)
plt.title('frame %s. reward: %s. time: %s' % (frame_idx, np.mean(rewards[-10:]), elapsed_time))
plt.plot(rewards)
if losses:
plt.subplot(132)
plt.title('loss')
plt.plot(losses)
if sigma:
plt.subplot(133)
plt.title('noisy param magnitude')
plt.plot(sigma)
plt.show()
Training Loop
start=timer()
env_id = "PongNoFrameskip-v4"
env = make_atari(env_id)
env = wrap_deepmind(env, frame_stack=False)
env = wrap_pytorch(env)
model = Model(env=env, config=config)
episode_reward = 0
observation = env.reset()
for frame_idx in range(1, config.MAX_FRAMES + 1):
epsilon = config.epsilon_by_frame(frame_idx)
action = model.get_action(observation, epsilon)
prev_observation=observation
observation, reward, done, _ = env.step(action)
observation = None if done else observation
model.update(prev_observation, action, reward, observation, frame_idx)
episode_reward += reward
if done:
observation = env.reset()
model.save_reward(episode_reward)
episode_reward = 0
if np.mean(model.rewards[-10:]) > 19:
plot(frame_idx, model.rewards, model.losses, model.sigma_parameter_mag, timedelta(seconds=int(timer()-start)))
break
if frame_idx % 10000 == 0:
plot(frame_idx, model.rewards, model.losses, model.sigma_parameter_mag, timedelta(seconds=int(timer()-start)))
model.save_w()
env.close()
DQNs总结
改进DQN
March 31, 2020: Agent57
五年内的改进集合
https://deepmind.com/blog/article/Agent57-Outperforming-the-human-Atari-benchmark
Optional Homework