这个教程展示了如何在gym库里的cartpole环境中用pytorch去训练一个DQN代理。
任务
这个代理有两个动作,将小车左移或者右移动,以便让这个附着的杆保持直立。你能在 Gym website.
找到官方的有各种算法和各种可视化的选手积分榜,
CartPole
当这个代理观测到环境的当前状态并且选择一个动作,环境会转移到一个新的状态,并且会返回一个奖励,表明了这个动作的结果.在这个任务中,每增加一个时间步,奖励是+1。如果这个杆子跌落的角度太大,或者这个车子远离中心移动了超过2.4个单位就终止。这意味着表现更好的场景将运行更长的时间,积累更大的回报。
这个CartPole任务是设计好的,所以这输入是4个real values,代表着环境状态(位置,速度,etc.)然而,神经网络纯粹依靠看屏幕解决这个问题,意思是,靠屏幕截图。
所以我们将使用以这个小车为中心的一块屏幕为输入,因为这个,我们的结果不会直接比得上积分榜的那些算法的结果,我们的工作要更难。不幸的是,这会减慢我们的训练,因为我们不得不渲染所有的架构。
我们将状态表示为在当前的屏幕与新的屏幕之间的差异。
这将允许代理从一张图像中考虑杆子的速度。
包
首先,让我们导入需要的包。第
1.我们需要gym库的环境
2.我们还要用pytorch中的这些东西:
1.神经网络(torch.nn)
2.优化(torch.optim)
- 自动微分(automatic differentiation) (torch.autograd)
4.视觉任务的实用程序(utilities for vision tasks )(torchvision
- a separate package
)
import gym
import math
import random
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from collections import namedtuple, deque
from itertools import count
from PIL import Image
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torchvision.transforms as T
env = gym.make('CartPole-v0').unwrapped
# set up matplotlib
is_ipython = 'inline' in matplotlib.get_backend()
if is_ipython:
from IPython import display
plt.ion()
# if gpu is to be used
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
记忆回放(Replay Memory)
我们将使用经验回放记忆来训练我们的DQN。它存储agent观测到的transitions,允许我们在以后复用这个数据。通过随机采样,这个transitions是不相关的。这个有什么用呢,它使得DQN训练的过程稳定了,并且改善了DQN训练过程。
现在,我们需要两个类。
Transition:一个命名元组,代表我们环境中的单个转换。它本质上将(状态,动作)对映射到它们的(next_state,reward)结果,状态是稍后描述的屏幕差异图像。(a named tuple representing a single transition in our environment. It essentially maps (state, action) pairs to their (next_state, reward) result, with the state being the screen difference image as described later on.)
ReplayMemory:
一个有界大小的循环缓冲区,用于保存最近观察到的转换。它还实现了一个 .sample() 方法,用于选择随机批次的转换进行训练。(a cyclic buffer of bounded size that holds the transitions observed recently. It also implements a .sample() method for selecting a random batch of transitions for training.)
Transition = namedtuple('Transition',
('state', 'action', 'next_state', 'reward'))
class ReplayMemory(object):
def __init__(self, capacity):
self.memory = deque([],maxlen=capacity)
def push(self, *args):
"""Save a transition"""
self.memory.append(Transition(*args))
def sample(self, batch_size):
return random.sample(self.memory, batch_size)
def __len__(self):
return len(self.memory)
现在,让我们定义一下我们的模型,但是首先,让我们快速概括一下DQN是什么。
DQN algorithm
我们的环境是确定性(deterministic)的,所以为了简单起见,这里给出的所有方程也都是确定性的。在强化学习文献中,它们还会包含对环境中随机转变的期望。
我们的目标是训练一个策略,试图去最大化折扣过的,累积的奖励Rt0也就是return
γ就是折扣因子,应该是一个常数在0到1之间,这个常数保证了sum会收敛。它使得从不确定的遥远的未来的得到的奖励变得比那些从near future来的奖励变得更不重要。
Q-learning背后的主要思想是,如果我们有一个函数Q:state x action --> R,这告诉我们return是什么,如果我们在一个给定状态采取一个action,我们能轻易的构造一个策略最大化我们的奖励。
然而,我们并不知道关于这个世界的一切,所以我们不会得到Q,但是,因为神经网络是通用的函数逼近器,我们能简单的创造一个并且训练他去近似Q*.
对于我们的训练更新规则,我们将使用一个事实,即,某个策略的每一个Q函数都遵循贝尔曼方程:
等式两边的差异被称为时序差分差异,δ:
为了让这个差异减到最小,我们将使用 Huber loss.
这个在这个误差小的时候,Huber loss 起的作用类似于均方误差(mean squared error)。
什么叫做均方误差:
MSE formula = (1/n) * Σ(actual – forecast)**2
当这个误差大的时候类似于:平均绝对误差(mean absolute error)
什么是平均绝对误差:
this makes it more robust to outliers when the estimates of Q are very noisy.
我们在一批从记忆回放中采样到的transitions,B上面计算Q.
Q-network
我们的模型将是一个卷积神经网络,它接受当前和先前屏幕patch之间的差异。它有两个输出,表示为:
Q(s,left)和Q(s,right)(s是network的输入)
实际上,网络正试图预测给定当前输入的每个动作的回报的期望。
class DQN(nn.Module):
def __init__(self, h, w, outputs):
super(DQN, self).__init__()
self.conv1 = nn.Conv2d(3, 16, kernel_size=5, stride=2)
self.bn1 = nn.BatchNorm2d(16)
self.conv2 = nn.Conv2d(16, 32, kernel_size=5, stride=2)
self.bn2 = nn.BatchNorm2d(32)
self.conv3 = nn.Conv2d(32, 32, kernel_size=5, stride=2)
self.bn3 = nn.BatchNorm2d(32)
# Number of Linear input connections depends on output of conv2d layers
# and therefore the input image size, so compute it.
def conv2d_size_out(size, kernel_size = 5, stride = 2):
return (size - (kernel_size - 1) - 1) // stride + 1
convw = conv2d_size_out(conv2d_size_out(conv2d_size_out(w)))
convh = conv2d_size_out(conv2d_size_out(conv2d_size_out(h)))
linear_input_size = convw * convh * 32
self.head = nn.Linear(linear_input_size, outputs)
# Called with either one element to determine next action, or a batch
# during optimization. Returns tensor([[left0exp,right0exp]...]).
def forward(self, x):
x = x.to(device)
x = F.relu(self.bn1(self.conv1(x)))
x = F.relu(self.bn2(self.conv2(x)))
x = F.relu(self.bn3(self.conv3(x)))
return self.head(x.view(x.size(0), -1))
输入提取(input extraction)
下面的代码是实用程序,用来提取和处理从环境中渲染得来的图片,它使用了torchvision包,使得使组合图像变换变得容易。一旦你运行下面的代码,会显示一个它提取的patch例子
resize = T.Compose([T.ToPILImage(),
T.Resize(40, interpolation=Image.CUBIC),
T.ToTensor()])
def get_cart_location(screen_width):
world_width = env.x_threshold * 2
scale = screen_width / world_width
return int(env.state[0] * scale + screen_width / 2.0) # MIDDLE OF CART
def get_screen():
# Returned screen requested by gym is 400x600x3, but is sometimes larger
# such as 800x1200x3. Transpose it into torch order (CHW).
screen = env.render(mode='rgb_array').transpose((2, 0, 1))
# Cart is in the lower half, so strip off the top and bottom of the screen
_, screen_height, screen_width = screen.shape
screen = screen[:, int(screen_height*0.4):int(screen_height * 0.8)]
view_width = int(screen_width * 0.6)
cart_location = get_cart_location(screen_width)
if cart_location < view_width // 2:
slice_range = slice(view_width)
elif cart_location > (screen_width - view_width // 2):
slice_range = slice(-view_width, None)
else:
slice_range = slice(cart_location - view_width // 2,
cart_location + view_width // 2)
# Strip off the edges, so that we have a square image centered on a cart
screen = screen[:, :, slice_range]
# Convert to float, rescale, convert to torch tensor
# (this doesn't require a copy)
screen = np.ascontiguousarray(screen, dtype=np.float32) / 255
screen = torch.from_numpy(screen)
# Resize, and add a batch dimension (BCHW)
return resize(screen).unsqueeze(0)
env.reset()
plt.figure()
plt.imshow(get_screen().cpu().squeeze(0).permute(1, 2, 0).numpy(),
interpolation='none')
plt.title('Example extracted screen')
plt.show()
训练
超参数和实用程序
这一块举例说明了我们的模型和它的优化器,并且定义了一些实用程序:
1.select_action:根据epsilon greedy policy选择一个action,简单的表达就是,我们有时候会用我们的模型来选择action,并且有时我们只是均匀采样来选择一个action.选随机行为的概率从EPS_START开始,然后呈几何级数朝着EPS_END衰减。
EPS_DECAY控制衰减率。
2.plot_durations:用于绘制回合持续时间的助手,以及过去 100 个回合的平均值(官方评估中使用的度量)
这个绘图在下面,并包含了主循环,并且每个回合会更新。
BATCH_SIZE = 128
GAMMA = 0.999
EPS_START = 0.9
EPS_END = 0.05
EPS_DECAY = 200
TARGET_UPDATE = 10
# Get screen size so that we can initialize layers correctly based on shape
# returned from AI gym. Typical dimensions at this point are close to 3x40x90
# which is the result of a clamped and down-scaled render buffer in get_screen()
init_screen = get_screen()
_, _, screen_height, screen_width = init_screen.shape
# Get number of actions from gym action space
n_actions = env.action_space.n
policy_net = DQN(screen_height, screen_width, n_actions).to(device)
target_net = DQN(screen_height, screen_width, n_actions).to(device)
target_net.load_state_dict(policy_net.state_dict())
target_net.eval()
optimizer = optim.RMSprop(policy_net.parameters())
memory = ReplayMemory(10000)
steps_done = 0
def select_action(state):
global steps_done
sample = random.random()
eps_threshold = EPS_END + (EPS_START - EPS_END) * \
math.exp(-1. * steps_done / EPS_DECAY)
steps_done += 1
if sample > eps_threshold:
with torch.no_grad():
# t.max(1) will return largest column value of each row.
# second column on max result is index of where max element was
# found, so we pick action with the larger expected reward.
return policy_net(state).max(1)[1].view(1, 1)
else:
return torch.tensor([[random.randrange(n_actions)]], device=device, dtype=torch.long)
episode_durations = []
def plot_durations():
plt.figure(2)
plt.clf()
durations_t = torch.tensor(episode_durations, dtype=torch.float)
plt.title('Training...')
plt.xlabel('Episode')
plt.ylabel('Duration')
plt.plot(durations_t.numpy())
# Take 100 episode averages and plot them too
if len(durations_t) >= 100:
means = durations_t.unfold(0, 100, 1).mean(1).view(-1)
means = torch.cat((torch.zeros(99), means))
plt.plot(means.numpy())
plt.pause(0.001) # pause a bit so that plots are updated
if is_ipython:
display.clear_output(wait=True)
display.display(plt.gcf())
训练循环
最后是,训练我们的模型的代码:
在这里,你能够找到一optimize_model 模型函数,它执行优化的一步。
然后把他们联合起来我们的loss中.
根据定义我们设置V(s)=0 如果s是终止状态。
为了稳定性我们也使用目标网络去计算V(st+1)。目标网络有它自己的权重,大多数时间都是冻结的。也就是不变。但每隔一段时间就会更新策略网络的权重。
这通常是一组固定数量的步骤,但为了简单起见,我们将使用episodes。
def optimize_model():
if len(memory) < BATCH_SIZE:
return
transitions = memory.sample(BATCH_SIZE)
# Transpose the batch (see https://stackoverflow.com/a/19343/3343043 for
# detailed explanation). This converts batch-array of Transitions
# to Transition of batch-arrays.
batch = Transition(*zip(*transitions))
# Compute a mask of non-final states and concatenate the batch elements
# (a final state would've been the one after which simulation ended)
non_final_mask = torch.tensor(tuple(map(lambda s: s is not None,
batch.next_state)), device=device, dtype=torch.bool)
non_final_next_states = torch.cat([s for s in batch.next_state
if s is not None])
state_batch = torch.cat(batch.state)
action_batch = torch.cat(batch.action)
reward_batch = torch.cat(batch.reward)
# Compute Q(s_t, a) - the model computes Q(s_t), then we select the
# columns of actions taken. These are the actions which would've been taken
# for each batch state according to policy_net
state_action_values = policy_net(state_batch).gather(1, action_batch)
# Compute V(s_{t+1}) for all next states.
# Expected values of actions for non_final_next_states are computed based
# on the "older" target_net; selecting their best reward with max(1)[0].
# This is merged based on the mask, such that we'll have either the expected
# state value or 0 in case the state was final.
next_state_values = torch.zeros(BATCH_SIZE, device=device)
next_state_values[non_final_mask] = target_net(non_final_next_states).max(1)[0].detach()
# Compute the expected Q values
expected_state_action_values = (next_state_values * GAMMA) + reward_batch
# Compute Huber loss
criterion = nn.SmoothL1Loss()
loss = criterion(state_action_values, expected_state_action_values.unsqueeze(1))
# Optimize the model
optimizer.zero_grad()
loss.backward()
for param in policy_net.parameters():
param.grad.data.clamp_(-1, 1)
optimizer.step()
在下面,您可以找到主要的训练循环。一开始我们重置环境并初始化状态张量。然后,我们采样一个动作,执行它,观察下一个屏幕和奖励(总是 1),并优化我们的模型一次。当回合结束(我们的模型失败)时,我们重新开始循环。
Below, num_episodes is set small. You should download the notebook and run lot more epsiodes, such as 300+ for meaningful duration improvements.
num_episodes = 50
for i_episode in range(num_episodes):
# Initialize the environment and state
env.reset()
last_screen = get_screen()
current_screen = get_screen()
state = current_screen - last_screen
for t in count():
# Select and perform an action
action = select_action(state)
_, reward, done, _ = env.step(action.item())
reward = torch.tensor([reward], device=device)
# Observe new state
last_screen = current_screen
current_screen = get_screen()
if not done:
next_state = current_screen - last_screen
else:
next_state = None
# Store the transition in memory
memory.push(state, action, next_state, reward)
# Move to the next state
state = next_state
# Perform one step of the optimization (on the policy network)
optimize_model()
if done:
episode_durations.append(t + 1)
plot_durations()
break
# Update the target network, copying all weights and biases in DQN
if i_episode % TARGET_UPDATE == 0:
target_net.load_state_dict(policy_net.state_dict())
print('Complete')
env.render()
env.close()
plt.ioff()
plt.show()
Here is the diagram that illustrates the overall resulting data flow.
动作是有随机选择和基于策略二者之一决定的,从gym环境中得到下一步采样。我们在记忆回放中记录这个结果并且在每一次迭代上进行优化来计算Q值的期望。它不时更新以保持最新状态。