So far, you’ve solved many of your own reinforcement learning problems, using solution methods that represent the action values in a small table. Earlier in the nanodegree, we referred to this table as a Q-table.
You have already worked with some Deep Learning Neural Networks for classification, detection and segmentation. However, these Deep Learning Applications use label training data for supervised learning. The inference engine then produces the best guess label, not an action, as the output. When an oral agent handles the entire end-to-end pipeline, it’s called pixels-to-action, referring to the networks ability to take raw sensor data and choose the action, it thinks will best maximize its reward. Over time, oral agents have a uncanny knack for developing intuitive human-like learning to walk or peeking behind corners when they’re unsrue.
They naturally incorporate elements of exploration and knowledge gathering, which makes them good for imitating behaviors and performing path planning. Robots operating in unstructured environments tend to greatly benefit from oral agents.
As you’ll learn in this lesson, the Deep Q-Learning algorithm represents the optimal action-value function q ∗ q_* q∗ as a neural network (instead of a table).
Unfortunately, reinforcement learning is notoriously unstable when neural networks are used to represent the action values. In this lesson, you’ll learn all about the Deep Q-Learning algorithm, which addressed these instabilities by using two key features:
There are few very useful additions and tweaks though, in DQN. The first edition is the use of a rolly history of the past data via replay pool. By using the replay pool, the behavior distribution is average over many of its previous states, smoothing out learning and avoiding oscillations. This has the advantage that, each step of the experience is potentially used in many weight updates.
The other big idea is the use of a target network to represent the Q-function, which will be used to compute the loss of every action during training. Why not use a single network? Well, the issue is that at each step of training, the Q-fucntions values change and then the value estimates can easily spiral out of control. These additions enable RL agents to converge, more reliably during training.
By using the experienve replay methods, we can turn the reinforcement learning into supervised learning.
When the agent interacts with the environment, the sequence of experience tuples can be highly correlated. The naive Q-learning algorithm that learns from each of these experience tuples in sequential order runs the risk of getting swayed by the effects of this correlation. By instead keeping track of a replay buffer and using experience replay to sample from the buffer at random, we can prevent action values from oscillaing of diverging catastrophically.
The replay buffer contains a collection of experience tuples ( S , A , R , S ′ ) \left( S,A,R,S' \right) (S,A,R,S′), The tuples are gradually added to the buffer as we are interacting with the environment.
The act of sampling a small batch of tuples from the replay buffer in order to learn is known as experience replay. In order to breaking harmful correlations, experience replay allows us to learn more from individual tuples, recall rare occurrences, and in general make better use of our experience.
Experience replay helps us address one type of correlation. That is between consecutive experience tuples.
There is another kind of correlation that Q-Learning is susceptible to. Q-learning is a form of Temporal Difference of TD learning.
Our goal is to reduce the difference between the TD Target and the currently predicted Q-Value (TD Predict), this difference is the TD error. Now, the TD target here is supposed to be a replacement for the true value function q π ( s , a ) q_\pi\left(s,a\right) qπ(s,a) which is unknown to us.
We originally used q π q_\pi qπ to define a squared error loss:
J ( ω ) = E π [ ( q π ( S , A ) − q ^ ( S , A , ω ) ) 2 ] J\left(\omega\right)= \mathbb{E}_{\pi} \left[ \left( q_{\pi}\left(S,A\right) - \hat{q} \left( S,A,\omega \right)\right)^{2} \right] J(ω)=Eπ[(qπ(S,A)−q^(S,A,ω))2]
And differentiated that with respect to ω \omega ω to get the gradient descent update rule. (获取梯度的方向)
∇ ω J ( ω ) = − 2 ( q π ( S , A ) − q ^ ( S , A , ω ) ) ∇ ω q ^ ( S , A , ω ) \nabla_{\omega}J\left(\omega\right)=-2\left( q_{\pi}\left(S,A\right) - \hat{q} \left( S,A,\omega \right)\right)\nabla_{\omega}\hat{q} \left( S,A,\omega \right) ∇ωJ(ω)=−2(qπ(S,A)−q^(S,A,ω))∇ωq^(S,A,ω)
根据梯度的方向,在其反方向,进行权重的更新
Δ ω = − α 1 2 ∇ ω J ( ω ) = α ( q π ( S , A ) − q ^ ( S , A , ω ) ) ∇ ω q ^ ( S , A , ω ) \Delta\omega=-\alpha\frac{1}{2}\nabla_{\omega}J\left(\omega\right)\\ =\alpha\left( q_{\pi}\left(S,A\right) - \hat{q} \left( S,A,\omega \right)\right)\nabla_{\omega}\hat{q}\left(S,A,\omega \right) Δω=−α21∇ωJ(ω)=α(qπ(S,A)−q^(S,A,ω))∇ωq^(S,A,ω)
即:
Δ ω = α ⋅ R + γ max a q ^ ( S ′ , a , ω − ) ⏟ TD Target − q ^ ( S , A , ω ) ⏟ TD Predict ⏞ TD error ∇ ω q ^ ( S , A , ω ) \Delta\omega=\alpha \cdot \overbrace{\underbrace{R + \gamma \underset{a}{\max} \hat{q} \left( S',a,\omega^{-} \right)}_{\text{TD Target}} - \underbrace{\hat{q} \left( S,A,\omega \right)}_{\text{TD Predict}}}^{\text{TD error}} \nabla_{\omega}\hat{q}\left(S,A,\omega \right) Δω=α⋅TD Target R+γamaxq^(S′,a,ω−)−TD Predict q^(S,A,ω) TD error∇ωq^(S,A,ω)
where w − w^- w− are the weights of a separate target network that are not changed during the learning step, and ( S , A , R , S ′ ) (S, A, R, S') (S,A,R,S′) is an experience tuple.
In practice, we use w − w^- w− to generate targets while changing w for a certain number of learning steps. Then, we update w − w^- w− with the latest w. This decouples the target from the parameters, makes the learning algorithm much more stable and less likely to diverge or fall into oscillations.
Here, q π ( S , A ) q_{\pi} \left(S,A\right) qπ(S,A) is not dependent on our function approximation or its parameters, but the TD Target is dependent on these parameters which means simply replacing the true value function q π q_{\pi} qπ with a target is mathematically incorrect. We can get away with it in practice because every update results in a small change to the parameters. We’re just generally in the right direction.
There are two main processes that are interleaved in this algorithm. One, is where we sample the envirnment by performing actions and store away the observed experienced tuples in a replay memeory.
The other is where we select the small batch of tuples from this memory randomly, and learn from that batch using a gradient descent update step.
These two processes are not directly dependent on each other. So, you could perform multiple sampling steps then one learning step, or even multiple learning steps with different random batches.
There are many other techniques and optimizations that are used in the DQN paper, such as reward clippling, error clipping, storing past actions as part of the state vector, dealing with terminal states, digging epsilon over time, etc.
batch
样本states
,actions
,rewards
,next_states
,dones
next_states
获取当前网络参数下的动作值函数 Q_target_next
loss
# Deep Q-Network (DQN)
---
In this notebook, you will implement a DQN agent with OpenAI Gym's LunarLander-v2 environment.
### 1. Import the Necessary Packages
import gym
!pip3 install box2d
import random
import torch
import numpy as np
from collections import deque
import matplotlib.pyplot as plt
%matplotlib inline
!python -m pip install pyvirtualdisplay
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900))
display.start()
is_ipython = 'inline' in plt.get_backend()
if is_ipython:
from IPython import display
plt.ion()
### 2. Instantiate the Environment and Agent
Initialize the environment in the code cell below.
env = gym.make('LunarLander-v2')
env.seed(0)
print('State shape: ', env.observation_space.shape)
print('Number of actions: ', env.action_space.n)
from dqn_agent import Agent
agent = Agent(state_size=8, action_size=4, seed=0)
# watch an untrained agent
state = env.reset()
img = plt.imshow(env.render(mode='rgb_array'))
for j in range(200):
action = agent.act(state)
img.set_data(env.render(mode='rgb_array'))
plt.axis('off')
display.display(plt.gcf())
display.clear_output(wait=True)
state, rewared, done, _ = env.step(action)
if done:
break
env.close()
**### 3. Train the Agent with DQN
Run the code cell below to train the agent from scratch. You are welcome to amend the supplied values of the parameters in the function, to try to see if you can get better performance!
Alternatively, you can skip to the next step below (**4. Watch a Smart Agent!**), to load the saved model weights from a pre-trained agent.**
def dqn(n_episodes=2000, max_t=1000, eps_start=1.0, eps_end=0.01, eps_decay=0.995):
"""Deep Q-Learning.
Params
======
n_episodes (int): maximum number of training episodes
max_t (int): maximum number of timesteps per episode
eps_start (float): starting value of epsilon, for epsilon-greedy action selection
eps_end (float): minimum value of epsilon
eps_decay (float): multiplicative factor (per episode) for decreasing epsilon
"""
scores = [] # list containing socres from each episode
scores_window = deque(maxlen=100) # last 100 scores
eps = eps_start
for i_episode in range(1, n_episodes+1):
state = env.reset()
score = 0
for t in range(max_t):
action = agent.act(state, eps)
next_state, reward, done, _ = env.step(action)
agent.step(state, action, reward, next_state, done)
state = next_state
score += reward
if done:
break
scores_window.append(score) # save most recent score
scores.append(score) # save most recent score
eps = max(eps_end, eps_decay * eps) # decreate epsilon
print('\rEpisode {} \tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)), end='')
if i_episode % 100 == 0:
print('\rEpisode {} \tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)))
if np.mean(scores_window) >= 200:
print('\nEnvironment solved in {:d} episodes! \t Average Score: {:.2f}'.format(i_episode-100, np.mean(scores_window)))
torch.save(agent.q_network_local.state_dict(), 'checkpoint.pth')
break
return scores
scores = dqn()
# plot the scores
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(len(scores)), scores)
plt.ylabel("Score")
plt.xlabel("Episode # ")
plt.show()
### 4. Watch a Smart Agent!
In the next code cell, you will load the trained weights from file to watch a smart agent!
# load the weights from file
agent.q_network_local.load_state_dict(torch.load('checkpoint.pth'))
for i in range(3):
state = env.reset()
img = plt.imshow(env.render(mode='rgb_array'))
for j in range(200):
action = agent.act(state)
img.set_data(env.render(mode='rgb_array'))
plt.axis('off')
display.display(plt.gcf())
display.clear_output(wait=True)
state, reward, done, _ = env.step(action)
if done:
break
env.close(0)
### 5. Explore
In this exercise, you have implemented a DQN agent and demonstrated how to use it to solve an OpenAI Gym environment. To continue your learning, you are encouraged to complete any (or all!) of the following tasks:
- Amend the various hyperparameters and network architecture to see if you can get your agent to solve the environment faster. Once you build intuition for the hyperparameters that work well with this environment, try solving a different OpenAI Gym task with discrete actions!
- You may like to implement some improvements such as prioritized experience replay, Double DQN, or Dueling DQN!
- Write a blog post explaining the intuition behind the DQN algorithm and demonstrating how to use it to solve an RL environment of your choosing.
model.py 文件中的内容
import torch
import torch.nn as nn
import torch.nn.functional as F
class QNetwork(nn.Module):
"""Actor (Policy) Model."""
def __init__(self, state_size, action_size, seed, fc1_units=64, fc2_units=64):
"""Initialize parameters and build model.
Params
======
state_size (int): Dimension of each state
action_size (int): Dimension of each action
seed (int): Random seed
fc1_units (int): Number of nodes in first hidden layer
fc2_units (int): Number of nodes in second hidden layer
"""
super(QNetwork, self).__init__()
self.seed = torch.manual_seed(seed)
self.fc1 = nn.Linear(state_size, fc1_units)
self.fc2 = nn.Linear(fc1_units, fc2_units)
self.fc3 = nn.Linear(fc2_units, action_size)
def forward(self, state):
"""Build a network that maps state -> action values."""
x = F.relu(self.fc1(state))
x = F.relu(self.fc2(x))
return self.fc3(x)
dqn_agent.py 中的内容:
#! /usr/bin/env python3
# coding=utf-8
import numpy as np
import random
from collections import namedtuple, deque
from model import QNetwork
import torch
import torch.nn.functional as F
import torch.optim as optim
BUFFER_SIZE = int(1e5) # replay buffer size
BATCH_SIZE = 64 # minibatch size
GAMMA = 0.99 # discount factor
TAU = 1e-3 # for soft update of
LEARNING_RATE = 5e-4 # learning rate
UPDATE_EVERY = 4 # how often to update the network
device = torch.device("cuda:2" if torch.cuda.is_available() else "cpu")
print(device)
class Agent():
"""Interacts with and learns from the environment."""
def __init__(self, state_size, action_size, seed):
"""Initialize an Agent object.
Params
======
state_size (int): dimension of each state
action_size (int): dimension of each action
seed (int): random seed
"""
self.state_size = state_size
self.action_size = action_size
self.seed = random.seed(seed)
# Q-Network
self.q_network_local = QNetwork(state_size, action_size, seed).to(device)
self.q_network_target = QNetwork(state_size, action_size, seed).to(device)
self.optimizer = optim.Adam(self.q_network_local.parameters(), lr=LEARNING_RATE)
# Replay memory
self.memory = ReplayBuffer(action_size, BUFFER_SIZE, BATCH_SIZE, seed)
# Initialize time step (for updating every UPDATE_EVERY steps)
self.t_step = 0
def step(self, state, action, reward, next_state, done):
# Save experiences in replay memory
self.memory.add(state, action, reward, next_state, done)
# Learn every UPDATE_EVERY time stpes.
self.t_step = (self.t_step + 1) % UPDATE_EVERY
if self.t_step == 0:
# If enough samples are available in memory, get random subset and learn
if len(self.memory) > BATCH_SIZE:
experiences = self.memory.sample()
self.learn(experiences, GAMMA)
def act(self, state, eps=0):
"""Returns actions for given state as per current policy
Params
======
state (array_like): current state
eps (float): epsilon, for epsilon-greedy action selection
"""
state = torch.from_numpy(state).float().unsqueeze(0).to(device)
self.q_network_local.eval()
with torch.no_grad():
action_values = self.q_network_local(state)
self.q_network_local.train()
# Epsilon-greedy action selection
if random.random() > eps:
return np.argmax(action_values.cpu().data.numpy())
else:
return random.choice(np.arange(self.action_size))
def soft_update(self, local_model, target_model, tau):
"""Soft update model parameters.
θ_target = τ*θ_local + (1 - τ)*θ_target
Params
======
local_model (PyTorch model): weights will be copied from
target_model (PyTorch model): weights will be copied to
tau (float): interpolation parameter
"""
for target_param, local_param in zip(target_model.parameters(), local_model.parameters()):
target_param.data.copy_(tau*local_param.data + (1.0-tau)*target_param.data)
def learn(self, experiences, gamma):
"""Update value parameters using given batch of experience tuples.
Params
======
experiences (Tuple[torch.Variable]): tuple of (s, a, r, s', done) tuples
gamma (float): discount factor
"""
states, actions, rewards, next_states, dones = experiences
# Get max predicted Q values (for next states) from target model
Q_targets_next = self.q_network_target(next_states).detach().max(1)[0].unsqueeze(1)
# Compute Q targets for current states
Q_targets = rewards + (gamma * Q_targets_next * (1-dones))
# Get expected Q values from local model
Q_expected = self.q_network_local(states).gather(1, actions)
# Compute loss
loss = F.mse_loss(Q_expected, Q_targets)
# Minimize the loss
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
self.soft_update(self.q_network_local, self.q_network_target, TAU)
class ReplayBuffer():
""" Fixed-size buffer to store experience tuples."""
def __init__(self, action_size, buffer_size, batch_size, seed):
"""Initialize a ReplayBuffer object.
Params
======
action_size (int): dimension of each action
buffer_size (int): maximum size of buffer
batch_size (int): size of each training batch
seed (int): random seed
"""
self.action_size = action_size
self.memory = deque(maxlen=buffer_size)
self.batch_size = batch_size
self.experience = namedtuple("Experience", field_names=['state', 'action', 'reward', 'next_state', 'done'])
self.seed = random.seed(seed)
def add(self, state, action, reward, next_state, done):
"""Add a new experience to memory"""
e = self.experience(state, action, reward, next_state, done)
self.memory.append(e)
def sample(self):
"""Randomly sample a batch of experiences from memory"""
experiences = random.sample(self.memory, k=self.batch_size)
states = torch.from_numpy(np.vstack([e.state for e in experiences if e is not None])).float().to(device)
actions = torch.from_numpy(np.vstack([e.action for e in experiences if e is not None])).long().to(device)
rewards = torch.from_numpy(np.vstack([e.reward for e in experiences if e is not None])).float().to(device)
next_states = torch.from_numpy(np.vstack([e.next_state for e in experiences if e is not None])).float().to(device)
dones = torch.from_numpy(np.vstack([e.done for e in experiences if e is not None]).astype(np.uint8)).float().to(device)
return (states, actions, rewards, next_states, dones)
def __len__(self):
"""Return the current size of internal memory"""
return len(self.memory)
Double DQN:
Deep Q-Learning tends to overestimate action values. Double Q-Learning has been shown to work well in practice to help with this.
Prioritized Experience Replay
Deep Q-Learning samples experience transitions uniformly from a replay memory. Prioritized experienced replay is based on the idea that the agent can learn more effectively from some transitions than from others, and the more important transitions should be sampled with higher probability.
Dueling DQN
Currently, in order to determine which states are (or are not) valuable, we have to estimate the corresponding action values for each action. However, by replacing the traditional Deep Q-Network (DQN) architecture with a dueling architecture, we can assess the value of each state, without having to learn the effect of each action.