主要应用在游戏AI、自动驾驶、仿生机器人等场景。如在某个环境(游戏、驾驶路上),执行特定动作后,导致状态发生改变,产生不同结果(对应不同激励值,有正负),这种类型的场景都能使用强化学习。
其中 DQN 算法是由 Google 提出的,不久前 Google 的 AlphaGo (围棋机器人)就是由该算法演变产生。
Q-learning 属于 off-policy 算法。用一个Q tabel记录每个state各个action的Q值(权重)。用下一个state概率最大的action的Q值更新当前state当前action的Q值。通过将最终激励值往前传播,达到学习的目的。
Q ( s t , a t ) ← ( 1 − α ) Q ( s t , a t ) + α [ r t + 1 + γ m a x a Q ( s t + 1 , a ) ] Q(s_t,a_t)\leftarrow(1-\alpha)Q(s_t,a_t)+\alpha[r_{t+1}+\gamma \underset{a}{max} Q(s_{t+1},a)] Q(st,at)←(1−α)Q(st,at)+α[rt+1+γamaxQ(st+1,a)]
α : 学 习 速 率 , γ : 衰 减 率 , r : 激 励 值 \alpha:学习速率,\gamma:衰减率,r:激励值 α:学习速率,γ:衰减率,r:激励值
SARSA 属于 on-policy 算法。用一个Q tabel记录每个state各个action的Q值(权重)。用下一个state选中将要执行的action的Q值更新当前state当前action的Q值。
Q ( s t , a t ) ← ( 1 − α ) Q ( s t , a t ) + α [ r t + 1 + γ Q ( s t + 1 , a t + 1 ) ] Q(s_t,a_t)\leftarrow(1-\alpha)Q(s_t,a_t)+\alpha[r_{t+1}+\gamma Q(s_{t+1},a_{t+1})] Q(st,at)←(1−α)Q(st,at)+α[rt+1+γQ(st+1,at+1)]
α : 学 习 速 率 , γ : 衰 减 率 , r : 激 励 值 \alpha:学习速率,\gamma:衰减率,r:激励值 α:学习速率,γ:衰减率,r:激励值
DQN 算法,由 DeepMind 在 2013 年在 NIPS 提出。DQN 算法的主要做法是 Experience Replay,其将系统探索环境得到的数据储存起来,然后随机采样样本更新深度神经网络的参数。
Experience Replay 的动机是:1)深度神经网络作为有监督学习模型,要求数据满足独立同分布,2)但 Q Learning 算法得到的样本前后是有关系的。为了打破数据之间的关联性,Experience Replay 方法通过存储-采样的方法将这个关联性打破了。
DeepMind 在 2015 年初在 Nature 上发布了文章,引入了 Target Q 的概念,进一步打破数据关联性。Target Q 的概念是用旧的深度神经网络 w^{-} 去得到目标值,下面是带有 Target Q 的 Q Learning 的优化目标。
用神经网络建模 Q function,替代Q tabel,基于Q-learning算法演变而成。
网络改进如下:
Q值更新基于target模型的值 Q ( s t + 1 , a ) Q(s_{t+1},a) Q(st+1,a),经过下面计算获取激励后的Q值,该值由冻结模型计算,与当前预测值有一定差距,可切断关联性,使模型能够收敛。
Q t a r g e t ( s t , a t ) ← r t + 1 + γ m a x a Q ( s t + 1 , a ) Q_{target}(s_t,a_t)\leftarrow r_{t+1}+\gamma \underset{a}{max} Q(s_{t+1},a) Qtarget(st,at)←rt+1+γamaxQ(st+1,a)
α : 学 习 速 率 , γ : 衰 减 率 , r : 激 励 值 \alpha:学习速率,\gamma:衰减率,r:激励值 α:学习速率,γ:衰减率,r:激励值
学习速率包含在模型训练时使用的梯度下降方法里。
计算 Q t a r g e t Q_{target} Qtarget与 Q e v a l Q_{eval} Qeval的平方误差,然后更新eval模型权重。
每隔一段时间把eval模型参数更新到target模型。
该教程源码基于莫凡Demo修改,使用Tenserflow 2.0运行。
运行代码,源码地址:https://github.com/tfwcn/Reinforcement-learning-with-tensorflow/blob/master/contents/5_Deep_Q_Network/run_this.py
from maze_env import Maze
from RL_brain_tf2 import DeepQNetwork
def run_maze():
step = 0
for episode in range(300):
# initial observation
observation = env.reset()
while True:
# fresh env
env.render()
# RL choose action based on observation
# 根据state选择action,有10%概率随机选择,其他按最大action选择
action = RL.choose_action(observation)
# RL take action and get next observation and reward
# 执行动作,获取下一个state,激励值reward,是否结束标识done
observation_, reward, done = env.step(action)
# 将state、action、激励值、下一状态放入记忆库
RL.store_transition(observation, action, reward, observation_)
# 50步以后,每5步从记忆库内随机选择一批记录进行训练
if (step > 50) and (step % 5 == 0):
RL.learn()
# swap observation
# 跳到下一个状态
observation = observation_
# break while loop when end of this episode
# 判断结束游戏,进行下一轮学习
if done:
break
step += 1
# end of game
print('game over')
env.destroy()
if __name__ == "__main__":
# maze game
env = Maze()
RL = DeepQNetwork(env.n_actions, env.n_features,
learning_rate=0.01,
reward_decay=0.9,
e_greedy=0.9,
replace_target_iter=50,
memory_size=2000,
# output_graph=True
)
env.after(100, run_maze)
env.mainloop()
RL.plot_cost()
DQN代码,基于莫凡源码修改,使用Tensorflow 2.0实现,源码地址:https://github.com/tfwcn/Reinforcement-learning-with-tensorflow/blob/master/contents/5_Deep_Q_Network/RL_brain_tf2.py
"""
This part of code is the DQN brain, which is a brain of the agent.
All decisions are made in here.
Using Tensorflow to build the neural network.
View more on my tutorial page: https://morvanzhou.github.io/tutorials/
Using:
Tensorflow: 2.0
gym: 0.7.3
"""
import numpy as np
import pandas as pd
import tensorflow as tf
np.random.seed(1)
tf.random.set_seed(1)
# Deep Q Network off-policy
class DeepQNetwork:
def __init__(
self,
n_actions,
n_features,
learning_rate=0.01,
reward_decay=0.9,
e_greedy=0.9,
replace_target_iter=300,
memory_size=500,
batch_size=32,
e_greedy_increment=None,
output_graph=False,
):
'''
n_actions:4,动作数量(上下左右)
n_features:2,状态数量(x,y)
'''
print('n_actions:', n_actions)
print('n_features:', n_features)
print('learning_rate:', learning_rate)
print('reward_decay:', reward_decay)
print('e_greedy:', e_greedy)
self.n_actions = n_actions
self.n_features = n_features
self.lr = learning_rate
self.gamma = reward_decay
self.epsilon_max = e_greedy
self.replace_target_iter = replace_target_iter
self.memory_size = memory_size
self.batch_size = batch_size
self.epsilon_increment = e_greedy_increment
self.epsilon = 0 if e_greedy_increment is not None else self.epsilon_max
# total learning step
self.learn_step_counter = 0
# initialize zero memory [s, a, r, s_]
self.memory = np.zeros((self.memory_size, n_features * 2 + 2))
# consist of [target_net, evaluate_net]
self._build_net()
self.cost_his = []
def _build_net(self):
'''建立预测模型和target模型'''
# ------------------ build evaluate_net ------------------
s = tf.keras.Input([None, self.n_features], name='s')
q_target = tf.keras.Input([None, self.n_actions], name='Q_target')
# 预测模型
x = tf.keras.layers.Dense(20, activation=tf.keras.activations.relu, name='l1')(s)
x = tf.keras.layers.Dense(self.n_actions, name='l2')(x)
self.eval_net = tf.keras.Model(inputs=s, outputs=x)
# 损失计算函数
self.loss = tf.keras.losses.MeanSquaredError()
# 梯度下降方法
self._train_op = tf.keras.optimizers.RMSprop(learning_rate=self.lr)
# ------------------ build target_net ------------------
s_ = tf.keras.Input([None, self.n_features], name='s_')
# target模型
x = tf.keras.layers.Dense(20, activation=tf.keras.activations.relu, name='l1')(s_)
x = tf.keras.layers.Dense(self.n_actions, name='l2')(x)
self.target_net = tf.keras.Model(inputs=s_, outputs=x)
def replace_target(self):
'''预测模型权重更新到target模型权重'''
self.target_net.get_layer(name='l1').set_weights(self.eval_net.get_layer(name='l1').get_weights())
self.target_net.get_layer(name='l2').set_weights(self.eval_net.get_layer(name='l2').get_weights())
def store_transition(self, s, a, r, s_):
'''存入记忆库'''
if not hasattr(self, 'memory_counter'):
self.memory_counter = 0
transition = np.hstack((s, [a, r], s_))
# replace the old memory with new memory
index = self.memory_counter % self.memory_size
self.memory[index, :] = transition
self.memory_counter += 1
def choose_action(self, observation):
'''根据state选择action'''
# to have batch dimension when feed into tf placeholder
observation = observation[np.newaxis, :]
if np.random.uniform() < self.epsilon:
# forward feed the observation and get q value for every actions
actions_value = self.eval_net(observation).numpy()
action = np.argmax(actions_value)
else:
action = np.random.randint(0, self.n_actions)
return action
def learn(self):
'''从记忆库学习'''
# check to replace target parameters
if self.learn_step_counter % self.replace_target_iter == 0:
self.replace_target()
print('\ntarget_params_replaced\n')
# sample batch memory from all memory
if self.memory_counter > self.memory_size:
sample_index = np.random.choice(self.memory_size, size=self.batch_size)
else:
sample_index = np.random.choice(self.memory_counter, size=self.batch_size)
batch_memory = self.memory[sample_index, :]
with tf.GradientTape() as tape:
# 生成标签,是以前的模型(50次前)
q_next = self.target_net(batch_memory[:, -self.n_features:]).numpy()
# 生成预测结果,是当前的模型
q_eval = self.eval_net(batch_memory[:, :self.n_features])
# change q_target w.r.t q_eval's action
q_target = q_eval.numpy()
batch_index = np.arange(self.batch_size, dtype=np.int32)
eval_act_index = batch_memory[:, self.n_features].astype(int)
reward = batch_memory[:, self.n_features + 1]
# 根据激励值更新当前的预测结果,对应的state-action=激励值+衰减系数*下一个最大action概率。这里只更新对应action,其余保留预测action,计算loss时只有对应action有loss。
q_target[batch_index, eval_act_index] = reward + self.gamma * np.max(q_next, axis=1)
"""
For example in this batch I have 2 samples and 3 actions:
q_eval =
[[1, 2, 3],
[4, 5, 6]]
q_target = q_eval =
[[1, 2, 3],
[4, 5, 6]]
Then change q_target with the real q_target value w.r.t the q_eval's action.
For example in:
sample 0, I took action 0, and the max q_target value is -1;
sample 1, I took action 2, and the max q_target value is -2:
q_target =
[[-1, 2, 3],
[4, 5, -2]]
So the (q_target - q_eval) becomes:
[[(-1)-(1), 0, 0],
[0, 0, (-2)-(6)]]
We then backpropagate this error w.r.t the corresponding action to network,
leave other action as error=0 cause we didn't choose it.
"""
# train eval network
self.cost = self.loss(y_true=q_target,y_pred=q_eval)
# print('loss:', self.cost)
gradients = tape.gradient(
self.cost, self.eval_net.trainable_variables)
self._train_op.apply_gradients(
zip(gradients, self.eval_net.trainable_variables))
self.cost_his.append(self.cost)
# increasing epsilon
# 随机概率随训练次数减少,默认不变
self.epsilon = self.epsilon + self.epsilon_increment if self.epsilon < self.epsilon_max else self.epsilon_max
self.learn_step_counter += 1
def plot_cost(self):
'''打印损失变化记录'''
import matplotlib.pyplot as plt
plt.plot(np.arange(len(self.cost_his)), self.cost_his)
plt.ylabel('Cost')
plt.xlabel('training steps')
plt.show()