莫烦PYTHON
DeepMind
《强化学习精要》
Deep Reinforcement Learning 基础知识(DQN方面)
用Tensorflow基于Deep Q Learning DQN 玩Flappy Bird
Human-level control through deep reinforcement learning
q-learning需要一张q table来表示q value,如果state和action数量巨大,那么这张q table就会很大,导致存储查找都极其耗费时间和空间。解决的思路就是,用一个神经网络来代替这张q table。可以将这个神经网络想象成一个函数,即 q_values,a=f(state,action) q _ v a l u e s , a = f ( s t a t e , a c t i o n ) ,或者更近一步,我们只需要输入state,因为q learning在选择action的时候比较激进,直接找最大的q value对应的action。给网络输入state,网络输出所有的action对应的q value,我们自己再在这个输出的tensor中
找到q value最大的对应的action,完成action的选择。
那么怎么更新q value的值呢?怎么训练网络。一个问题是监督学习需要大量数据,一个问题是强化学习的数据是有顺序的序列,而神经网络需要的是独立同分布的数据。
解决办法是,设置一个样本回放缓冲区replay buffer,也就是存储bot跟环境交互产生的(s, a, r, s_)。其容量很大,当replay buffer满了以后会用新数据覆盖老数据。每次训练的时候都从replay buffer中随机的抽取一批数据。这样一来,打乱了数据的相关性,向独立同分布靠近。同时也提高了数据的使用效率。
另外一个要解决的问题是不稳定的问题。在q-learning中,本次更新由上次的q-value和q target决定。
depp Q-learning with experience replay:
Initialize replay memory D D to capacity N N .
Initialize action-value function Q Q with random weights θ θ
Initialize target action-value function Q̂ Q ^ with weights θ−=θ θ − = θ
For episode=1,M e p i s o d e = 1 , M do:
- Initialize sequense s1={x1} s 1 = { x 1 } and preprocessed sequence ϕ1=ϕ(s1) ϕ 1 = ϕ ( s 1 )
- For t=1,T t = 1 , T do:
- With probability ϵ ϵ select a random aciton at a t , otherwise select at=argmaxaQ(ϕt,a;θ) a t = a r g m a x a Q ( ϕ t , a ; θ )
- Execute action at a t in emulator and observe reward rt r t and image xt+1 x t + 1
- Set st+1=xt+1, s t + 1 = x t + 1 , and preprecess ϕt+1=ϕ(st+1) ϕ t + 1 = ϕ ( s t + 1 )
- Store transition (ϕt,at,rt,ϕt+1) ( ϕ t , a t , r t , ϕ t + 1 ) in D
- Sample random minibatch of transitions (ϕj,aj,rj,ϕj+1) ( ϕ j , a j , r j , ϕ j + 1 )
- Set:
一个容量为N的容器D,存储大量的( s, a,r,s’ ),同时设置从容器抽取样本的数量batch_size;
两个相同结构的网络,Q、Q‘,并且初始化的参数相同。
# coding:utf-8
import os
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
class DeepQNet:
def __init__(
self,
n_features,
n_actions,
learning_rate=0.01,
reward_decay=0.9,
e_greedy_max=0.9,
e_greedy_increment=None,
replace_target_iter=300,
memory_pool_size=500,
batch_size=32,
output_graph=False,
):
# 强化学习需要的部分
self.gamma = reward_decay # 折扣因子
self.e_greedy_max = e_greedy_max # =0.9,90%利用,10%探索 e 即: epsilon
self.e_greedy_increment = e_greedy_increment # epsilon greedy的增量
self.e_greedy = 0 if e_greedy_increment is not None else e_greedy_max # 即有增量就从100%探索开始到100%利用,无就固定一个值
# 神经网络需要的部分
self.lr = learning_rate # 学习率alpha
self.n_features = n_features # state,state_的特征数量
self.n_actions = n_actions # action的数量
self.replace_target_iter = replace_target_iter # 每隔replace_target_iter个action以后更新一次q target网络的参数,更新q target的步数
self.learn_step_counter = 0 # 记录学习的步数,便于进行更新q target的参数更新
# 记忆池
self.memory_pool_size = memory_pool_size # 记忆池的容量,一般比较大,比如100万
self.memory_pool_counter = 0
self.memory_pool = np.zeros((memory_pool_size, n_features * 2 + 2)) # 全零初始化记忆池
self.batch_size = batch_size # 每次从记忆池取出数据的数量
self._build_net() # 建立网络q target net, q evaluate net
target_params = tf.get_collection('target_net_params') # 从collection中提取q target的参数
eval_params = tf.get_collection('eval_net_params') # 提取q eval的参数
self.replace_q_target_op_params = [tf.assign(t, e) for t, e in zip(target_params, eval_params)]
self.cost_history = [] # cost的更改数据,用来监测网络学习的结果
self.sess = tf.Session()
if output_graph:
# 需要从根目录开始写完整路径# FIXME 可改进仅写一次,不要每次运行都生成一个图,if图存在,则不生成
# tensorboard --logdir=name1:/Users/tu/PycharmProjects/myFirstPythonDir/DQN/logs
tf.summary.FileWriter('logs/', self.sess.graph)
self.sess.run(tf.global_variables_initializer())
# 建立神经网络
def _build_net(self):
self.state = tf.placeholder(tf.float32, [None, self.n_features], name='state')
self.state_ = tf.placeholder(tf.float32, [None, self.n_features], name='state_')
self.q_target = tf.placeholder(tf.float32, [None, self.n_actions], name='Q_target') # target net 的输出值
w_initializer, b_initializer = tf.random_normal_initializer(0., 0.3), tf.constant_initializer(0.1)
with tf.variable_scope('eval_net'):
my_collections = ['eval_net_params', tf.GraphKeys.GLOBAL_VARIABLES]
with tf.variable_scope('l1'):
w1 = tf.get_variable('w1', [self.n_features, 20], initializer=w_initializer, collections=my_collections)
b1 = tf.get_variable('b1', [1, 20], initializer=b_initializer, collections=my_collections)
l1 = tf.nn.relu(tf.matmul(self.state, w1) + b1)
with tf.variable_scope('l2'):
w2 = tf.get_variable('w2', [20, self.n_actions], initializer=w_initializer, collections=my_collections)
b2 = tf.get_variable('b2', [1, self.n_actions], initializer=w_initializer, collections=my_collections)
self.q_eval = tf.matmul(l1, w2) + b2 # 输出某一个state的actions value
with tf.variable_scope('loss'):
self.loss = tf.reduce_mean(tf.squared_difference(self.q_target, self.q_eval))
with tf.variable_scope('train'):
self.train_op = tf.train.RMSPropOptimizer(self.lr).minimize(self.loss)
with tf.variable_scope('target_net'): # fixme 初始化的时候两个网络的参数不同,目标是相同
my_collections = ['target_net_params', tf.GraphKeys.GLOBAL_VARIABLES]
with tf.variable_scope('l1'):
w1 = tf.get_variable('w1', [self.n_features, 20], initializer=w_initializer, collections=my_collections)
b1 = tf.get_variable('b1', [1, 20], initializer=b_initializer, collections=my_collections)
l1 = tf.nn.relu(tf.matmul(self.state_, w1) + b1)
with tf.variable_scope('l2'):
w2 = tf.get_variable('w2', [20, self.n_actions], initializer=w_initializer, collections=my_collections)
b2 = tf.get_variable('b2', [1, self.n_actions], initializer=b_initializer, collections=my_collections)
self.q_next = tf.matmul(l1, w2) + b2
# 存储记忆
def store_memory(self, state, action, reward, state_):
transition = np.hstack((state, action, reward, state_))
index = self.memory_pool_counter % self.memory_pool_size
self.memory_pool[index, :] = transition
self.memory_pool_counter += 1
# 选择行为
def choose_action(self, observation):
observation = observation[np.newaxis, :]
if np.random.uniform() < self.e_greedy:
actions_value = self.sess.run(self.q_eval, feed_dict={self.state: observation})
action = np.argmax(actions_value)
else:
action = np.random.randint(0, self.n_actions)
return action
# 更新q网络
def learn(self):
# 从memory poll 中随机获取一批数据
if self.memory_pool_counter >= self.memory_pool_size:
sample_index = np.random.choice(self.memory_pool_size, self.batch_size)
else:
sample_index = np.random.choice(self.memory_pool_counter, self.batch_size)
batch_memory = self.memory_pool[sample_index, :]
# 计算出实际值
q_eval, q_next = self.sess.run([self.q_eval, self.q_next],
feed_dict={self.state: batch_memory[:, :self.n_features],
self.state_: batch_memory[:, -self.n_features:]})
q_target = q_eval.copy()
batch_index = np.arange(self.batch_size, dtype=np.int32)
eval_action_index = batch_memory[:, self.n_features].astype(int)
reward = batch_memory[:, self.n_features + 1]
q_target[batch_index, eval_action_index] = reward + self.gamma * np.max(q_next, axis=1) # axis = 1才为行向
# 实际值与预测值构成lose,更新q eval参数
_, cost = self.sess.run([self.train_op, self.loss], feed_dict={self.state: batch_memory[:, :self.n_features],
self.q_target: q_target})
# 到达一定局数,更新q target网络
if (self.learn_step_counter % self.replace_target_iter) == 0:
self.sess.run(self.replace_q_target_op_params)
print('q target net has updated')
# 添加loss
self.cost_history.append(cost)
# epsilon
self.e_greedy = self.e_greedy + self.e_greedy_increment \
if self.e_greedy < self.e_greedy_max else self.e_greedy_max
self.learn_step_counter += 1
# 代价函数下降线
def plot_cost(self):
plt.plot(np.arange(len(self.cost_history)), self.cost_history)
plt.xlabel('my training steps')
plt.ylabel('my cost')
plt.show()
代价函数的曲线:
可以看到确实是有波动性的。
胜利和失败情况:
玩的游戏比较简单:
flappy bird