用Tensorflow Agents实现强化学习DQN

在我之前的博客中强化学习笔记(4)-深度Q学习_gzroy的博客-CSDN博客,实现了用Tensorflow keras搭建DQN模型,解决小车上山问题。在代码里面,需要自己实现经验回放,采样等过程,比较繁琐。

Tensorflow里面有一个agents库,实现了很多强化学习的算法和工具。我尝试用agents来实现一个DQN模型来解决小车上山问题。Tensorflow网上的DQN教程是解决CartPole问题的,如果直接照搬这个代码来解决小车上山问题,则会发现模型无法收敛。经过一番研究,我发现原来是在agents里面,默认环境的回合步数是限制在200步,这样导致小车一直无法到达回合结束的位置,模型学习到的总回报一直保持不变。

以下代码是加载训练环境和评估环境,需要注意的是max_episode_steps需要设置为0,即不限制回合的最大步数:

from tf_agents.environments import suite_gym
from tf_agents.environments import tf_py_environment
from tf_agents.agents.dqn import dqn_agent
from tf_agents.networks import q_network
from tf_agents.replay_buffers import tf_uniform_replay_buffer
from tf_agents.policies import random_tf_policy
from tf_agents.utils import common
from tf_agents.drivers import dynamic_step_driver
from tf_agents.policies import EpsilonGreedyPolicy
import tensorflow as tf
from tqdm import trange
from tf_agents.policies.q_policy import QPolicy
import seaborn as sns
from matplotlib.ticker import MultipleLocator
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib notebook

env_name = 'MountainCar-v0'
env = suite_gym.load(env_name)
train_py_env = suite_gym.load(env_name, max_episode_steps=0)
eval_py_env = suite_gym.load(env_name, max_episode_steps=0)
train_env = tf_py_environment.TFPyEnvironment(train_py_env)
eval_env = tf_py_environment.TFPyEnvironment(eval_py_env)

然后我们建立一个DQN agent,这个agent包括了一个Q_net和一个target network,这两个network的结构是相同的,其中Q_net用于学习状态动作对的Q值,target network分享Q_net的权重,用于给定状态输入下找到最大Q值的动作。target_update_tau和target_update_period两个参数用于控制何时更新target network的权重,这里的设定是每一步更新target network的权重W_target = (1-0.005)*W_target + 0.005*W_q。gamma参数表示下一状态对应的Q值有多少计入到U值。epsilion_greedy用于控制有多少百分比的概率是随机挑选动作而不是根据Q值。

q_net = q_network.QNetwork(
    train_env.time_step_spec().observation,
    train_env.action_spec(),
    fc_layer_params=(64,))

target_q_net = q_network.QNetwork(
    train_env.time_step_spec().observation,
    train_env.action_spec(),
    fc_layer_params=(64,))

agent = dqn_agent.DqnAgent(
    train_env.time_step_spec(),
    train_env.action_spec(),
    q_network=q_net,
    target_update_tau=0.005,
    target_update_period=1,
    gamma=0.99,
    epsilon_greedy=0.1,
    td_errors_loss_fn=common.element_wise_squared_loss,
    optimizer=tf.compat.v1.train.AdamOptimizer(learning_rate=0.001))

设置一个缓冲池,用于存放和回放历史经验数据

replay_buffer_capacity = 10000

replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
    agent.collect_data_spec,
    batch_size=train_env.batch_size,
    max_length=replay_buffer_capacity)

# Add an observer that adds to the replay buffer:
replay_observer = [replay_buffer.add_batch]

先用一个随机动作策略来收集一些历史数据到缓冲池

random_policy = random_tf_policy.RandomTFPolicy(train_env.time_step_spec(), train_env.action_spec())
initial_driver = dynamic_step_driver.DynamicStepDriver(
      train_env,
      random_policy,
      observers=replay_observer,
      num_steps=1)
for _ in range(1):
    time_step = train_env.reset()
    step = 0
    while not time_step.is_last():
        step += 1
        if step>1000:
            break
        time_step, _ = initial_driver.run(time_step)

搜集数据之后,我们可以把replay_buffer转换为dataset来方便读取数据。这里的num_steps=2表示每次需要取两条相邻的经验数据,因为计算U值的时候需要用下一条数据的Q值来计算。

dataset = replay_buffer.as_dataset(
    num_parallel_calls=3,
    sample_batch_size=128,
    num_steps=2).prefetch(3)

iterator = iter(dataset)

定义一个评估函数,用于评估训练效果:

def compute_avg_return(environment, policy, num_episodes=10):
    total_return = 0.0
    for _ in range(num_episodes):
        time_step = environment.reset()
        episode_return = 0.0
        step = 0
        while not time_step.is_last():
            step += 1
            if step>1000:
                break
            action_step = policy.action(time_step)
            time_step = environment.step(action_step.action)
            episode_return += time_step.reward
        total_return += episode_return
    avg_return = total_return / num_episodes
    return avg_return.numpy()[0]

定义一个函数绘制训练过程中的每回合回报和评估回报:

class Chart:
    def __init__(self):
        self.fig, self.ax = plt.subplots(figsize = (8, 6))
        x_major_locator = MultipleLocator(1)
        self.ax.xaxis.set_major_locator(x_major_locator)
        self.ax.set_xlim(0.5, 50.5)

    def plot(self, data):
        self.ax.clear()
        sns.lineplot(data=data, x=data['episode'], y=data['reward'], hue=data['type'], ax=self.ax)
        self.fig.canvas.draw()

最后是训练和评估的代码。这里设置随着训练回合的增加,epsilion_greedy的值也逐渐减小,相当于在训练初期,随机寻找动作的概率较大,随着训练的增加,Q_net能更好的反映真实的Q值,因此随机动作的概率需要相应减小。另外要注意的是,由于初始回合里面需要通过一定的随机概率才能找到合适的动作结束回合,有可能会碰到回合经过很多步仍不能到达回合结束的条件,例如我曾经碰到第一回合运行了15000多步仍不能结束回合,这是可以重新进行训练。

train_episodes = 50
num_eval_episodes = 5
epsilon = 0.1
chart = Chart()

for episode in range(1,train_episodes):
    lr_step.assign(episode)
    learning_rate = learning_rate_fn(episode)
    episodes.append(episode)
    episode_reward = 0
    if epsilon>0.01:
        train_policy = EpsilonGreedyPolicy(agent.policy, epsilon=epsilon)
        train_driver = dynamic_step_driver.DynamicStepDriver(
              train_env,
              train_policy,
              observers=replay_observer,
              num_steps=1)
        epsilon -= 0.01
    time_step = train_env.reset()
    total_loss = 0
    step = 0
    while not time_step.is_last():
        step += 1
        time_step, _ = train_driver.run(time_step, _)
        experience, unused_info = next(iterator)
        train_loss = agent.train(experience).loss
        total_loss += train_loss
        episode_reward += time_step.reward.numpy()[0]
        if step%100==0:
            print("Epsiode_{}, step_{}, loss:{}".format(episode, step, total_loss/step))
    if episode==1:
        rewards_df = pd.DataFrame([[episode, episode_reward, 'train']], columns=['episode','reward','type'])
    else:
        rewards_df = rewards_df.append({'episode':episode, 'reward':episode_reward, 'type':'train'}, ignore_index=True)

    avg_return = compute_avg_return(eval_env, agent.policy, num_eval_episodes)
    rewards_df = rewards_df.append({'episode':episode, 'reward':avg_return, 'type':'eval'}, ignore_index=True)
    chart.plot(rewards_df)

训练完成后,以下代码可以把训练后的策略在评估环境上运行,并生成视频,可以看到训练效果:

import imageio
import base64
import IPython

def embed_mp4(filename):
  """Embeds an mp4 file in the notebook."""
  video = open(filename,'rb').read()
  b64 = base64.b64encode(video)
  tag = '''
  '''.format(b64.decode())

  return IPython.display.HTML(tag)

def create_policy_eval_video(policy, filename, num_episodes=1, fps=30):
  filename = filename + ".mp4"
  with imageio.get_writer(filename, fps=fps) as video:
    for _ in range(num_episodes):
      time_step = eval_env.reset()
      video.append_data(eval_py_env.render())
      while not time_step.is_last():
        action_step = policy.action(time_step)
        time_step = eval_env.step(action_step.action)
        video.append_data(eval_py_env.render())
  return embed_mp4(filename)

create_policy_eval_video(agent.policy, "trained-agent")

视频如下:

trained-agent

你可能感兴趣的:(人工智能,机器学习,Python编程,tensorflow,人工智能,python)