深度强化学习(四):DQN的拓展和改进

一、预备工具

1.1、Gym

Gym是OpenAI开发的通用强化学习算法测试平台,集成了众多仿真实验环境,开发者可以直接调用写好的环境,而不必考虑其中种种复杂逻辑,从而更加专注于算法本身。

import gym
env = gym.make("CartPole-v1")
observation = env.reset()
for _ in range(1000):
  env.render()
  action = env.action_space.sample() # your agent here (this takes random actions)
  observation, reward, done, info = env.step(action)
  if done:
    observation = env.reset()
env.close()

以上是一个具体而微的使用Gym的强化学习案例,即通过控制保持倒立摆的稳定(感兴趣的读者可以尝试自己设置一下策略),下图为其交互过程,通过这个小例子可以了解基本使用方法,在此基础上可以进行更加复杂和有趣的开发,同时可以使用OpenAI的其他开源软件,如universe, roboschool和baselines等。
基于Gym的强化学习交互过程

1.2、PTAN

为了使DQN的代码复用,且突出改动的地方和差异,需要对深度强化学习的代码进行进一步的封装。PTAN就是这样一种工具,它基于PyTorch、简单易学、灵活可拓展、可实现函数复用,其源码见于https://github.com/Shmuma/ptan 。下面简要介绍PTAN提供的几个模块:
(为了之后叙述准确的需要,以下相当部分会出现大量的代码,不妨当一节代码训练课好了)。

Agent

PTAN提供了一个DQN基础类ptan.agent.DQNAgent,使用PyTorch nn.Module根据DQN网络模型和动作选择策略,返回动作和状态,同时可以选择计算的设备(CPU或者GPU,在笔者的计算机中,二者计算速度相差10倍左右)。

class DQNAgent(BaseAgent):
    """
    DQNAgent is a memoryless DQN agent which calculates Q values
    from the observations and  converts them into the actions using action_selector
    """
    def __init__(self, dqn_model, action_selector, device="cpu", preprocessor=default_states_preprocessor):
        self.dqn_model = dqn_model
        self.action_selector = action_selector
        self.preprocessor = preprocessor
        self.device = device
    def __call__(self, states, agent_states=None):
        if agent_states is None:
            agent_states = [None] * len(states)
        if self.preprocessor is not None:
            states = self.preprocessor(states)
            if torch.is_tensor(states):
                states = states.to(self.device)
        q_v = self.dqn_model(states)
        q = q_v.data.cpu().numpy()
        actions = self.action_selector(q)
        return actions, agent_states
Actions

在PTAN中,ptan.actions.ArgmaxActionSelector和ptan.actions.EpsilonGreedyActionSelector分别对应贪婪策略和ε-greedy策略,该部分的作用是将神经网络的输出转化为动作。

class EpsilonGreedyActionSelector(ActionSelector):
    def __init__(self, epsilon=0.05, selector=None):
        self.epsilon = epsilon
        self.selector = selector if selector is not None else ArgmaxActionSelector()
    def __call__(self, scores):
        assert isinstance(scores, np.ndarray)
        batch_size, n_actions = scores.shape
        actions = self.selector(scores)
        mask = np.random.random(size=batch_size) < self.epsilon
        rand_actions = np.random.choice(n_actions, sum(mask))
        actions[mask] = rand_actions
        return actions
Agent's experience

在PTAN中,另外有负责值函数迭代的部分,称为经验源,如 ptan.experience.ExperienceSourceFirstLast ,这是经验源的一个包装器,当只需要第一个和最后一个状态时,可以避免在经验缓存区存储完整轨迹,当在一次实验结束时,最后一个状态为空。需要说明的是,类ExperienceSourceFirstLast 是类ExperienceSource的子类,采用super函数实现对父类的继承。

class ExperienceSourceFirstLast(ExperienceSource):
    def __init__(self, env, agent, gamma, steps_count=1, steps_delta=1, vectorized=False):
        assert isinstance(gamma, float)
        super(ExperienceSourceFirstLast, self).__init__(env, agent, steps_count+1, steps_delta, vectorized=vectorized)
        self.gamma = gamma
        self.steps = steps_count
    def __iter__(self):
        for exp in super(ExperienceSourceFirstLast, self).__iter__():
            if exp[-1].done and len(exp) <= self.steps:
                last_state = None
                elems = exp
            else:
                last_state = exp[-1].state
                elems = exp[:-1]
            total_reward = 0.0
            for e in reversed(elems):
                total_reward *= self.gamma
                total_reward += e.reward
            yield ExperienceFirstLast(state=exp[0].state, action=exp[0].action,
                                      reward=total_reward, last_state=last_state)

还有经验存储的部分 ptan.experience.ExperienceReplayBuffer 。

class ExperienceReplayBuffer:
    def __init__(self, experience_source, buffer_size):
        assert isinstance(experience_source, (ExperienceSource, type(None)))
        assert isinstance(buffer_size, int)
        self.experience_source_iter = None if experience_source is None else iter(experience_source)
        self.buffer = []
        self.capacity = buffer_size
        self.pos = 0
    def __len__(self):
        return len(self.buffer)
    def __iter__(self):
        return iter(self.buffer)
    def sample(self, batch_size):
        """
        Get one random batch from experience replay
        TODO: implement sampling order policy
        :param batch_size:
        :return:
        """
        if len(self.buffer) <= batch_size:
            return self.buffer
        # Warning: replace=False makes random.choice O(n)
        keys = np.random.choice(len(self.buffer), batch_size, replace=True)
        return [self.buffer[key] for key in keys]
    def _add(self, sample):
        if len(self.buffer) < self.capacity:
            self.buffer.append(sample)
        else:
            self.buffer[self.pos] = sample
            self.pos = (self.pos + 1) % self.capacity
    def populate(self, samples):
        """
        Populates samples into the buffer
        :param samples: how many samples to populate
        """
        for _ in range(samples):
            entry = next(self.experience_source_iter)
            self._add(entry)

为了更加清晰,笔者绘制了各部分功能与关系,如下图所示:


各部分间的关系

1.3 基本 DQN

现通过基本DQN的实践来说明以上工具的使用,采用Gym的PongNoFrameskip-v4环境,参数设置如下:

HYPERPARAMS = {
    'pong': {
        'env_name':         "PongNoFrameskip-v4",
        'stop_reward':      18.0,
        'run_name':         'pong',
        'replay_size':      100000,
        'replay_initial':   10000,
        'target_net_sync':  1000,
        'epsilon_frames':   10**5,
        'epsilon_start':    1.0,
        'epsilon_final':    0.02,
        'learning_rate':    0.0001,
        'gamma':            0.99,
        'batch_size':       32
    },
}

损失函数值的计算:从第一个状态开始,根据贝尔曼方程,计算值函数Q,误差函数采用Q值与期望Q值的MSE(Mean Square Error)。

def calc_loss_dqn(batch, net, tgt_net, gamma, device="cpu"):
    states, actions, rewards, dones, next_states = unpack_batch(batch)
    states_v = torch.tensor(states).to(device)
    next_states_v = torch.tensor(next_states).to(device)
    actions_v = torch.tensor(actions).to(device)
    rewards_v = torch.tensor(rewards).to(device)
    done_mask = torch.ByteTensor(dones).to(device)
    state_action_values = net(states_v).gather(1, actions_v.unsqueeze(-1)).squeeze(-1)
    next_state_values = tgt_net(next_states_v).max(1)[0]
    next_state_values[done_mask] = 0.0
    expected_state_action_values = next_state_values.detach() * gamma + rewards_v
    return nn.MSELoss()(state_action_values, expected_state_action_values)

RewardTracker 在每次试验后获取总的奖励,并且在最后一次试验结束后计算平均奖励,同时将当前值发送到TensorBoard。还能够计算运算速度,检查问题是否解决。

class RewardTracker:
    def __init__(self, writer, stop_reward):
        self.writer = writer
        self.stop_reward = stop_reward

    def __enter__(self):
        self.ts = time.time()
        self.ts_frame = 0
        self.total_rewards = []
        return self

    def __exit__(self, *args):
        self.writer.close()

    def reward(self, reward, frame, epsilon=None):
        self.total_rewards.append(reward)
        speed = (frame - self.ts_frame) / (time.time() - self.ts)
        self.ts_frame = frame
        self.ts = time.time()
        mean_reward = np.mean(self.total_rewards[-100:])
        epsilon_str = "" if epsilon is None else ", eps %.2f" % epsilon
        print("%d: done %d games, mean reward %.3f, speed %.2f f/s%s" % (
            frame, len(self.total_rewards), mean_reward, speed, epsilon_str
        ))
        sys.stdout.flush()
        if epsilon is not None:
            self.writer.add_scalar("epsilon", epsilon, frame)
        self.writer.add_scalar("speed", speed, frame)
        self.writer.add_scalar("reward_100", mean_reward, frame)
        self.writer.add_scalar("reward", reward, frame)
        if mean_reward > self.stop_reward:
            print("Solved in %d frames!" % frame)
            return True
        return False

以下为DQN的主函数,为了更加清晰显示主函数的过程,笔者绘制了主函数的流程图,并说明调用的包、类、函数及其功能。

import gym
import ptan
import argparse
import torch
import torch.optim as optim
from tensorboardX import SummaryWriter
from lib import dqn_model, common

if __name__ == "__main__":
    params = common.HYPERPARAMS['pong']
    parser = argparse.ArgumentParser()
    parser.add_argument("--cuda", default=False, action="store_true", help="Enable cuda")
    args = parser.parse_args()
    device = torch.device("cuda" if args.cuda else "cpu")

    env = gym.make(params['env_name'])
    env = ptan.common.wrappers.wrap_dqn(env)

    writer = SummaryWriter(comment="-" + params['run_name'] + "-basic")
    net = dqn_model.DQN(env.observation_space.shape, env.action_space.n).to(device)

    tgt_net = ptan.agent.TargetNet(net)
    selector = ptan.actions.EpsilonGreedyActionSelector(epsilon=params['epsilon_start'])
    epsilon_tracker = common.EpsilonTracker(selector, params)
    agent = ptan.agent.DQNAgent(net, selector, device=device)

    exp_source = ptan.experience.ExperienceSourceFirstLast(env, agent, gamma=params['gamma'], steps_count=1)
    buffer = ptan.experience.ExperienceReplayBuffer(exp_source, buffer_size=params['replay_size'])
    optimizer = optim.Adam(net.parameters(), lr=params['learning_rate'])
    frame_idx = 0

    with common.RewardTracker(writer, params['stop_reward']) as reward_tracker:
        while True:
            frame_idx += 1
            buffer.populate(1)
            epsilon_tracker.frame(frame_idx)

            new_rewards = exp_source.pop_total_rewards()
            if new_rewards:
                if reward_tracker.reward(new_rewards[0], frame_idx, selector.epsilon):
                    break

            if len(buffer) < params['replay_initial']:
                continue

            optimizer.zero_grad()
            batch = buffer.sample(params['batch_size'])
            loss_v = common.calc_loss_dqn(batch, net, tgt_net.target_model, gamma=params['gamma'], device=device)
            loss_v.backward()
            optimizer.step()

            if frame_idx % params['target_net_sync'] == 0:
                tgt_net.sync()
主函数流程图

下面是经典DQN的运行结果。

训练效果 (GPU)
奖励值变化

100次奖励值的中值

二、Double DQN

Double DQN算法来源于论文Deep Reinforcement Learning with Double Q-learning,用来克服DQN对值函数过估计的问题。在上一篇的叙述中,我们知道值函数按照以下方式更新:
其中max操作使值函数的估计值比真实值大,由于目标是寻找最优策略,因此同等程度的值函数偏差并不影响最优策略。但是,实际的过估计并不均匀,另外一些场合需要更加准确的值函数,因此需要对原有DQN算法进行改进。
在上面得更新方式中,q-target计算公式为: 其中表示Target Network的参数,公式可以进一步展开为:
可以发现,模型在选择最优行动和计算目标值时使用了相同的参数模型,从而导致过于乐观的值估计,为了减少过高估计得影响,一种简单的想法就是将选择最优行动和估计最优行动解耦,从而有了Double DQN算法,此时q-target计算公式变为: 即用Behavior Network 完成最优行动的选择工作。

核心部分的实现比较简单,在损失函数部分需要进行一些修改,如果参数double为真,则下一状态的动作值由主网络计算得到,而状态值函数由目标网络计算得到。

def calc_loss(batch, net, tgt_net, gamma, device="cpu", double=True):
    states, actions, rewards, dones, next_states = common.unpack_batch(batch)
    states_v = torch.tensor(states).to(device)
    next_states_v = torch.tensor(next_states).to(device)
    actions_v = torch.tensor(actions).to(device)
    rewards_v = torch.tensor(rewards).to(device)
    done_mask = torch.ByteTensor(dones).to(device)

    state_action_values = net(states_v).gather(1, actions_v.unsqueeze(-1)).squeeze(-1)
    if double:
        next_state_actions = net(next_states_v).max(1)[1]
        next_state_values = tgt_net(next_states_v).gather(1, next_state_actions.unsqueeze(-1)).squeeze(-1)
    else:
        next_state_values = tgt_net(next_states_v).max(1)[0]
    next_state_values[done_mask] = 0.0

    expected_state_action_values = next_state_values.detach() * gamma + rewards_v
    return nn.MSELoss()(state_action_values, expected_state_action_values)

此处与经典DQN区别不大,仅仅增加了double启用的接口。

STATES_TO_EVALUATE = 1000
EVAL_EVERY_FRAME = 100

if __name__ == "__main__":
    params = common.HYPERPARAMS['pong']
    parser = argparse.ArgumentParser()
    parser.add_argument("--cuda", default=False, action="store_true", help="Enable cuda")
    parser.add_argument("--double", default=False, action="store_true", help="Enable double DQN")
    args = parser.parse_args()
    device = torch.device("cuda" if args.cuda else "cpu")

    env = gym.make(params['env_name'])
    env = ptan.common.wrappers.wrap_dqn(env)

模块的调用和参数的选择与经典DQN没有区别。在初始重放缓存区填充后,将使用留存状态(held-out states)填充eval_states变量。

    writer = SummaryWriter(comment="-" + params['run_name'] + "-double=" + str(args.double))
    net = dqn_model.DQN(env.observation_space.shape, env.action_space.n).to(device)

    tgt_net = ptan.agent.TargetNet(net)
    selector = ptan.actions.EpsilonGreedyActionSelector(epsilon=params['epsilon_start'])
    epsilon_tracker = common.EpsilonTracker(selector, params)
    agent = ptan.agent.DQNAgent(net, selector, device=device)

    exp_source = ptan.experience.ExperienceSourceFirstLast(env, agent, gamma=params['gamma'], steps_count=1)
    buffer = ptan.experience.ExperienceReplayBuffer(exp_source, buffer_size=params['replay_size'])
    optimizer = optim.Adam(net.parameters(), lr=params['learning_rate'])

    frame_idx = 0
    eval_states = None

在这里,我们执行状态的初始创建,以便在训练期间进行评估。 STATES_TO_EVALUATE常量在程序的开头定义并且等于1000,这足够大以具有代表性的一组状态。其余部分没有过多变化。

    with common.RewardTracker(writer, params['stop_reward']) as reward_tracker:
        while True:
            frame_idx += 1
            buffer.populate(1)
            epsilon_tracker.frame(frame_idx)

            new_rewards = exp_source.pop_total_rewards()
            if new_rewards:
                if reward_tracker.reward(new_rewards[0], frame_idx, selector.epsilon):
                    break

            if len(buffer) < params['replay_initial']:
                continue
            if eval_states is None:
                eval_states = buffer.sample(STATES_TO_EVALUATE)
                eval_states = [np.array(transition.state, copy=False) for transition in eval_states]
                eval_states = np.array(eval_states, copy=False)

            optimizer.zero_grad()
            batch = buffer.sample(params['batch_size'])
            loss_v = calc_loss(batch, net, tgt_net.target_model, gamma=params['gamma'], device=device,
                               double=args.double)
            loss_v.backward()
            optimizer.step()

            if frame_idx % params['target_net_sync'] == 0:
                tgt_net.sync()
            if frame_idx % EVAL_EVERY_FRAME == 0:
                mean_val = calc_values_of_states(eval_states, net, device=device)
                writer.add_scalar("values_mean", mean_val, frame_idx)

下图为Double DQN与DQN结果的对比,一方面Double DQN有更快的收敛速度,另一方面Double DQN有更小的值函数。

Double DQN与DQN奖励值的对比

Double DQN与DQN奖励值中值的对比

三、Prioritized replay buffer

Replay Buffer 能够在提高样本利用率的同时减少样本的相关性。但是还存在一些问题:在采样样本时,每一个样本都会以相同的概率被采样出来,即以相同的频率被学习。而每一个样本的难度是不同的,学习每一个样本得到的收获也不同。如果平等地看待每一个样本,就会在那些简单的样本上花费比较多的时间,而学习的潜力没有被充分挖掘出来。
Priority Replay Buffer 则很好地解决了这个问题(参见论文Prioritized Experience Replay)。它会根据模型对当前样本的表现情况,给样本一定的权重,在采样时样本被采样的概率就和这个权重有关。交互时表现得越差,对应的权重越高,采样的概率也就越高;反过来,交互时表现得越好,那么权重也就越低,采样的概率也就越低。

利用数组实现Priority Replay Buffer功能

利用线段树实现Priority Replay Buffer功能

本文利用数组实现Priority Replay Buffer,以下为参数设置与初始化。优先级重放缓冲区的类将样本存储在循环缓冲区中和NumPy数组保持优先级。 我们还将迭代器存储到经验源中,从环境中提取样本。

PRIO_REPLAY_ALPHA = 0.6
BETA_START = 0.4
BETA_FRAMES = 100000

class PrioReplayBuffer:
    def __init__(self, exp_source, buf_size, prob_alpha=0.6):
        self.exp_source_iter = iter(exp_source)
        self.prob_alpha = prob_alpha
        self.capacity = buf_size
        self.pos = 0
        self.buffer = []
        self.priorities = np.zeros((buf_size, ), dtype=np.float32)

populate()方法需要从中提取给定数量的转换ExperienceSource对象并将它们存储在缓冲区中。 作为我们的存储转换是作为循环缓冲区实现的,使用此缓冲区的情况:当缓冲区没有达到最大容量时,只需要将新转换附加到缓冲区,如果缓冲区已满,需要覆盖最旧的转换,由pos类字段跟踪,并调整此位置模块缓冲区大小。

    def __len__(self):
        return len(self.buffer)

    def populate(self, count):
        max_prio = self.priorities.max() if self.buffer else 1.0
        for _ in range(count):
            sample = next(self.exp_source_iter)
            if len(self.buffer) < self.capacity:
                self.buffer.append(sample)
            else:
                self.buffer[self.pos] = sample
            self.priorities[self.pos] = max_prio
            self.pos = (self.pos + 1) % self.capacity

在采样方法中,需要使用参数将优先级转化为概率。

    def sample(self, batch_size, beta=0.4):
        if len(self.buffer) == self.capacity:
            prios = self.priorities
        else:
            prios = self.priorities[:self.pos]
        probs = prios ** self.prob_alpha
        probs /= probs.sum()

使用获得的概率,从缓存区获取样本。同时计算批次中样品的重量并返回三个对象:batch, indices 和weights 。

        indices = np.random.choice(len(self.buffer), batch_size, p=probs)
        samples = [self.buffer[idx] for idx in indices]
        total = len(self.buffer)
        weights = (total * probs[indices]) ** (-beta)
        weights /= weights.max()
        return samples, indices, np.array(weights, dtype=np.float32)

priority replay buffer 的最后一个函数更新优先级。

    def update_priorities(self, batch_indices, batch_priorities):
        for idx, prio in zip(batch_indices, batch_priorities):
            self.priorities[idx] = prio

主循环的初始化部分,仅仅是用PrioReplayBuffer代替ReplayBuffer。

if __name__ == "__main__":
    ...
    buffer = PrioReplayBuffer(exp_source, params['replay_size'], PRIO_REPLAY_ALPHA)
    ...

    frame_idx = 0
    beta = BETA_START

    with common.RewardTracker(writer, params['stop_reward']) as reward_tracker:
        while True:
            ...
            beta = min(1.0, BETA_START + frame_idx * (1.0 - BETA_START) / BETA_FRAMES)
            ...

优化器的调用与基本DQN版本不同。 首先,来自缓冲区的样本现在不返回单个批次,而是返回三个值:batch, indices 和 weights。 将indices 和weights传递到损失函数,其结果是两件事:第一件是需要反向传播的累积损失值,第二个是批次中的每个样品的个体损失值的张量。 反向传播累积损失并使Priority Replay Buffer更新样本的优先级。

            optimizer.zero_grad()
            batch, batch_indices, batch_weights = buffer.sample(params['batch_size'], beta)
            loss_v, sample_prios_v = calc_loss(batch, batch_weights, net, tgt_net.target_model,
                                               params['gamma'], device=device)
            loss_v.backward()
            optimizer.step()
            buffer.update_priorities(batch_indices, sample_prios_v.data.cpu().numpy())

            if frame_idx % params['target_net_sync'] == 0:
                tgt_net.sync()

Priority Replay Buffer与经典DQN的结果对比如下图所示,可见其收敛速度明显加快。


Priority Replay Buffer与DQN奖励值的对比

Priority Replay Buffer与DQN奖励值中值的对比

完整代码参见这里,Dueling DQN、Categorical DQN、Rainbow将在后续补充说明。

参考资料

[1] Maxim Lapan. Deep Reinforcement Learning Hands-on. Packt Publishing 2018.
[2] 冯超著 强化学习精要:核心算法与TensorFlow实现. ----北京:电子工业出版社 2018.

何意百炼刚,化为绕指柔. ----西晋·刘琨《 重赠卢谌》

你可能感兴趣的:(深度强化学习(四):DQN的拓展和改进)