深度强化学习实战:A2C算法实现

目录

  • A2C实现要点
      • 网络
      • 损失函数
  • 算法实现
      • 构建网络
      • 构建environment和agent
      • 训练模型
      • 信息监控
  • 附录
      • 在Google Colab中运行
      • 完整代码


A2C实现要点

A2C也是属于Policy算法族的,是在Policy Gradient的基础上拆分出两个网络Critic和Actor。代码实现有如下要点:

网络

  • actor网络:输入state,输出动作的概率分布,从中选择动作后作为critic网络的输入
  • critic网络:输入state和action预估下一个state的q-value

损失函数

  • actor网络:在Policy Gradient损失函数的Q-value值上减一个Baseline使得反馈有正有负,减少网络的波动。
  • critic网络:用于预估state的Q-value, loss为"真实值"和预估值的误差,而"真实值"是用Bellman方程求得,即Q(s) = r + gamma*Q(s’),和DQN的实现相似

算法实现

构建网络

actor和critic网络的输入都有state,只是输出不同所以可以放入同一个主体网络中训练共享信息和参数,但各自的输出层不同。网络代码如下:

# 构建model,model的主体一致,但输出层不同,会同时输出policy和Q-value
class A2CModel(nn.Module):
    def __init__(self, input_shape, n_actions):
        super().__init__()

        self.conv = nn.Sequential(
            nn.Conv2d(input_shape[0], 32, kernel_size=8, stride=4),
            nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=4, stride=2),
            nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, stride=1),
            nn.ReLU()
        )

        conv_out_size = self._get_conv_out(input_shape)
        # 用于输出策略(动作的概率分布)
        self.policy = nn.Sequential(
            nn.Linear(conv_out_size, 512),
            nn.ReLU(),
            nn.Linear(512, n_actions)
        )
        # 输出预估的value
        self.value = nn.Sequential(
            nn.Linear(conv_out_size, 512),
            nn.ReLU(),
            nn.Linear(512, 1)
        )

    def _get_conv_out(self, shape):
        o = self.conv(torch.zeros(1, *shape))
        return int(np.prod(o.size()))

    def forward(self, x):
        fx = x.float() / 256
        conv_out = self.conv(fx).view(fx.size()[0], -1)
        return self.policy(conv_out), self.value(conv_out)

构建environment和agent

(在构建agent以及后续的experience source时用到了高级包PTAN, 详情请参见深度强化学习高级包PTAN-1. Agent, Experience

  • 为了加速训练进程可以多个环境并行运行
  • 使用PolicyAgent
NUM_ENVS = 70

make_env = lambda: ptan.common.wrappers.wrap_dqn(gym.make("PongNoFrameskip-v4"))
envs = [make_env() for _ in range(NUM_ENVS)]
obs_shape = envs[0].observation_space.shape
n_actions = envs[0].action_space.n
print("Observation Space Shape: {}\nNumber of Action: {}".format(obs_shape, n_actions))

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
agent = ptan.agent.PolicyAgent(lambda x: net(x)[0], apply_softmax=True, device=device)

-- 输出 --
Observation Space Shape: (4, 84, 84)
Number of Action: 6

训练模型

  1. 实例化网络net和优化器optimizer。代码实现如下:
LEARNING_RATE = 0.003
net = A2CModel(envs[0].observation_space.shape, envs[0].action_space.n).to(device)
print(net)
optimizer = optim.Adam(net.parameters(), lr=LEARNING_RATE, eps=1e-3)

-- 输出 --
A2CModel(
  (conv): Sequential(
    (0): Conv2d(4, 32, kernel_size=(8, 8), stride=(4, 4))
    (1): ReLU()
    (2): Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2))
    (3): ReLU()
    (4): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1))
    (5): ReLU()
  )
  (policy): Sequential(
    (0): Linear(in_features=3136, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=6, bias=True)
  )
  (value): Sequential(
    (0): Linear(in_features=3136, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=1, bias=True)
  )
)
  1. 生成agent和environment的交互记录,存储在ExperienceSource中。代码实现如下:
GAMMA = 0.99
REWARD_STEPS = 4
exp_source = ptan.experience.ExperienceSourceFirstLast(envs, agent, gamma=GAMMA, steps_count=REWARD_STEPS)
  1. 进行模型训练,训练流程如下:
    a. 从ExperienceSource中提取信息(state,reward,next_state)
    b. 将信息传入net中进行前向传播forward
    c. 计算loss值
    d. 反向传播backward
    e. 更新网络net参数

代码实现如下:

batch = []
for step_idx, exp in enumerate(exp_source):
    batch.append(exp)

    if len(batch) < BATCH_SIZE:
        continue
    # 把state和actions转换为tensor,并返回state的Q-value作为Critic Net的target
    states_v, actions_t, vals_ref_v = process_exp(batch, net, device=device)
    batch.clear()

    optimizer.zero_grad()
    logits_v, value_v = net(states_v)
    loss_value_v = F.mse_loss(value_v.squeeze(-1), vals_ref_v)

    log_prob_v = F.log_softmax(logits_v, dim=1)
    adv_v = vals_ref_v - value_v.detach()
    log_prob_actions_v = adv_v * log_prob_v[range(BATCH_SIZE), actions_t]
    loss_policy_v = -log_prob_actions_v.mean()

    prob_v = F.softmax(logits_v, dim=1)
    entropy_loss_v = ENTROPY_BETA * (prob_v * log_prob_v).sum(dim=1).mean()

    loss_policy_v.backward(retain_graph=True)
    grads = np.concatenate([p.grad.data.cpu().numpy().flatten()
                            for p in net.parameters()
                            if p.grad is not None])

    loss_v = entropy_loss_v + loss_value_v
    loss_v.backward()
    nn_utils.clip_grad_norm_(net.parameters(), CLIP_GRAD)
    optimizer.step()

代码中包含了自定义的process_exp()方法,主要用于处理交互记录,实现如下:

def process_exp(batch, net, device='cpu'):
    """
    把state和actions转换为tensor,并计算state的Q-value
    :param batch: experience列表
    :param net: 构建的网络
    :return: states tensor, actions tensor, state Q-value(with discount)
    """
    states = []
    actions = []
    rewards = []
    not_done_idx = []
    last_states = []
    for idx, exp in enumerate(batch):
        states.append(np.array(exp.state, copy=False))
        actions.append(int(exp.action))
        rewards.append(exp.reward)
        if exp.last_state is not None:
            # not_done_idx: 存储非最后一个的experience的idx
            not_done_idx.append(idx)
            last_states.append(np.array(exp.last_state, copy=False))

    # 把states和actions转换为tensor
    states_v = torch.FloatTensor(np.array(states, copy=False)).to(device)
    actions_t = torch.LongTensor(actions).to(device)

    rewards_np = np.array(rewards, dtype=np.float32)
    if not_done_idx:
        last_states_v = torch.FloatTensor(np.array(last_states, copy=False)).to(device)
        # state的预估value
        last_vals_v = net(last_states_v)[1]
        # data.cpu().numpy(): 把tensor转换为numpy
        # .data提取tensor中的数据,.cpu()转换为cpu上的tensor,.numpy()转换为numpy array
        last_vals_np = last_vals_v.data.cpu().numpy()[:, 0]
        # REWARD_STEPS得到的reward已经乘以了GAMMA, Q(s')也要乘以GAMMA
        last_vals_np *= GAMMA ** REWARD_STEPS
        rewards_np[not_done_idx] += last_vals_np

    ref_vals_v = torch.FloatTensor(rewards_np).to(device)

    return states_v, actions_t, ref_vals_v

信息监控

在训练过程中需要对信息进行监控,以便知道目前模型的优化情况,是否在逐渐收敛。监控方式有两种:

  • 把信息打印出来,在强化学习中一般都会打印step和reward信息,观察reward是否在逐渐增长
  • 打印的方式观察起来不够直观,而且一旦打印的信息太多就不便于观察。所以就需要使用tensorboard对信息进行可视化处理。通常除了reward之外,也会把loss和一些重要参数写入tensorboard中,监控它们的变化情况,为后期调参优化提供依据。

打印和记录reward信息的方法实现如下:

import sys
import time

class RewardTracker:
    def __init__(self, writer, stop_reward):
        self.writer = writer
        self.stop_reward = stop_reward

    def __enter__(self):
        self.ts = time.time()
        self.ts_frame = 0
        self.total_rewards = []
        return self

    def __exit__(self, *args):
        self.writer.close()
    # 训练时输出日志,显示目前的模型效果
    def reward(self, reward, frame):
        self.total_rewards.append(reward)
        speed = (frame - self.ts_frame) / (time.time() - self.ts)
        self.ts_frame = frame
        self.ts = time.time()
        mean_reward = np.mean(self.total_rewards[-100:])
        n_games = len(self.total_rewards)
        if n_games % 50 == 0:
            print("%d: done %d games, mean reward %.3f, speed %.2f f/s" % (
                frame, n_games, mean_reward, speed
            ))
        sys.stdout.flush()
        # 信息写入到日志中,便于在tensorboard中可视化
        self.writer.add_scalar("speed", speed, frame)
        self.writer.add_scalar("reward_100", mean_reward, frame)
        self.writer.add_scalar("reward", reward, frame)
        if mean_reward > self.stop_reward:
            print("Solved in %d frames!" % frame)
            return True
        return False

其中writer是SummaryWriter对象,用于向日志中记录信息,并显示到tensorboard中。除了观察reward外,也可以在tensorboard中监控loss的变化趋势,代码如下:

from tensorboardX import SummaryWriter
writer = SummaryWriter(comment="-pong-a2c")
#### 
# 模型训练......
####
# 把loss信息写入日志,在tensorboard中显示
writer.add_scalar("Policy Loss", loss_policy_v, step_idx)
writer.add_scalar("Value Loss", loss_v, step_idx)

执行结果如下:

  • 打印信息
    68114: done 50 games, mean reward -20.640, speed 367.11 f/s
    124979: done 100 games, mean reward -20.440, speed 324.80 f/s
    179129: done 150 games, mean reward -20.150, speed 389.28 f/s
    207251: done 200 games, mean reward -20.220, speed 351.81 f/s
  • 在tensorboard中可视化
    深度强化学习实战:A2C算法实现_第1张图片

附录

在Google Colab中运行

需要运行100万以上step,上千次的games进行训练才可以达到比较满意的效果,用单机消耗资源较多耗费时间较长,甚至有宕机的可能,所以建议在google colab中运行。Google Colaboratory可以提供GPU资源用于模型训练,且是免费的。详见Colaboratory介绍

在Colab中运行需要添加如下代码:

  1. 安装必要包
# 安装高级包PTAN
!pip install ptan
# 安装tensorboardX
!pip install tensorboardX
  1. 在notebook中开启tensorboard:
# 在google colab中显示tensorboard
%load_ext tensorboard
# runs是记录日志的路径,可以自定义
%tensorboard --logdir runs

完整代码

完整代码见深度强化学习A2C算法实现,亲测可在Google Colab中运行

你可能感兴趣的:(深度强化学习)