强化学习策略梯度梳理-SOTA下(A2C,A3C 代码)

强化学习策略梯度梳理-SOTA下

  • 策略梯度SOTA
    • 分布式Actor learner
      • A2C
        • advantage & lambda return
        • 最大熵
      • batched A2C
      • A3C(Asynchronous Advantage Actor critic)
        • worker
        • optimiser
        • train
      • IMPALA

策略梯度SOTA

这个部分主要参考周博磊老师的第九节的顺序
主要参考课程 Intro to Reinforcement Learning,Bolei Zhou
相关文中代码
https://github.com/ThousandOfWind/RL-basic-alg.git
参考到了pytorch a3c, 另一个版本

分布式Actor learner

A2C

A2C还没有分布式,只是后面分布了A2C,基于QAC我们做两个改动

advantage & lambda return

这里懒了一下,直接从ppo那里抄过来的

        # advantage
        advantage = th.zeros_like(reward)
        returns = th.zeros_like(reward)
        deltas = th.zeros_like(reward)
        pre_return = 0
        pre_value = 0
        pre_advantage = 0
        for i in range(advantage.shape[0]-1, -1, -1):
            returns[i] = reward[i] + self.gamma * pre_return
            deltas[i] = reward[i] + self.gamma * pre_value - value[i]
            advantage[i] = deltas[i] + self.gamma * self.lamda * pre_advantage
            pre_return = returns[i]
            pre_value = value[i]
            pre_advantage = advantage[i]

最大熵

        entropies = -(log_pi * log_pi.exp()).sum(dim=1, keepdim=True)
        J = - (advantage.detach() * log_pi + entropies).mean()

batched A2C

多线程,同步更新,效率还比较低

A3C(Asynchronous Advantage Actor critic)

只是比A2C多了Asynchronous

  1. 在cpu上并行多个actor和本地环境交互
  2. 因为多个actor提供多样化的经验,所以不需要经验池
  3. 在本地计算梯度,并把梯度更新到目标网络上

强化学习策略梯度梳理-SOTA下(A2C,A3C 代码)_第1张图片

worker

首先worker初始化的时候需要也传入那个真正被更新的网络

    def __init__(self, param_set, writer, share_model:DNNAgent, optimizer):
        self.obs_shape = param_set['obs_shape'][0]
        self.gamma = param_set['gamma']
        self.learning_rate = param_set['learning_rate']
        self.clone_share_model = param_set['clone_share_model']
        self.id = param_set['worker_id']

        self.ac = copy.deepcopy(share_model)
        self.soft_clone = param_set['soft_clone']
        if self.soft_clone:
            self.tau = param_set['tau']

        self.params = share_model.parameters()
        self.optimiser = optimizer

梯度回传的时候需要是把本地梯度引入到全局

        self.optimiser.zero_grad()
        loss.backward()
        grad_norm = th.nn.utils.clip_grad_norm_(self.params, 10)
        self.ensure_shared_grads()
        self.optimiser.step()

利用函数

    def ensure_shared_grads(self):
        for param, shared_param in zip(self.ac.parameters(),
                                       self.params):
            if shared_param.grad is not None:
                return
            shared_param._grad = param.grad

optimiser

这里我还没有懂啦,大概也可以直接用原本的pytorch自带的,我就在这里贴一个短的自定义版本做参考

class GlobalAdam(optim.Adam):
    def __init__(self, params, lr):
        super(GlobalAdam, self).__init__(params, lr=lr)
        for group in self.param_groups:
            for p in group['params']:
                state = self.state[p]
                state['step'] = 0
                state['exp_avg'] = torch.zeros_like(p.data)
                state['exp_avg_sq'] = torch.zeros_like(p.data)

                state['exp_avg'].share_memory_()
                state['exp_avg_sq'].share_memory_()

train

    writer, param_set = init()
    share_model = ac_sharenet(param_set)
    processes = []

    for i in range(param_set['num_processes']):
        p = mp.Process(target=run, args=(writer, param_set, share_model))
        p.start()
        processes.append(p)
    for p in processes:
        p.join()

IMPALA

  1. 分布式的actor 和分布式的learner
  2. actor不计算梯度仅仅采样给learner
  3. learner之间互相传递梯度
  4. 利用importance sampling复用轨迹

你可能感兴趣的:(强化学习,机器学习)