DDPG源码解读

源码来源

主函数分为两个模式,一个测试,一个训练,测试时动作未使用随机噪声,未将样本存进经验池。训练过程中,每个episode内不更新,结束后再更新,最终要的损失函数和梯度计算部分集中在agent.update()函数内。

def main():
    agent = DDPG(state_dim, action_dim, max_action)
    ep_r = 0
    if args.mode == 'test':
        agent.load()
        for i in range(args.test_iteration):
            state = env.reset()
            for t in count():
                action = agent.select_action(state)
                next_state, reward, done, info = env.step(np.float32(action))
                ep_r += reward
                env.render()
                if done or t >= args.max_length_of_trajectory:
                    print("Ep_i \t{}, the ep_r is \t{:0.2f}, the step is \t{}".format(i, ep_r, t))
                    ep_r = 0
                    break
                state = next_state

    elif args.mode == 'train':
        if args.load: agent.load()
        total_step = 0
        for i in range(args.max_episode):
            total_reward = 0
            step =0
            state = env.reset()
            for t in count():
                action = agent.select_action(state) # a=u(s), no other tricks
                action = (action + np.random.normal(0, args.exploration_noise, size=env.action_space.shape[0])).clip(
                    env.action_space.low, env.action_space.high)

                next_state, reward, done, info = env.step(action)
                if args.render and i >= args.render_interval : env.render()
                agent.replay_buffer.push((state, next_state, action, reward, np.float(done)))

                state = next_state
                if done:
                    break
                step += 1
                total_reward += reward
            total_step += step+1
            print("Total T:{} Episode: \t{} Total Reward: \t{:0.2f}".format(total_step, i, total_reward))
            agent.update() #update the network after an episode, the most important part
           # "Total T: %d Episode Num: %d Episode T: %d Reward: %f

            if i % args.log_interval == 0:
                agent.save()
    else:
        raise NameError("mode wrong!!!")

agent.update()函数可以分为几个部分:从经验池取样,更新critic Q网络,更新actor 网络,软更新target Q网络和target actor网络,其中
更新critic Q网络,这一部分与DQN算法一致,区别在于TD target的计算不是用的max,而是直接计算
r + γ Q t a r g e t ( s , μ t a r g e t ( s ) ) r+\gamma Q^{target}(s,\mu^{target}(s)) r+γQtarget(s,μtarget(s))

 # Compute the target Q value
            target_Q = self.critic_target(next_state, self.actor_target(next_state))
            target_Q = reward + (done * args.gamma * target_Q).detach()

            # Get current Q estimate
            current_Q = self.critic(state, action)

            # Compute critic loss
            critic_loss = F.mse_loss(current_Q, target_Q)
            self.writer.add_scalar('Loss/critic_loss', critic_loss, global_step=self.num_critic_update_iteration)
            # Optimize the critic
            self.critic_optimizer.zero_grad()
            critic_loss.backward()
            self.critic_optimizer.step()

更新actor 网络。在Actor-Critic模型中,actor部分的思路是直接参数化策略,然后通过直接最大化一个度量值实现策略的提升过程,一般通用的度量值为:
J ( θ ) = E τ ∼ π θ [ R ( τ ) ] = ∑ s ∈ S d π ( s ) V π ( s ) = ∑ s ∈ S d π ( s ) ∑ a ∈ A π θ ( a ∣ s ) Q π ( s , a ) J(\theta)=E_{\tau \sim \pi_{\theta}}[R(\tau)]=\sum_{s \in \mathcal{S}} d^{\pi}(s) V^{\pi}(s)=\sum_{s \in \mathcal{S}} d^{\pi}(s) \sum_{a \in \mathcal{A}} \pi_{\theta}(a \mid s) Q^{\pi}(s, a) J(θ)=Eτπθ[R(τ)]=sSdπ(s)Vπ(s)=sSdπ(s)aAπθ(as)Qπ(s,a)
之后使用梯度上升方法实现该值的最大化,因而需要求解该值对策略参数的梯度,称为策略梯度。相关推导在这里
策略梯度为 ∇ θ J ( θ ) = E π θ [ ∇ θ log ⁡ π θ ( s , a ) Q π ( s , a ) ] \nabla_{\theta} J(\theta)=\mathbb{E}_{\pi_{\theta}}\left[\nabla_{\theta} \log \pi_{\theta}(s, a) Q_{\pi}(s, a)\right] θJ(θ)=Eπθ[θlogπθ(s,a)Qπ(s,a)],对应损失函数为 − log ⁡ π θ ( s , a ) Q π ( s , a ) -\log\pi_{\theta}(s,a)Q_{\pi}(s,a) logπθ(s,a)Qπ(s,a),通过最小化该损失函数,等价于最大化 log ⁡ π θ ( s , a ) Q π ( s , a ) \log\pi_{\theta}(s,a)Q_{\pi}(s,a) logπθ(s,a)Qπ(s,a),从而实现最大化 J ( θ ) J(\theta) J(θ) (上述推导仅做粗略参考)

在DDPG算法中,策略从随机策略变为确定性策略,在实际神经网络中,损失函数为 − Q ( s , μ ( s ) ) -Q(s,\mu(s)) Q(s,μ(s))

# Compute actor loss
            actor_loss = -self.critic(state, self.actor(state)).mean()
            self.writer.add_scalar('Loss/actor_loss', actor_loss, global_step=self.num_actor_update_iteration)

            # Optimize the actor
            self.actor_optimizer.zero_grad()
            actor_loss.backward()
            self.actor_optimizer.step()

软更新target Q网络和target actor网络

DDPG算法不再依赖完整回合的轨迹,而是通过从经验池中采样利用样本,显著提高了数据利用效率,并将DQN的思想扩展到连续动作空间。

你可能感兴趣的:(机器学习)