stable-baselines3学习之Logger

stable-baselines3学习之Logger

1.Logger是什么?

它打印一些训练过程中的实时日志数据

stable-baselines3学习之Logger_第1张图片

为了重写默认的logger可以将它传到一个算法中,可用的形式有 ["stdout", "csv", "log", "tensorboard", "json"].

from stable_baselines3 import A2C
from stable_baselines3.common.logger import configure

tmp_path = "/tmp/sb3_log/"
# set up logger
new_logger = configure(tmp_path, ["stdout", "csv", "tensorboard"])

model = A2C("MlpPolicy", "CartPole-v1", verbose=1)
# Set new logger
model.set_logger(new_logger)
model.learn(10000)

stable-baselines3学习之Logger_第2张图片

训练完成后

stable-baselines3学习之Logger_第3张图片

tensorboard --logdir ./tmp/sb3_log

stable-baselines3学习之Logger_第4张图片

2.Logger输出的含义

在代码运行时logger会输出的方框内的一些参数的具体含义是什么呢

看下面的一个实例它是用PPO训练agent的一个例子:

stable-baselines3学习之Logger_第5张图片

eval/

所有 eval/值被 EvalCallback计算

  • mean_ep_length: 平均回合长度
  • mean_reward: 平均回合奖励(在评估时)
  • success_rate: 在评估时的平均成功率 (1.0 means 100% success), 环境的info dict中必须包含is_success这个关键字才能计算这个值

rollout/

  • ep_len_mean: 100个回合的平均回合长度

  • ep_rew_mean: 100个回合的平均回合长度

    需要一个Monitor wrapper 计算这个值 (被make_vec_env自动添加)

  • exploration_rate:当前的 exploration rate 使用DQN, 他对应于随机抽样的动作(epsilon of the “epsilon-greedy” exploration)

  • success_rate: 训练时100个回合的平均成功率, 你必须穿一个额外的参数 (info_keywords=("is_success",))给 Monitor wrapper 去记录这个值,并且提供info["is_success"]=True/False在episode的最后一个step。

time/

  • episodes:episodes的总数
  • fps: Number of frames per seconds (includes time taken by gradient update)
  • iterations: Number of iterations (data collection + policy update for A2C/PPO)
  • time_elapsed: 自从训练开始持续的时间
  • total_timesteps: timesteps 的总数(steps in the environments)

train

  • actor_loss: Current value for the actor loss for off-policy algorithms
  • approx_kl: approximate mean KL divergence between old and new policy (for PPO), it is an estimation of how much changes happened in the update
  • clip_fraction: mean fraction of surrogate(代理) loss that was clipped (above clip_range threshold) for PPO.
  • clip_range: Current value of the clipping factor for the surrogate loss of PPO
  • critic_loss: Current value for the critic function loss for off-policy algorithms, usually error between value function output and TD(0), temporal difference estimate
  • ent_coef: Current value of the entropy coefficient(系数) (when using SAC)
  • ent_coef_loss: Current value of the entropy coefficient loss (when using SAC)
  • entropy_loss: Mean value of the entropy loss (negative of the average policy entropy)
  • explained_variance: Fraction of the return variance explained by the value function
  • learning_rate: Current learning rate value
  • loss: Current total loss value
  • n_updates: Number of gradient updates applied so far
  • policy_gradient_loss: Current value of the policy gradient loss (its value does not have much meaning)
  • value_loss: Current value for the value function loss for on-policy algorithms, usually error between value function output and Monte-Carle estimate (or TD(lambda) estimate)
    y error between value function output and Monte-Carle estimate (or TD(lambda) estimate)
  • std: Current standard deviation(标准差) of the noise when using generalized State-Dependent Exploration (gSDE)

你可能感兴趣的:(DRL,深度学习,深度强化学习)