REINFORCE 算法

Neil Zhu，ID Not_GOD，University AI 创始人 & Chief Scientist，致力于推进世界人工智能化进程。制定并实施 UAI 中长期增长战略和目标，带领团队快速成长为人工智能领域最专业的力量。
作为行业领导者，他和UAI一起在2014年创建了TASA（中国最早的人工智能社团）, DL Center（深度学习知识中心全球价值网络），AI growth（行业智库培训）等，为中国的人工智能人才建设输送了大量的血液和养分。此外，他还参与或者举办过各类国际性的人工智能峰会和活动，产生了巨大的影响力，书写了60万字的人工智能精品技术内容，生产翻译了全球第一本深度学习入门书《神经网络与深度学习》，生产的内容被大量的专业垂直公众号和媒体转载与连载。曾经受邀为国内顶尖大学制定人工智能学习规划和教授人工智能前沿课程，均受学生和老师好评。

更好的渲染效果，请到这里来

原文地址：https://rllab.readthedocs.io/en/latest/user/implement_algo_basic.html
译文地址：https://tigerneil.wordpress.com/2016/05/23/reinforce-algorithm/

本节，我们将学习一下经典 REINFORCE 算法的实现，这样被称为“基本”策略梯度方法。我们会从一个应用在固定策略和环境开始。下一节会实现进阶版本，使用框架提供的功能来让整个项目更加结构化并对命令行友好。

预备知识

首先，我们简要回顾一下算法和一些基本符号。我们采用定义为 (\mathcal{S},\mathcal{A},P,r,\mu_0,\gamma,T)，其中\mathcal{S}是状态集合，\mathcal{A} 是行动集合，P:\mathcal{S}\times \mathcal{A}\times \mathcal{S}\rightarrow [0,1] 是转移概率，P:\mathcal{S}\times \mathcal{A} \rightarrow \mathbb{R} 是奖励函数，\gamma \in [0,1] 是折扣因子，而 T\in \mathbb{N} 是片段长度。REINFORCE 算法直接优化参数化的随机策略 \pi_\theta: \mathcal{S}\times\mathcal{A}\rightarrow [0,1]，通过执行在期望奖励目标函数的梯度上升：

\eta(\theta) = \mathbb{E}\Bigg[\sum\limits_{t=0}^{T} \gamma^t r(s_t, a_t)\Bigg]

其中期望是隐式地覆盖所有可能的轨迹，按照采样过程 s_0\sim \mu_0, a_t\sim \pi_\theta(\dot|s_t)，而 s_{t+1}\sim P(\dot|s_t, a_t)。通过似然比例技巧，目标函数关于 \theta 的梯度如下式给出：

\nabla_\theta \eta(\theta) = \mathbb{E}\Bigg[(\sum\limits_{t=0}^{T} \gamma^t r(s_t, a_t))(\sum\limits_{t=0}^{T}\nabla_\theta \log \pi_\theta (a_t|s_t))\Bigg]

注意到对 t'

\mathbb{E} [r(s_{t'},a_{t'}) \nabla_\theta \log\pi_\theta (a_t|s_t)] = 0

我们可以降低估计量方差。

因此，

\nabla_\theta\eta(\theta) = \mathbb{E}\Bigg[\sum\limits_{t=0}^{T}\nabla_\theta\log \pi_\theta(a_t|s_t)\sum\limits_{t'=t}^{T}\gamma{t'}r(s_{t'},a_{t'})\Bigg]

通常，我们使用下面的估计代替：

\nabla_\theta\eta(\theta) = \mathbb{E}\Bigg[\sum\limits_{t=0}^{T}\nabla_\theta\log \pi_\theta(a_t|s_t)\sum\limits_{t'=t}^{T}\gamma{t'-t}r(s_{t'},a_{t'})\Bigg]

其中，\gamma^{t'} 由 \gamma^{t'-t} 代替。我们将折扣因子看成是对无折扣目标函数的方差降低因子时，会得到更小偏差的梯度估计，会带来一定的方差增大。我们定义 R_t := \sum\limits_{t'=t}^{T} \gamma^{t'-t} r(s_{t'},a_{t'}) 为经验折扣奖励。

上面的公式是我们实现的核心。整个算法的伪代码如下：

初始化参数为 \theta_1 的策略 \pi.
对迭代 k = 1,2,\dots:
根据当前策略 \theta_k 采样 N 个轨迹 \tau_1,\dots,\tau_n，其中 \tau_i = (s_t^i, a_t^i, R_t^i)_{t=0}{T-1}.注意到因为在观察到最后的状态时无行动了已经，所以最后的状态丢弃。
计算经验策略梯度：\widehat{\nabla_\theta\eta(\theta)} = \frac{1}{NT}\sum\limits_{i=1}^{{N}\sum\limits_{t=0}}{T-1} \nabla_\theta\log \pi_\theta(a_t^i|s_ti)R_t^i
进行一步梯度计算：\theta_{k+1} = \theta_k + \alpha\widehat{\nabla_\theta\eta(\theta)}
准备工作

作为开始，我们试着使用神经网络策略来解决 cartpole 平衡问题. 后面我们会推广算法来接受配置参数. 现在先看看最简单形式.

from __future__ import print_function
from rllab.envs.box2d.cartpole_env import CartpoleEnv
from rllab.policies.gaussian_mlp_policy import GaussianMLPPolicy
from rllab.envs.normalized_env import normalize
import numpy as np
import theano
import theano.tensor as TT
from lasagne.updates import adam
 
# normalize() makes sure that the actions for the environment lies
# within the range [-1, 1] (only works for environments with continuous actions)
env = normalize(CartpoleEnv())
# Initialize a neural network policy with a single hidden layer of 8 hidden units
policy = GaussianMLPPolicy(env.spec, hidden_sizes=(8,))
 
# We will collect 100 trajectories per iteration
N = 100
# Each trajectory will have at most 100 time steps
T = 100
# Number of iterations
n_itr = 100
# Set the discount factor for the problem
discount = 0.99
# Learning rate for the gradient update
learning_rate = 0.01

收集样本

现在，一次迭代中在我们当前的策略下收集样本.

paths = []
 
for _ in xrange(N):
observations = []
actions = []
rewards = []
 
observation = env.reset()
 
for _ in xrange(T):
# policy.get_action() returns a pair of values. The second one returns a dictionary, whose values contains
# sufficient statistics for the action distribution. It should at least contain entries that would be
# returned by calling policy.dist_info(), which is the non-symbolic analog of policy.dist_info_sym().
# Storing these statistics is useful, e.g., when forming importance sampling ratios. In our case it is
# not needed.
action, _ = policy.get_action(observation)
# Recall that the last entry of the tuple stores diagnostic information about the environment. In our
# case it is not needed.
next_observation, reward, terminal, _ = env.step(action)
observations.append(observation)
actions.append(action)
rewards.append(reward)
observation = next_observation
if terminal:
# Finish rollout if terminal state reached
break
 
# We need to compute the empirical return for each time step along the
# trajectory
returns = []
return_so_far = 0
for t in xrange(len(rewards) - 1, -1, -1):
return_so_far = rewards[t] + discount * return_so_far
returns.append(return_so_far)
# The returns are stored backwards in time, so we need to revert it
returns = returns[::-1]
 
paths.append(dict(
observations=np.array(observations),
actions=np.array(actions),
rewards=np.array(rewards),
returns=np.array(returns)
))

根据经验策略梯度的公式，我们可以将所有搜集来的数据合在一起，这样可以帮助我们向量化实现.

observations = np.concatenate([p["observations"] for p in paths])
actions = np.concatenate([p["actions"] for p in paths])
returns = np.concatenate([p["returns"] for p in paths])

构造计算图

我们使用 Theano 来实现，假设读者已经对其有了了解。如果没有，请参考some tutorials.

首先，我们构造输入数据的符号变量：

# Create a Theano variable for storing the observations
# We could have simply written `observations_var = TT.matrix('observations')` instead for this example. However,
# doing it in a slightly more abstract way allows us to delegate to the environment for handling the correct data
# type for the variable. For instance, for an environment with discrete observations, we might want to use integer
# types if the observations are represented as one-hot vectors.
observations_var = env.observation_space.new_tensor_variable(
'observations',
# It should have 1 extra dimension since we want to represent a list of observations
extra_dims=1
)
actions_var = env.action_space.new_tensor_variable(
'actions',
extra_dims=1
)
returns_var = TT.vector('returns')

注意到，我们可以将策略梯度公式变换如下：

\widehat{\nabla_\theta\eta(\theta)} = \nabla_\theta\Bigg(\frac{1}{NT}\sum\limits_{i=1}^{{N}\sum\limits_{t=0}}{T-1}\log\pi_\theta(a_t^i|s_ti)R_t^i \Bigg) = \nabla_\theta L(\theta)

其中 L(\theta) = \frac{1}{NT} \sum\limits_{i=1}^{{N}\sum\limits_{t=0}}{T-1} \log \pi_\theta(a_t^i|s_ti)R_t^i 被称为 surrogate 函数. 因此，我们可以首先构造 L(\theta) 的计算图，然后用其梯度获得经验策略梯度.

# policy.dist_info_sym returns a dictionary, whose values are symbolic expressions for quantities related to the
# distribution of the actions. For a Gaussian policy, it contains the mean and (log) standard deviation.
dist_info_vars = policy.dist_info_sym(observations_var, actions_var)
 
# policy.distribution returns a distribution object under rllab.distributions. It contains many utilities for computing
# distribution-related quantities, given the computed dist_info_vars. Below we use dist.log_likelihood_sym to compute
# the symbolic log-likelihood. For this example, the corresponding distribution is an instance of the class
# rllab.distributions.DiagonalGaussian
dist = policy.distribution
 
# Note that we negate the objective, since most optimizers assume a
# minimization problem
surr = - TT.mean(dist.log_likelihood_sym(actions_var, dist_info_vars) * returns_var)
 
# Get the list of trainable parameters.
params = policy.get_params(trainable=True)
grads = theano.grad(surr, params)

梯度更新和诊断

我们基本上完成了！现在，你可以使用自己喜欢的随机优化算法来执行参数的更新。我们使用 ADAM：

f_train = theano.function(
inputs=[observations_var, actions_var, returns_var],
outputs=None,
updates=adam(grads, params, learning_rate=learning_rate),
allow_input_downcast=True
)
f_train(observations, actions, returns)

因为算法是同策略的，我们可以通过查看收集的样本来评价其性能：

print('Average Return:', np.mean([sum(path["rewards"]) for path in paths]))

完整的代码参考 examples/vpg_1.py.

策略梯度的方差可以通过增加基准函数的方式进一步降低。重新定义的公式如下

\widehat{\nabla_\theta\eta(\theta)} = \frac{1}{NT}\sum\limits_{i=1}^{N}\sum\limits_{t=0}^{T-1}\nabla_\theta\log\pi_\theta(a_t^i|s_t^i)(R_t^i-b(s_t^i))

由于

\mathbb{E}[\nabla_\theta\log\pi_\theta(a_t^i|s_t^i)b(s_t^i)]=0

我们才能得到这个结果.

基准函数一般实现为 V^\pi(s) 的估计。这里，R_T^i - b(s_t^i) 是 A^\pi(s_ti, a_t^i) 的估计. 该框架实现了一些类基准函数的不同选择。通过使用状态特征的线性基准函数在性能和准确度方面进行了较好的平衡，可在 rllab/baselines/linear_feature_baseline.py 中查看. 使用这个实现的相关的代码如下：

# ... initialization code ...
from rllab.baselines.linear_feature_baseline import LinearFeatureBaseline
baseline = LinearFeatureBaseline(env.spec)
# ... inside the loop for each episode, after the samples are collected
path = dict(observations=np.array(observations),actions=np.array(actions),rewards=np.array(rewards),)
path_baseline = baseline.predict(path)
advantages = []
returns = []
return_so_far = 0
for t in xrange(len(rewards) - 1, -1, -1):
return_so_far = rewards[t] + discount * return_so_far
returns.append(return_so_far)
advantage = return_so_far - path_baseline[t]
advantages.append(advantage)
# The advantages are stored backwards in time, so we need to revert it
advantages = np.array(advantages[::-1])
# And we need to do the same thing for the list of returns
returns = np.array(returns[::-1])

规范化回报

现在我们的学习率常会受到奖励的值范围的影响. 我们可以通过在计算梯度前进行白噪化 advantage 来降低这个依赖。用代码就是：

advantages = (advantages - np.mean(advantages)) / (np.std(advantages) + 1e-8)

训练基准函数

在每个迭代，我们使用最新获得的轨迹来训练基准函数：

baseline.fit(paths)

在计算基准函数之后执行本步骤的原因是在极端情形下，如果我们从每个状态仅仅有一个轨迹，并且基准函数能够完美地拟合数据，那么所有的advantages会变成 0，这样就没有梯度信号了。

现在，我们可以更快地训练策略（我们需要改变学习率因为重新规范化了）. 完整的代码在examples/vpg_2.py 可得.

REINFORCE 算法

预备知识

收集样本

规范化回报

你可能感兴趣的:(REINFORCE 算法)