策略网络,即建立一个神经网络模型,通过观察环境状态,直接预测出目前应该执行的策略(Policy),执行这个策略可以获得最大期望收益。策略网络不只是使用当前的reward作为期望收益,而是使用discounted future reward,即把未来奖励乘上衰减系数γ,γ为略小于1的数,期望收益为
Policy Gradients指的是模型会通过学习Action在Environment中获得的反馈,使用梯度更新模型参数的过程。在训练过程中,模型会接触到好的Action及它们带来的高期望价值,和差Action带来的低期望价值,通过对样本的学习,模型会增加选择好Action的概率,减少选择坏Action的概率。
策略网络学习的不是某个Action对应的期望价值,而是直接学习在当前环境应该采取的策略,比如选择每个Action的概率(如果是有限个可选Action,好的Action对应较大的概率),或者输出某个Action的具体数值(如果Action是连续值,例如在赛车游戏中,执行的动作时朝某个方向移动,这样就有了0~360度连续数值空间可以选择)。
Policy Based方法相比于Valued-Based具有更好的收敛性(通常可以保证收敛到局部最优,且不会发散),同时对高维和连续值得Action非常高效(训练和输出结果都更高效),同时能学习出带有随机性的策略。例如,在石头剪刀布游戏中,任何有规律的策略都会被对方学习到并且会被针对,因此随机的策略反而可以立于不败之地。这种情况下,策略网络可以学习到这三个Action相等的概率。
实例:CartPole
代码思路:
1.每执行一次完整的episode,将此过程中产生的[observation,action,discounted_rewards]数据用于神经网络训练。
2.一次完整的episode中,越靠前的Action期望价值越大,因为他们长时间保持了Pole的竖直;越靠后的Action期望价值越小,因为他们可能是导致Pole倾倒的原因。对于1~n个动作,第 i i 个动作的discounted_reward为: ri+γri+1+γ2ri+2+⋯+γn−irn r i + γ r i + 1 + γ 2 r i + 2 + ⋯ + γ n − i r n
这样的话,1~n个动作的期望价值从后往前依次递增。
3.神经网络的输入是observation,输出是action,输出采用了Softmax+cross_entropy的形式,选择概率最大的action。discounted_norm_rewards为各action reward标准化以后的值,网络中,loss并不是完全的cross_entropy,而是加入了discounted_norm_rewards的概念,decounted_rewards越大的action,所占的误差权重越大,梯度下降的比例也越大。
import numpy as np
import tensorflow as tf
import gym
class PolicyGradient(object):
def __init__(self,
n_actions,
n_features,
learning_rate=0.01,
reward_decay=0.95):
self.n_actions = n_actions
self.n_features = n_features
self.lr = learning_rate
self.gamma = reward_decay
self.ep_obs, self.ep_as, self.ep_rs = [], [], []
self.tf_obs = tf.placeholder(tf.float32, [None, self.n_features])
self.tf_acts = tf.placeholder(tf.int32, [None, ])
self.tf_vt = tf.placeholder(tf.float32, [None, ])
hidden_layer = tf.layers.dense(
inputs=self.tf_obs,
units=10,
activation=tf.nn.tanh, # 比使用relu函数效果更好些
kernel_initializer=tf.random_normal_initializer(mean=0.0, stddev=0.3),
bias_initializer=tf.constant_initializer(0.1)
)
output_layer = tf.layers.dense(
inputs=hidden_layer,
units=self.n_actions,
kernel_initializer=tf.random_normal_initializer(mean=0.0, stddev=0.3),
bias_initializer=tf.constant_initializer(0.1),
)
self.action_prob = tf.nn.softmax(output_layer)
cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=output_layer, labels=self.tf_acts)
self.loss = tf.reduce_mean(cross_entropy * self.tf_vt)
self.train_op = tf.train.AdamOptimizer(self.lr).minimize(self.loss)
self.sess = tf.Session()
self.sess.run(tf.global_variables_initializer())
# 基于概率选择行为
def choose_action(self, observation):
prob_weights = self.sess.run(self.action_prob, feed_dict={self.tf_obs: observation[np.newaxis, :]})
action = np.random.choice(np.arange(prob_weights.shape[1]), p=prob_weights.ravel())
return action
def store_transition(self, s, a, r):
self.ep_obs.append(s)
self.ep_as.append(a)
self.ep_rs.append(r)
def learn(self):
discounted_ep_rs_norm = self.discount_norm_rewards()
self.sess.run(self.train_op, feed_dict={
self.tf_obs: np.vstack(self.ep_obs), # shape=[None, n_obs]
self.tf_acts: np.array(self.ep_as), # shape=[None, 1]
self.tf_vt: discounted_ep_rs_norm, # shape=[None, 1]
})
self.ep_obs, self.ep_as, self.ep_rs = [], [], []
return discounted_ep_rs_norm
def discount_norm_rewards(self):
# 存储action的直接价值和潜在价值
discounted_ep_rs = np.zeros_like(self.ep_rs)
running_add = 0
for t in reversed(range(len(self.ep_rs))):
running_add = running_add * self.gamma + self.ep_rs[t]
discounted_ep_rs[t] = running_add
# normalize episode rewards
discounted_ep_rs -= np.mean(discounted_ep_rs)
discounted_ep_rs /= np.std(discounted_ep_rs)
return discounted_ep_rs
env = gym.make('CartPole-v0').unwrapped
rl = PolicyGradient(
n_actions=env.action_space.n,
n_features=env.observation_space.shape[0],
learning_rate=0.02,
reward_decay=0.99
)
max_episode = 120
for episode in range(max_episode):
observation = env.reset()
ep_rewards = 0
while True:
# env.render()
action = rl.choose_action(observation)
observation_, reward, done, info = env.step(action)
ep_rewards += reward
rl.store_transition(observation, action, reward)
if done:
print('episode: ', episode, 'reward: ', ep_rewards)
vt = rl.learn()
break
observation = observation_
实例2,MountainCar
env = gym.make('MountainCar-v0')
env.seed(1)
env = env.unwrapped
rl = PolicyGradient(
n_actions=env.action_space.n,
n_features=env.observation_space.shape[0],
learning_rate=0.02,
reward_decay=0.99
)
max_episode = 1000
for episode in range(max_episode):
observation = env.reset()
ep_rewards = 0
while True:
# env.render()
action = rl.choose_action(observation)
observation_, reward, done, info = env.step(action) # 所有的reward=-1
ep_rewards += reward
rl.store_transition(observation, action, reward)
if done:
print('episode: ', episode, 'reward: ', ep_rewards)
vt = rl.learn()
break
observation = observation_