强化学习-Policy Gradients

策略网络,即建立一个神经网络模型,通过观察环境状态,直接预测出目前应该执行的策略(Policy),执行这个策略可以获得最大期望收益。策略网络不只是使用当前的reward作为期望收益,而是使用discounted future reward,即把未来奖励乘上衰减系数γ,γ为略小于1的数,期望收益为

r=r1+γr2+γ2r3++γn1rn r = r 1 + γ r 2 + γ 2 r 3 + ⋯ + γ n − 1 r n

Policy Gradients指的是模型会通过学习Action在Environment中获得的反馈,使用梯度更新模型参数的过程。在训练过程中,模型会接触到好的Action及它们带来的高期望价值,和差Action带来的低期望价值,通过对样本的学习,模型会增加选择好Action的概率,减少选择坏Action的概率。

策略网络学习的不是某个Action对应的期望价值,而是直接学习在当前环境应该采取的策略,比如选择每个Action的概率(如果是有限个可选Action,好的Action对应较大的概率),或者输出某个Action的具体数值(如果Action是连续值,例如在赛车游戏中,执行的动作时朝某个方向移动,这样就有了0~360度连续数值空间可以选择)。

Policy Based方法相比于Valued-Based具有更好的收敛性(通常可以保证收敛到局部最优,且不会发散),同时对高维和连续值得Action非常高效(训练和输出结果都更高效),同时能学习出带有随机性的策略。例如,在石头剪刀布游戏中,任何有规律的策略都会被对方学习到并且会被针对,因此随机的策略反而可以立于不败之地。这种情况下,策略网络可以学习到这三个Action相等的概率。

实例:CartPole

代码思路:

1.每执行一次完整的episode,将此过程中产生的[observation,action,discounted_rewards]数据用于神经网络训练。

2.一次完整的episode中,越靠前的Action期望价值越大,因为他们长时间保持了Pole的竖直;越靠后的Action期望价值越小,因为他们可能是导致Pole倾倒的原因。对于1~n个动作,第 i i 个动作的discounted_reward为: ri+γri+1+γ2ri+2++γnirn r i + γ r i + 1 + γ 2 r i + 2 + ⋯ + γ n − i r n

这样的话,1~n个动作的期望价值从后往前依次递增。

3.神经网络的输入是observation,输出是action,输出采用了Softmax+cross_entropy的形式,选择概率最大的action。discounted_norm_rewards为各action reward标准化以后的值,网络中,loss并不是完全的cross_entropy,而是加入了discounted_norm_rewards的概念,decounted_rewards越大的action,所占的误差权重越大,梯度下降的比例也越大。

loss=cross_entropydiscounted_norm_rewards l o s s = c r o s s _ e n t r o p y ∗ d i s c o u n t e d _ n o r m _ r e w a r d s

import numpy as np
import tensorflow as tf
import gym


class PolicyGradient(object):
    def __init__(self,
                 n_actions,
                 n_features,
                 learning_rate=0.01,
                 reward_decay=0.95):
        self.n_actions = n_actions
        self.n_features = n_features
        self.lr = learning_rate
        self.gamma = reward_decay
        self.ep_obs, self.ep_as, self.ep_rs = [], [], []

        self.tf_obs = tf.placeholder(tf.float32, [None, self.n_features])
        self.tf_acts = tf.placeholder(tf.int32, [None, ])
        self.tf_vt = tf.placeholder(tf.float32, [None, ])

        hidden_layer = tf.layers.dense(
            inputs=self.tf_obs,
            units=10,
            activation=tf.nn.tanh,  # 比使用relu函数效果更好些
            kernel_initializer=tf.random_normal_initializer(mean=0.0, stddev=0.3),
            bias_initializer=tf.constant_initializer(0.1)
        )

        output_layer = tf.layers.dense(
            inputs=hidden_layer,
            units=self.n_actions,
            kernel_initializer=tf.random_normal_initializer(mean=0.0, stddev=0.3),
            bias_initializer=tf.constant_initializer(0.1),
        )
        self.action_prob = tf.nn.softmax(output_layer)
        cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=output_layer, labels=self.tf_acts)
        self.loss = tf.reduce_mean(cross_entropy * self.tf_vt)
        self.train_op = tf.train.AdamOptimizer(self.lr).minimize(self.loss)

        self.sess = tf.Session()
        self.sess.run(tf.global_variables_initializer())

    # 基于概率选择行为
    def choose_action(self, observation):
        prob_weights = self.sess.run(self.action_prob, feed_dict={self.tf_obs: observation[np.newaxis, :]})
        action = np.random.choice(np.arange(prob_weights.shape[1]), p=prob_weights.ravel())
        return action

    def store_transition(self, s, a, r):
        self.ep_obs.append(s)
        self.ep_as.append(a)
        self.ep_rs.append(r)

    def learn(self):
        discounted_ep_rs_norm = self.discount_norm_rewards()
        self.sess.run(self.train_op, feed_dict={
            self.tf_obs: np.vstack(self.ep_obs),  # shape=[None, n_obs]
            self.tf_acts: np.array(self.ep_as),  # shape=[None, 1]
            self.tf_vt: discounted_ep_rs_norm,  # shape=[None, 1]
        })
        self.ep_obs, self.ep_as, self.ep_rs = [], [], []
        return discounted_ep_rs_norm

    def discount_norm_rewards(self):
        # 存储action的直接价值和潜在价值
        discounted_ep_rs = np.zeros_like(self.ep_rs)
        running_add = 0
        for t in reversed(range(len(self.ep_rs))):
            running_add = running_add * self.gamma + self.ep_rs[t]
            discounted_ep_rs[t] = running_add
        # normalize episode rewards
        discounted_ep_rs -= np.mean(discounted_ep_rs)
        discounted_ep_rs /= np.std(discounted_ep_rs)
        return discounted_ep_rs


env = gym.make('CartPole-v0').unwrapped
rl = PolicyGradient(
    n_actions=env.action_space.n,
    n_features=env.observation_space.shape[0],
    learning_rate=0.02,
    reward_decay=0.99
)

max_episode = 120
for episode in range(max_episode):
    observation = env.reset()
    ep_rewards = 0
    while True:
        # env.render()
        action = rl.choose_action(observation)
        observation_, reward, done, info = env.step(action)
        ep_rewards += reward
        rl.store_transition(observation, action, reward)
        if done:
            print('episode: ', episode, 'reward: ', ep_rewards)
            vt = rl.learn()
            break
        observation = observation_

实例2,MountainCar

env = gym.make('MountainCar-v0')
env.seed(1)
env = env.unwrapped
rl = PolicyGradient(
    n_actions=env.action_space.n,
    n_features=env.observation_space.shape[0],
    learning_rate=0.02,
    reward_decay=0.99
)

max_episode = 1000
for episode in range(max_episode):
    observation = env.reset()
    ep_rewards = 0
    while True:
        # env.render()
        action = rl.choose_action(observation)
        observation_, reward, done, info = env.step(action)  # 所有的reward=-1
        ep_rewards += reward
        rl.store_transition(observation, action, reward)
        if done:
            print('episode: ', episode, 'reward: ', ep_rewards)
            vt = rl.learn()
            break
        observation = observation_

你可能感兴趣的:(强化学习)