深度强化学习之策略梯度和优化(一) — PolicyGradient

引言

  之前所讲的各种强化学习算法,如DQN、DRQN、A3C。在这些算法中,目标都是为了找到正确的策略,以便能够获得最大的奖励。由于Q函数能够得到哪个行为是在某一状态下执行的最佳行为,因此,使用Q函数来寻找最优策略。在策略梯度的方法中,我们可以不适用策略来得到最优策略。

策略梯度

  策略梯度是强化学习(RL)中一种令人惊叹的算法,可通过一些参数直接优化参数化的策略。在此之前,已学习了利用Q函数来寻找最优策略。现在将学习如何在不使用Q函数的情况下找到最优策略。首先,将策略函数定义为 π ( a ∣ s ) \pi(a|s) π(as),即给定状态s下执行行为a的概率。通过一个参数 θ \theta θ 将策略参数化为 π ( a ∣ s ; θ ) \pi(a|s;\theta) π(as;θ),从而可以确定某一状态下的最佳行为。

  策略梯度方法具有许多优点,可以处理具有无限个行为和状态的连续行为空间。比如正在构建一辆自动驾驶汽车,汽车应在不碰撞其他任何车辆的情况下行驶。当汽车撞到车辆时,会得到一个负面奖励,而如果没有撞到其他任何车辆,则会得到一个正面奖励。这种更新模型参数的方式是只接收正面奖励,从而使得汽车不会撞到任何其他车辆。策略梯度的基本思想是,以使得奖励最大化的方式来更新模型参数。 接下来,进行详细分析。

  在此通过神经网络来寻找最优策略,并称为策略网络。策略网络的输入是状态,而输出是该状态下每个行为的概率。一旦已知这些概率,就可以根据分布来采样一个行为,并在该状态下执行这一行为。但所采样的行为可能并不是该状态下应执行的正确行为。好吧,执行该行为并保存奖励。同理,根据分布采样某一行为,并在每个状态下执行行为且保存奖励。此时,这些信息就是训练数据。执行梯度下降,并以在该状态下能够产生较大奖励的行为具有较大概率而产生较小奖励的行为具有较小概率的方式来更新梯度。那么损失函数是什么呢? 在此,采用 Softmax 交叉熵的损失, 然后该损失再乘以奖励值。

基于策略梯度的月球着陆器

  假设智能体正在驾驶宇宙飞船,且目标是正确降落在着陆架上。如果智能体(着陆器)着陆时偏离着陆架,则会失去奖励,而如果智能体坠毁或停止运行,则这一回合结束。在环境中可用的4种孤立行为分别是保持不变、发动左侧发动机、发动主发动机和发动右侧发动机。

  现在,分析如何训练智能体在策略梯度作用下正确降落在着陆架上。

代码:

import tensorflow as tf
import numpy as np
from tensorflow.python.framework import ops
import gym
class PolicyGradient:   
    
    # first we define the __init__ method where we initialize all variables
    
    def __init__(self, n_x,n_y,learning_rate=0.01, reward_decay=0.95):
            
        # number of states in the environemnt    
        self.n_x = n_x 
        
        # number of actions in the environment
        self.n_y = n_y
        
        # learning rate of the network
        self.lr = learning_rate
        
        # discount factor
        self.gamma = reward_decay 
    
        # initialize the lists for storing observations, actions and rewards
        self.episode_observations, self.episode_actions, self.episode_rewards = [], [], []
        
        # we define a function called build_network for building the neural network
        self.build_network()
        
        # stores the cost i.e loss
        self.cost_history = []
        
        # initialize tensorflow session
        self.sess = tf.Session()
        self.sess.run(tf.global_variables_initializer())
        

    # next we define a function called store_transition which stores the transition information
    # i.e state, action, reward and we can use this transitions for training the network

    def store_transition(self, s, a, r):
        
        self.episode_observations.append(s)
        self.episode_rewards.append(r)

        # store actions as list of arrays  将行为保存为数组列表
        action = np.zeros(self.n_y)
        action[a] = 1
        self.episode_actions.append(action)
        
    # now, we define a function choose_action for choosing the action given the state, (实现给定状态下选择动作)

    def choose_action(self, observation):

        # reshape observation to (num_features, 1) 将观察信息重新整理为(num_features, 1)
        observation = observation[:, np.newaxis]

        # run forward propagation to get softmax probabilities 执行前向传播来得到softmax概率
        prob_weights = self.sess.run(self.outputs_softmax, feed_dict={self.X: observation})

        # select action using a biased sample this will return the index of the action we have sampled
        # 利用偏置样本选择行为,将返回所采样行为的编号
        action = np.random.choice(range(len(prob_weights.ravel())), p=prob_weights.ravel())
        
        return action

    
    # we define build_network for creating our neural network
    
    def build_network(self):
        
        # self.X 指的是环境的观测量
        # self.Y 指的是可执行的动作
        # self.discounted_episode_rewards_norm 指的是每个东西的奖惩情况
        self.X = tf.placeholder(tf.float32, shape=(self.n_x, None), name="X")
        self.Y = tf.placeholder(tf.float32, shape=(self.n_y, None), name="Y")
        self.discounted_episode_rewards_norm = tf.placeholder(tf.float32, [None, ], name="actions_value")

        # we build 3 layer neural network with 2 hidden layers and 1 output layer
        
        # number of neurons in the hidden layer
        units_layer_1 = 10
        units_layer_2 = 10
        
        # number of neurons in the output layer
        units_output_layer = self.n_y
        
        # now let us initialize weights and bias value using tensorflow's tf.contrib.layers.xavier_initializer
        
        W1 = tf.get_variable("W1", [units_layer_1, self.n_x], initializer = tf.contrib.layers.xavier_initializer(seed=1))
        b1 = tf.get_variable("b1", [units_layer_1, 1], initializer = tf.contrib.layers.xavier_initializer(seed=1))
        W2 = tf.get_variable("W2", [units_layer_2, units_layer_1], initializer = tf.contrib.layers.xavier_initializer(seed=1))
        b2 = tf.get_variable("b2", [units_layer_2, 1], initializer = tf.contrib.layers.xavier_initializer(seed=1))
        W3 = tf.get_variable("W3", [self.n_y, units_layer_2], initializer = tf.contrib.layers.xavier_initializer(seed=1))
        b3 = tf.get_variable("b3", [self.n_y, 1], initializer = tf.contrib.layers.xavier_initializer(seed=1))

        # and then, we perform forward propagation 执行前向传播

        Z1 = tf.add(tf.matmul(W1,self.X), b1)
        A1 = tf.nn.relu(Z1)
        Z2 = tf.add(tf.matmul(W2, A1), b2)
        A2 = tf.nn.relu(Z2)
        Z3 = tf.add(tf.matmul(W3, A2), b3)
        A3 = tf.nn.softmax(Z3)


        # as we require, probabilities, we apply softmax activation function in the output layer,
        # 由于需要获得概率,因此在输出层应用softmax激活函数
        logits = tf.transpose(Z3)
        labels = tf.transpose(self.Y)
        self.outputs_softmax = tf.nn.softmax(logits, name='A3')

        # next we define our loss function as cross entropy loss
        neg_log_prob = tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels)
        
        # reward guided loss
        loss = tf.reduce_mean(neg_log_prob * self.discounted_episode_rewards_norm)  

        # we use adam optimizer for minimizing the loss
        self.train_op = tf.train.AdamOptimizer(self.lr).minimize(loss)


    # define discount_and_norm_rewards function which will result the discounted and normalised reward
    
    def discount_and_norm_rewards(self):
        discounted_episode_rewards = np.zeros_like(self.episode_rewards)
        cumulative = 0
        for t in reversed(range(len(self.episode_rewards))):
            cumulative = cumulative * self.gamma + self.episode_rewards[t]
            discounted_episode_rewards[t] = cumulative

        discounted_episode_rewards -= np.mean(discounted_episode_rewards)
        discounted_episode_rewards /= np.std(discounted_episode_rewards)
        return discounted_episode_rewards
    
    # now we actually learn i.e train our network
    
    def learn(self):
        # discount and normalize episodic reward
        discounted_episode_rewards_norm = self.discount_and_norm_rewards()

        # train the nework
        self.sess.run(self.train_op, feed_dict={
             self.X: np.vstack(self.episode_observations).T,
             self.Y: np.vstack(np.array(self.episode_actions)).T,
             self.discounted_episode_rewards_norm: discounted_episode_rewards_norm,
        })

        # reset the episodic data  重置情景数据
        self.episode_observations, self.episode_actions, self.episode_rewards  = [], [], []

        return discounted_episode_rewards_norm

初始化环境

env = gym.make('LunarLander-v2')
env = env.unwrapped

初始化变量

RENDER_ENV = False
EPISODES = 5000
rewards = []
RENDER_REWARD_MIN = 5000

为PolicyGradient类创建实例

PG = PolicyGradient(
    n_x = env.observation_space.shape[0],
    n_y = env.action_space.n,
    learning_rate=0.02,
    reward_decay=0.99,
)

运行模型

for episode in range(EPISODES):
    
    # get the state
    observation = env.reset()
    episode_reward = 0


    while True:
        
        if RENDER_ENV: env.render()

        # choose an action based on the state
        action = PG.choose_action(observation)

        # perform action in the environment and move to next state and receive reward
        observation_, reward, done, info = env.step(action)

        # store the transition information
        PG.store_transition(observation, action, reward)
        
        # sum the rewards obtained in each episode
        episode_rewards_sum = sum(PG.episode_rewards)
        
        # if the reward is less than -250 then terminate the episode
        if episode_rewards_sum < -250:
            done = True
    
        if done:
            episode_rewards_sum = sum(PG.episode_rewards)
            rewards.append(episode_rewards_sum)
            max_reward_so_far = np.amax(rewards)

            print("Episode: ", episode)
            print("Reward: ", episode_rewards_sum)
            print("Max reward so far: ", max_reward_so_far)

            # train the network
            discounted_episode_rewards_norm = PG.learn()

            if max_reward_so_far > RENDER_REWARD_MIN: RENDER_ENV = False


            break

        # update the next state as current state
        observation = observation_

https://github.com/PacktPublishing/Hands-On-Reinforcement-Learning-with-Python/blob/master/Chapter11/11.2%20Lunar%20Lander%20Using%20Policy%20Gradients.ipynb

你可能感兴趣的:(强化学习,深度强化学习)