强化学习笔记(一)基于openAI gym CartPole-V0实现

强化学习是机器学习的一个分支,主要用来解决时序决策问题。他可以在复杂的、不确定的环境中学习如何实现我们设定的目标。一个强化学习模型主要包括四个部分, 环境状态(environment state), 智能体(agent),行动(action),奖励函数(reward)。

环境状态(environment state)集合可以是一个有限集合,状态是其中的特征,环境中除了包含状态信息,还需要定义智能体和环境交互所带来的环境的改变。




一、基于openAI gym CartPole-V0实例学习



强化学习笔记(一)基于openAI gym CartPole-V0实现_第1张图片
智能体的目标是坚持尽量长的时间,对于任意一个action,只要不导致任务失败,reward +1。要求智能体必须有远见,对于每一个action除了考虑当前reward,还要考虑未来利益。

以下是openAI gym 中关于CartPole-V0游戏的描述

        A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The pendulum starts upright, and the goal is to prevent it from falling over by increasing and reducing the cart's velocity.

        This environment corresponds to the version of the cart-pole problem described by Barto, Sutton, and Anderson

        Type: Box(4)
        Num	Observation                 Min         Max
        0	Cart Position             -4.8            4.8
        1	Cart Velocity             -Inf            Inf
        2	Pole Angle                 -24 deg        24 deg
        3	Pole Velocity At Tip      -Inf            Inf
        Type: Discrete(2)
        Num	Action
        0	Push cart to the left
        1	Push cart to the right
        Note: The amount the velocity that is reduced or increased is not fixed; it depends on the angle the pole is pointing. This is because the center of gravity of the pole increases the amount of energy needed to move the cart underneath it

        Reward is 1 for every step taken, including the termination step

    Starting State:
        All observations are assigned a uniform random value in [-0.05..0.05]

    Episode Termination:
        Pole Angle is more than 12 degrees
        Cart Position is more than 2.4 (center of the cart reaches the edge of the display)
        Episode length is greater than 200
        Solved Requirements
        Considered solved when the average reward is greater than or equal to 195.0 over 100 consecutive trials.


2.1 测试CartPole环境中随机action的表现,作为baseline

# -*- coding: utf-8 -*-
Created on Fri Apr  5 20:12:03 2019

@author: zhangfan
#导入openAI gym 以及 TensorFlow
import gym
import numpy as np
import tensorflow as tf

env = gym.make('CartPole-v0')

env.reset() #初始化环境
random_episodes = 0
reward_sum = 0
while random_episodes < 10:
    obsevation, reward, done, _ = env.step(np.random.randint(0, 2))
    reward_sum += reward
    if done:
        random_episodes += 1
        print("Reward for this episodes was:", reward_sum)
        reward_sum = 0  #重置reward

2.2 构建策略网络

我们在策略网络使用一个带有一层隐藏层的ANN,各种超参数为:nodes = 50, batch_size = 25, learning_rate = 0.1, discout_rate = 0.99。定义策略网络将agent对环境的observation作为输入,最后输入概率值选择action。

#PART2:AGENT. Agent是一个简单的ANN with one hidden layer.
H = 50 #50个neure
batch_size = 25 
learning_rate = 0.1
D = 4 #observation维度为4
gamma = 0.99 #discount rate

observations = tf.placeholder(tf.float32, [None, D], name = "input_x")
W1 = tf.get_variable("W1", shape = [D, H], initializer=tf.contrib.layers.xavier_initializer())
layer1 = tf.nn.relu(tf.matmul(observations, W1))
W2 = tf.get_variable("W2", shape = [H, 1], initializer=tf.contrib.layers.xavier_initializer())
score = tf.matmul(layer1, W2)
probability = tf.nn.sigmoid(score)

#定义优化器,梯度占位符,采用batch training更新参数
adam = tf.train.AdamOptimizer(learning_rate = learning_rate)
W1_grad = tf.placeholder(tf.float32,name = "batch_grad1")
W2_grad = tf.placeholder(tf.float32,name = "batch_grad2")
batchGrad = [W1_grad, W2_grad]
tvars = tf.trainable_variables()
updateGrads = adam.apply_gradients(zip(batchGrad, tvars))

def discount_rewards(r):
    discounted_r = np.zeros_like(r)
    running_add = 0
    for t in reversed(range(r.size)):
        running_add = running_add * gamma + r[t]
        discounted_r[t] = running_add
    return discounted_r

input_y = tf.placeholder(tf.float32, [None, 1], name = "input_y")
advantages = tf.placeholder(tf.float32, name = "reward_signal")
loglik = tf.log(input_y*(input_y - probability) + (1-input_y)*(input_y + probability))
#loglik为action的对数概率,P(act=1) = probablility,P(act=0) = 1-probablility
#action=1,loglik = tf.log(probability)
#action=0,loglik = tf.log(1-probability)
loss = -tf.reduce_mean(loglik * advantages)
newGrads = tf.gradients(loss, tvars)

xs = [] #observation的列表
ys = [] #label的列表, label = 1 - action
drs = [] #每个action的reward
reward_sum = 0 #累计reward
episode_num = 1 #每次实验index
total_episodes = 10000 #总实验次数

with tf.Session() as sess:
    rendering = False
    init = tf.global_variables_initializer()
    sess.run(init) #初始化状态
    observation = env.reset() #重置环境
    gradBuffer = sess.run(tvars)
    for ix,grad in enumerate(gradBuffer):
        gradBuffer[ix] = grad * 0 #将所有参数全部初始化为零
    while episode_num <= total_episodes:
        if reward_sum/batch_size > 100 or rendering == True:
            rendering = True
        x = np.reshape(observation, [1, D])
        tfprob = sess.run(probability, feed_dict={observations: x})
        action = 1 if np.random.uniform() < tfprob else 0
        xs.append(x) #将observation加入列表xs
        y = 1 - action
        ys.append(y) #将label加入列表ys
        observation, reward, done, info = env.step(action)
        reward_sum += reward
        drs.append(reward) #将reward加入列表drs
        # done=True 即实验结束
        if done:
            episode_num += 1  #一次实验结束,index+1
            epx = np.vstack(xs)
            epy = np.vstack(ys)
            epr = np.vstack(drs)
            xs, ys, drs = [], [], []
            discounted_epr = discount_rewards(epr)
            discounted_epr -= np.mean(discounted_epr)
            discounted_epr /= np.std(discounted_epr)
            #将epx epy epr 输入神经网络,newGrads求梯度
            tGrad = sess.run(newGrads, feed_dict={observations: epx, input_y: epy, advantages: discounted_epr})
            for ix,grad in enumerate(tGrad):
                gradBuffer[ix] += grad
            if episode_num % batch_size == 0:
                sess.run(updateGrads, feed_dict={W1_grad: gradBuffer[0], W2_grad: gradBuffer[1]})
                for ix, grad in enumerate(gradBuffer):
                    gradBuffer[ix] = grad * 0
                print('Average reward for episode %d : %f.' %(episode_num, reward_sum/batch_size))
                if reward_sum/batch_size > 200:
                    print("Task solved in", episode_num, 'episodes!')
                reward_sum = 0
            observation = env.reset()

2.3 运行结果

强化学习笔记(一)基于openAI gym CartPole-V0实现_第2张图片

2.4 总结

