alphago 基础之policy gradient
policy_gradient,主要包括两个网络:
价值网络和策略网络:
价值网络,主要用于评估基于当前状态下能够得到的最大reward(或者叫胜率),该最大reward包括该状态下的reward,以及后面几步的reward,只是后面几步的reward的权重系数更小
策略网络:主要用于评估在当前状态下采取哪个策略使得agent获取的reward最大,要利用训练数据的实际reward和价值网络产生的reward对当前状态下采取当前action的梯度进行更新
以cartpole训练为例子
cartpole是一个最简单的例子:
一个车子放在一个无阻力的水平滑竿上,车上有一个竖直的杆子,很显然,如果车移动,那么杆子就会倒,那么你能够做的是将车左移或者右移,车在原点,车移动到>|2.4|,或者杆子倒下的角度超过12度,就算失败.
基于价值策略梯度的训练过程:
1 随机产生action(测试时是直接根据策略网络产生当前状态下应该采取的action),对应的每步的action会产生一个reward,reward值要么为1,要么没有,没有就代表结束,这样就会产生一个三元组的序列:每个元素为:当前状态,当前action,当前reward,这个序列的长度就是这个游戏结束的时刻(或者超过人为设定的长度就手动终止)
2 产生用于价值网络训练的训练数据:上面的三元组序列A,假设长度为N,那么对于其中的一个元素i,求当前状态下对应的权重reward:
deacy=1.0
for x in (N-i)
future_reward += A[x+i].reward*decay
decay = decay*0.97
而这个状态与future_reward对应的二元组序列就可以用来对价值网络进行训练,对价值网络进行训练的目的就是为了能够更好的评估当前的状态下的胜率,或者说最大reward
3 产生用于策略网络训练的训练数据,上面的价值网络训练好了以后就可以评估在当前状态下能够获得的最大future_reward,而策略网络,训练就是基于该最大future_reward,策略网络的输入,就是当前状态,得到的就是当前状态下采取当前action的概率,而前面不是说要利用future_reward吗?怎么利用?future_reward是根据实际的训练样本得到的,而价值网络的目的就是评估当前状态下的future_reward_assessment,那么假设这个价值网络已经训练好了,那么得到的future_reward_assessment就是准确的,那么用future_reward-future_reward_assessment,就是当前action的好坏程度,如果相减小于0,那么证明采取该策略不好,因为经过评估你能够获得更多的reward,而实际结果却更小,因此在梯度更新的时候乘以这个相减的值,那么梯度更新了以后,下次main对同样的情况采取这个action的概率就更小;而如果相减大于0,那么说明,经过评估,你能够获得的reward是这么多:future_reward_assessment,而实际你获取的reward是这么多:future_reward > future_reward_assessment,那就说明一点,你采取的这个action很好,获得了比预期更多的收成,那么在梯度更新的时候,也乘以这个值,那么下次在面对同样的状态的时候,就会有更大的几率采取这个action.
4 第一步产生数据,第二步对数据进行简单的处理,用于训练价值网络,第三步利用实际的future_reward与价值网络对状态的评估得到的future_reward_assessment相减,得到对当前action进行梯度更新的权重.1,2,3步反复进行,就实现了对价值网络和策略网络的训练
给出简单的示例代码:
代码地址:
https://github.com/kvfrans/openai-cartpole
import tensorflow as tf
import numpy as np
import random
import gym
import math
import matplotlib.pyplot as plt
def softmax(x):
e_x = np.exp(x - np.max(x))
out = e_x / e_x.sum()
return out
#get action based on state
def policy_gradient():
with tf.variable_scope("policy"):
params = tf.get_variable("policy_parameters",[4,2])
state = tf.placeholder("float",[None,4])
actions = tf.placeholder("float",[None,2])
advantages = tf.placeholder("float",[None,1])
linear = tf.matmul(state,params)
probabilities = tf.nn.softmax(linear)
good_probabilities = tf.reduce_sum(tf.multiply(probabilities, actions),reduction_indices=[1])
#策略网络就是为了使得在该策略下能够获得的reward最大
#advantages是在当前state下获得的future_reward-value_function估算出来的reward(assess_reward)
#这个差值为正,代表经过评估你应该获得assess_reward的reward,而实际上你获取的reward>assess_reward,
#那么就说明这个action不错,下次遇到同样的state应该有更大的几率采取这个action
#为负值就说明经过评估,你能够获得的assess_reward<实际的reward,那么说明你的这个action可能不够好
#那么下次采取这个action的几率更小
eligibility = tf.log(good_probabilities) * advantages
loss = -tf.reduce_sum(eligibility)
optimizer = tf.train.AdamOptimizer(0.01).minimize(loss)
return probabilities, state, actions, advantages, optimizer
#价值网络,用于后期估计当前状态下的reward,包括基于当前状态下的直到停止的累加reward
def value_gradient():
with tf.variable_scope("value"):
state = tf.placeholder("float",[None,4])
#newvals is future reward?
newvals = tf.placeholder("float",[None,1])
w1 = tf.get_variable("w1",[4,10])
b1 = tf.get_variable("b1",[10])
h1 = tf.nn.relu(tf.matmul(state,w1) + b1)
w2 = tf.get_variable("w2",[10,1])
b2 = tf.get_variable("b2",[1])
calculated = tf.matmul(h1,w2) + b2
#the value_gradient function is to makes this function estimate the future reward
#according to current state,as long as the future reward exist,it is good
#then this function is to estimate the good reward under current state with good action
#which means you will get this reward if you take the good action
#later when testing,you can use the
diffs = calculated - newvals
loss = tf.nn.l2_loss(diffs)
optimizer = tf.train.AdamOptimizer(0.1).minimize(loss)
return calculated, state, newvals, optimizer, loss
def run_episode(env, policy_grad, value_grad, sess,is_train = True):
pl_calculated, pl_state, pl_actions, pl_advantages, pl_optimizer = policy_grad
vl_calculated, vl_state, vl_newvals, vl_optimizer, vl_loss = value_grad
observation = env.reset()
totalreward = 0
states = []
actions = []
advantages = []
transitions = []
update_vals = []
for _ in range(200):
# calculate policy
obs_vector = np.expand_dims(observation, axis=0)
#calculate action according to current state
probs = sess.run(pl_calculated,feed_dict={pl_state: obs_vector})
#take a random action
#print("shape of probs is ",probs.shape)
action = 1 if probs[0][0]0][1] else 0
if is_train:
action = 0 if random.uniform(0,1) < probs[0][0] else 1
# record the transition
states.append(observation)
actionblank = np.zeros(2)
actionblank[action] = 1
actions.append(actionblank)
# take the action in the environment
old_observation = observation
observation, reward, done, info = env.step(action)
transitions.append((old_observation, action, reward))
totalreward += reward
if done:
break
for index, trans in enumerate(transitions):
obs, action, reward = trans
# calculate discounted monte-carlo return
future_reward = 0
future_transitions = len(transitions) - index
decrease = 1
for index2 in range(future_transitions):
future_reward += transitions[(index2) + index][2] * decrease
decrease = decrease * 0.97
obs_vector = np.expand_dims(obs, axis=0)
#value function calculate reward under current state
#值函数在当前状态下,能够得到的最好的reward:currentval
currentval = sess.run(vl_calculated,feed_dict={vl_state: obs_vector})[0][0]
# advantage: how much better was this action than normal
# 根据实际数据得到future_reward比值函数计算出来的reward要好多少
# 训练到后来,这个currentval:即在当前reward会估计的比较准确,即在当前state下能够获得的
# 最大reward或者平均reward,而有了这个估计,用实际的reward减去这个reward,就可以判断这个
# action的好坏,即这个currentval是训练时用来评估某个action的好坏,因此这个估值也很重要
# ,用future_reward减去这个最大reward,就得到了这个action
# 对应的label,如果比估计的值更大,那说明要根据该参数进行更新,如果比该值小,那说明
# 达不到平均水平,那么将将该action对应的梯度进行反向更新(相减为负值),使得下次碰到这个
# 类似的state的时候,不再采取这个action
advantages.append(future_reward - currentval)
print("future_reward:",future_reward)
print("currentval:",currentval)
# update the value function towards new return
update_vals.append(future_reward)
# update value function
update_vals_vector = np.expand_dims(update_vals, axis=1)
#根据future reward对值函数进行优化,让值函数能够在当前state下估计出能够得到的最好的reward,包括
# 后期的reward累加
sess.run(vl_optimizer, feed_dict={vl_state: states, vl_newvals: update_vals_vector})
# real_vl_loss = sess.run(vl_loss, feed_dict={vl_state: states, vl_newvals: update_vals_vector})
advantages_vector = np.expand_dims(advantages, axis=1)
#对策略函数进行优化,输入的是实际的action,以及future_reward
# 需要值函数的目的也只是为了与 future_reward相减
#得到在当前state下,下次是采取更大的概率采取该action,还是采取更小的概率采取该action
#good_probabilities = tf.reduce_sum(tf.multiply(probabilities, actions),reduction_indices=[1])
#eligibility = tf.log(good_probabilities) * advantages
#loss = -tf.reduce_sum(eligibility)
sess.run(pl_optimizer, feed_dict={pl_state: states, pl_advantages: advantages_vector, pl_actions: actions})
return totalreward
env = gym.make('CartPole-v0')
#env.monitor.start('cartpole-hill/', force=True)
policy_grad = policy_gradient()
value_grad = value_gradient()
sess = tf.InteractiveSession()
sess.run(tf.initialize_all_variables())
for i in range(2000):
reward = run_episode(env, policy_grad, value_grad, sess)
if reward == 200:
print("reward 200")
print(i)
break
t = 0
for _ in range(1000):
env.render()
reward = run_episode(env, policy_grad, value_grad, sess,False)
t += reward
print(t / 1000)
gym,tensorflow的安装:
git安装
git clone https://github.com/openai/gym
cd gym
pip install -e . # minimal install
or
pip install -e .[all] # full install (this requires cmake and a recent pip version)
pip安装
pip install gym #minimal install
or
pip install gym[all] #full install, fetch gym as a package
tensorflow安装:
pip install tensorflow==1.2