学习系统没有像很多其它形式的机器学习方法一样被告知应该做出什么行为,必须在尝试了之后才能发现哪些行为会导致奖励的最大化,当前的行为可能不仅仅会影响即时奖励,还会影响下一步的奖励以及后续的所有奖励。
强化学习的五个关键字:
强化学习的学习过程:总的来说就是先观测,再行动,再观测。如下图所示:
以下是强化学习的原理图:
马尔科夫决策要求:
1.能够检测到理想的状态
2.可以多次尝试
3.系统的下个状态只与当前状态信息有关,而与更早之前的状态无关
在决策过程中还和当前采取的动作有关
马尔科夫决策过程由五个元素构成:
S:表示状态集(states)
A:表示一组动作(actions)
P:表示状态转移概率 Psa ,表示在当前s ∈ S状态下,经过a ∈ A作用后,会转移到的其他状态的概率分布情况在状态s下执行动作a,转移到 s′ 的概率可以表示为 p(s′|s,a)
R: 奖励函数(reward function)表示 agent 采取某个动作后的即时奖励
y:折扣系数意味着当下的 reward 比未来反馈的 reward 更重要 ∑t=0∞γtR(st) 0≤γ≤1
强化学习步骤:
1.智能体初始状态为S0
2.选择一个动作a0
3.按概率转移矩阵Psa转移到了下一个状态S1
然后。
状态价值函数: v(s)=E[Ut|St=s] ,t 时刻的状态 s 能获得的未来回报的期望,价值函数用来衡量某一状态或状态-动作对的优劣价 ,累计奖励的期望。
最优价值函数:所有策略下的最优累计奖励期望 v∗(s)=maxπvπ(s)
策略:已知状态下可能产生动作的概率分布。
Bellman方程: 当前状态的价值和下一步的价值及当前的奖励(Reward)有关价值函数分解为当前的奖励和下一步的价值两部分。
我们来看下面的例子:
π是给定状态s的情况下,动作a的概率分布,因为动作空间A,状态空间S均为有限集合,所以我们可以用求和来计算期望。
π(a|s) 为当前动作的概率分布, Ras 为即时的奖励, Pass′ 为概率的转移矩阵。
Bellman最优化方程:
将其转化为:
3.1 随机选择一个初始的状态s
3.2 若未达到目标状态则执行以下几步
(1)在当前状态s的可能行为中选取一个行为a
(2)利用给定的行为a,得到下一个状态 s~
(3)按照(1,1)计算Q(s,a)
(4)令s= s~
首先取学习参数 γ =0.8,初始状态房间1,并将Q初始化为一个零矩阵。如下所示:
观察矩阵R的第二行(对应房间1或状态1),它包含两个非负值,即当前状态1的行为有两种可能:转至状态3或状态5,随机的,我们选取转至状态5。
想象一下,当我们的agent位于状态5以后,会发生什么事情?观察矩阵R的第6行它对应三个可能的行为:转至状态1,4或5.
接下来,进行下一次episode的迭代,首先随机的选取一个初始状态。这次我们选取状态3作为初始状态。
观测矩阵R的第四行,他对应3个可能的行为:转至状态1,2 或4,随机的我们选取转至状态1,因此观察矩阵R的第二行,它对应两个可能的行为:转至状态3或5。
现在状态1变成了当前状态,因为状态1还不是目标状态,因此我们需要继续往前探索,状态1对应三个可能的行为:转至状态3或5,假设我们选择了状态5.
若我们执行更多的episode,矩阵Q最终将收敛成。
Deep Q Network 架构如下:
基于tensorflow的智能游戏(flabby bird)源码:
项目源码及网络模型链接:https://pan.baidu.com/s/1i5b7QhZ 密码:3jau
#!/usr/bin/env python
from __future__ import print_function
import tensorflow as tf
import cv2
import sys
sys.path.append("game/")
import wrapped_flappy_bird as game
import random
import numpy as np
from collections import deque
GAME = 'bird' # the name of the game being played for log files
ACTIONS = 2 # number of valid actions
GAMMA = 0.99 # decay rate of past observations
OBSERVE = 100000. # timesteps to observe before training
EXPLORE = 2000000. # frames over which to anneal epsilon
FINAL_EPSILON = 0.0001 # final value of epsilon
INITIAL_EPSILON = 0.0001 # starting value of epsilon
REPLAY_MEMORY = 50000 # number of previous transitions to remember
BATCH = 32 # size of minibatch
FRAME_PER_ACTION = 1
def weight_variable(shape):
initial = tf.truncated_normal(shape, stddev = 0.01)
return tf.Variable(initial)
def bias_variable(shape):
initial = tf.constant(0.01, shape = shape)
return tf.Variable(initial)
def conv2d(x, W, stride):
return tf.nn.conv2d(x, W, strides = [1, stride, stride, 1], padding = "SAME")
def max_pool_2x2(x):
return tf.nn.max_pool(x, ksize = [1, 2, 2, 1], strides = [1, 2, 2, 1], padding = "SAME")
def createNetwork():
# network weights
W_conv1 = weight_variable([8, 8, 4, 32])
b_conv1 = bias_variable([32])
W_conv2 = weight_variable([4, 4, 32, 64])
b_conv2 = bias_variable([64])
W_conv3 = weight_variable([3, 3, 64, 64])
b_conv3 = bias_variable([64])
W_fc1 = weight_variable([1600, 512])
b_fc1 = bias_variable([512])
W_fc2 = weight_variable([512, ACTIONS])
b_fc2 = bias_variable([ACTIONS])
# input layer
s = tf.placeholder("float", [None, 80, 80, 4])
# hidden layers
h_conv1 = tf.nn.relu(conv2d(s, W_conv1, 4) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)
h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2, 2) + b_conv2)
#h_pool2 = max_pool_2x2(h_conv2)
h_conv3 = tf.nn.relu(conv2d(h_conv2, W_conv3, 1) + b_conv3)
#h_pool3 = max_pool_2x2(h_conv3)
#h_pool3_flat = tf.reshape(h_pool3, [-1, 256])
h_conv3_flat = tf.reshape(h_conv3, [-1, 1600])
h_fc1 = tf.nn.relu(tf.matmul(h_conv3_flat, W_fc1) + b_fc1)
# readout layer
readout = tf.matmul(h_fc1, W_fc2) + b_fc2
return s, readout, h_fc1
def trainNetwork(s, readout, h_fc1, sess):
# define the cost function
a = tf.placeholder("float", [None, ACTIONS])
y = tf.placeholder("float", [None])
readout_action = tf.reduce_sum(tf.multiply(readout, a), reduction_indices=1)
cost = tf.reduce_mean(tf.square(y - readout_action))
train_step = tf.train.AdamOptimizer(1e-6).minimize(cost)
# open up a game state to communicate with emulator
game_state = game.GameState()
# store the previous observations in replay memory
D = deque()
# printing
a_file = open("logs_" + GAME + "/readout.txt", 'w')
h_file = open("logs_" + GAME + "/hidden.txt", 'w')
# get the first state by doing nothing and preprocess the image to 80x80x4
do_nothing = np.zeros(ACTIONS)
do_nothing[0] = 1
x_t, r_0, terminal = game_state.frame_step(do_nothing)
x_t = cv2.cvtColor(cv2.resize(x_t, (80, 80)), cv2.COLOR_BGR2GRAY)
ret, x_t = cv2.threshold(x_t,1,255,cv2.THRESH_BINARY)
s_t = np.stack((x_t, x_t, x_t, x_t), axis=2)
# saving and loading networks
saver = tf.train.Saver()
sess.run(tf.initialize_all_variables())
checkpoint = tf.train.get_checkpoint_state("saved_networks")
if checkpoint and checkpoint.model_checkpoint_path:
saver.restore(sess, checkpoint.model_checkpoint_path)
print("Successfully loaded:", checkpoint.model_checkpoint_path)
else:
print("Could not find old network weights")
# start training
epsilon = INITIAL_EPSILON
t = 0
while "flappy bird" != "angry bird":
# choose an action epsilon greedily
readout_t = readout.eval(feed_dict={s : [s_t]})[0]
a_t = np.zeros([ACTIONS])
action_index = 0
if t % FRAME_PER_ACTION == 0:
if random.random() <= epsilon:
print("----------Random Action----------")
action_index = random.randrange(ACTIONS)
a_t[random.randrange(ACTIONS)] = 1
else:
action_index = np.argmax(readout_t)
a_t[action_index] = 1
else:
a_t[0] = 1 # do nothing
# scale down epsilon
if epsilon > FINAL_EPSILON and t > OBSERVE:
epsilon -= (INITIAL_EPSILON - FINAL_EPSILON) / EXPLORE
# run the selected action and observe next state and reward
x_t1_colored, r_t, terminal = game_state.frame_step(a_t)
x_t1 = cv2.cvtColor(cv2.resize(x_t1_colored, (80, 80)), cv2.COLOR_BGR2GRAY)
ret, x_t1 = cv2.threshold(x_t1, 1, 255, cv2.THRESH_BINARY)
x_t1 = np.reshape(x_t1, (80, 80, 1))
#s_t1 = np.append(x_t1, s_t[:,:,1:], axis = 2)
s_t1 = np.append(x_t1, s_t[:, :, :3], axis=2)
# store the transition in D
D.append((s_t, a_t, r_t, s_t1, terminal))
if len(D) > REPLAY_MEMORY:
D.popleft()
# only train if done observing
if t > OBSERVE:
# sample a minibatch to train on
minibatch = random.sample(D, BATCH)
# get the batch variables
s_j_batch = [d[0] for d in minibatch]
a_batch = [d[1] for d in minibatch]
r_batch = [d[2] for d in minibatch]
s_j1_batch = [d[3] for d in minibatch]
y_batch = []
readout_j1_batch = readout.eval(feed_dict = {s : s_j1_batch})
for i in range(0, len(minibatch)):
terminal = minibatch[i][4]
# if terminal, only equals reward
if terminal:
y_batch.append(r_batch[i])
else:
y_batch.append(r_batch[i] + GAMMA * np.max(readout_j1_batch[i]))
# perform gradient step
train_step.run(feed_dict = {
y : y_batch,
a : a_batch,
s : s_j_batch}
)
# update the old values
s_t = s_t1
t += 1
# save progress every 10000 iterations
if t % 10000 == 0:
saver.save(sess, 'saved_networks/' + GAME + '-dqn', global_step = t)
# print info
state = ""
if t <= OBSERVE:
state = "observe"
elif t > OBSERVE and t <= OBSERVE + EXPLORE:
state = "explore"
else:
state = "train"
print("TIMESTEP", t, "/ STATE", state, \
"/ EPSILON", epsilon, "/ ACTION", action_index, "/ REWARD", r_t, \
"/ Q_MAX %e" % np.max(readout_t))
# write info to files
'''
if t % 10000 <= 100:
a_file.write(",".join([str(x) for x in readout_t]) + '\n')
h_file.write(",".join([str(x) for x in h_fc1.eval(feed_dict={s:[s_t]})[0]]) + '\n')
cv2.imwrite("logs_tetris/frame" + str(t) + ".png", x_t1)
'''
def playGame():
sess = tf.InteractiveSession()
s, readout, h_fc1 = createNetwork()
trainNetwork(s, readout, h_fc1, sess)
def main():
playGame()
if __name__ == "__main__":
main()