唐宇迪强化学习笔记之项目实战(flabby bird)

强化学习:

学习系统没有像很多其它形式的机器学习方法一样被告知应该做出什么行为,必须在尝试了之后才能发现哪些行为会导致奖励的最大化,当前的行为可能不仅仅会影响即时奖励,还会影响下一步的奖励以及后续的所有奖励。

强化学习的五个关键字:

唐宇迪强化学习笔记之项目实战(flabby bird)_第1张图片

强化学习的学习过程:总的来说就是先观测,再行动,再观测。如下图所示:

唐宇迪强化学习笔记之项目实战(flabby bird)_第2张图片

以下是强化学习的原理图:

唐宇迪强化学习笔记之项目实战(flabby bird)_第3张图片

马尔科夫决策要求:

1.能够检测到理想的状态
2.可以多次尝试
3.系统的下个状态只与当前状态信息有关,而与更早之前的状态无关
在决策过程中还和当前采取的动作有关

马尔科夫决策过程由五个元素构成:

S:表示状态集(states)
A:表示一组动作(actions)
P:表示状态转移概率 Psa ,表示在当前s ∈ S状态下,经过a ∈ A作用后,会转移到的其他状态的概率分布情况在状态s下执行动作a,转移到 s 的概率可以表示为 p(s|s,a)
R: 奖励函数(reward function)表示 agent 采取某个动作后的即时奖励
y:折扣系数意味着当下的 reward 比未来反馈的 reward 更重要 t=0γtR(st) 0γ1

强化学习步骤:

1.智能体初始状态为S0
2.选择一个动作a0
3.按概率转移矩阵Psa转移到了下一个状态S1
然后。

这里写图片描述

状态价值函数 v(s)=E[Ut|St=s] ,t 时刻的状态 s 能获得的未来回报的期望,价值函数用来衡量某一状态或状态-动作对的优劣价 ,累计奖励的期望。
最优价值函数:所有策略下的最优累计奖励期望 v(s)=maxπvπ(s)
策略:已知状态下可能产生动作的概率分布。
Bellman方程: 当前状态的价值和下一步的价值及当前的奖励(Reward)有关价值函数分解为当前的奖励和下一步的价值两部分。
我们来看下面的例子:

唐宇迪强化学习笔记之项目实战(flabby bird)_第4张图片

π是给定状态s的情况下,动作a的概率分布,因为动作空间A,状态空间S均为有限集合,所以我们可以用求和来计算期望。

vπ(s)=aAπ(a|s)(Ras+γsSPassvπ(s))

π(a|s) 为当前动作的概率分布, Ras 为即时的奖励, Pass 为概率的转移矩阵。

Pass=P(St+1=s|St=s,At=a)

vπ(s)=Eπ[Ras+γvπ(St+1)|St=s]

如下所示:
唐宇迪强化学习笔记之项目实战(flabby bird)_第5张图片

Bellman最优化方程:

v(s)=maxaRas+γsSPassvπ(s)

在某个状态(state)下最优价值函数的值,就是智能体(agent)在该状态下,所能获得的累积期望奖励值(cumulative expective rewards)的最大值。
我们以下面房子为例:

唐宇迪强化学习笔记之项目实战(flabby bird)_第6张图片

将其转化为:

唐宇迪强化学习笔记之项目实战(flabby bird)_第7张图片
唐宇迪强化学习笔记之项目实战(flabby bird)_第8张图片

唐宇迪强化学习笔记之项目实战(flabby bird)_第9张图片

Q(s,a)=R(s,a)+γmaxa~{Q{s~,a~}}

其中s,a表示当前的状态和行为, s~,a~ ,表示下一个状态和行为,学习参数 γ 为满足 0γ1 的常数。
step1:给定参数 γ 和reward矩阵R
step2:令Q=0
**step3:**For each episode:

3.1 随机选择一个初始的状态s
3.2 若未达到目标状态则执行以下几步
(1)在当前状态s的可能行为中选取一个行为a
(2)利用给定的行为a,得到下一个状态 s~
(3)按照(1,1)计算Q(s,a)
(4)令s= s~

首先取学习参数 γ =0.8,初始状态房间1,并将Q初始化为一个零矩阵。如下所示:

唐宇迪强化学习笔记之项目实战(flabby bird)_第10张图片

观察矩阵R的第二行(对应房间1或状态1),它包含两个非负值,即当前状态1的行为有两种可能:转至状态3或状态5,随机的,我们选取转至状态5。

唐宇迪强化学习笔记之项目实战(flabby bird)_第11张图片

想象一下,当我们的agent位于状态5以后,会发生什么事情?观察矩阵R的第6行它对应三个可能的行为:转至状态1,4或5.

唐宇迪强化学习笔记之项目实战(flabby bird)_第12张图片

接下来,进行下一次episode的迭代,首先随机的选取一个初始状态。这次我们选取状态3作为初始状态。
观测矩阵R的第四行,他对应3个可能的行为:转至状态1,2 或4,随机的我们选取转至状态1,因此观察矩阵R的第二行,它对应两个可能的行为:转至状态3或5。

唐宇迪强化学习笔记之项目实战(flabby bird)_第13张图片

现在状态1变成了当前状态,因为状态1还不是目标状态,因此我们需要继续往前探索,状态1对应三个可能的行为:转至状态3或5,假设我们选择了状态5.

唐宇迪强化学习笔记之项目实战(flabby bird)_第14张图片
唐宇迪强化学习笔记之项目实战(flabby bird)_第15张图片

若我们执行更多的episode,矩阵Q最终将收敛成。

唐宇迪强化学习笔记之项目实战(flabby bird)_第16张图片

项目实战:

Deep Q Network 架构如下:

唐宇迪强化学习笔记之项目实战(flabby bird)_第17张图片

唐宇迪强化学习笔记之项目实战(flabby bird)_第18张图片
唐宇迪强化学习笔记之项目实战(flabby bird)_第19张图片

基于tensorflow的智能游戏(flabby bird)源码:
项目源码及网络模型链接:https://pan.baidu.com/s/1i5b7QhZ 密码:3jau

#!/usr/bin/env python
from __future__ import print_function

import tensorflow as tf
import cv2
import sys
sys.path.append("game/")
import wrapped_flappy_bird as game
import random
import numpy as np
from collections import deque

GAME = 'bird' # the name of the game being played for log files
ACTIONS = 2 # number of valid actions
GAMMA = 0.99 # decay rate of past observations
OBSERVE = 100000. # timesteps to observe before training
EXPLORE = 2000000. # frames over which to anneal epsilon
FINAL_EPSILON = 0.0001 # final value of epsilon
INITIAL_EPSILON = 0.0001 # starting value of epsilon
REPLAY_MEMORY = 50000 # number of previous transitions to remember
BATCH = 32 # size of minibatch
FRAME_PER_ACTION = 1

def weight_variable(shape):
    initial = tf.truncated_normal(shape, stddev = 0.01)
    return tf.Variable(initial)

def bias_variable(shape):
    initial = tf.constant(0.01, shape = shape)
    return tf.Variable(initial)

def conv2d(x, W, stride):
    return tf.nn.conv2d(x, W, strides = [1, stride, stride, 1], padding = "SAME")

def max_pool_2x2(x):
    return tf.nn.max_pool(x, ksize = [1, 2, 2, 1], strides = [1, 2, 2, 1], padding = "SAME")

def createNetwork():
    # network weights
    W_conv1 = weight_variable([8, 8, 4, 32])
    b_conv1 = bias_variable([32])

    W_conv2 = weight_variable([4, 4, 32, 64])
    b_conv2 = bias_variable([64])

    W_conv3 = weight_variable([3, 3, 64, 64])
    b_conv3 = bias_variable([64])

    W_fc1 = weight_variable([1600, 512])
    b_fc1 = bias_variable([512])

    W_fc2 = weight_variable([512, ACTIONS])
    b_fc2 = bias_variable([ACTIONS])

    # input layer
    s = tf.placeholder("float", [None, 80, 80, 4])

    # hidden layers
    h_conv1 = tf.nn.relu(conv2d(s, W_conv1, 4) + b_conv1)
    h_pool1 = max_pool_2x2(h_conv1)

    h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2, 2) + b_conv2)
    #h_pool2 = max_pool_2x2(h_conv2)

    h_conv3 = tf.nn.relu(conv2d(h_conv2, W_conv3, 1) + b_conv3)
    #h_pool3 = max_pool_2x2(h_conv3)

    #h_pool3_flat = tf.reshape(h_pool3, [-1, 256])
    h_conv3_flat = tf.reshape(h_conv3, [-1, 1600])

    h_fc1 = tf.nn.relu(tf.matmul(h_conv3_flat, W_fc1) + b_fc1)

    # readout layer
    readout = tf.matmul(h_fc1, W_fc2) + b_fc2

    return s, readout, h_fc1

def trainNetwork(s, readout, h_fc1, sess):
    # define the cost function
    a = tf.placeholder("float", [None, ACTIONS])
    y = tf.placeholder("float", [None])
    readout_action = tf.reduce_sum(tf.multiply(readout, a), reduction_indices=1)
    cost = tf.reduce_mean(tf.square(y - readout_action))
    train_step = tf.train.AdamOptimizer(1e-6).minimize(cost)

    # open up a game state to communicate with emulator
    game_state = game.GameState()

    # store the previous observations in replay memory
    D = deque()

    # printing
    a_file = open("logs_" + GAME + "/readout.txt", 'w')
    h_file = open("logs_" + GAME + "/hidden.txt", 'w')

    # get the first state by doing nothing and preprocess the image to 80x80x4
    do_nothing = np.zeros(ACTIONS)
    do_nothing[0] = 1
    x_t, r_0, terminal = game_state.frame_step(do_nothing)
    x_t = cv2.cvtColor(cv2.resize(x_t, (80, 80)), cv2.COLOR_BGR2GRAY)
    ret, x_t = cv2.threshold(x_t,1,255,cv2.THRESH_BINARY)
    s_t = np.stack((x_t, x_t, x_t, x_t), axis=2)

    # saving and loading networks
    saver = tf.train.Saver()
    sess.run(tf.initialize_all_variables())
    checkpoint = tf.train.get_checkpoint_state("saved_networks")
    if checkpoint and checkpoint.model_checkpoint_path:
        saver.restore(sess, checkpoint.model_checkpoint_path)
        print("Successfully loaded:", checkpoint.model_checkpoint_path)
    else:
        print("Could not find old network weights")

    # start training
    epsilon = INITIAL_EPSILON
    t = 0
    while "flappy bird" != "angry bird":
        # choose an action epsilon greedily
        readout_t = readout.eval(feed_dict={s : [s_t]})[0]
        a_t = np.zeros([ACTIONS])
        action_index = 0
        if t % FRAME_PER_ACTION == 0:
            if random.random() <= epsilon:
                print("----------Random Action----------")
                action_index = random.randrange(ACTIONS)
                a_t[random.randrange(ACTIONS)] = 1
            else:
                action_index = np.argmax(readout_t)
                a_t[action_index] = 1
        else:
            a_t[0] = 1 # do nothing

        # scale down epsilon
        if epsilon > FINAL_EPSILON and t > OBSERVE:
            epsilon -= (INITIAL_EPSILON - FINAL_EPSILON) / EXPLORE

        # run the selected action and observe next state and reward
        x_t1_colored, r_t, terminal = game_state.frame_step(a_t)
        x_t1 = cv2.cvtColor(cv2.resize(x_t1_colored, (80, 80)), cv2.COLOR_BGR2GRAY)
        ret, x_t1 = cv2.threshold(x_t1, 1, 255, cv2.THRESH_BINARY)
        x_t1 = np.reshape(x_t1, (80, 80, 1))
        #s_t1 = np.append(x_t1, s_t[:,:,1:], axis = 2)
        s_t1 = np.append(x_t1, s_t[:, :, :3], axis=2)

        # store the transition in D
        D.append((s_t, a_t, r_t, s_t1, terminal))
        if len(D) > REPLAY_MEMORY:
            D.popleft()

        # only train if done observing
        if t > OBSERVE:
            # sample a minibatch to train on
            minibatch = random.sample(D, BATCH)

            # get the batch variables
            s_j_batch = [d[0] for d in minibatch]
            a_batch = [d[1] for d in minibatch]
            r_batch = [d[2] for d in minibatch]
            s_j1_batch = [d[3] for d in minibatch]

            y_batch = []
            readout_j1_batch = readout.eval(feed_dict = {s : s_j1_batch})
            for i in range(0, len(minibatch)):
                terminal = minibatch[i][4]
                # if terminal, only equals reward
                if terminal:
                    y_batch.append(r_batch[i])
                else:
                    y_batch.append(r_batch[i] + GAMMA * np.max(readout_j1_batch[i]))

            # perform gradient step
            train_step.run(feed_dict = {
                y : y_batch,
                a : a_batch,
                s : s_j_batch}
            )

        # update the old values
        s_t = s_t1
        t += 1

        # save progress every 10000 iterations
        if t % 10000 == 0:
            saver.save(sess, 'saved_networks/' + GAME + '-dqn', global_step = t)

        # print info
        state = ""
        if t <= OBSERVE:
            state = "observe"
        elif t > OBSERVE and t <= OBSERVE + EXPLORE:
            state = "explore"
        else:
            state = "train"

        print("TIMESTEP", t, "/ STATE", state, \
            "/ EPSILON", epsilon, "/ ACTION", action_index, "/ REWARD", r_t, \
            "/ Q_MAX %e" % np.max(readout_t))
        # write info to files
        '''
        if t % 10000 <= 100:
            a_file.write(",".join([str(x) for x in readout_t]) + '\n')
            h_file.write(",".join([str(x) for x in h_fc1.eval(feed_dict={s:[s_t]})[0]]) + '\n')
            cv2.imwrite("logs_tetris/frame" + str(t) + ".png", x_t1)
        '''

def playGame():
    sess = tf.InteractiveSession()
    s, readout, h_fc1 = createNetwork()
    trainNetwork(s, readout, h_fc1, sess)

def main():
    playGame()

if __name__ == "__main__":
    main()

你可能感兴趣的:(深度学习笔记)