强化学习-随机策略梯度(Policy-Base)

比较懒,引用一下别人的图片
强化学习-随机策略梯度(Policy-Base)_第1张图片策略 π \pi π就是在状态 s s s下采取动作 a a a的概率分布,可以标示为以下形式:
π θ ( a ∣ s ) = π ( a ∣ s , θ ) = P r { A t = a ∣ S t = s , θ t = θ } (1) \tag{1} \pi_{\theta}(a|s)=\pi(a|s,\theta)=P_r\{A_t=a|S_t=s,\theta_t=\theta\} πθ(as)=π(as,θ)=Pr{At=aSt=s,θt=θ}(1)
其中时刻 t t t,环境状态为 s s s,参数为 θ \theta θ,输出动作 a a a的概率为 P r P_r Pr

智能体与环境做一次回合的动作轨迹:
τ = { s 1 , a 1 , s 2 , a 2 , . . . . . . , s t , a t , s t + 1 } (2) \tag{2} \tau=\{s_1,a_1,s_2,a_2,......,s_t,a_t,s_{t+1}\} τ={s1,a1,s2,a2,......,st,at,st+1}(2)
( 2 ) (2) (2)式可以算出 τ \tau τ轨迹发生的概率为:
p ( τ ) = p { s 1 , a 1 , s 2 , a 2 , . . . . . . , s t , a t , s t + 1 } = p ( s 1 ) ∗ p ( a 1 ∣ s 1 ) ∗ p ( s 2 ∣ s 1 , a 1 ) ∗ p ( a 2 ∣ s 2 ) ∗ p ( s 3 ∣ s 2 , a 2 ) ∗ . . . ∗ p ( a t ∣ s t ) ∗ p ( s t + 1 ∣ s t , a t ) = p ( s 1 ) ∗ ∏ t = 1 T p ( a t ∣ s t ) ∗ p ( s t + 1 ∣ s t , a t ) (3) \tag{3} \begin{aligned} p(\tau)&=p\{s_1,a_1,s_2,a_2,......,s_t,a_t,s_{t+1}\} \\ &=p(s_1)*p(a_1|s_1)*p(s_2|s_1,a_1)*p(a_2|s_2)*p(s_3|s_2,a_2)*...*p(a_t|s_t)*p(s_{t+1}|s_t,a_t) \\ &=p(s_1)*\prod_{t=1}^Tp(a_t|s_t)*p(s_{t+1}|s_t,a_t) \end{aligned} p(τ)=p{s1,a1,s2,a2,......,st,at,st+1}=p(s1)p(a1s1)p(s2s1,a1)p(a2s2)p(s3s2,a2)...p(atst)p(st+1st,at)=p(s1)t=1Tp(atst)p(st+1st,at)(3)
( 1 ) (1) 1式可将 ( 3 ) (3) (3)式写成:
p θ ( τ ) = π θ ( τ ) = p ( s 1 ) ∗ ∏ t = 1 T π θ ( a t ∣ s t ) ∗ p ( s t + 1 ∣ s t , a t ) (4) \tag{4} p_\theta(\tau)=\pi_{\theta}(\tau)=p(s_1)*\prod_{t=1}^T\pi_\theta(a_t|s_t)*p(s_{t+1}|s_t,a_t) pθ(τ)=πθ(τ)=p(s1)t=1Tπθ(atst)p(st+1st,at)(4)
其中, p θ ( τ ) p_\theta(\tau) pθ(τ)标示在 π θ \pi_\theta πθ采用 θ \theta θ参数下,动作轨迹出现的概率,展开式中 p ( s 1 ) p(s_1) p(s1) p ( s t + 1 ∣ s t , a t ) t ∈ { 1 , 2 , 3 , . . . , T } p(s_{t+1}|s_t,a_t) \kern{1em} t \in \{1,2,3,...,T\} p(st+1st,at)t{1,2,3,...,T}都是环境所产生的,因此跟 θ \theta θ没有关系

智能体与环境做一次回合的总回报为:
R ( τ ) = r 1 + r 2 + . . . . . . = ∑ t = 1 T r t = ∑ t = 1 T r ( s t , a t ) (5) \tag{5} R(\tau)=r_1+r_2+......=\sum_{t=1}^Tr_t=\sum_{t=1}^Tr(s_t,a_t) R(τ)=r1+r2+......=t=1Trt=t=1Tr(st,at)(5)
其中 r t r_t rt或者 r ( a t ∣ s t ) r(a_t|s_t) r(atst)标示在 s t s_t st状态下,采取 a t a_t at动作所得到的即时回报

那么智能体在策略 π \pi π取参数 θ \theta θ的情况下,跟环境做一个回合得到的总回报的期望为:
R θ ‾ = ∑ τ π θ ( τ ) ∗ R ( τ ) = E τ ∼ π θ R ( τ ) (6) \tag{6} \begin{aligned} \overline{R_\theta}&=\sum_{\tau}\pi_\theta(\tau)*R(\tau) \\ &=E_{\tau \sim\pi_{\theta}}R(\tau) \end{aligned} Rθ=τπθ(τ)R(τ)=EτπθR(τ)(6)
有些地方也将求和写出积分的形式,求和其实就是积分的一种特殊情况
R θ ‾ = ∫ π θ ( τ ) ∗ R ( τ ) d τ = E τ ∼ π θ R ( τ ) (6.1) \tag {6.1} \begin{aligned} \overline{R_\theta}&=\int \pi_\theta(\tau)*R(\tau) d\tau \\ &=E_{\tau \sim\pi_{\theta}}R(\tau) \end{aligned} Rθ=πθ(τ)R(τ)dτ=EτπθR(τ)(6.1)
其中 E τ ∼ π θ ( τ ) R ( τ ) E_{\tau \sim\pi_{\theta}(\tau)}R(\tau) Eτπθ(τ)R(τ)对于每个行动轨迹 τ \tau τ π θ ( τ ) \pi_{\theta}(\tau) πθ(τ)概率分布下行动轨迹得到的总回报 R R R的期望

则我们的优化目标是,找到一组 θ \theta θ使得智能体跟环境进行一个回合的总回报期望值最大:
θ ∗ = a r g m a x θ R θ ‾ (7) \tag 7 \theta^*=\underset{\theta}{argmax} \kern{1em} {\overline{R_\theta}} θ=θargmaxRθ(7)
那么怎么求得这个最优的 θ \theta θ呢,就需要使用梯度下降的办法来求解,则先求 R θ ‾ \overline{R_\theta} Rθ θ \theta θ的偏导数,也就是参数 θ \theta θ的梯度(这里的 θ \theta θ是一组参数向量,不是一个标量)
▽ θ R θ ‾ = ▽ θ ∑ τ π θ ( τ ) ∗ R ( τ ) = ∑ τ ▽ θ π θ ( τ ) ∗ R ( τ ) (8) \tag 8 \begin{aligned} \triangledown_{\theta} {\overline{R_\theta}}&=\triangledown_{\theta} \sum_{\tau}\pi_\theta(\tau)*R(\tau) \\ &=\sum_{\tau}\triangledown_{\theta} \pi_\theta(\tau)*R(\tau) \end{aligned} θRθ=θτπθ(τ)R(τ)=τθπθ(τ)R(τ)(8)
又,由复合函数求导可知
▽ θ l o g π θ ( τ ) = 1 π θ ( τ ) ∗ ▽ θ π θ ( τ ) ⇒ ▽ θ π θ ( τ ) = π θ ( τ ) ∗ ▽ θ l o g π θ ( τ ) (9) \tag 9 \triangledown_{\theta} log{\pi_{\theta}(\tau)} = {1 \over {\pi_\theta(\tau)}}*\triangledown_{\theta} \pi_{\theta}(\tau) \\ \kern{1em}\\ \rArr \triangledown_{\theta} \pi_{\theta}(\tau) = \pi_{\theta}(\tau)* \triangledown_{\theta} log{\pi_{\theta}(\tau)} θlogπθ(τ)=πθ(τ)1θπθ(τ)θπθ(τ)=πθ(τ)θlogπθ(τ)(9)
( 8 ) ( 9 ) (8)(9) (8)(9)可得:
▽ θ R θ ‾ = ∑ τ ▽ θ π θ ( τ ) ∗ R ( τ ) = ∑ τ π θ ( τ ) ∗ ▽ θ l o g π θ ( τ ) ∗ R ( τ ) = E τ ∼ π θ ( τ ) ▽ θ l o g π θ ( τ ) ∗ R ( τ ) (10) \tag {10} \begin{aligned} \triangledown_{\theta} {\overline{R_\theta}}&=\sum_{\tau}\triangledown_{\theta} \pi_\theta(\tau)*R(\tau) \\ &=\sum_{\tau} \pi_{\theta}(\tau)* \triangledown_{\theta} log{\pi_{\theta}(\tau)}*R(\tau) \\ &=E_{\tau \sim \pi_{\theta}(\tau)} \triangledown_{\theta} log{\pi_{\theta}(\tau)}*R(\tau) \end{aligned} θRθ=τθπθ(τ)R(τ)=τπθ(τ)θlogπθ(τ)R(τ)=Eτπθ(τ)θlogπθ(τ)R(τ)(10)
( 10 ) (10) (10)式可以看出,总回报的期望的梯度,其实就是每个轨迹 τ \tau τ π θ ( τ ) \pi_{\theta}(\tau) πθ(τ)概率分布下 ▽ θ l o g π θ ( τ ) \triangledown_{\theta} log{\pi_{\theta}(\tau)} θlogπθ(τ)的期望

然后我们再来求解 ▽ l o g π θ ( τ ) \triangledown log{\pi_{\theta}(\tau)} logπθ(τ)
l o g π θ ( τ ) = l o g [ p ( s 1 ) ∗ ∏ t = 1 T π θ ( a t ∣ s t ) ∗ p ( s t + 1 ∣ s t , a t ) ] = l o g [ p ( s 1 ) ] + ∑ t = 1 T l o g [ π θ ( a t ∣ s t ) ] + ∑ t = 1 T l o g [ p ( s t + 1 ∣ s t , a t ) ] (11) \tag{11} \begin{aligned} log{\pi_{\theta}(\tau)} &= log[p(s_1)*\prod_{t=1}^T\pi_\theta(a_t|s_t)*p(s_{t+1}|s_t,a_t)] \\ &=log[p(s_1)] + \sum_{t=1}^Tlog[\pi_{\theta}(a_t|s_t)] + \sum_{t=1}^Tlog[p(s_{t+1}|s_t,a_t)] \\ \end{aligned} logπθ(τ)=log[p(s1)t=1Tπθ(atst)p(st+1st,at)]=log[p(s1)]+t=1Tlog[πθ(atst)]+t=1Tlog[p(st+1st,at)](11)
( 11 ) (11) (11)式可以看出,对参数 θ \theta θ求梯度, l o g [ p ( s 1 ) ] log[p(s_1)] log[p(s1)] ∑ t = 1 T l o g [ p ( s t + 1 ∣ s t , a t ) ] \sum_{t=1}^Tlog[p(s_{t+1}|s_t,a_t)] t=1Tlog[p(st+1st,at)]都是跟参数 θ \theta θ没有关系的,因此求导为 0 0 0,因此得到如下式子:
▽ θ l o g π θ ( τ ) = l o g [ p ( s 1 ) ] + ∑ t = 1 T ▽ θ l o g [ π θ ( a t ∣ s t ) ] + ∑ t = 1 T l o g [ p ( s t + 1 ∣ s t , a t ) ] (12) \tag{12} \begin{aligned} \triangledown_{\theta} log{\pi_{\theta}(\tau)}=&\xcancel{log[p(s_1)]} + \sum_{t=1}^T\triangledown_{\theta}log[\pi_{\theta}(a_t|s_t)] + \xcancel{\sum_{t=1}^Tlog[p(s_{t+1}|s_t,a_t)]} \\ \end{aligned} θlogπθ(τ)=log[p(s1)] +t=1Tθlog[πθ(atst)]+t=1Tlog[p(st+1st,at)] (12)
( 5 ) ( 10 ) ( 12 ) (5)(10)(12) (5)(10)(12)式可以得到最终目标函数的梯度为:
▽ θ J ( θ ) = ▽ θ R θ ‾ = E τ ∼ π θ ( τ ) { ∑ t = 1 T ▽ θ l o g [ π θ ( a t ∣ s t ) ] } ∗ { ∑ t = 1 T r ( s t , a t ) } (13) \tag{13} \triangledown_{\theta}J(\theta)=\triangledown_{\theta} \overline{R_{\theta}}=E_{\tau \sim \pi_{\theta}(\tau)}\{\sum_{t=1}^T \triangledown_{\theta}log[\pi_{\theta}(a_t|s_t)]\}*\{\sum_{t=1}^Tr(s_t,a_t)\} θJ(θ)=θRθ=Eτπθ(τ){t=1Tθlog[πθ(atst)]}{t=1Tr(st,at)}(13)
其实 R θ ‾ \overline{R_{\theta}} Rθ就是我们的目标函数 J ( θ ) J(\theta) J(θ),由于 E τ ∼ π θ ( τ ) { ∑ t = 1 T ▽ θ l o g [ π θ ( a t ∣ s t ) ] } ∗ { ∑ t = 1 T r ( s t , a t ) } E_{\tau \sim \pi_{\theta}(\tau)}\{\sum_{t=1}^T \triangledown_{\theta}log[\pi_{\theta}(a_t|s_t)]\}*\{\sum_{t=1}^Tr(s_t,a_t)\} Eτπθ(τ){t=1Tθlog[πθ(atst)]}{t=1Tr(st,at)} { ∑ t = 1 T ▽ θ l o g [ π θ ( a t ∣ s t ) ] } ∗ { ∑ t = 1 T r ( s t , a t ) } \{\sum_{t=1}^T \triangledown_{\theta}log[\pi_{\theta}(a_t|s_t)]\}*\{\sum_{t=1}^Tr(s_t,a_t)\} {t=1Tθlog[πθ(atst)]}{t=1Tr(st,at)}的期望值,则实际操作过程中需要采样多个策略轨迹 τ i \tau_i τi进行期望的计算,比如我们采样数量为n个,则上面 ( 13 ) (13) (13)式可以写为:
▽ θ J ( θ ) = ▽ θ R θ ‾ = E τ ∼ π θ ( τ ) { ∑ t = 1 T ▽ θ l o g [ π θ ( a t ∣ s t ) ] } ∗ { ∑ t = 1 T r ( s t , a t ) } = 1 n ∑ i = 1 n [ ( ∑ t = 1 T ▽ θ l o g [ π θ ( a i t ∣ s i t ) ] ) ∗ ( ∑ t = 1 T r ( s i t , a i t ) ) ] (14) \tag{14} \begin{aligned} \triangledown_{\theta}J(\theta)=\triangledown_{\theta} \overline{R_{\theta}}&=E_{\tau \sim \pi_{\theta}(\tau)}\{\sum_{t=1}^T \triangledown_{\theta}log[\pi_{\theta}(a_t|s_t)]\}*\{\sum_{t=1}^Tr(s_t,a_t)\} \\ &={1 \over n}\sum_{i=1}^n [(\sum_{t=1}^T \triangledown_{\theta}log[\pi_{\theta}(a_{it}|s_{it})])*(\sum_{t=1}^Tr(s_{it},a_{it}))] \end{aligned} θJ(θ)=θRθ=Eτπθ(τ){t=1Tθlog[πθ(atst)]}{t=1Tr(st,at)}=n1i=1n[(t=1Tθlog[πθ(aitsit)])(t=1Tr(sit,ait))](14)

那么我们就可以利用这个梯度对 θ \theta θ进行更新
θ ∗ ← θ + α ▽ θ J ( θ ) (15) \tag{15} \theta^* \larr \theta + \alpha \triangledown_{\theta}J(\theta) θθ+αθJ(θ)(15)
其中 α \alpha α为学习率

那么运用到深度学习当中应该怎么使用呢?我们可以搭建一个神经网络,输入为状态 s s s,输出为每个动作 a a a的可能概率,而我们的损失函数则可以直接取 − l o g [ π θ ( a t ∣ s t ) ] ∗ R ( τ ) -log[\pi_{\theta}(a_t|s_t)] * R(\tau) log[πθ(atst)]R(τ),因为神经网络只能进行梯度递减,因此 l o g log log前面多了一个负号,另外,应为跟 R ( τ ) R(\tau) R(τ)有关,所以策略梯度求解,只能每个回合结束后进行参数更新,这个跟Value-base方法有区别。
L ( θ ) = − l o g [ π θ ( a t ∣ s t ) ] ∗ R ( τ ) (16) \tag{16} L(\theta)=-log[\pi_{\theta}(a_t|s_t)] * R(\tau) L(θ)=log[πθ(atst)]R(τ)(16)
从公式来看,当 R ( τ ) R(\tau) R(τ)大的时候,因为 − l o g x -logx logx是单调递减函数,则 π θ ( a t ∣ s t ) \pi_{\theta}(a_t|s_t) πθ(atst)需要增大才行能使总的 L ( θ ) L(\theta) L(θ)越小,同理,当 R ( τ ) R(\tau) R(τ)越小,则需要 π θ ( a t ∣ s t ) \pi_{\theta}(a_t|s_t) πθ(atst)越小,这样就可以增大有价值的动作概率,减小没有价值的动作概率。

代码实例:
手动实现一个迷宫探宝游戏,智能体(蓝色方块)随机出现在地图中的某个位置,并且在地图中放一个或者多个宝藏(红色方块),并且设置多个陷阱(黑色方块),智能体可以向周围8个方向移动,最终取得宝藏为目标。
代码和截图如下:
定义通用的游戏接口abstract_kernel.py

import abc


class AbstractPlayer(object):
    """
    玩家
    """

    def __init__(self):
        """
        初始化
        """
        pass

    @abc.abstractmethod
    def reinforce(self, observation, action, reward, done, info):
        """
        经验增强
        :param observation: 环境观测
        :param action:      环境观测执行的动作
        :param reward:      反馈
        :param done:
        :param info:        其他信息
        :return:
        """
        pass

    @abc.abstractmethod
    def decide(self, observation):
        """
        观测到环境后,进行动作决策
        :param observation:     观测到的环境
        :return:
        """
        pass


class AbstractGame(object):
    """
    定义所有验证游戏平台的统一接口
    """

    def __init__(self, rf=None, episodes=1):
        """
        初始化
        :param rf:       奖励函数
        :param episodes: 尝试次数
        """
        self._rf = rf
        self._episodes = episodes

    @abc.abstractmethod
    def start(self):
        """
        开始游戏
        :return:
        """
        pass


class AbstractGameUI(object):
    """
    游戏的UI界面
    """

    def __init__(self, game):
        """
        于某个游戏绑定
        :param game:    需要绑定的游戏
        """
        self._game = game

    @abc.abstractmethod
    def show(self):
        """
        开始展示
        :return:
        """
        pass

实现迷宫探宝游戏maze_kernel.py

import numpy as np
import sys
import pygame
from games.abstract_kernel import *


class Game(AbstractGame):
    """
    迷宫
    """

    def __init__(self, player, start_func, golds_func, traps_func=None,
                 scene_size=(10, 10), max_step=0, **kwargs):
        """
        初始化
        :param player:      玩家
        :param start_func:  出生地生成函数
        :param golds_func:  宝藏生成函数
        :param traps_func:  陷阱生成函数
        :param scene_size:  场地大小
        :param max_step:    最大步数
        :param kwargs:
        """
        super().__init__(**kwargs)

        # ************* 配置参数 ***************
        self.__player = player
        self.__golds_func = golds_func
        self.__traps_func = traps_func
        self.__start_func = start_func
        self.__scene_size = scene_size
        if max_step > 0:
            self.__max_step = max_step
        else:
            self.__max_step = (self.__scene_size[0] + self.__scene_size[1])
        # 设置默认的奖励函数
        if self._rf is None:
            self._rf = self.__default_reward_function
        # ************* 常数定义 ***************
        self.__actions = {
            0: (-1, 0),  # 上
            1: (0, 1),  # 右
            2: (1, 0),  # 下
            3: (0, -1),  # 左
            4: (-1, -1),  # 左上
            5: (-1, 1),  # 右上
            6: (1, 1),  # 右下
            7: (1, -1)  # 左下
        }
        # ************* 运行参数 ***************
        # 初始化场景
        self.__env = np.zeros(self.__scene_size)
        # 当前尝试次数
        self.__episode = 0
        # 当前智能体位置
        self.__p = [0, 0]

    def start(self):
        """
        开始游戏
        :return:
        """
        while self.__episode < self._episodes:
            # 记录场次
            self.__episode += 1
            self.__env = np.zeros(self.__scene_size)
            # 初始化场景,获取宝藏位置
            golds_locations = self.__golds_func()
            if golds_locations is None or len(golds_locations) == 0: return
            for location in golds_locations:
                self.__env[location[0], location[1]] = 1
            # 初始化场景,获取陷阱位置
            if self.__traps_func is not None:
                trap_locations = self.__traps_func()
                if trap_locations is not None and len(trap_locations) > 0:
                    for location in trap_locations:
                        if self.__env[location[0], location[1]] == 1: continue
                        self.__env[location[0], location[1]] = -1
            # 初始化智能体的出生位置
            self.__p = self.__start_func()
            # 记录总步数
            step = 0
            while True:
                # 步数加1
                step += 1
                # 取得一个决策
                action = self.__player.decide([np.copy(self.__env), np.copy(self.__p)])
                # 评价动作
                reward, done, win = self._rf([np.copy(self.__env), np.copy(self.__p)], action, step)
                # 反馈信息
                self.__player.reinforce([np.copy(self.__env), np.copy(self.__p)], action, reward, done, win)
                if done: break
                # 执行动作
                self.__p = self.__p + self.__actions.get(action)


    def get_env(self):
        """
        获取环境样式
        :return:
        """
        return [np.copy(self.__env), np.copy(self.__p)]

    def __default_reward_function(self, s, a, step):
        """
        默认的奖励函数
        :param s:       环境
        :param a:       动作
        :param step:    总步数
        :return:
        """
        # 得到环境和智能体位置
        env, p = s[0], s[1]
        # 执行前的最短距离
        d = self.__min_distance_for_gold(env, p)
        # 模拟执行动作
        n_p = p + self.__actions.get(a)
        if n_p[0] < 0 or n_p[1] < 0 or n_p[0] > self.__scene_size[0] - 1 or n_p[1] > self.__scene_size[1] - 1 or step >= self.__max_step:
            # 出界
            reward, done, info = -100 - d, True, False
        else:
            # 判断是否拿到宝藏
            if env[n_p[0], n_p[1]] == 1:
                reward, done, info = 100, True, True
            elif env[n_p[0], n_p[1]] == -1:
                reward, done, info = -100 - d, True, False
            else:
                # 计算距离差值
                n_d = self.__min_distance_for_gold(env, n_p)
                reward, done, info = d - n_d, False, False
        return reward, done, info

    def __min_distance_for_gold(self, env, p):
        """
        智能体距离宝藏最近的距离
        :param s: 环境
        :return:
        """
        # 宝藏位置
        golds = np.array(np.where(env == 1)).transpose()
        min_distance = sys.maxsize
        for g in golds:
            # 计算距离
            distance = ((p[0] - g[0]) ** 2 + (p[1] - g[1]) ** 2) ** 0.5
            if distance < min_distance: min_distance = distance
        return min_distance



class GameUI(AbstractGameUI):
    """
    UI展示
    """
    def __init__(self, game, cell_size=20):
        """
        初始化UI
        :param game:        游戏
        :param cell_size:   单元格尺寸大小
        """
        super().__init__(game)
        self.__cell_size = cell_size

    def show(self):
        """
        显示
        :return:
        """
        pygame.init()
        pygame.font.init()
        # 创建一个窗口
        size = self._game.get_env()[0].shape
        size = [size[0] * self.__cell_size, size[1] * self.__cell_size]
        screen = pygame.display.set_mode(size)
        # 设置窗口标题
        pygame.display.set_caption("Do Something")
        while True:
            # 处理事件监听
            for event in pygame.event.get():
                if event.type == pygame.QUIT:
                    print('exit game......')
                    pygame.quit()
            # 清理历史图像
            screen.fill((255, 255, 255))
            env, p = self._game.get_env()
            # 绘制游戏画面
            for row in range(len(env)):
                # 绘制行直线
                pygame.draw.line(screen, (0, 0, 0), (0, self.__cell_size * row), (size[0], self.__cell_size * row), 1)
                for column in range(len(env[row])):
                    # 绘制列
                    pygame.draw.line(screen, (0, 0, 0), (self.__cell_size * column, 0), (self.__cell_size * column, size[1]), 1)
                    # 绘制宝藏
                    if env[row, column] == 1:
                        pygame.draw.rect(screen, (220, 20, 60), [self.__cell_size * column, self.__cell_size * row, self.__cell_size, self.__cell_size], 0)
                    # 绘制陷阱
                    if env[row, column] == -1:
                        pygame.draw.rect(screen, (0, 0, 0), [self.__cell_size * column, self.__cell_size * row, self.__cell_size, self.__cell_size], 0)
                    # 绘制当前智能体位置
                    if row == p[0] and column == p[1]:
                        pygame.draw.rect(screen, (30, 144, 255), [self.__cell_size * column, self.__cell_size * row, self.__cell_size, self.__cell_size], 0)
                    # 设置当前的概率值
                    # to do
            # 设置时钟
            pygame.time.Clock().tick(24)
            pygame.display.update()

采用随机策略梯度测试代码,pg.py:

"""
采用随机策略梯度进行最优解逼近
"""
import tensorflow as tf
import os
import time
import tensorflow.python.keras.backend as K
from rl.games.maze.maze_kernel import *


class Player(AbstractPlayer):
    """
    采用随机策略梯度进行问题的求解
    """

    def __init__(self, alpha=0.02, gamma=0.9, think_time=0.1, prob_file=None):
        """
        初始化
        :param alpha:           学习率
        :param gamma:           反馈值衰减系数
        :param think_time:      思考时间
        :param prob_file:       模型保存文件
        """
        super().__init__()
        # ********** 配置参数 ********
        self.alpha = alpha
        self.gamma = gamma
        self.think_time = think_time
        self.prob_file = prob_file
        # ********** 运行参数 ********
        # 参数概率表
        self.prob = None
        if self.prob_file is not None and os.path.exists(self.prob_file):
            self.prob = tf.Variable(tf.constant(np.load(self.prob_file, allow_pickle=False)))
        # 得分记录
        self.record = [0, 0]
        # 轨迹记录
        self.observations, self.actions, self.rewards = [], [], []
        # 参数优化器
        self.optimizer = tf.keras.optimizers.Adam(self.alpha)

    def decide(self, observation):
        """
        进行决策
        :param observation:
        :return:
        """
        if self.prob is None:
            self.prob = tf.constant(np.zeros(shape=(observation[0].shape[0], observation[0].shape[1], 8)))
            self.prob = tf.Variable(self.prob)
        time.sleep(self.think_time)
        # 得到agent的位置信息
        agent = observation[1]
        # 得到得分值
        vt = self.prob[agent[0], agent[1]]
        # 通过softmax将概率归一化
        p = K.softmax(vt, axis=0)
        # 根据概率选择决策,随机策略梯度
        action = np.random.choice([0, 1, 2, 3, 4, 5, 6, 7], size=1, p=p)[0]
        return action

    def reinforce(self, observation, action, reward, done, info):
        """
        增强学习
        :param observation:
        :param action:
        :param reward:
        :param done:
        :param info:
        :return:
        """
        self.observations.append(observation)
        self.actions.append(action)
        self.rewards.append(reward)
        if done:
            self.optimizer.minimize(self.loss, [self.prob])
            # 记录一波Q值
            if self.prob_file is not None:
                np.save(self.prob_file, self.prob.numpy())
            # 记录
            if info:
                self.record[0] += 1
            else:
                self.record[1] += 1
            print(self.record)

    def loss(self):
        """
        更新策略
        :return:
        """
        # 计算每一个step的策略梯度
        vt = np.zeros(self.prob.shape)
        # 循环每一个step进行梯度计算
        for step in range(len(self.observations)):
            agent = self.observations[step][1]
            action = self.actions[step]
            R = 0
            for i in range(step, len(self.rewards)):
                R = R + self.rewards[i] * (self.gamma ** (i - step))
            vt[agent[0], agent[1], action] += R
            
        # 采用softmax进行概率归一化
        prob = K.softmax(self.prob)
        # 计算loss
        loss = -K.log(prob)
        # 乘以vt值
        loss = loss * vt
        loss = K.sum(loss)
        self.observations, self.actions, self.rewards = [], [], []
        return loss


def start_func():
    return np.array([np.random.randint(0, 19), np.random.randint(0, 19)])


# def start_func():
#     return np.array([0, 0])

def golds_func():
    return np.array([
        [19, 19]
    ])


def traps_func():
    return np.array([
        [6, 6],
        [9, 12],
        [4, 3],
        [13, 14],
        [17, 4],
        [3, 14],
        [6, 3],
        [1, 12],
        [14, 3],
        [1, 4],
        [17, 14],
        [19, 18],

        [10, 9],
        [8, 3],
        [6, 12],
        [9, 11],
        [11, 4],
        [15, 14],
        [17, 17],
    ])


import threading

if __name__ == '__main__':
    # 创建玩家
    player = Player(think_time=0.1, gamma=0.65, prob_file=os.path.sep.join([os.path.dirname(__file__), 'pg.npy']))
    # 创建游戏
    game = Game(player, episodes=sys.maxsize, scene_size=(20, 20), traps_func=traps_func, start_func=start_func,
                golds_func=golds_func)
    # 创建ui
    ui = GameUI(game)
    # 开始游戏
    t = threading.Thread(target=ui.show, args=())
    t.start()
    game.start()


最终运行结果:
强化学习-随机策略梯度(Policy-Base)_第2张图片控制台会打印成功和失败的尝试数量,刚开始的时候会很慢,可以将think_time设置成0.001或者去掉思考时间,大概总尝试次数达到5000次的时候,智能体的成功率会明显提升上来。

实例中所有的陷阱和宝藏都是固定的,因此观测其实只需要参考agent的当前位置就可以得到最优解,如果宝藏和陷阱式动态的,那么则需要将整个迷宫的布局信息加入到优化函数中,我们可以采用一个神经网络来拟合,输入就是迷宫当前陷阱和宝藏位置,以及agent的当前位置,输出也是agent各个方向的概率,这样agent就可以在动态的迷宫中避开陷阱并最终拿到宝藏。

你可能感兴趣的:(数学方法和原理,人工智能,强化学习,深度学习,算法,机器学习)