【强化学习】Q-Learning用于二维空间探索【Python实现】

前言

这个基于之前的工作,如果有什么疑问也可以参照以前的文章。

  • 【强化学习】Q-Learning算法详解以及Python实现【80行代码】

本文主要做了两件事情

  1. 将上篇文章中得弱智般的treasure on right问题,扩张到二维。且将teasure位置随机。并添加一些节点表示坑,即到该节点得到的奖励是负数。
  2. 将代码结构调整了一下,让整个项目更加清晰。已经方便复用。

其实上面的操作,mofan大神在github中也有提到,Q_Learning_maze

但可能每个人的具体写法不一样,但是思想基本都是一致的。

此外,我自己也做了一些小的创新or改动:

  • 将越界的奖励也设置为负数,-1
    • Q-Learning还是比较容易陷入局部最优解。如果整张图的rewards是有正有负数的情况,在经过几次负数的rewards之后,就会导致智能体偏好在边缘疯狂试探。即原地不动,因为这样他的rewards是0。而不是一般数值上的1。而这样的选择是没有意义,简直浪费时间。本来训练就慢了,还不想办法加速一下。
  • 我没有选择做图形界面。觉得除了好看没什么意义。但我也一样封装了一下,这种封装的抽象化思维还是很不错的。值得学习。

具体代码如下

把三个文件分别按名字保存。然后直接运行下面命令即可。

python treasure_maze_main.py
  • util.py
import time
import numpy as np


class Maze(object):
    def __init__(self, shape=None, hell_num=2):
        if (shape is None) or (not isinstance(shape, (tuple, list))) or (len(shape) > 2):
            shape = (5, 5)
        self.shape = shape
        self.map = np.zeros(shape)
        self.actions = {
            'u': [-1, 0],
            'd': [1, 0],
            'l': [0, -1],
            'r': [0, 1]
        }

        for _ in range(hell_num):
            self._random_num(shape, -1)
        self._random_num(shape, 1)

        self.point = None
        self.refresh()

    def _random_num(self, shape, v):
        n = shape[0] * shape[1]
        while True:
            rd_num = np.random.randint(0, n - 1)
            y = rd_num // shape[0]
            x = rd_num % shape[0]
            if self.map[x][y] == 0:
                self.map[x][y] = v
                break

    def refresh(self):
        self.point = [0, 0]

    def point_check(self, point):
        flags = [0, 1]
        for f in flags:
            if (point[f] < 0) or (point[f] >= self.shape[f]):
                return False
        return True

    def get_env_feedback(self, A):
        if A not in self.actions:
            raise Exception("Wrong Action")
        A = self.actions[A]
        point_ = [
            self.point[0] + A[0],
            self.point[1] + A[1]
        ]
        if self.point_check(point_):
            self.point = point_
            R = self.map[self.point[0]][self.point[1]]
            done = (R != 0)
        else:
            R, done = -1, False
        return self.point, R, done

    def show_matrix(self, m):
        for x in m:
            print(' '.join(list(map(lambda i: str(int(i)) if not isinstance(i, str) else i, x))))

    def update(self, done, episode, step, r=None):
        # os.system("cls")
        m = self.map.tolist()
        m[self.point[0]][self.point[1]] = 'x'
        self.show_matrix(m)
        print("==========")
        if done:
            print("episode: %s; step: %s; reward: %s" % (episode, step, r))
            time.sleep(3)
        else:
            time.sleep(0.3)

  • RL_Brain.py
import pandas as pd
import numpy as np


class RLBrain(object):
    def __init__(self, actions, lr=0.1, gamma=0.9, epsilon=0.9):
        self.actions = actions
        self.q_table = pd.DataFrame(
            [],
            columns=self.actions
        )
        self.lr, self.gamma, self.epsilon = lr, gamma, epsilon

    def check_state(self, s):
        if s not in self.q_table.index:
            self.q_table = self.q_table.append(
                pd.Series(
                    [0] * len(self.actions),
                    index=self.actions,
                    name=s
                )
            )

    def choose_action(self, s):
        self.check_state(s)
        state_table = self.q_table.loc[s, :]

        if (np.random.uniform() >= self.epsilon) or (state_table == 0).all():
            return np.random.choice(self.actions)
        else:
            return np.random.choice(state_table[state_table == np.max(state_table)].index)

    def learn(self, s, s_, a, r, done):
        self.check_state(s_)
        q_old = self.q_table.loc[s, a]
        if done:
            q_new = r
        else:
            q_new = r + self.gamma * self.q_table.loc[s_, :].max()
        self.q_table.loc[s, a] += self.lr * (q_new - q_old)

  • treasure_maze_main.py
from RL_Brain import RLBrain
from util import Maze

if __name__ == '__main__':
    ALPHA = 0.1
    GAMMA = 0.9
    EPSILON = 0.9
    MAX_EPISODE = 15

    env = Maze(shape=(3, 4))
    RL = RLBrain(actions=list(env.actions.keys()))
    for episode in range(MAX_EPISODE):
        env.refresh()

        s = env.point
        step_counter = 0
        done = False

        env.update(done, episode, step_counter)

        while not done:
            a = RL.choose_action(str(s))
            s_, r, done = env.get_env_feedback(a)

            RL.learn(str(s), str(s_), a, r, done)
            s = s_
            step_counter += 1
            env.update(done, episode, step_counter, r)

你可能感兴趣的:(机器学习+深度学习+强化学习,Python,算法,python,强化学习,算法)