强化学习经典算法笔记(二十一):gym-super-mario-bros游戏环境笔记

gym-super-mario-bros游戏环境笔记

  • gym-super-mario-bros游戏环境笔记
    • 简介
    • 安装
    • Demo
      • Gym demo
      • 命令行demo
    • 环境
      • 单独关卡
      • 随机选择关卡
    • 奖励函数
    • info内容解读

gym-super-mario-bros游戏环境笔记

最近在学习Intrinsic Reward Model相关的paper,super-mario-bros可以说是算法性能测试的标配游戏环境了,可惜之前太多关注点都放在Atari上,特此开一篇笔记记录一下内容,以备后查。
强化学习经典算法笔记(二十一):gym-super-mario-bros游戏环境笔记_第1张图片

简介

项目地址https://pypi.org/project/gym-super-mario-bros/

安装

pip install nes-py
pip install gym-super-mario-bros

需要在Ubuntu下安装,Windows不行。

Demo

游戏结束的条件应该有两个:3条命没了,或者超时了。具体实践时应该要设置一个最大探索长度。

Gym demo

from nes_py.wrappers import JoypadSpace
import gym_super_mario_bros
from gym_super_mario_bros.actions import SIMPLE_MOVEMENT
env = gym_super_mario_bros.make('SuperMarioBros-v0')
env = JoypadSpace(env, SIMPLE_MOVEMENT)

done = True
for step in range(5000):
    if done:
        state = env.reset()
    state, reward, done, info = env.step(env.action_space.sample())
    env.render()

env.close()

命令行demo

gym_super_mario_bros -e <the environment ID to play> -m <`human` or `random`>

-e默认选项是SuperMarioBros-v0-m默认选项是human
选择human时,键盘A D是左右移动,O是跳跃,长按O大跳。

环境

3条命玩32关。模拟器把无关画面全部剔除了,无关画面是指和agent操作无关的画面,比如过长画面。

Environment Game ROM Screenshot
SuperMarioBros-v0 SMB standard 强化学习经典算法笔记(二十一):gym-super-mario-bros游戏环境笔记_第2张图片
SuperMarioBros-v1 SMB downsample 降采样版本 强化学习经典算法笔记(二十一):gym-super-mario-bros游戏环境笔记_第3张图片
SuperMarioBros-v2 SMB pixel 强化学习经典算法笔记(二十一):gym-super-mario-bros游戏环境笔记_第4张图片
SuperMarioBros-v3 SMB rectangle 强化学习经典算法笔记(二十一):gym-super-mario-bros游戏环境笔记_第5张图片
SuperMarioBros2-v0 SMB2 standard 强化学习经典算法笔记(二十一):gym-super-mario-bros游戏环境笔记_第6张图片
SuperMarioBros2-v1 SMB2 downsample 强化学习经典算法笔记(二十一):gym-super-mario-bros游戏环境笔记_第7张图片

单独关卡

也可以单独训练一个关卡,Environment可以这样写:

SuperMarioBros-<world>-<stage>-v<version>
  • is a number in {1, 2, 3, 4, 5, 6, 7, 8} indicating the world
  • is a number in {1, 2, 3, 4} indicating the stage within a world
  • is a number in {0, 1, 2, 3} specifying the ROM mode to use
    • 0: standard ROM
    • 1: downsampled ROM
    • 2: pixel ROM
    • 3: rectangle ROM

For example, to play 4-2 on the downsampled ROM, you would use the environment id SuperMarioBros-4-2-v1.

随机选择关卡

随机选择一个关卡,并且只有一条命,死掉并reset之后会再随机选择一个关卡。此功能只对SMB有效,对SMB2无效。
示例代码:SuperMarioBrosRandomStages-v0
设置种子:env.seed(1),在调用reset前设置一下。

奖励函数

奖励功能假定游戏的目标是尽可能快地向右移动(增加Agent的x值)而不会死。为了建模这个游戏,奖励由三个独立的变量组成:

  1. v: the difference in agent x values between states
    • in this case this is instantaneous velocity for the given step
    • v = x1 - x0
      • x0 is the x position before the step
      • x1 is the x position after the step
    • moving right ⇔ v > 0
    • moving left ⇔ v < 0
    • not moving ⇔ v = 0
  2. c: the difference in the game clock between frames
    the penalty prevents the agent from standing still
    c = c0 - c1
    c0 is the clock reading before the step
    c1 is the clock reading after the step
    no clock tick ⇔ c = 0
    clock tick ⇔ c < 0
  3. d: a death penalty that penalizes the agent for dying in a state
    this penalty encourages the agent to avoid death
    alive ⇔ d = 0
    dead ⇔ d = -15

r = v + c + d
The reward is clipped into the range (-15, 15).

info内容解读

The info dictionary returned by the step method contains the following keys:

Key Type Description
coins int The number of collected coins
flag_get bool True if Mario reached a flag or ax
life int The number of lives left, i.e., {3, 2, 1}
score int The cumulative in-game score
stage int The current stage, i.e., {1, …, 4}
status str Mario’s status, i.e., {‘small’, ‘tall’, ‘fireball’}
time int The time left on the clock
world int The current world, i.e., {1, …, 8}
x_pos int Mario’s x position in the stage (from the left)
y_pos int Mario’s y position in the stage (from the bottom)

你可能感兴趣的:(强化学习,强化学习,游戏,深度学习,pytorch,机器学习)