openAI的Gym现在改成了Gymnasium。本文主要对是对Gymnasium文档的翻译。大多是机器翻译,少量人工修改。肯定有翻译不对的地方,可根据上下文理解修正。
文档地址:https://gymnasium.farama.org/content/basic_usage/
目录
1 INTRODUCTION
1.1 Basic Usage
1.1.1 Initializing Environments
1.1.2 Interacting with the Environment
Explaining the code
1.1.3 Action and observation spaces
1.1.4 Modifying the environment
正文开始
Gymnasium is a project that provide an API for all single agent reinforcement learning environments that include implementations of common environments: cartpole, pendulum, mountain-car, mujoco, atari, and more. | Gymnasium是一个为所有单代理强化学习环境提供API的项目,包括常见环境的实现:cartpole、pendulum、mountain-car、mujoco、atari等等。 |
The API contains four key functions: make , reset , step and render that this basic usage will introduce you to. At the core of Gymnasium is Env which is a high level python class representing a markov decision process from reinforcement learning theory (this is not a perfect reconstruction missing several components of MDPs). Within gymnasium, environments (MDPs) are implements as Env along with Wrappers that can change the results passed to the user. |
API包含四个关键函数:make、reset、step和render,这是基本用法介绍。Gymnasium的核心是Env,它是一个高级python类,表示来自强化学习理论的马尔可夫决策过程(这不是一个完美的重构,缺少MDPs的几个组件)。在gymnasium中,环境(MDP)是作为Env和Wrappers 实现的,Wrappers 可以改变传递给用户的结果。 |
Initializing environments is very easy in Gymnasium and can be done via the make function: |
在Gymnasium中初始化环境非常容易,可以通过make函数完成: |
import gymnasium as gym
env = gym.make('CartPole-v1')
This will return an Env for users to interact with. To see all environments you can create, use gymnasium.envs.registry.keys() .make includes a number of additional parameters to adding wrappers, specifying keywords to the environment and more. |
这将返回一个Env供用户交互。要查看您可以创建的所有环境,请使用ymnasium.envs.registry.keys() .make 包括许多附加参数,用于添加包装器、为环境指定关键字等等。 |
The classic “agent-environment loop” pictured below is simplified representation of reinforcement learning that Gymnasium implements. | 下图中经典的“代理-环境循环”是Gymnasium实施的强化学习的简化表示。 |
This loop is implemented using the following gymnasium code | 该循环使用以下gymnasium代码实现 |
import gymnasium as gym
env = gym.make("LunarLander-v2", render_mode="human")
observation, info = env.reset()
for _ in range(1000):
action = env.action_space.sample() # agent policy that uses the observation and info
observation, reward, terminated, truncated, info = env.step(action)
if terminated or truncated:
observation, info = env.reset()
env.close()
First, an environment is created using |
首先,使用make创建一个环境,并附加一个关键字“render_mode ”,指定环境应该如何可视化。有关不同渲染模式的默认含义的详细信息,请参见render 。在这个例子中,我们使用“LunarLander”环境,其中代理控制一艘需要安全着陆的飞船。 |
After initializing the environment, we reset the environment to get the first observation of the environment. For initializing the environment with a particular random seed or options (see environment documentation for possible values) use the seed or options parameters with reset . |
初始化环境后,我们重置(reset )环境以获得对环境的第一次观察。要使用特定的随机种子或选项初始化环境(有关可能的值,请参见环境文档),请使用reset的参数seed 或options 参数。 |
Next, the agent performs an action in the environment, step , this can be imagined as moving a robot or pressing a button on a games’ controller that causes a change within the environment. As a result, the agent receives a new observation from the updated environment along with a reward for taking the action. This reward could be for instance positive for destroying an enemy or a negative reward for moving into lava. One such action-observation exchange is referred to as a timestep. |
接下来,代理在环境中执行一个动作,step ,这可以想象为移动一个机器人或按下游戏控制器上的一个按钮,从而引起环境的变化。结果,代理从更新的环境中接收到新的观察以及对采取该行动的奖励。这个奖励可以是正面的,比如消灭一个敌人,或者是负面的,比如进入岩浆。一个这样的动作-观察交换被称为时间步长timestep)。 |
However, after some timesteps, the environment may end, this is called the terminal state. For instance, the robot may have crashed, or the agent have succeeded in completing a task, the environment will need to stop as the agent cannot continue. In gymnasium, if the environment has terminated, this is returned by step . Similarly, we may also want the environment to end after a fixed number of timesteps, in this case, the environment issues a truncated signal. If either of terminated or truncated are true then reset should be called next to restart the environment. |
然而,在一些时间步之后,环境可能结束,这被称为terminal 状态。例如,机器人可能已经坠毁,或者代理已经成功完成任务,环境将需要停止,因为代理无法继续。在gymnasium中,如果环境已经终止,这是由step 返回的。类似地,我们也可能希望环境在固定数量的时间步长后结束,在这种情况下,环境发出一个截断的信号。如果terminated或truncated为true ,则接下来应该调用reset来重新启动环境。 |
Every environment specifies the format of valid actions and observations with the |
每个环境都使用env.action_space和env.observation_space属性指定有效动作和观察的格式。这有助于了解环境的预期输入和输出,因为所有有效的动作和观察都应该包含在各自的空间中。 |
In the example, we sampled random actions via env.action_space.sample() instead of using an agent policy, mapping observations to actions which users will want to make. See one of the agent tutorials for an example of creating and training an agent policy. |
在这个例子中,我们通过env.action_space.sample()对随机动作进行采样,而不是使用代理策略,将观察结果映射到用户想要执行的动作。有关创建和训练代理策略的示例,请参见其中一个代理教程。 |
Every environment should have the attributes
|
每个环境都应该有属性action_space和observation_space,这两个属性都应该是从space继承的类的实例。Gymnasium支持用户可能需要的大多数空间:
|
For example usage of spaces, see their documentation along with utility functions. There are a couple of more niche spaces Graph , Sequence and Text . |
有关spaces的用法,请参阅它们的documentation以及utility functions。还有几个更适合的spaces如 Graph , Sequence 和Text 。 |
Wrappers are a convenient way to modify an existing environment without having to alter the underlying code directly. Using wrappers will allow you to avoid a lot of boilerplate code and make your environment more modular. Wrappers can also be chained to combine their effects. Most environments that are generated via gymnasium.make will already be wrapped by default using the TimeLimit , OrderEnforcing and PassiveEnvChecker . |
Wrappers是修改现有环境的一种便捷方式,无需直接修改底层代码。使用Wrappers将允许您避免大量样板代码,并使您的环境更加模块化。Wrappers也可以被链接起来以组合它们的效果。默认情况下,通过gymnasium.make生成的大多数环境已经使用TimeLimit、OrderEnforcing和PassiveEnvChecker进行了包装。 |
In order to wrap an environment, you must first initialize a base environment. Then you can pass this environment along with (possibly optional) parameters to the wrapper’s constructor: | 为了包装一个环境,您必须首先初始化一个基础环境。然后,您可以将这个环境以及(可能是可选的)参数传递给包装器的构造函数: |
import gymnasium as gym
from gymnasium.wrappers import FlattenObservation
env = gym.make("CarRacing-v2")
env.observation_space.shape
(96, 96, 3)
wrapped_env = FlattenObservation(env)
wrapped_env.observation_space.shape
(27648,)
Gymnasium already provides many commonly used wrappers for you. Some examples:
|
Gymnasium已经为您提供了许多常用的wrappers。一些例子:
|
For a full list of implemented wrappers in gymnasium, see wrappers. |
有关gymnasium中实现的包装器的完整列表,请参见 wrappers。 |
If you have a wrapped environment, and you want to get the unwrapped environment underneath all the layers of wrappers (so that you can manually call a function or change some underlying aspect of the environment), you can use the .unwrapped attribute. If the environment is already a base environment, the .unwrapped attribute will just return itself. |
如果您有一个包装的环境,并且希望在所有包装器层的下面得到一个未包装的环境(以便您可以手动调用一个函数或者更改环境的一些底层方面),那么您可以使用.unwrapped 的属性。如果该环境已经是base环境,则.unwrapped 的属性将自动返回。 |
wrapped_env
>>>>>
wrapped_env.unwrapped