向量化环境Vectorized environments
将同一环境的多个独立副本组织在一起运行的环境,它输入一批动作,同时返回一批观察结果。环境向量化技术在训练时非常有用gym.vector.SyncVectorEnv
:这里环境不同副本是顺序执行的gym.vector.AsyncVectorEnv
:这里使用 python 的多进程机制 multiprocessing 并行地执行环境的不同副本,每个环境副本都运行在一个独立的进程中gym.vector.make
方法创建这种向量化环境,原型如下gym.vector.make(
id: str,
num_envs: int = 1,
asynchronous: bool = True,
wrappers: Union[<built-in function callable>, List[<built-in function callable>], NoneType] = None,
disable_env_checker: Union[bool, NoneType] = None,
**kwargs,
) -> gym.vector.vector_env.VectorEnv
num_envs
指定被组织的环境副本数量asynchronous
指定是否异步交互(设为 True
异步并行)wrappers
指定各个环境副本使用的包装,这里要么都加要么都不加disable_env_checker
指定是否对第一个环境副本进行 gym 规范性检查(设为 None
或 True
不检查)kwargs
代表环境副本的自身参数import gym
#envs = gym.vector.make('CartPole-v1', num_envs=3, disable_env_checker=None)
envs = gym.vector.make('MyGymExamples:MyGymExamples/CliffWalkingEnv-v0',
num_envs=3,
disable_env_checker=False,
render_mode='rgb_array', # 从这开始为环境的自身参数
map_size=(4,12),
pix_square_size=30)
observations, infos = envs.reset()
print('observations: ', observations)
print('infos: ', infos)
observations: OrderedDict([('agent', array([[0, 3],
[0, 3],
[0, 3]], dtype=int64)), ('target', array([[11, 3],
[11, 3],
[11, 3]], dtype=int64))])
infos: {'distance': array([11., 11., 11.]), '_distance': array([ True, True, True])}
# 异步(并行)
asyn_env = gym.vector.AsyncVectorEnv([
lambda: gym.make("Pendulum-v0", g=9.81),
lambda: gym.make("Pendulum-v0", g=1.62)
])
# 同步(顺序)
sync_env = gym.vector.SyncVectorEnv([
lambda: gym.make("Pendulum-v0", g=9.81),
lambda: gym.make("Pendulum-v0", g=1.62)
])
需要注意的是,如果用 gym.vector.AsyncVectorEnv
创建并行训练的一组环境,由于 python 的多进程机制性质,应将其放在 if __name__ == "__main__":
之后向量化环境的使用和普通环境几乎完全相同,只是原先的所有变量维度都进行了扩展,见下例
env = gym.vector.make('MyGymExamples:MyGymExamples/CliffWalkingEnv-v0',
num_envs=3,
disable_env_checker=False,
render_mode='rgb_array',
map_size=(4,12),
pix_square_size=30)
observation, info = env.reset()
>>> observation
OrderedDict([('agent',
array([[0, 3],
[0, 3],
[0, 3]], dtype=int64)),
('target',
array([[11, 3],
[11, 3],
[11, 3]], dtype=int64))])
>>> info
{'distance': array([11., 11., 11.]), '_distance': array([ True, True, True])}
从此可见
接下来运行一步
action_direction = {'noop': 0, 'right': 1, 'down': 2, 'left': 3, 'up': 4}
observation, reward, terminated, truncated, info = env.step(np.array([action_direction['up'], # 这会向上移动一格
action_direction['down'], # 这会被地图下界挡住停在起点
action_direction['right']])) # 这会落入悬崖
>>> observation
OrderedDict([('agent',
array([[0, 2],
[0, 3],
[0, 3]], dtype=int64)),
('target',
array([[11, 3],
[11, 3],
[11, 3]], dtype=int64))])
>>> reward
array([ -1, -1, -100])
>>> terminated
array([False, False, False])
>>> truncated
array([False, False, True])
>>> info
{'distance': array([12., 11., 11.]),
'_distance': array([ True, True, True]),
'final_observation': array([None, None, {'agent': array([1, 3]), 'target': array([11, 3])}],
dtype=object),
'_final_observation': array([False, False, True]),
'final_info': array([None, None, {'distance': 10.0}], dtype=object),
'_final_info': array([False, False, True])}
从此可见
env.step
返回的 info 还会多几个 final_observation、final_info 相关的字段,指出了被 reset 的环境副本 reset 前对终止状态的观测gym.space
子类的 observation_space
和 action_space
,这些空间是自动从被组织的环境副本推断出来的>>> envs = gym.vector.make("CartPole-v1", num_envs=3)
>>> envs.observation_space
Box([[-4.8 ...]], [[4.8 ...]], (3, 4), float32)
>>> envs.action_space
MultiDiscrete([2 2 2])
>>> envs = gym.vector.AsyncVectorEnv([
... lambda: gym.make("CartPole-v1"),
... lambda: gym.make("MountainCar-v0")
...])
RuntimeError: Some environments have an observation space different from `Box([-4.8 ...], [4.8 ...], (4,), float32)`.
In order to batch observations, the observation spaces from all environments must be equal.
VectorEnv.single_observation_space
和 VectorEnv.single_action_space
得到其子环境副本的观测和动作空间,常用这个来指定策略模型的一些参数尺寸>>> envs = gym.vector.make("CartPole-v1", num_envs=3)
>>> envs.single_observation_space
Box([-4.8 ...], [4.8 ...], (4,), float32)
>>> envs.single_action_space
Discrete(2)
gym.vector.AsyncVectorEnv
类型的并行向量化环境会在独立进程中运行每个环境副本,每次调用 AsyncVectorEnv.reset()
或AsyncVectorEnv.step()
时,所有并行环境的 observation 结果都会发送回主进程,这种进程间数据传输成本很高,对于高维 observation 此问题尤其明显gym.vector.AsyncVectorEnv
默认使用进程共享内存方法 (shared_memory=True
) 尽量减少进程间数据传输成本,这可以增加向量化环境的吞吐量(throughout)class ErrorEnv(gym.Env):
observation_space = gym.spaces.Box(-1., 1., (2,), np.float32)
action_space = gym.spaces.Discrete(2)
def reset(self):
return np.zeros((2,), dtype=np.float32), {}
def step(self, action):
if action == 1:
raise ValueError("An error occurred.")
observation = self.observation_space.sample()
return (observation, 0., False, {})
>>> envs = gym.vector.AsyncVectorEnv([lambda: ErrorEnv()] * 3)
>>> observations, infos = envs.reset()
>>> observations, rewards, dones, infos = envs.step(np.array([0, 0, 1]))
ERROR: Received the following error from Worker-2: ValueError: An error occurred.
ERROR: Shutting down Worker-2.
ERROR: Raising the last exception back to the main process.
ValueError: An error occurred.
本节对比普通环境 gym.Env
、同步向量化环境 gym.vector.SyncVectorEnv
和异步向量化环境 gym.vector.AsyncVectorEnv
的运行速度,注意基础环境每一步交互越慢(即 .step()
用时越久),将其组织起来并行计算的效率提升将会越高,因此我们首先定义一个单步交互很慢的环境
class SlowEnv(gym.Env):
# 随便定义观测和动作空间
observation_space = gym.spaces.Dict({
"position": gym.spaces.Box(-1., 1., (3,), np.float32),
"velocity": gym.spaces.Box(-1., 1., (2,), np.float32)
})
action_space = gym.spaces.Dict({
"fire": gym.spaces.Discrete(2),
"jump": gym.spaces.Discrete(2),
"acceleration": gym.spaces.Box(-1., 1., (2,), np.float32)
})
def reset(self):
return self.observation_space.sample(), {}
def step(self, action):
i = 0
for _ in range(500000): i+= 1 # make it slow
observation = self.observation_space.sample()
return (observation, 0., False, False, {})
创建普通环境和两种向量化环境
env = SlowEnv()
asyn_envs = gym.vector.AsyncVectorEnv([
lambda: SlowEnv(),
lambda: SlowEnv(),
lambda: SlowEnv(),
])
sync_envs = gym.vector.SyncVectorEnv([
lambda: SlowEnv(),
lambda: SlowEnv(),
lambda: SlowEnv(),
])
强化学习训练主要的耗时都在环境交互上,我们现在对比三种环境的交互速度,即对比 .step()
速度。可以很方便地使用 jupyter notebook 的 %timeit
魔法方法进行这种计时比较,结果如下
>>> %timeit -n 100 -r 2 env.step(env.action_space.sample())
26.5 ms ± 831 µs per loop (mean ± std. dev. of 2 runs, 100 loops each)
>>> %timeit -n 100 -r 2 asyn_envs.step(asyn_envs.action_space.sample())
30.4 ms ± 177 µs per loop (mean ± std. dev. of 2 runs, 100 loops each)
>>> %timeit -n 100 -r 2 sync_envs.step(sync_envs.action_space.sample())
78.9 ms ± 136 µs per loop (mean ± std. dev. of 2 runs, 100 loops each)
可见异步向量化的三个环境交互耗时只比单一环境多一点,而同步量化的三个环境交互耗时约为单一环境的三倍,使用并行计算可以大幅提高训练效率。关于 %timeit
魔法方法可以参考 Jupyter Notebook %timeit 功能详解 Python 代码执行时间