MAML-RL Pytorch 代码解读 (13) -- maml_rl/envs/mujoco/half_cheetah.py

MAML-RL Pytorch 代码解读 (13) – maml_rl/envs/mujoco/half_cheetah.py

文章目录

  • MAML-RL Pytorch 代码解读 (13) -- maml_rl/envs/mujoco/half_cheetah.py
      • 基本介绍
        • 源码链接
        • 文件路径
      • `import` 包
      • `HalfCheetahEnv()` 类
      • `HalfCheetahVelEnv()` 类
      • `HalfCheetahDirEnv()` 类

基本介绍

在网上看到的元学习 MAML 的代码大多是跟图像相关的,强化学习这边的代码比较少。

因为自己的思路跟 MAML-RL 相关,所以打算读一些源码。

MAML 的原始代码是基于 tensorflow 的,在 Github 上找到了基于 Pytorch 源码包,学习这个包。

源码链接

https://github.com/dragen1860/MAML-Pytorch-RL

文件路径

./maml_rl/envs/mujoco/half_cheetah.py

import

import numpy as np

#### 这里是引用pip方式安装的gym里面的mujoco的HalfCheetahEnv()类
from gym.envs.mujoco import HalfCheetahEnv as HalfCheetahEnv_

HalfCheetahEnv()

class HalfCheetahEnv(HalfCheetahEnv_):
    
    #### 获得观测信息,实际就是机器人的位置、姿态、"torso"信息并做numpy整合。
	def _get_obs(self):
		return np.concatenate([
			self.sim.data.qpos.flat[1:],
			self.sim.data.qvel.flat,
			self.get_body_com("torso").flat,
		]).astype(np.float32).flatten()

    #### 构建仿真器内部的相机。先指定camera_id,两款相机,其中一款是固定的,距离模型放置的0.35倍的距离
	def viewer_setup(self):
		camera_id = self.model.camera_name2id('track')
		self.viewer.cam.type = 2
		self.viewer.cam.fixedcamid = camera_id
		self.viewer.cam.distance = self.model.stat.extent * 0.35
		# Hide the overlay
		self.viewer._hide_overlay = True

    #### 用于渲染。如果采用的渲染模式是'rgb_array',那么从相机中渲染获得信息,设置图片大小是500x500,将图片转换成数据并返回。如果采用的渲染模式是'human',直接对仿真器渲染,不需要获得信息。
	def render(self, mode='human'):
		if mode == 'rgb_array':
			self._get_viewer().render()
			# window size used for old mujoco-py:
			width, height = 500, 500
			data = self._get_viewer().read_pixels(width, height, depth=False)
			return data
		elif mode == 'human':
			self._get_viewer().render()

HalfCheetahVelEnv()

class HalfCheetahVelEnv(HalfCheetahEnv):
    
    #### 这个类是具有目标速度的木头人环境,继承HalfCheetahEnv()类。奖励函数由:控制消耗,当前速度和目标速度之间的惩罚项。从均匀分布[0, 2]中采样目标速度。
	"""
	Half-cheetah environment with target velocity, as described in [1]. The
	code is adapted from
	https://github.com/cbfinn/maml_rl/blob/9c8e2ebd741cb0c7b8bf2d040c4caeeb8e06cc95/rllab/envs/mujoco/half_cheetah_env_rand.py

	The half-cheetah follows the dynamics from MuJoCo [2], and receives at each 
	time step a reward composed of a control cost and a penalty equal to the 
	difference between its current velocity and the target velocity. The tasks 
	are generated by sampling the target velocities from the uniform 
	distribution on [0, 2].

	[1] Chelsea Finn, Pieter Abbeel, Sergey Levine, "Model-Agnostic 
		Meta-Learning for Fast Adaptation of Deep Networks", 2017 
		(https://arxiv.org/abs/1703.03400)
	[2] Emanuel Todorov, Tom Erez, Yuval Tassa, "MuJoCo: A physics engine for 
		model-based control", 2012 
		(https://homes.cs.washington.edu/~todorov/papers/TodorovIROS12.pdf)
	"""

    #### 接受任务、任务的目标速度键值对,声明对父类的继承。
	def __init__(self, task={}):
		self._task = task
		self._goal_vel = task.get('velocity', 0.0)
		super(HalfCheetahVelEnv, self).__init__()

    #### 从仿真器中获取蚂蚁的采取动作前的位置(位姿)xposbefore;在仿真器中采用self.frame_skip帧率执行action动作后进行仿真;从仿真器中获取蚂蚁的采取动作后的位置(位姿)xposafter;
	def step(self, action):
		xposbefore = self.sim.data.qpos[0]
		self.do_simulation(action, self.frame_skip)
		xposafter = self.sim.data.qpos[0]

        #### 前馈速度用速度公式求出来;然后获得前馈的速度有关的奖励;控制损失是各个维度的动作的平方和,也就是说,动作的幅度越大,那么控制损失就越大。
		forward_vel = (xposafter - xposbefore) / self.dt
		forward_reward = -1.0 * abs(forward_vel - self._goal_vel)
		ctrl_cost = 0.5 * 1e-1 * np.sum(np.square(action))

        #### 从上一个类中获得位姿和其他一些从参数。计算奖励。done默认设置成False,表示不完成。infos表示当前任务信息,和奖励函数的各个子变量。最后返回一个元组。
		observation = self._get_obs()
		reward = forward_reward - ctrl_cost
		done = False
		infos = dict(reward_forward=forward_reward,
		             reward_ctrl=-ctrl_cost, task=self._task)
		return (observation, reward, done, infos)

	def sample_tasks(self, num_tasks):
        #### 从均匀分布[0.0,2.0]中采样num_tasks个任务,在每个任务中记录键值对'velocity'和数值。
		velocities = self.np_random.uniform(0.0, 2.0, size=(num_tasks,))
		tasks = [{'velocity': velocity} for velocity in velocities]
		return tasks

	def reset_task(self, task):
        #### 重置任务。
		self._task = task
		self._goal_vel = task['velocity']

HalfCheetahDirEnv()

class HalfCheetahDirEnv(HalfCheetahEnv):
    
    #### 这个类是具有目标速度的木头人环境,继承HalfCheetahEnv()类。奖励函数由:控制消耗,当前速度和目标速度之间的惩罚项。从伯努力分布中采样方向,正负方向都是0.5概率。
	"""
	Half-cheetah environment with target direction, as described in [1]. The
	code is adapted from
	https://github.com/cbfinn/maml_rl/blob/9c8e2ebd741cb0c7b8bf2d040c4caeeb8e06cc95/rllab/envs/mujoco/half_cheetah_env_rand_direc.py

	The half-cheetah follows the dynamics from MuJoCo [2], and receives at each 
	time step a reward composed of a control cost and a reward equal to its 
	velocity in the target direction. The tasks are generated by sampling the 
	target directions from a Bernoulli distribution on {-1, 1} with parameter 
	0.5 (-1: backward, +1: forward).

	[1] Chelsea Finn, Pieter Abbeel, Sergey Levine, "Model-Agnostic 
		Meta-Learning for Fast Adaptation of Deep Networks", 2017 
		(https://arxiv.org/abs/1703.03400)
	[2] Emanuel Todorov, Tom Erez, Yuval Tassa, "MuJoCo: A physics engine for 
		model-based control", 2012 
		(https://homes.cs.washington.edu/~todorov/papers/TodorovIROS12.pdf)
	"""

    #### 接受任务、任务的目标方向键值对,声明对父类的继承。
	def __init__(self, task={}):
		self._task = task
		self._goal_dir = task.get('direction', 1)
		super(HalfCheetahDirEnv, self).__init__()

    #### 从仿真器中获取蚂蚁的采取动作前的位置(位姿)xposbefore;在仿真器中采用self.frame_skip帧率执行action动作后进行仿真;从仿真器中获取蚂蚁的采取动作后的位置(位姿)xposafter;
	def step(self, action):
		xposbefore = self.sim.data.qpos[0]
		self.do_simulation(action, self.frame_skip)
		xposafter = self.sim.data.qpos[0]

        #### 前馈速度用速度公式求出来;然后获得前馈的速度的正负数值表示方向,随后计算有关的奖励;控制损失是各个维度的动作的平方和,也就是说,动作的幅度越大,那么控制损失就越大。
		forward_vel = (xposafter - xposbefore) / self.dt
		forward_reward = self._goal_dir * forward_vel
		ctrl_cost = 0.5 * 1e-1 * np.sum(np.square(action))

        #### 从上上一个类中获得位姿和其他一些参数。计算奖励。done默认设置成False,表示不完成。infos表示当前任务信息,和奖励函数的各个子变量。最后返回一个元组。
		observation = self._get_obs()
		reward = forward_reward - ctrl_cost
		done = False
		infos = dict(reward_forward=forward_reward,
		             reward_ctrl=-ctrl_cost, task=self._task)
		return (observation, reward, done, infos)

    #### 从伯努力分布中采样num_tasks个任务,在每个任务中记录键值对'direction'和数值。
	def sample_tasks(self, num_tasks):
		directions = 2 * self.np_random.binomial(1, p=0.5, size=(num_tasks,)) - 1
		tasks = [{'direction': direction} for direction in directions]
		return tasks

    #### 重置任务。
	def reset_task(self, task):
		self._task = task
		self._goal_dir = task['direction']

你可能感兴趣的:(源码解读,MetaRL_Notes,pytorch,深度学习,python)