MAML-RL Pytorch 代码解读 (7) -- maml_rl/envs/mdp.py

MAML-RL Pytorch 代码解读 (7) – maml_rl/envs/mdp.py

文章目录

  • MAML-RL Pytorch 代码解读 (7) -- maml_rl/envs/mdp.py
      • 基本介绍
        • 源码链接
        • 文件路径
      • `import` 包
      • `TabularMDPEnv()` 类

基本介绍

在网上看到的元学习 MAML 的代码大多是跟图像相关的,强化学习这边的代码比较少。

因为自己的思路跟 MAML-RL 相关,所以打算读一些源码。

MAML 的原始代码是基于 tensorflow 的,在 Github 上找到了基于 Pytorch 源码包,学习这个包。

源码链接

https://github.com/dragen1860/MAML-Pytorch-RL

文件路径

./maml_rl/envs/mdp.py

import

import numpy as np

import gym
from gym import spaces
from gym.utils import seeding

TabularMDPEnv()

class TabularMDPEnv(gym.Env):
    
    #### 在每一个时间步,智能体选择num_actions数量动作中的一个,说“i”信息,收到从均值是“m_i”方差是1固定的正态分布中获得的奖励,并通过马尔可夫动力学特性进入到下一个状态。表格形式的马尔可夫过程任务由均值和方差分别是1的正态分布中采样平均奖励,通过迪立克雷分布采样状态转移概率。
	"""Tabular MDP problems, as described in [1].

	At each time step, the agent chooses one of `num_actions` actions, say `i`, 
	receives a reward sampled from a Normal distribution with mean `m_i` and 
	variance 1 (fixed across all tasks), and reaches a new state following the 
	dynamics of the Markov Decision Process (MDP). The tabular MDP tasks are 
	generated by sampling the mean rewards from a Normal distribution with mean 
	1 and variance 1, and sampling the transition probabilities from a uniform 
	Dirichlet distribution (ie. with parameter 1).

	[1] Yan Duan, John Schulman, Xi Chen, Peter L. Bartlett, Ilya Sutskever,
		Pieter Abbeel, "RL2: Fast Reinforcement Learning via Slow Reinforcement
		Learning", 2016 (https://arxiv.org/abs/1611.02779)
	"""

    #### 表格形式还好理解。num_states状态数,num_actions动作数,二者组成一个二维表格。构建action_space动作空间和observation_space状态空间,空间限幅是[0,1]。self._task接收任务信息。self._transitions构建一个维度是[num_states x num_actions x num_states]的,元素是1的矩阵。self._rewards_mean表示平均奖励,也是一个二维的表格。self._state表示初始状态在0,可以理解为是第0行第0列。self.seed()启动随机数。
	def __init__(self, num_states, num_actions, task={}):
		super(TabularMDPEnv, self).__init__()
		self.num_states = num_states
		self.num_actions = num_actions

		self.action_space = spaces.Discrete(num_actions)
		self.observation_space = spaces.Box(low=0.0,
		                                    high=1.0, shape=(num_states,), dtype=np.float32)

		self._task = task
		self._transitions = task.get('transitions', np.full((num_states,
		                                                     num_actions, num_states), 1.0 / num_states,
		                                                    dtype=np.float32))
		self._rewards_mean = task.get('rewards_mean', np.zeros(num_states,
		                                                       num_actions), dtype=np.float32)
		self._state = 0
		self.seed()

    #### 跳转到源码上是numpy的,意思就是获得一组随机数。第一个self.np_random是一个与随机数相关的实例,seed是随机数种子。
	def seed(self, seed=None):
		self.np_random, seed = seeding.np_random(seed)
		return [seed]

    #### 元学习有很多的任务,那么状态转移就是采取元素全是1的num_states的alpha参数,输出大小是[num_tasks x self.num_states x self.num_actions] 的状态转移。每个任务的状态转移就是 [self.num_states x self.num_actions]。奖励表格同理构建,不过是用正态分布构建的。最后对每个任务分配状态转移和奖励函数,构建字典。
	def sample_tasks(self, num_tasks):
		transitions = self.np_random.dirichlet(np.ones(self.num_states),
		                                       size=(num_tasks, self.num_states, self.num_actions))
		rewards_mean = self.np_random.normal(1.0, 1.0,
		                                     size=(num_tasks, self.num_states, self.num_actions))
		tasks = [{'transitions': transition, 'rewards_mean': reward_mean}
		         for (transition, reward_mean) in zip(transitions, rewards_mean)]
		return tasks

    #### 因为元学习有很多的任务,需要重置任务实现切换。用_task形参task,意味着在程序运行中task的初始化可能更原本的初始化不一样。self._transitions和self._rewards_mean用task字典的”transitions“和“rewards_mean”来替代。
	def reset_task(self, task):
		self._task = task
		self._transitions = task['transitions']
		self._rewards_mean = task['rewards_mean']

    #### 重置任务到初始状态中。输出的观测信息,首先用元素全是0的状态信息初始化,然后赋观测的初始信息是1。
	def reset(self):
		# From [1]: "an episode always starts on the first state"
		self._state = 0
		observation = np.zeros(self.num_states, dtype=np.float32)
		observation[self._state] = 1.0

		return observation

    #### assert关键字的作用是检查输出的动作信息在不在动作空间内。mean的意思应该是从action中获得这个表格MDP的奖励分布中心,reward获得奖励。_state的意思是在状态转移矩阵下选择进入下一个状态。观测信息首先对所有元素全0初始化,然后在设置新的状态为1.0。
	def step(self, action):
		assert self.action_space.contains(action)
		mean = self._rewards_mean[self._state, action]
		reward = self.np_random.normal(mean, 1.0)

		self._state = self.np_random.choice(self.num_states,
		                                    p=self._transitions[self._state, action])
		observation = np.zeros(self.num_states, dtype=np.float32)
		observation[self._state] = 1.0

		return observation, reward, False, self._task

你可能感兴趣的:(源码解读,MetaRL_Notes,pytorch,深度学习,人工智能)