在网上看到的元学习 MAML 的代码大多是跟图像相关的,强化学习这边的代码比较少。
因为自己的思路跟 MAML-RL 相关,所以打算读一些源码。
MAML 的原始代码是基于 tensorflow 的,在 Github 上找到了基于 Pytorch 源码包,学习这个包。
https://github.com/dragen1860/MAML-Pytorch-RL
./maml_rl/envs/mdp.py
import
包import numpy as np
import gym
from gym import spaces
from gym.utils import seeding
TabularMDPEnv()
类class TabularMDPEnv(gym.Env):
#### 在每一个时间步,智能体选择num_actions数量动作中的一个,说“i”信息,收到从均值是“m_i”方差是1固定的正态分布中获得的奖励,并通过马尔可夫动力学特性进入到下一个状态。表格形式的马尔可夫过程任务由均值和方差分别是1的正态分布中采样平均奖励,通过迪立克雷分布采样状态转移概率。
"""Tabular MDP problems, as described in [1].
At each time step, the agent chooses one of `num_actions` actions, say `i`,
receives a reward sampled from a Normal distribution with mean `m_i` and
variance 1 (fixed across all tasks), and reaches a new state following the
dynamics of the Markov Decision Process (MDP). The tabular MDP tasks are
generated by sampling the mean rewards from a Normal distribution with mean
1 and variance 1, and sampling the transition probabilities from a uniform
Dirichlet distribution (ie. with parameter 1).
[1] Yan Duan, John Schulman, Xi Chen, Peter L. Bartlett, Ilya Sutskever,
Pieter Abbeel, "RL2: Fast Reinforcement Learning via Slow Reinforcement
Learning", 2016 (https://arxiv.org/abs/1611.02779)
"""
#### 表格形式还好理解。num_states状态数,num_actions动作数,二者组成一个二维表格。构建action_space动作空间和observation_space状态空间,空间限幅是[0,1]。self._task接收任务信息。self._transitions构建一个维度是[num_states x num_actions x num_states]的,元素是1的矩阵。self._rewards_mean表示平均奖励,也是一个二维的表格。self._state表示初始状态在0,可以理解为是第0行第0列。self.seed()启动随机数。
def __init__(self, num_states, num_actions, task={}):
super(TabularMDPEnv, self).__init__()
self.num_states = num_states
self.num_actions = num_actions
self.action_space = spaces.Discrete(num_actions)
self.observation_space = spaces.Box(low=0.0,
high=1.0, shape=(num_states,), dtype=np.float32)
self._task = task
self._transitions = task.get('transitions', np.full((num_states,
num_actions, num_states), 1.0 / num_states,
dtype=np.float32))
self._rewards_mean = task.get('rewards_mean', np.zeros(num_states,
num_actions), dtype=np.float32)
self._state = 0
self.seed()
#### 跳转到源码上是numpy的,意思就是获得一组随机数。第一个self.np_random是一个与随机数相关的实例,seed是随机数种子。
def seed(self, seed=None):
self.np_random, seed = seeding.np_random(seed)
return [seed]
#### 元学习有很多的任务,那么状态转移就是采取元素全是1的num_states的alpha参数,输出大小是[num_tasks x self.num_states x self.num_actions] 的状态转移。每个任务的状态转移就是 [self.num_states x self.num_actions]。奖励表格同理构建,不过是用正态分布构建的。最后对每个任务分配状态转移和奖励函数,构建字典。
def sample_tasks(self, num_tasks):
transitions = self.np_random.dirichlet(np.ones(self.num_states),
size=(num_tasks, self.num_states, self.num_actions))
rewards_mean = self.np_random.normal(1.0, 1.0,
size=(num_tasks, self.num_states, self.num_actions))
tasks = [{'transitions': transition, 'rewards_mean': reward_mean}
for (transition, reward_mean) in zip(transitions, rewards_mean)]
return tasks
#### 因为元学习有很多的任务,需要重置任务实现切换。用_task形参task,意味着在程序运行中task的初始化可能更原本的初始化不一样。self._transitions和self._rewards_mean用task字典的”transitions“和“rewards_mean”来替代。
def reset_task(self, task):
self._task = task
self._transitions = task['transitions']
self._rewards_mean = task['rewards_mean']
#### 重置任务到初始状态中。输出的观测信息,首先用元素全是0的状态信息初始化,然后赋观测的初始信息是1。
def reset(self):
# From [1]: "an episode always starts on the first state"
self._state = 0
observation = np.zeros(self.num_states, dtype=np.float32)
observation[self._state] = 1.0
return observation
#### assert关键字的作用是检查输出的动作信息在不在动作空间内。mean的意思应该是从action中获得这个表格MDP的奖励分布中心,reward获得奖励。_state的意思是在状态转移矩阵下选择进入下一个状态。观测信息首先对所有元素全0初始化,然后在设置新的状态为1.0。
def step(self, action):
assert self.action_space.contains(action)
mean = self._rewards_mean[self._state, action]
reward = self.np_random.normal(mean, 1.0)
self._state = self.np_random.choice(self.num_states,
p=self._transitions[self._state, action])
observation = np.zeros(self.num_states, dtype=np.float32)
observation[self._state] = 1.0
return observation, reward, False, self._task