目录
前沿
作业
作业一:behavioral cloning
首先完成环境配置
windows的环境安转
ubuntu上的环境安装:
然后下面的包安装后:
hw1-Readme
python run_expert.py ./experts/Ant-v2.pkl Ant-v2
第二个作业是:
由于之前已经给出视频地址以及别人的笔记地址
第一章是综述,只是概念而已,第二三课(虽然这里是只有两节课,但是视频其实可以到第四节)讲了一些基本概念,其实是可以接受的,最好自己根据这两篇文章理清楚。
笔记此处先略
homework1给出了查看自己是否满足课程条件的作业,如果不懂还需要在MLDL的课程以及tensorflow的先验课程上加以学习。
再次说一下:
作业可以到http://rail.eecs.berkeley.edu/deeprlcourse/ 去下载,这里有给出作业和PPT
这里有作业的标准答案:https://github.com/berkeleydeeprlcourse/homework
简易翻译一下:
作业的提交截止时间是2019九月16th11:59号之前(这门课其实很新)
这个作业的目的是为了让我们体验一下克隆学习是什么样的,包括直接行为的克隆以及DAgger算法。
作业的目标是去建立一个行为克隆器和DAgger,并且比较他们在一些不同的连续控制任务上的表现(通过OpenAI Gym基准套件),作业的提交形式是如同section4中的代码和报告。
你可以找到在github上给出的初始代码,通过readme给出的提示来开始此处代码编程。
github上的代码中有两个:README.md和requirements.txt
起始代码为openAI Gym中的的每一个MuJoCo任务提供了最佳策略。请把代码中标注Todo的副本填空,以此完成behavioral cloning。跑behavioral cloning的命令在readme文件中给出。
windows的环境安转
参考教程:
OK 先装环境 首先conda里面开启一个py3.5的环境
安装好所有的包运行一下代码:
(DRL) C:\Users\14020\Desktop\DRL\homework\hw1>python run_expert.py ./experts/Ant-v2.pkl Ant-v2
之前是说需要添加环境变量,然后去找了一下,貌似说要先下载Microsoft Visual C++ 14.0,安装之后需要安装win的mjpro150 win64,我之前是pip的mujoco_py,所以下载好之后,运行simulate.exe成功
然后说本来要
- cd 到mujoco-py所在目录:D:\ANACONDA\envs\DRL\Lib\site-packages\mujoco_py>
- pip install -r requirements.txt
没有就算了
其实最好还是在ubuntu上作业:
ubuntu上的环境安装:
https://blog.csdn.net/will_ye/article/details/81087463
与windows类似
gym==0.10.5
mujoco-py==1.50.1.56
tensorflow
numpy
seaborn
# CS294-112 HW 1: Imitation Learning
Dependencies:依赖
* Python **3.5**
* Numpy version **1.14.5**
* TensorFlow version **1.10.5**
* MuJoCo version **1.50** and mujoco-py **1.50.1.56**
* OpenAI Gym version **0.10.5**
Once Python **3.5** is installed, you can install the remaining dependencies using `pip install -r requirements.txt`.
运行这行完成所有的包下载
**Note**: MuJoCo versions until 1.5 do not support NVMe disks therefore won't be compatible with recent Mac machines.
There is a request for OpenAI to support it that can be followed [here](https://github.com/openai/gym/issues/638).
**Note**: Students enrolled in the course will receive an email with their MuJoCo activation key. Please do **not** share this key.
The only file that you need to look at is `run_expert.py`, which is code to load up an expert policy, run a specified number of roll-outs, and save out data.
你只需要看run_expert.py这个文件,这个文件加载一个专家策略
In `experts/`, the provided expert policies are:
在expers/文档下我们提供以下游戏的最佳策略
* Ant-v2.pkl
* HalfCheetah-v2.pkl
* Hopper-v2.pkl
* Humanoid-v2.pkl
* Reacher-v2.pkl
* Walker2d-v2.pkl
The name of the pickle file corresponds to the name of the gym environment.
然后运行之后是这样的输出:
(DRL) C:\Users\14020\Desktop\DRL\homework\hw1>python run_expert.py ./experts/Ant-v2.pkl Ant-v2
WARNING:tensorflow:From D:\ANACONDA\envs\DRL\lib\site-packages\tensorflow_core\python\compat\v2_compat.py:65: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
loading and building expert policy
obs (1, 111) (1, 111)
loaded and built
2019-10-24 19:15:09.135584: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
WARNING:tensorflow:From C:\Users\14020\Desktop\DRL\homework\hw1\tf_util.py:92: all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Please use tf.global_variables instead.
WARNING:tensorflow:From D:\ANACONDA\envs\DRL\lib\site-packages\tensorflow_core\python\util\tf_should_use.py:198: initialize_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Use `tf.variables_initializer` instead.
D:\ANACONDA\envs\DRL\lib\site-packages\gym\envs\registration.py:14: PkgResourcesDeprecationWarning: Parameters to load are deprecated. Call .resolve and .require separately.
result = entry_point.load(False)
[33mWARN: gym.spaces.Box autodetected dtype as . Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as . Please provide explicit dtype.[0m
iter 0
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
……
iter 19
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
returns [4826.537688869258, 4905.60911747871, 4843.0923281947125, 4983.677960547672, 4584.065942975533, 4669.895084909249, 4811.072431409634, 4778.371028168479, 4765.399718158855, 4852.6678103811855, 4921.288770246154, 4791.330282193481, 5071.702381917921, 4493.876447153158, 4504.676123632411, 4709.790413876621, 4628.911190315782, 4977.376827405835, 4697.280528014795, 4857.660492883054]
mean return 4783.714128436624
std of return 152.07689277612764
以及
作业中需要我们填空空白文档,说明的含有空白代码的文档有:
可是这个文件貌似不在github上这个homework文件夹中
我们现在需要跑BC算法并且在两个任务上记录结果:
1.BCagent至少学习了专家经验的30%
2.BCagent没有学到30%
结果以程序运行几次后每次的mean和standard deviation的表格的形式展示(并且说明是跑的哪个任务)
当比较两个任务的时候,保证他们都有差不多的网络大小、数据训练量、和相同的训练次数(这些东西也可以适当的记录在表格上)
Note:记录mean和标准差需要你的eval_batchsize要比ep_len(貌似没有在run_expert.py里看到这个参数)大,这个ep_len相当于每一次运行最长的运行次数,eval_batchsize代表你一共跑几次,相当于你决定今天总共跑5公里,而一次跑最多跑一圈(一公里)所以总共你跑了五圈,(如果中间你摔倒了,重新跑,那么次数可能会大于5圈)
如何设置网络大小、数据训练量、和相同的训练次数以及设置eval_batchsize和ep_len呢?
#!/bin/bash set -eux for e in Hopper-v2 Ant-v2 HalfCheetah-v2 Humanoid-v2 Reacher-v2 Walker2d-v2 do python run_expert.py experts/$e.pkl $e --render --num_rollouts=1 done
其实就是简单的运行每一个
注意:tf2.0大改,这里homework使用的是tf1.x版本的,运行会出现下面错误
+ for e in Hopper-v2 Ant-v2 HalfCheetah-v2 Humanoid-v2 Reacher-v2 Walker2d-v2 + python run_expert.py experts/Hopper-v2.pkl Hopper-v2 --render --num_rollouts=1 loading and building expert policy Traceback (most recent call last): File "run_expert.py", line 76, in
main() File "run_expert.py", line 32, in main policy_fn = load_policy.load_policy(args.expert_policy_file) File "/home/asber/DRL/homework/hw1/load_policy.py", line 55, in load_policy obs_bo = tf.placeholder(tf.float32, [None, None]) AttributeError: module 'tensorflow' has no attribute 'placeholder' 需要在有用到tensorf加入下面代码:
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()使用gpu的tensorflow会出现下面问题
tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: out of memory
我们来通过py文件看看到底如何设置超参数的。
run_exper.py文件如下:
#!/usr/bin/env python
"""
Code to load an expert policy and generate roll-out data for behavioral cloning.
Example usage:
python run_expert.py experts/Humanoid-v1.pkl Humanoid-v1 --render \
--num_rollouts 20
Author of this script and included expert policies: Jonathan Ho ([email protected])
"""
import os
import pickle
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
import numpy as np
import tf_util
import gym
import load_policy
def main():
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('expert_policy_file', type=str)
parser.add_argument('envname', type=str)
parser.add_argument('--render', action='store_true')
parser.add_argument("--max_timesteps", type=int)
parser.add_argument('--num_rollouts', type=int, default=20,
help='Number of expert roll outs')
args = parser.parse_args()
print('loading and building expert policy')
policy_fn = load_policy.load_policy(args.expert_policy_file)
print('loaded and built')
with tf.Session():
tf_util.initialize()
import gym
env = gym.make(args.envname)
max_steps = args.max_timesteps or env.spec.timestep_limit
returns = []
observations = []
actions = []
for i in range(args.num_rollouts):
print('iter', i)
obs = env.reset()#重置环境
done = False
totalr = 0.
steps = 0
while not done:
action = policy_fn(obs[None,:])#得到策略输出
observations.append(obs)
actions.append(action)
obs, r, done, _ = env.step(action)#step() 更新物理引擎:得到下一步观测以及这步的reward
totalr += r
steps += 1
if args.render:
env.render()#render() 更新图像引擎
if steps % 100 == 0: print("%i/%i"%(steps, max_steps))
if steps >= max_steps:
break
returns.append(totalr)
print('returns', returns)
print('mean return', np.mean(returns))
print('std of return', np.std(returns))
expert_data = {'observations': np.array(observations),
'actions': np.array(actions)}
with open(os.path.join('expert_data', args.envname + '.pkl'), 'wb') as f:
pickle.dump(expert_data, f, pickle.HIGHEST_PROTOCOL)
if __name__ == '__main__':
main()
import pickle, tf_util, numpy as np
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
def load_policy(filename):#加载策略模型
with open(filename, 'rb') as f:
data = pickle.loads(f.read())#打开文件
# assert len(data.keys()) == 2
nonlin_type = data['nonlin_type']#文件是一个字典,字典存放nonlin_type和policy_type
policy_type = [k for k in data.keys() if k != 'nonlin_type'][0]
assert policy_type == 'GaussianPolicy', 'Policy type {} not supported'.format(policy_type)
policy_params = data[policy_type]
assert set(policy_params.keys()) == {'logstdevs_1_Da', 'hidden', 'obsnorm', 'out'}
# Keep track of input and output dims (i.e. observation and action dims) for the user
def build_policy(obs_bo):
def read_layer(l):
assert list(l.keys()) == ['AffineLayer']
assert sorted(l['AffineLayer'].keys()) == ['W', 'b']
return l['AffineLayer']['W'].astype(np.float32), l['AffineLayer']['b'].astype(np.float32)
def apply_nonlin(x):
if nonlin_type == 'lrelu':
return tf_util.lrelu(x, leak=.01) # openai/imitation nn.py:233
elif nonlin_type == 'tanh':
return tf.tanh(x)
else:
raise NotImplementedError(nonlin_type)
# Build the policy. First, observation normalization. 观测(输入)归一化
assert list(policy_params['obsnorm'].keys()) == ['Standardizer']
obsnorm_mean = policy_params['obsnorm']['Standardizer']['mean_1_D']
obsnorm_meansq = policy_params['obsnorm']['Standardizer']['meansq_1_D']
obsnorm_stdev = np.sqrt(np.maximum(0, obsnorm_meansq - np.square(obsnorm_mean)))
print('obs', obsnorm_mean.shape, obsnorm_stdev.shape)
normedobs_bo = (obs_bo - obsnorm_mean) / (obsnorm_stdev + 1e-6) # 1e-6 constant from Standardizer class in nn.py:409 in openai/imitation
curr_activations_bd = normedobs_bo
# Hidden layers next#建立隐层
assert list(policy_params['hidden'].keys()) == ['FeedforwardNet']
layer_params = policy_params['hidden']['FeedforwardNet']
for layer_name in sorted(layer_params.keys()):
l = layer_params[layer_name]
W, b = read_layer(l)
curr_activations_bd = apply_nonlin(tf.matmul(curr_activations_bd, W) + b)
# Output layer
W, b = read_layer(policy_params['out'])
output_bo = tf.matmul(curr_activations_bd, W) + b
return output_bo
obs_bo = tf.placeholder(tf.float32, [None, None])
a_ba = build_policy(obs_bo)
policy_fn = tf_util.function([obs_bo], a_ba)
return policy_fn
import numpy as np
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()# pylint: ignore-module
#import builtins
import functools
import copy
import os
import collections
# ================================================================
# Import all names into common namespace
# ================================================================
clip = tf.clip_by_value
# Make consistent with numpy
# ----------------------------------------
def sum(x, axis=None, keepdims=False):
return tf.reduce_sum(x, reduction_indices=None if axis is None else [axis], keep_dims = keepdims)
def mean(x, axis=None, keepdims=False):
return tf.reduce_mean(x, reduction_indices=None if axis is None else [axis], keep_dims = keepdims)
def var(x, axis=None, keepdims=False):
meanx = mean(x, axis=axis, keepdims=keepdims)
return mean(tf.square(x - meanx), axis=axis, keepdims=keepdims)
def std(x, axis=None, keepdims=False):
return tf.sqrt(var(x, axis=axis, keepdims=keepdims))
def max(x, axis=None, keepdims=False):
return tf.reduce_max(x, reduction_indices=None if axis is None else [axis], keep_dims = keepdims)
def min(x, axis=None, keepdims=False):
return tf.reduce_min(x, reduction_indices=None if axis is None else [axis], keep_dims = keepdims)
def concatenate(arrs, axis=0):
return tf.concat(axis, arrs)
def argmax(x, axis=None):
return tf.argmax(x, dimension=axis)
def switch(condition, then_expression, else_expression):
'''Switches between two operations depending on a scalar value (int or bool).
Note that both `then_expression` and `else_expression`
should be symbolic tensors of the *same shape*.
# Arguments
condition: scalar tensor.
then_expression: TensorFlow operation.
else_expression: TensorFlow operation.
'''
x_shape = copy.copy(then_expression.get_shape())
x = tf.cond(tf.cast(condition, 'bool'),
lambda: then_expression,
lambda: else_expression)
x.set_shape(x_shape)
return x
# Extras
# ----------------------------------------
def l2loss(params):
if len(params) == 0:
return tf.constant(0.0)
else:
return tf.add_n([sum(tf.square(p)) for p in params])
def lrelu(x, leak=0.2):
f1 = 0.5 * (1 + leak)
f2 = 0.5 * (1 - leak)
return f1 * x + f2 * abs(x)
def categorical_sample_logits(X):
# https://github.com/tensorflow/tensorflow/issues/456
U = tf.random_uniform(tf.shape(X))
return argmax(X - tf.log(-tf.log(U)), axis=1)
# ================================================================
# Global session
# ================================================================
def get_session():
return tf.get_default_session()
def single_threaded_session():
tf_config = tf.ConfigProto(
inter_op_parallelism_threads=1,
intra_op_parallelism_threads=1)
return tf.Session(config=tf_config)
def make_session(num_cpu):
tf_config = tf.ConfigProto(
inter_op_parallelism_threads=num_cpu,
intra_op_parallelism_threads=num_cpu)
return tf.Session(config=tf_config)
ALREADY_INITIALIZED = set()
def initialize():
new_variables = set(tf.all_variables()) - ALREADY_INITIALIZED
get_session().run(tf.initialize_variables(new_variables))
ALREADY_INITIALIZED.update(new_variables)
def eval(expr, feed_dict=None):
if feed_dict is None: feed_dict = {}
return get_session().run(expr, feed_dict=feed_dict)
def set_value(v, val):
get_session().run(v.assign(val))
def load_state(fname):
saver = tf.train.Saver()
saver.restore(get_session(), fname)
def save_state(fname):
os.makedirs(os.path.dirname(fname), exist_ok=True)
saver = tf.train.Saver()
saver.save(get_session(), fname)
# ================================================================
# Model components
# ================================================================
def normc_initializer(std=1.0):
def _initializer(shape, dtype=None, partition_info=None): #pylint: disable=W0613
out = np.random.randn(*shape).astype(np.float32)
out *= std / np.sqrt(np.square(out).sum(axis=0, keepdims=True))
return tf.constant(out)
return _initializer
def conv2d(x, num_filters, name, filter_size=(3, 3), stride=(1, 1), pad="SAME", dtype=tf.float32, collections=None,
summary_tag=None):
with tf.variable_scope(name):
stride_shape = [1, stride[0], stride[1], 1]
filter_shape = [filter_size[0], filter_size[1], int(x.get_shape()[3]), num_filters]
# there are "num input feature maps * filter height * filter width"
# inputs to each hidden unit
fan_in = intprod(filter_shape[:3])
# each unit in the lower layer receives a gradient from:
# "num output feature maps * filter height * filter width" /
# pooling size
fan_out = intprod(filter_shape[:2]) * num_filters
# initialize weights with random weights
w_bound = np.sqrt(6. / (fan_in + fan_out))
w = tf.get_variable("W", filter_shape, dtype, tf.random_uniform_initializer(-w_bound, w_bound),
collections=collections)
b = tf.get_variable("b", [1, 1, 1, num_filters], initializer=tf.zeros_initializer,
collections=collections)
if summary_tag is not None:
tf.image_summary(summary_tag,
tf.transpose(tf.reshape(w, [filter_size[0], filter_size[1], -1, 1]),
[2, 0, 1, 3]),
max_images=10)
return tf.nn.conv2d(x, w, stride_shape, pad) + b
def dense(x, size, name, weight_init=None, bias=True):
w = tf.get_variable(name + "/w", [x.get_shape()[1], size], initializer=weight_init)
ret = tf.matmul(x, w)
if bias:
b = tf.get_variable(name + "/b", [size], initializer=tf.zeros_initializer)
return ret + b
else:
return ret
def wndense(x, size, name, init_scale=1.0):
v = tf.get_variable(name + "/V", [int(x.get_shape()[1]), size],
initializer=tf.random_normal_initializer(0, 0.05))
g = tf.get_variable(name + "/g", [size], initializer=tf.constant_initializer(init_scale))
b = tf.get_variable(name + "/b", [size], initializer=tf.constant_initializer(0.0))
# use weight normalization (Salimans & Kingma, 2016)
x = tf.matmul(x, v)
scaler = g / tf.sqrt(sum(tf.square(v), axis=0, keepdims=True))
return tf.reshape(scaler, [1, size]) * x + tf.reshape(b, [1, size])
def densenobias(x, size, name, weight_init=None):
return dense(x, size, name, weight_init=weight_init, bias=False)
def dropout(x, pkeep, phase=None, mask=None):
mask = tf.floor(pkeep + tf.random_uniform(tf.shape(x))) if mask is None else mask
if phase is None:
return mask * x
else:
return switch(phase, mask*x, pkeep*x)
def batchnorm(x, name, phase, updates, gamma=0.96):
k = x.get_shape()[1]
runningmean = tf.get_variable(name+"/mean", shape=[1, k], initializer=tf.constant_initializer(0.0), trainable=False)
runningvar = tf.get_variable(name+"/var", shape=[1, k], initializer=tf.constant_initializer(1e-4), trainable=False)
testy = (x - runningmean) / tf.sqrt(runningvar)
mean_ = mean(x, axis=0, keepdims=True)
var_ = mean(tf.square(x), axis=0, keepdims=True)
std = tf.sqrt(var_)
trainy = (x - mean_) / std
updates.extend([
tf.assign(runningmean, runningmean * gamma + mean_ * (1 - gamma)),
tf.assign(runningvar, runningvar * gamma + var_ * (1 - gamma))
])
y = switch(phase, trainy, testy)
out = y * tf.get_variable(name+"/scaling", shape=[1, k], initializer=tf.constant_initializer(1.0), trainable=True)\
+ tf.get_variable(name+"/translation", shape=[1,k], initializer=tf.constant_initializer(0.0), trainable=True)
return out
# ================================================================
# Basic Stuff
# ================================================================
def function(inputs, outputs, updates=None, givens=None):
if isinstance(outputs, list):
return _Function(inputs, outputs, updates, givens=givens)
elif isinstance(outputs, (dict, collections.OrderedDict)):
f = _Function(inputs, outputs.values(), updates, givens=givens)
return lambda *inputs : type(outputs)(zip(outputs.keys(), f(*inputs)))
else:
f = _Function(inputs, [outputs], updates, givens=givens)
return lambda *inputs : f(*inputs)[0]
class _Function(object):
def __init__(self, inputs, outputs, updates, givens, check_nan=False):
assert all(len(i.op.inputs)==0 for i in inputs), "inputs should all be placeholders"
self.inputs = inputs
updates = updates or []
self.update_group = tf.group(*updates)
self.outputs_update = list(outputs) + [self.update_group]
self.givens = {} if givens is None else givens
self.check_nan = check_nan
def __call__(self, *inputvals):
assert len(inputvals) == len(self.inputs)
feed_dict = dict(zip(self.inputs, inputvals))
feed_dict.update(self.givens)
results = get_session().run(self.outputs_update, feed_dict=feed_dict)[:-1]
if self.check_nan:
if any(np.isnan(r).any() for r in results):
raise RuntimeError("Nan detected")
return results
def mem_friendly_function(nondata_inputs, data_inputs, outputs, batch_size):
if isinstance(outputs, list):
return _MemFriendlyFunction(nondata_inputs, data_inputs, outputs, batch_size)
else:
f = _MemFriendlyFunction(nondata_inputs, data_inputs, [outputs], batch_size)
return lambda *inputs : f(*inputs)[0]
class _MemFriendlyFunction(object):
def __init__(self, nondata_inputs, data_inputs, outputs, batch_size):
self.nondata_inputs = nondata_inputs
self.data_inputs = data_inputs
self.outputs = list(outputs)
self.batch_size = batch_size
def __call__(self, *inputvals):
assert len(inputvals) == len(self.nondata_inputs) + len(self.data_inputs)
nondata_vals = inputvals[0:len(self.nondata_inputs)]
data_vals = inputvals[len(self.nondata_inputs):]
feed_dict = dict(zip(self.nondata_inputs, nondata_vals))
n = data_vals[0].shape[0]
for v in data_vals[1:]:
assert v.shape[0] == n
for i_start in range(0, n, self.batch_size):
slice_vals = [v[i_start:min(i_start+self.batch_size, n)] for v in data_vals]
for (var,val) in zip(self.data_inputs, slice_vals):
feed_dict[var]=val
results = tf.get_default_session().run(self.outputs, feed_dict=feed_dict)
if i_start==0:
sum_results = results
else:
for i in range(len(results)):
sum_results[i] = sum_results[i] + results[i]
for i in range(len(results)):
sum_results[i] = sum_results[i] / n
return sum_results
# ================================================================
# Modules
# ================================================================
class Module(object):
def __init__(self, name):
self.name = name
self.first_time = True
self.scope = None
self.cache = {}
def __call__(self, *args):
if args in self.cache:
print("(%s) retrieving value from cache"%self.name)
return self.cache[args]
with tf.variable_scope(self.name, reuse=not self.first_time):
scope = tf.get_variable_scope().name
if self.first_time:
self.scope = scope
print("(%s) running function for the first time"%self.name)
else:
assert self.scope == scope, "Tried calling function with a different scope"
print("(%s) running function on new inputs"%self.name)
self.first_time = False
out = self._call(*args)
self.cache[args] = out
return out
def _call(self, *args):
raise NotImplementedError
@property
def trainable_variables(self):
assert self.scope is not None, "need to call module once before getting variables"
return tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, self.scope)
@property
def variables(self):
assert self.scope is not None, "need to call module once before getting variables"
return tf.get_collection(tf.GraphKeys.VARIABLES, self.scope)
def module(name):
@functools.wraps
def wrapper(f):
class WrapperModule(Module):
def _call(self, *args):
return f(*args)
return WrapperModule(name)
return wrapper
# ================================================================
# Graph traversal
# ================================================================
VARIABLES = {}
def get_parents(node):
return node.op.inputs
def topsorted(outputs):
"""
Topological sort via non-recursive depth-first search
"""
assert isinstance(outputs, (list,tuple))
marks = {}
out = []
stack = [] #pylint: disable=W0621
# i: node
# jidx = number of children visited so far from that node
# marks: state of each node, which is one of
# 0: haven't visited
# 1: have visited, but not done visiting children
# 2: done visiting children
for x in outputs:
stack.append((x,0))
while stack:
(i,jidx) = stack.pop()
if jidx == 0:
m = marks.get(i,0)
if m == 0:
marks[i] = 1
elif m == 1:
raise ValueError("not a dag")
else:
continue
ps = get_parents(i)
if jidx == len(ps):
marks[i] = 2
out.append(i)
else:
stack.append((i,jidx+1))
j = ps[jidx]
stack.append((j,0))
return out
# ================================================================
# Flat vectors
# ================================================================
def var_shape(x):
out = [k.value for k in x.get_shape()]
assert all(isinstance(a, int) for a in out), \
"shape function assumes that shape is fully known"
return out
def numel(x):
return intprod(var_shape(x))
def intprod(x):
return int(np.prod(x))
def flatgrad(loss, var_list):
grads = tf.gradients(loss, var_list)
return tf.concat(0, [tf.reshape(grad, [numel(v)])
for (v, grad) in zip(var_list, grads)])
class SetFromFlat(object):
def __init__(self, var_list, dtype=tf.float32):
assigns = []
shapes = list(map(var_shape, var_list))
total_size = np.sum([intprod(shape) for shape in shapes])
self.theta = theta = tf.placeholder(dtype,[total_size])
start=0
assigns = []
for (shape,v) in zip(shapes,var_list):
size = intprod(shape)
assigns.append(tf.assign(v, tf.reshape(theta[start:start+size],shape)))
start+=size
self.op = tf.group(*assigns)
def __call__(self, theta):
get_session().run(self.op, feed_dict={self.theta:theta})
class GetFlat(object):
def __init__(self, var_list):
self.op = tf.concat(0, [tf.reshape(v, [numel(v)]) for v in var_list])
def __call__(self):
return get_session().run(self.op)
# ================================================================
# Misc
# ================================================================
def fancy_slice_2d(X, inds0, inds1):
"""
like numpy X[inds0, inds1]
XXX this implementation is bad
"""
inds0 = tf.cast(inds0, tf.int64)
inds1 = tf.cast(inds1, tf.int64)
shape = tf.cast(tf.shape(X), tf.int64)
ncols = shape[1]
Xflat = tf.reshape(X, [-1])
return tf.gather(Xflat, inds0 * ncols + inds1)
def scope_vars(scope, trainable_only):
"""
Get variables inside a scope
The scope can be specified as a string
"""
return tf.get_collection(
tf.GraphKeys.TRAINABLE_VARIABLES if trainable_only else tf.GraphKeys.VARIABLES,
scope=scope if isinstance(scope, str) else scope.name
)
def lengths_to_mask(lengths_b, max_length):
"""
Turns a vector of lengths into a boolean mask
Args:
lengths_b: an integer vector of lengths
max_length: maximum length to fill the mask
Returns:
a boolean array of shape (batch_size, max_length)
row[i] consists of True repeated lengths_b[i] times, followed by False
"""
lengths_b = tf.convert_to_tensor(lengths_b)
assert lengths_b.get_shape().ndims == 1
mask_bt = tf.expand_dims(tf.range(max_length), 0) < tf.expand_dims(lengths_b, 1)
return mask_bt
def in_session(f):
@functools.wraps(f)
def newfunc(*args, **kwargs):
with tf.Session():
f(*args, **kwargs)
return newfunc
_PLACEHOLDER_CACHE = {} # name -> (placeholder, dtype, shape)
def get_placeholder(name, dtype, shape):
print("calling get_placeholder", name)
if name in _PLACEHOLDER_CACHE:
out, dtype1, shape1 = _PLACEHOLDER_CACHE[name]
assert dtype1==dtype and shape1==shape
return out
else:
out = tf.placeholder(dtype=dtype, shape=shape, name=name)
_PLACEHOLDER_CACHE[name] = (out,dtype,shape)
return out
def get_placeholder_cached(name):
return _PLACEHOLDER_CACHE[name][0]
def flattenallbut0(x):
return tf.reshape(x, [-1, intprod(x.get_shape().as_list()[1:])])
def reset():
global _PLACEHOLDER_CACHE
global VARIABLES
_PLACEHOLDER_CACHE = {}
VARIABLES = {}
tf.reset_default_graph()
所以只要了解load_policy就是建立了一个简单的从输入映射到输出的带隐层和非线性层的MLP就好了
跑DAgger但是我找不到需要填空的代码。但是DAgger的代码是需要每一次巡行之后都对之前策略生成的观测进行标定正确动作,比较麻烦低效,所以这里也不继续深究。