DRL的学习-homework1

目录

 

前沿

作业

作业一:behavioral cloning

首先完成环境配置

windows的环境安转

ubuntu上的环境安装:

然后下面的包安装后:

 hw1-Readme

python run_expert.py ./experts/Ant-v2.pkl Ant-v2

 

第二个作业是:


前沿

由于之前已经给出视频地址以及别人的笔记地址

  1. 第一课笔记:https://zhuanlan.zhihu.com/p/32530166
  2. 第二课笔记:https://zhuanlan.zhihu.com/p/32575824
  3. 第三课笔记:https://zhuanlan.zhihu.com/p/32598322

第一章是综述,只是概念而已,第二三课(虽然这里是只有两节课,但是视频其实可以到第四节)讲了一些基本概念,其实是可以接受的,最好自己根据这两篇文章理清楚。

笔记此处先略

homework1给出了查看自己是否满足课程条件的作业,如果不懂还需要在MLDL的课程以及tensorflow的先验课程上加以学习。

作业

再次说一下:

作业可以到http://rail.eecs.berkeley.edu/deeprlcourse/ 去下载,这里有给出作业和PPT

这里有作业的标准答案:https://github.com/berkeleydeeprlcourse/homework

DRL的学习-homework1_第1张图片

简易翻译一下:

作业的提交截止时间是2019九月16th11:59号之前(这门课其实很新)

这个作业的目的是为了让我们体验一下克隆学习是什么样的,包括直接行为的克隆以及DAgger算法。

作业的目标是去建立一个行为克隆器和DAgger,并且比较他们在一些不同的连续控制任务上的表现(通过OpenAI Gym基准套件),作业的提交形式是如同section4中的代码和报告。

你可以找到在github上给出的初始代码,通过readme给出的提示来开始此处代码编程。

github上的代码中有两个:README.md和requirements.txt 

DRL的学习-homework1_第2张图片

作业一:behavioral cloning

起始代码为openAI Gym中的的每一个MuJoCo任务提供了最佳策略。请把代码中标注Todo的副本填空,以此完成behavioral cloning。跑behavioral cloning的命令在readme文件中给出。

首先完成环境配置

windows的环境安转

参考教程:

OK 先装环境 首先conda里面开启一个py3.5的环境

安装好所有的包运行一下代码:

(DRL) C:\Users\14020\Desktop\DRL\homework\hw1>python run_expert.py ./experts/Ant-v2.pkl Ant-v2

之前是说需要添加环境变量,然后去找了一下,貌似说要先下载Microsoft Visual C++ 14.0,安装之后需要安装win的mjpro150 win64,我之前是pip的mujoco_py,所以下载好之后,运行simulate.exe成功

DRL的学习-homework1_第3张图片(最好还是在ubuntu下操作)

然后说本来要

  1. cd 到mujoco-py所在目录:D:\ANACONDA\envs\DRL\Lib\site-packages\mujoco_py>
  2. pip install -r requirements.txt

没有就算了

其实最好还是在ubuntu上作业:

ubuntu上的环境安装:

https://blog.csdn.net/will_ye/article/details/81087463

与windows类似

然后下面的包安装后:

gym==0.10.5
mujoco-py==1.50.1.56
tensorflow
numpy
seaborn

 hw1-Readme

# CS294-112 HW 1: Imitation Learning

Dependencies:依赖
 * Python **3.5**
 * Numpy version **1.14.5**
 * TensorFlow version **1.10.5**
 * MuJoCo version **1.50** and mujoco-py **1.50.1.56**
 * OpenAI Gym version **0.10.5**

Once Python **3.5** is installed, you can install the remaining dependencies using `pip install -r requirements.txt`. 
运行这行完成所有的包下载

**Note**: MuJoCo versions until 1.5 do not support NVMe disks therefore won't be compatible with recent Mac machines.
There is a request for OpenAI to support it that can be followed [here](https://github.com/openai/gym/issues/638).

**Note**: Students enrolled in the course will receive an email with their MuJoCo activation key. Please do **not** share this key.

The only file that you need to look at is `run_expert.py`, which is code to load up an expert policy, run a specified number of roll-outs, and save out data.
你只需要看run_expert.py这个文件,这个文件加载一个专家策略

In `experts/`, the provided expert policies are:
在expers/文档下我们提供以下游戏的最佳策略
* Ant-v2.pkl
* HalfCheetah-v2.pkl
* Hopper-v2.pkl
* Humanoid-v2.pkl
* Reacher-v2.pkl
* Walker2d-v2.pkl

The name of the pickle file corresponds to the name of the gym environment.

python run_expert.py ./experts/Ant-v2.pkl Ant-v2

然后运行之后是这样的输出:

(DRL) C:\Users\14020\Desktop\DRL\homework\hw1>python run_expert.py ./experts/Ant-v2.pkl Ant-v2
WARNING:tensorflow:From D:\ANACONDA\envs\DRL\lib\site-packages\tensorflow_core\python\compat\v2_compat.py:65: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
loading and building expert policy
obs (1, 111) (1, 111)
loaded and built
2019-10-24 19:15:09.135584: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
WARNING:tensorflow:From C:\Users\14020\Desktop\DRL\homework\hw1\tf_util.py:92: all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Please use tf.global_variables instead.
WARNING:tensorflow:From D:\ANACONDA\envs\DRL\lib\site-packages\tensorflow_core\python\util\tf_should_use.py:198: initialize_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Use `tf.variables_initializer` instead.
D:\ANACONDA\envs\DRL\lib\site-packages\gym\envs\registration.py:14: PkgResourcesDeprecationWarning: Parameters to load are deprecated.  Call .resolve and .require separately.
  result = entry_point.load(False)
[33mWARN: gym.spaces.Box autodetected dtype as . Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as . Please provide explicit dtype.[0m
iter 0
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
……
iter 19
100/1000
200/1000
300/1000
400/1000
500/1000
600/1000
700/1000
800/1000
900/1000
1000/1000
returns [4826.537688869258, 4905.60911747871, 4843.0923281947125, 4983.677960547672, 4584.065942975533, 4669.895084909249, 4811.072431409634, 4778.371028168479, 4765.399718158855, 4852.6678103811855, 4921.288770246154, 4791.330282193481, 5071.702381917921, 4493.876447153158, 4504.676123632411, 4709.790413876621, 4628.911190315782, 4977.376827405835, 4697.280528014795, 4857.660492883054]
mean return 4783.714128436624
std of return 152.07689277612764

以及

DRL的学习-homework1_第4张图片

 

作业中需要我们填空空白文档,说明的含有空白代码的文档有:

DRL的学习-homework1_第5张图片

可是这个文件貌似不在github上这个homework文件夹中

我们现在需要跑BC算法并且在两个任务上记录结果:

1.BCagent至少学习了专家经验的30%

2.BCagent没有学到30%

结果以程序运行几次后每次的mean和standard deviation的表格的形式展示(并且说明是跑的哪个任务)

当比较两个任务的时候,保证他们都有差不多的网络大小、数据训练量、和相同的训练次数(这些东西也可以适当的记录在表格上)

Note:记录mean和标准差需要你的eval_batchsize要比ep_len(貌似没有在run_expert.py里看到这个参数)大,这个ep_len相当于每一次运行最长的运行次数,eval_batchsize代表你一共跑几次,相当于你决定今天总共跑5公里,而一次跑最多跑一圈(一公里)所以总共你跑了五圈,(如果中间你摔倒了,重新跑,那么次数可能会大于5圈)

如何设置网络大小、数据训练量、和相同的训练次数以及设置eval_batchsize和ep_len呢?

以及DRL的学习-homework1_第6张图片这个文件需要在linux下才可以运行。

#!/bin/bash
set -eux
for e in Hopper-v2 Ant-v2 HalfCheetah-v2 Humanoid-v2 Reacher-v2 Walker2d-v2
do
    python run_expert.py experts/$e.pkl $e --render --num_rollouts=1
done

其实就是简单的运行每一个

注意:tf2.0大改,这里homework使用的是tf1.x版本的,运行会出现下面错误

+ for e in Hopper-v2 Ant-v2 HalfCheetah-v2 Humanoid-v2 Reacher-v2 Walker2d-v2
+ python run_expert.py experts/Hopper-v2.pkl Hopper-v2 --render --num_rollouts=1
loading and building expert policy
Traceback (most recent call last):
  File "run_expert.py", line 76, in 
    main()
  File "run_expert.py", line 32, in main
    policy_fn = load_policy.load_policy(args.expert_policy_file)
  File "/home/asber/DRL/homework/hw1/load_policy.py", line 55, in load_policy
    obs_bo = tf.placeholder(tf.float32, [None, None])
AttributeError: module 'tensorflow' has no attribute 'placeholder'

需要在有用到tensorf加入下面代码:

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

使用gpu的tensorflow会出现下面问题

tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: out of memory

我们来通过py文件看看到底如何设置超参数的。

run_exper.py文件如下:

#!/usr/bin/env python

"""
Code to load an expert policy and generate roll-out data for behavioral cloning.
Example usage:
    python run_expert.py experts/Humanoid-v1.pkl Humanoid-v1 --render \
            --num_rollouts 20

Author of this script and included expert policies: Jonathan Ho ([email protected])
"""

import os
import pickle
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
import numpy as np
import tf_util
import gym
import load_policy

def main():
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument('expert_policy_file', type=str)
    parser.add_argument('envname', type=str)
    parser.add_argument('--render', action='store_true')
    parser.add_argument("--max_timesteps", type=int)
    parser.add_argument('--num_rollouts', type=int, default=20,
                        help='Number of expert roll outs')
    args = parser.parse_args()

    print('loading and building expert policy')
    policy_fn = load_policy.load_policy(args.expert_policy_file)
    print('loaded and built')

    with tf.Session():
        tf_util.initialize()

        import gym
        env = gym.make(args.envname)
        max_steps = args.max_timesteps or env.spec.timestep_limit

        returns = []
        observations = []
        actions = []
        for i in range(args.num_rollouts):
            print('iter', i)
            obs = env.reset()#重置环境
            done = False
            totalr = 0.
            steps = 0
            while not done:
                action = policy_fn(obs[None,:])#得到策略输出
                observations.append(obs)
                actions.append(action)
                obs, r, done, _ = env.step(action)#step() 更新物理引擎:得到下一步观测以及这步的reward
                totalr += r
                steps += 1
                if args.render:
                    env.render()#render() 更新图像引擎
                if steps % 100 == 0: print("%i/%i"%(steps, max_steps))
                if steps >= max_steps:
                    break
            returns.append(totalr)

        print('returns', returns)
        print('mean return', np.mean(returns))
        print('std of return', np.std(returns))

        expert_data = {'observations': np.array(observations),
                       'actions': np.array(actions)}

        with open(os.path.join('expert_data', args.envname + '.pkl'), 'wb') as f:
            pickle.dump(expert_data, f, pickle.HIGHEST_PROTOCOL)

if __name__ == '__main__':
    main()

load_policy.py

import pickle, tf_util, numpy as np
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

def load_policy(filename):#加载策略模型
    with open(filename, 'rb') as f:
        data = pickle.loads(f.read())#打开文件

    # assert len(data.keys()) == 2
    nonlin_type = data['nonlin_type']#文件是一个字典,字典存放nonlin_type和policy_type
    policy_type = [k for k in data.keys() if k != 'nonlin_type'][0]

    assert policy_type == 'GaussianPolicy', 'Policy type {} not supported'.format(policy_type)
    policy_params = data[policy_type]

    assert set(policy_params.keys()) == {'logstdevs_1_Da', 'hidden', 'obsnorm', 'out'}

    # Keep track of input and output dims (i.e. observation and action dims) for the user

    def build_policy(obs_bo):
        def read_layer(l):
            assert list(l.keys()) == ['AffineLayer']
            assert sorted(l['AffineLayer'].keys()) == ['W', 'b']
            return l['AffineLayer']['W'].astype(np.float32), l['AffineLayer']['b'].astype(np.float32)

        def apply_nonlin(x):
            if nonlin_type == 'lrelu':
                return tf_util.lrelu(x, leak=.01) # openai/imitation nn.py:233
            elif nonlin_type == 'tanh':
                return tf.tanh(x)
            else:
                raise NotImplementedError(nonlin_type)

        # Build the policy. First, observation normalization. 观测(输入)归一化
        assert list(policy_params['obsnorm'].keys()) == ['Standardizer']
        obsnorm_mean = policy_params['obsnorm']['Standardizer']['mean_1_D']
        obsnorm_meansq = policy_params['obsnorm']['Standardizer']['meansq_1_D']
        obsnorm_stdev = np.sqrt(np.maximum(0, obsnorm_meansq - np.square(obsnorm_mean)))
        print('obs', obsnorm_mean.shape, obsnorm_stdev.shape)
        normedobs_bo = (obs_bo - obsnorm_mean) / (obsnorm_stdev + 1e-6) # 1e-6 constant from Standardizer class in nn.py:409 in openai/imitation

        curr_activations_bd = normedobs_bo

        # Hidden layers next#建立隐层
        assert list(policy_params['hidden'].keys()) == ['FeedforwardNet']
        layer_params = policy_params['hidden']['FeedforwardNet']
        for layer_name in sorted(layer_params.keys()):
            l = layer_params[layer_name]
            W, b = read_layer(l)
            curr_activations_bd = apply_nonlin(tf.matmul(curr_activations_bd, W) + b)

        # Output layer
        W, b = read_layer(policy_params['out'])
        output_bo = tf.matmul(curr_activations_bd, W) + b
        return output_bo

    obs_bo = tf.placeholder(tf.float32, [None, None])
    a_ba = build_policy(obs_bo)
    policy_fn = tf_util.function([obs_bo], a_ba)
    return policy_fn

tf_util.py

import numpy as np
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()# pylint: ignore-module
#import builtins
import functools
import copy
import os
import collections

# ================================================================
# Import all names into common namespace
# ================================================================

clip = tf.clip_by_value

# Make consistent with numpy
# ----------------------------------------

def sum(x, axis=None, keepdims=False):
    return tf.reduce_sum(x, reduction_indices=None if axis is None else [axis], keep_dims = keepdims)
def mean(x, axis=None, keepdims=False):
    return tf.reduce_mean(x, reduction_indices=None if axis is None else [axis], keep_dims = keepdims)
def var(x, axis=None, keepdims=False):
    meanx = mean(x, axis=axis, keepdims=keepdims)
    return mean(tf.square(x - meanx), axis=axis, keepdims=keepdims)
def std(x, axis=None, keepdims=False):
    return tf.sqrt(var(x, axis=axis, keepdims=keepdims))
def max(x, axis=None, keepdims=False):
    return tf.reduce_max(x, reduction_indices=None if axis is None else [axis], keep_dims = keepdims)
def min(x, axis=None, keepdims=False):
    return tf.reduce_min(x, reduction_indices=None if axis is None else [axis], keep_dims = keepdims)
def concatenate(arrs, axis=0):
    return tf.concat(axis, arrs)
def argmax(x, axis=None):
    return tf.argmax(x, dimension=axis)

def switch(condition, then_expression, else_expression):
    '''Switches between two operations depending on a scalar value (int or bool).
    Note that both `then_expression` and `else_expression`
    should be symbolic tensors of the *same shape*.

    # Arguments
        condition: scalar tensor.
        then_expression: TensorFlow operation.
        else_expression: TensorFlow operation.
    '''
    x_shape = copy.copy(then_expression.get_shape())
    x = tf.cond(tf.cast(condition, 'bool'),
                lambda: then_expression,
                lambda: else_expression)
    x.set_shape(x_shape)
    return x

# Extras
# ----------------------------------------
def l2loss(params):
    if len(params) == 0:
        return tf.constant(0.0)
    else:
        return tf.add_n([sum(tf.square(p)) for p in params])
def lrelu(x, leak=0.2):
    f1 = 0.5 * (1 + leak)
    f2 = 0.5 * (1 - leak)
    return f1 * x + f2 * abs(x)
def categorical_sample_logits(X):
    # https://github.com/tensorflow/tensorflow/issues/456
    U = tf.random_uniform(tf.shape(X))
    return argmax(X - tf.log(-tf.log(U)), axis=1)

# ================================================================
# Global session
# ================================================================

def get_session():
    return tf.get_default_session()

def single_threaded_session():
    tf_config = tf.ConfigProto(
        inter_op_parallelism_threads=1,
        intra_op_parallelism_threads=1)
    return tf.Session(config=tf_config)

def make_session(num_cpu):
    tf_config = tf.ConfigProto(
        inter_op_parallelism_threads=num_cpu,
        intra_op_parallelism_threads=num_cpu)
    return tf.Session(config=tf_config)


ALREADY_INITIALIZED = set()
def initialize():
    new_variables = set(tf.all_variables()) - ALREADY_INITIALIZED
    get_session().run(tf.initialize_variables(new_variables))
    ALREADY_INITIALIZED.update(new_variables)


def eval(expr, feed_dict=None):
    if feed_dict is None: feed_dict = {}
    return get_session().run(expr, feed_dict=feed_dict)

def set_value(v, val):
    get_session().run(v.assign(val))

def load_state(fname):
    saver = tf.train.Saver()
    saver.restore(get_session(), fname)

def save_state(fname):
    os.makedirs(os.path.dirname(fname), exist_ok=True)
    saver = tf.train.Saver()
    saver.save(get_session(), fname)

# ================================================================
# Model components
# ================================================================


def normc_initializer(std=1.0):
    def _initializer(shape, dtype=None, partition_info=None): #pylint: disable=W0613
        out = np.random.randn(*shape).astype(np.float32)
        out *= std / np.sqrt(np.square(out).sum(axis=0, keepdims=True))
        return tf.constant(out)
    return _initializer


def conv2d(x, num_filters, name, filter_size=(3, 3), stride=(1, 1), pad="SAME", dtype=tf.float32, collections=None,
           summary_tag=None):
    with tf.variable_scope(name):
        stride_shape = [1, stride[0], stride[1], 1]
        filter_shape = [filter_size[0], filter_size[1], int(x.get_shape()[3]), num_filters]

        # there are "num input feature maps * filter height * filter width"
        # inputs to each hidden unit
        fan_in = intprod(filter_shape[:3])
        # each unit in the lower layer receives a gradient from:
        # "num output feature maps * filter height * filter width" /
        #   pooling size
        fan_out = intprod(filter_shape[:2]) * num_filters
        # initialize weights with random weights
        w_bound = np.sqrt(6. / (fan_in + fan_out))

        w = tf.get_variable("W", filter_shape, dtype, tf.random_uniform_initializer(-w_bound, w_bound),
                            collections=collections)
        b = tf.get_variable("b", [1, 1, 1, num_filters], initializer=tf.zeros_initializer,
                            collections=collections)

        if summary_tag is not None:
            tf.image_summary(summary_tag,
                             tf.transpose(tf.reshape(w, [filter_size[0], filter_size[1], -1, 1]),
                                          [2, 0, 1, 3]),
                             max_images=10)

        return tf.nn.conv2d(x, w, stride_shape, pad) + b


def dense(x, size, name, weight_init=None, bias=True):
    w = tf.get_variable(name + "/w", [x.get_shape()[1], size], initializer=weight_init)
    ret = tf.matmul(x, w)
    if bias:
        b = tf.get_variable(name + "/b", [size], initializer=tf.zeros_initializer)
        return ret + b
    else:
        return ret

def wndense(x, size, name, init_scale=1.0):
    v = tf.get_variable(name + "/V", [int(x.get_shape()[1]), size],
                        initializer=tf.random_normal_initializer(0, 0.05))
    g = tf.get_variable(name + "/g", [size], initializer=tf.constant_initializer(init_scale))
    b = tf.get_variable(name + "/b", [size], initializer=tf.constant_initializer(0.0))

    # use weight normalization (Salimans & Kingma, 2016)
    x = tf.matmul(x, v)
    scaler = g / tf.sqrt(sum(tf.square(v), axis=0, keepdims=True))
    return tf.reshape(scaler, [1, size]) * x + tf.reshape(b, [1, size])

def densenobias(x, size, name, weight_init=None):
    return dense(x, size, name, weight_init=weight_init, bias=False)

def dropout(x, pkeep, phase=None, mask=None):
    mask = tf.floor(pkeep + tf.random_uniform(tf.shape(x))) if mask is None else mask
    if phase is None:
        return mask * x
    else:
        return switch(phase, mask*x, pkeep*x)

def batchnorm(x, name, phase, updates, gamma=0.96):
    k = x.get_shape()[1]
    runningmean = tf.get_variable(name+"/mean", shape=[1, k], initializer=tf.constant_initializer(0.0), trainable=False)
    runningvar = tf.get_variable(name+"/var", shape=[1, k], initializer=tf.constant_initializer(1e-4), trainable=False)
    testy = (x - runningmean) / tf.sqrt(runningvar)

    mean_ = mean(x, axis=0, keepdims=True)
    var_ = mean(tf.square(x), axis=0, keepdims=True)
    std = tf.sqrt(var_)
    trainy = (x - mean_) / std

    updates.extend([
        tf.assign(runningmean, runningmean * gamma + mean_ * (1 - gamma)),
        tf.assign(runningvar, runningvar * gamma + var_ * (1 - gamma))
    ])

    y = switch(phase, trainy, testy)

    out = y * tf.get_variable(name+"/scaling", shape=[1, k], initializer=tf.constant_initializer(1.0), trainable=True)\
            + tf.get_variable(name+"/translation", shape=[1,k], initializer=tf.constant_initializer(0.0), trainable=True)
    return out



# ================================================================
# Basic Stuff
# ================================================================

def function(inputs, outputs, updates=None, givens=None):
    if isinstance(outputs, list):
        return _Function(inputs, outputs, updates, givens=givens)
    elif isinstance(outputs, (dict, collections.OrderedDict)):
        f = _Function(inputs, outputs.values(), updates, givens=givens)
        return lambda *inputs : type(outputs)(zip(outputs.keys(), f(*inputs)))
    else:
        f = _Function(inputs, [outputs], updates, givens=givens)
        return lambda *inputs : f(*inputs)[0]

class _Function(object):
    def __init__(self, inputs, outputs, updates, givens, check_nan=False):
        assert all(len(i.op.inputs)==0 for i in inputs), "inputs should all be placeholders"
        self.inputs = inputs
        updates = updates or []
        self.update_group = tf.group(*updates)
        self.outputs_update = list(outputs) + [self.update_group]
        self.givens = {} if givens is None else givens
        self.check_nan = check_nan
    def __call__(self, *inputvals):
        assert len(inputvals) == len(self.inputs)
        feed_dict = dict(zip(self.inputs, inputvals))
        feed_dict.update(self.givens)
        results = get_session().run(self.outputs_update, feed_dict=feed_dict)[:-1]
        if self.check_nan:
            if any(np.isnan(r).any() for r in results):
                raise RuntimeError("Nan detected")
        return results

def mem_friendly_function(nondata_inputs, data_inputs, outputs, batch_size):
    if isinstance(outputs, list):
        return _MemFriendlyFunction(nondata_inputs, data_inputs, outputs, batch_size)
    else:
        f = _MemFriendlyFunction(nondata_inputs, data_inputs, [outputs], batch_size)
        return lambda *inputs : f(*inputs)[0]

class _MemFriendlyFunction(object):
    def __init__(self, nondata_inputs, data_inputs, outputs, batch_size):
        self.nondata_inputs = nondata_inputs
        self.data_inputs = data_inputs
        self.outputs = list(outputs)
        self.batch_size = batch_size
    def __call__(self, *inputvals):
        assert len(inputvals) == len(self.nondata_inputs) + len(self.data_inputs)
        nondata_vals = inputvals[0:len(self.nondata_inputs)]
        data_vals = inputvals[len(self.nondata_inputs):]
        feed_dict = dict(zip(self.nondata_inputs, nondata_vals))
        n = data_vals[0].shape[0]
        for v in data_vals[1:]:
            assert v.shape[0] == n
        for i_start in range(0, n, self.batch_size):
            slice_vals = [v[i_start:min(i_start+self.batch_size, n)] for v in data_vals]
            for (var,val) in zip(self.data_inputs, slice_vals):
                feed_dict[var]=val
            results = tf.get_default_session().run(self.outputs, feed_dict=feed_dict)
            if i_start==0:
                sum_results = results
            else:
                for i in range(len(results)):
                    sum_results[i] = sum_results[i] + results[i]
        for i in range(len(results)):
            sum_results[i] = sum_results[i] / n
        return sum_results

# ================================================================
# Modules
# ================================================================

class Module(object):
    def __init__(self, name):
        self.name = name
        self.first_time = True
        self.scope = None
        self.cache = {}
    def __call__(self, *args):
        if args in self.cache:
            print("(%s) retrieving value from cache"%self.name)
            return self.cache[args]
        with tf.variable_scope(self.name, reuse=not self.first_time):
            scope = tf.get_variable_scope().name
            if self.first_time:
                self.scope = scope
                print("(%s) running function for the first time"%self.name)
            else:
                assert self.scope == scope, "Tried calling function with a different scope"
                print("(%s) running function on new inputs"%self.name)
            self.first_time = False
            out = self._call(*args)
        self.cache[args] = out
        return out
    def _call(self, *args):
        raise NotImplementedError

    @property
    def trainable_variables(self):
        assert self.scope is not None, "need to call module once before getting variables"
        return tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, self.scope)

    @property
    def variables(self):
        assert self.scope is not None, "need to call module once before getting variables"
        return tf.get_collection(tf.GraphKeys.VARIABLES, self.scope)


def module(name):
    @functools.wraps
    def wrapper(f):
        class WrapperModule(Module):
            def _call(self, *args):
                return f(*args)
        return WrapperModule(name)
    return wrapper

# ================================================================
# Graph traversal
# ================================================================

VARIABLES = {}


def get_parents(node):
    return node.op.inputs

def topsorted(outputs):
    """
    Topological sort via non-recursive depth-first search
    """
    assert isinstance(outputs, (list,tuple))
    marks = {}
    out = []
    stack = [] #pylint: disable=W0621
    # i: node
    # jidx = number of children visited so far from that node
    # marks: state of each node, which is one of
    #   0: haven't visited
    #   1: have visited, but not done visiting children
    #   2: done visiting children
    for x in outputs:
        stack.append((x,0))
        while stack:
            (i,jidx) = stack.pop()
            if jidx == 0:
                m = marks.get(i,0)
                if m == 0:
                    marks[i] = 1
                elif m == 1:
                    raise ValueError("not a dag")
                else:
                    continue
            ps = get_parents(i)
            if jidx == len(ps):
                marks[i] = 2
                out.append(i)
            else:
                stack.append((i,jidx+1))
                j = ps[jidx]
                stack.append((j,0))
    return out


# ================================================================
# Flat vectors
# ================================================================

def var_shape(x):
    out = [k.value for k in x.get_shape()]
    assert all(isinstance(a, int) for a in out), \
        "shape function assumes that shape is fully known"
    return out

def numel(x):
    return intprod(var_shape(x))

def intprod(x):
    return int(np.prod(x))

def flatgrad(loss, var_list):
    grads = tf.gradients(loss, var_list)
    return tf.concat(0, [tf.reshape(grad, [numel(v)])
        for (v, grad) in zip(var_list, grads)])

class SetFromFlat(object):
    def __init__(self, var_list, dtype=tf.float32):
        assigns = []
        shapes = list(map(var_shape, var_list))
        total_size = np.sum([intprod(shape) for shape in shapes])

        self.theta = theta = tf.placeholder(dtype,[total_size])
        start=0
        assigns = []
        for (shape,v) in zip(shapes,var_list):
            size = intprod(shape)
            assigns.append(tf.assign(v, tf.reshape(theta[start:start+size],shape)))
            start+=size
        self.op = tf.group(*assigns)
    def __call__(self, theta):
        get_session().run(self.op, feed_dict={self.theta:theta})

class GetFlat(object):
    def __init__(self, var_list):
        self.op = tf.concat(0, [tf.reshape(v, [numel(v)]) for v in var_list])
    def __call__(self):
        return get_session().run(self.op)

# ================================================================
# Misc
# ================================================================


def fancy_slice_2d(X, inds0, inds1):
    """
    like numpy X[inds0, inds1]
    XXX this implementation is bad
    """
    inds0 = tf.cast(inds0, tf.int64)
    inds1 = tf.cast(inds1, tf.int64)
    shape = tf.cast(tf.shape(X), tf.int64)
    ncols = shape[1]
    Xflat = tf.reshape(X, [-1])
    return tf.gather(Xflat, inds0 * ncols + inds1)


def scope_vars(scope, trainable_only):
    """
    Get variables inside a scope
    The scope can be specified as a string
    """
    return tf.get_collection(
        tf.GraphKeys.TRAINABLE_VARIABLES if trainable_only else tf.GraphKeys.VARIABLES,
        scope=scope if isinstance(scope, str) else scope.name
    )

def lengths_to_mask(lengths_b, max_length):
    """
    Turns a vector of lengths into a boolean mask

    Args:
        lengths_b: an integer vector of lengths
        max_length: maximum length to fill the mask

    Returns:
        a boolean array of shape (batch_size, max_length)
        row[i] consists of True repeated lengths_b[i] times, followed by False
    """
    lengths_b = tf.convert_to_tensor(lengths_b)
    assert lengths_b.get_shape().ndims == 1
    mask_bt = tf.expand_dims(tf.range(max_length), 0) < tf.expand_dims(lengths_b, 1)
    return mask_bt


def in_session(f):
    @functools.wraps(f)
    def newfunc(*args, **kwargs):
        with tf.Session():
            f(*args, **kwargs)
    return newfunc


_PLACEHOLDER_CACHE = {} # name -> (placeholder, dtype, shape)
def get_placeholder(name, dtype, shape):
    print("calling get_placeholder", name)
    if name in _PLACEHOLDER_CACHE:
        out, dtype1, shape1 = _PLACEHOLDER_CACHE[name]
        assert dtype1==dtype and shape1==shape
        return out
    else:
        out = tf.placeholder(dtype=dtype, shape=shape, name=name)
        _PLACEHOLDER_CACHE[name] = (out,dtype,shape)
        return out
def get_placeholder_cached(name):
    return _PLACEHOLDER_CACHE[name][0]

def flattenallbut0(x):
    return tf.reshape(x, [-1, intprod(x.get_shape().as_list()[1:])])

def reset():
    global _PLACEHOLDER_CACHE
    global VARIABLES
    _PLACEHOLDER_CACHE = {}
    VARIABLES = {}
    tf.reset_default_graph()

 所以只要了解load_policy就是建立了一个简单的从输入映射到输出的带隐层和非线性层的MLP就好了

第二个作业是:

跑DAgger但是我找不到需要填空的代码。但是DAgger的代码是需要每一次巡行之后都对之前策略生成的观测进行标定正确动作,比较麻烦低效,所以这里也不继续深究。

你可能感兴趣的:(强化学习)