强化学习 — mujoco、mujoco_py、gym 和 baselines的环境配置

和其它的机器学习方向一样,强化学习(Reinforcement Learning)也有一些经典的实验场景,像Mountain-Car,Cart-Pole等。由于近年来深度强化学习(Deep Reinforcement Learning)的兴起,各种新的更复杂的实验场景也在不断涌现。于是出现了OpenAI Gym,MuJoCo,rllab, DeepMind Lab, TORCS, PySC2等一系列优秀的平台。

博主环境
Ubuntu16.04
Anaconda2
python 3.6(建议重新在anaconda中创建新的环境,以下操作均在conda创建环境下配置)
tensorflow-gpu 1.4.1 (baseline 最低要求1.4.1)
CUDA 8.0 (CUDA的安装可参考https://blog.csdn.net/Hansry/article/details/81008210)
Cudnn 6.0

1.安装mujoco

MuJoCo(Multi-Joint dynamics with Contact)是一个物理模拟器,可以用于机器人控制优化等研究。
1.准备工作
在官网上下载 mjpro150 linux ,同时点击Licence下载许可证,需要full name email address computer id 等信息,其中根据使用平台下载 getid_linux(可执行文件) 获取 computer id, 步骤如下:

$ chmod a+x getid_linux (给予执行权限)
$ ./getid_linux  
  • 1
  • 2

输出结果类似于 LINUX_A1EHAO_Q8BPHTIM10F05D0S3TB3293
这里写图片描述
点击submint 后,从输入的邮箱中下载证书mjkey.txt

2.环境配置
2.1 创建隐藏文件夹并将 mjpro150_linux 拷贝到 mujoco 文件夹中

mkdir ~/.mujoco    
cp mjpro150_linux.zip ~/.mujoco
cd ~/.mujoco
unzip mjpro150_linux.zip
  • 1
  • 2
  • 3
  • 4

2.2 将证书mjkey.txt拷贝到创建的隐藏文件夹中

cp mjkey.txt ~/.mujoco  
cp mjkey.txt ~/.mujoco/mjpro150/bin
  • 1
  • 2

2.3.添加环境变量, 打开~/.bashrc 文件,将以下命令添加进去

export LD_LIBRARY_PATH=~/.mujoco/mjpro150/bin${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export MUJOCO_KEY_PATH=~/.mujoco${MUJOCO_KEY_PATH}
  • 1
  • 2

3.运行结果

cd ~/.mujoco/mjpro150/bin
./simulate ../model/humanoid.xml
  • 1
  • 2

2.安装mujoco_py

首先现在官网上下载安装 mujoco_py源码, 注意的是在这里安装的时候可能会缺很多包,但是提示什么装什么就行了。

 pip3 install -U 'mujoco-py<1.50.2,>=1.50.1'
  • 1

如下顺利执行则说明安装成功

>>> import mujoco_py
>>> from os.path import dirname
>>> model = mujoco_py.load_model_from_path(dirname(dirname(mujoco_py.__file__))  +"/xmls/claw.xml")
>>> sim = mujoco_py.MjSim(model)
>>> print(sim.data.qpos)
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
>>> sim.step()
>>> print(sim.data.qpos)
[ 2.09217903e-06 -1.82329050e-12 -1.16711384e-07 -4.69613872e-11
 -1.43931860e-05  4.73350204e-10 -3.23749942e-05 -1.19854057e-13
 -2.39251380e-08 -4.46750545e-07  1.78771599e-09 -1.04232280e-08]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11

一定会出现的问题为

Creating window glfw
ERROR: GLEW initialization error: Missing GL version

Press Enter to exit ...
  • 1
  • 2
  • 3
  • 4

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/nvidia-390 加入到~/.bashrc 中解决问题

运行python body_interaction.py

3.安装gym

OpenAI Gym是OpenAI出的研究强化学习算法的toolkit,它里边cover的场景非常多,从经典的Cart-Pole, Mountain-Car到Atar,Go,MuJoCo都有。官方网站为https://gym.openai.com/,源码位于https://github.com/openai/gym,它的readme提供了安装和运行示例,按其中的安装方法:
最小安装

git clone https://github.com/openai/gym.git
cd gym
pip install -e .
  • 1
  • 2
  • 3

完全安装

apt-get install -y python-numpy python-dev cmake zlib1g-dev libjpeg-dev xvfb libav-tools xorg-dev python-opengl libboost-all-dev libsdl2-dev swig Pillow  libglfw3-dev
pip install -e '.[all]'
  • 1
  • 2

进入~/gym/examples/scripts中,通过./list_envs可发现gym中拥有很多环境。
通过调用gym中的环境,如下代码所示,可以运行

import gym
env = gym.make('Hero-ram-v0')
for i_episode in range(20):
    observation = env.reset()
    for t in range(100):
        env.render()
        print(observation)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12

<>

import gym
env = gym.make('Hero-ram-v0')
for i_episode in range(20):
    observation = env.reset()
    for t in range(100):
        env.render()
        print(observation)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12

4.安装baseline

OpenAI Baseline是一系列高质量的强化学习控制算法,需要python>=3.5, 且需要OpenMPIzlib,有些baseline example 是基于Mujoco 物理仿真环境的,可以根据上面的教程进行安装。

4.1 下载baselines包并且配置相关工具

git clone https://github.com/openai/baselines.git
cd baselines
pip install -e .
  • 1
  • 2
  • 3

4.2 安装 tensorflow-gpu,由于我们的CUDA 和Cudnn 分别是8.0 和6.0, 所以我们需要安装对应的tensorflow-gpu

conda install tensorflow-gpu=1.4.1
  • 1

4.3 为了在baseline 中对算法进行测试,需要安装pytest

pip install pytest
pytest
  • 1
  • 2

输出以下内容即配置成功

================================================================================== test session starts ==================================================================================
platform linux -- Python 3.6.6, pytest-3.6.3, py-1.5.4, pluggy-0.6.0
rootdir: /home/hansry/append/RL/baselines, inifile:
collected 12 items                                                                                                                                                                      

baselines/common/test_identity.py 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

其他包都是缺什么补什么

补充:github 上有大神写的gazebogym的包:https://github.com/erlerobot/gym-gazebo

5.baselines 中HER(Hindsight experience replay)的使用

进入到/baselines/baselines/her/experiment文件夹下,发现含有config.py play.py train.py等文件

其中config.py一般是设置其参数,诸如下面

DEFAULT_PARAMS = {
    # env
    'max_u': 1.,  # max absolute value of actions on different coordinates
    # ddpg
    'layers': 3,  # number of layers in the critic/actor networks
    'hidden': 256,  # number of neurons in each hidden layers
    'network_class': 'baselines.her.actor_critic:ActorCritic',
    'Q_lr': 0.001,  # critic learning rate
    'pi_lr': 0.001,  # actor learning rate
    'buffer_size': int(1E6),  # for experience replay
    'polyak': 0.95,  # polyak averaging coefficient
    'action_l2': 1.0,  # quadratic penalty on actions (before rescaling by max_u)
    'clip_obs': 200.,
    'scope': 'ddpg',  # can be tweaked for testing
    'relative_goals': False,
    # training
    'n_cycles': 50,  # per epoch
    'rollout_batch_size': 2,  # per mpi thread
    'n_batches': 40,  # training batches per cycle
    'batch_size': 256,  # per mpi thread, measured in transitions and reduced to even multiple of chunk_length.
    'n_test_rollouts': 10,  # number of test rollouts per epoch, each consists of rollout_batch_size rollouts
    'test_with_polyak': False,  # run test episodes with the target network
    # exploration
    'random_eps': 0.3,  # percentage of time a random action is taken
    'noise_eps': 0.2,  # std of gaussian noise added to not-completely-random actions as a percentage of max_u
    # HER
    'replay_strategy': 'future',  # supported modes: future, none
    'replay_k': 4,  # number of additional goals used for replay, only used if off_policy_data=future
    # normalization
    'norm_eps': 0.01,  # epsilon used for observation normalization
    'norm_clip': 5,  # normalized observations are cropped to this values
}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32

其中train.py为训练DDPG+HER中的参数,诸如神经网络中的参数等,在训练参数过程中,还有许多可选选项

@click.option('--env', type=str, default='FetchReach-v1', help='the name of the OpenAI Gym environment that you want to train on') (选择HER执行的任务)
@click.option('--logdir', type=str, default=None, help='the path to where logs and policy pickles should go. If not specified, creates a folder in /tmp/') (选择训练完policy的参数)
@click.option('--n_epochs', type=int, default=50, help='the number of training epochs to run') (迭代的次数)
@click.option('--num_cpu', type=int, default=1, help='the number of CPU cores to use (using MPI)') (使用多少个CPU)
@click.option('--seed', type=int, default=0, help='the random seed used to seed both the environment and the training code')
@click.option('--policy_save_interval', type=int, default=5, help='the interval with which policy pickles are saved. If set to 0, only the best and latest policy will be pickled.')
@click.option('--replay_strategy', type=click.Choice(['future', 'none']), default='future', help='the HER replay strategy to be used. "future" uses HER, "none" disables HER.')
@click.option('--clip_return', type=int, default=1, help='whether or not returns should be clipped')
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

训练的命令是python train.py --num_cpu=2 (参数可选)

其中play.py为调用训练好的参数进行执行,执行命令如下:

python play.py policy_best.pkl(后面需要跟着训练好的参数文件)
  • 1


你可能感兴趣的:(reinforment,learning)