和其它的机器学习方向一样,强化学习(Reinforcement Learning)也有一些经典的实验场景,像Mountain-Car,Cart-Pole等。由于近年来深度强化学习(Deep Reinforcement Learning)的兴起,各种新的更复杂的实验场景也在不断涌现。于是出现了OpenAI Gym,MuJoCo,rllab, DeepMind Lab, TORCS, PySC2等一系列优秀的平台。
博主环境
Ubuntu16.04
Anaconda2
python 3.6(建议重新在anaconda中创建新的环境,以下操作均在conda创建环境下配置)
tensorflow-gpu 1.4.1 (baseline 最低要求1.4.1)
CUDA 8.0 (CUDA的安装可参考https://blog.csdn.net/Hansry/article/details/81008210)
Cudnn 6.0
MuJoCo(Multi-Joint dynamics with Contact)是一个物理模拟器,可以用于机器人控制优化等研究。
1.准备工作
在官网上下载 mjpro150 linux
,同时点击Licence
下载许可证,需要full name
email address
computer id
等信息,其中根据使用平台下载 getid_linux(可执行文件)
获取 computer id
, 步骤如下:
$ chmod a+x getid_linux (给予执行权限)
$ ./getid_linux
输出结果类似于 LINUX_A1EHAO_Q8BPHTIM10F05D0S3TB3293
点击submint
后,从输入的邮箱中下载证书mjkey.txt
2.环境配置
2.1 创建隐藏文件夹并将 mjpro150_linux
拷贝到 mujoco
文件夹中
mkdir ~/.mujoco
cp mjpro150_linux.zip ~/.mujoco
cd ~/.mujoco
unzip mjpro150_linux.zip
2.2 将证书mjkey.txt
拷贝到创建的隐藏文件夹中
cp mjkey.txt ~/.mujoco
cp mjkey.txt ~/.mujoco/mjpro150/bin
2.3.添加环境变量, 打开~/.bashrc
文件,将以下命令添加进去
export LD_LIBRARY_PATH=~/.mujoco/mjpro150/bin${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export MUJOCO_KEY_PATH=~/.mujoco${MUJOCO_KEY_PATH}
3.运行结果
cd ~/.mujoco/mjpro150/bin
./simulate ../model/humanoid.xml
首先现在官网上下载安装 mujoco_py源码, 注意的是在这里安装的时候可能会缺很多包,但是提示什么装什么就行了。
pip3 install -U 'mujoco-py<1.50.2,>=1.50.1'
如下顺利执行则说明安装成功
>>> import mujoco_py
>>> from os.path import dirname
>>> model = mujoco_py.load_model_from_path(dirname(dirname(mujoco_py.__file__)) +"/xmls/claw.xml")
>>> sim = mujoco_py.MjSim(model)
>>> print(sim.data.qpos)
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
>>> sim.step()
>>> print(sim.data.qpos)
[ 2.09217903e-06 -1.82329050e-12 -1.16711384e-07 -4.69613872e-11
-1.43931860e-05 4.73350204e-10 -3.23749942e-05 -1.19854057e-13
-2.39251380e-08 -4.46750545e-07 1.78771599e-09 -1.04232280e-08]
一定会出现的问题为
Creating window glfw
ERROR: GLEW initialization error: Missing GL version
Press Enter to exit ...
将export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/nvidia-390
加入到~/.bashrc
中解决问题
OpenAI Gym是OpenAI出的研究强化学习算法的toolkit,它里边cover的场景非常多,从经典的Cart-Pole, Mountain-Car到Atar,Go,MuJoCo都有。官方网站为https://gym.openai.com/,源码位于https://github.com/openai/gym,它的readme提供了安装和运行示例,按其中的安装方法:
最小安装
git clone https://github.com/openai/gym.git
cd gym
pip install -e .
完全安装
apt-get install -y python-numpy python-dev cmake zlib1g-dev libjpeg-dev xvfb libav-tools xorg-dev python-opengl libboost-all-dev libsdl2-dev swig Pillow libglfw3-dev
pip install -e '.[all]'
进入~/gym/examples/scripts
中,通过./list_envs
可发现gym
中拥有很多环境。
通过调用gym中的环境,如下代码所示,可以运行
import gym
env = gym.make('Hero-ram-v0')
for i_episode in range(20):
observation = env.reset()
for t in range(100):
env.render()
print(observation)
action = env.action_space.sample()
observation, reward, done, info = env.step(action)
if done:
print("Episode finished after {} timesteps".format(t+1))
break
import gym
env = gym.make('Hero-ram-v0')
for i_episode in range(20):
observation = env.reset()
for t in range(100):
env.render()
print(observation)
action = env.action_space.sample()
observation, reward, done, info = env.step(action)
if done:
print("Episode finished after {} timesteps".format(t+1))
break
OpenAI Baseline是一系列高质量的强化学习控制算法,需要python>=3.5, 且需要OpenMPI
和zlib
,有些baseline example
是基于Mujoco
物理仿真环境的,可以根据上面的教程进行安装。
4.1 下载baselines
包并且配置相关工具
git clone https://github.com/openai/baselines.git
cd baselines
pip install -e .
4.2 安装 tensorflow-gpu
,由于我们的CUDA 和Cudnn 分别是8.0 和6.0, 所以我们需要安装对应的tensorflow-gpu
conda install tensorflow-gpu=1.4.1
4.3 为了在baseline
中对算法进行测试,需要安装pytest
pip install pytest
pytest
输出以下内容即配置成功
================================================================================== test session starts ==================================================================================
platform linux -- Python 3.6.6, pytest-3.6.3, py-1.5.4, pluggy-0.6.0
rootdir: /home/hansry/append/RL/baselines, inifile:
collected 12 items
baselines/common/test_identity.py
其他包都是缺什么补什么
补充:github 上有大神写的gazebo
和gym
的包:https://github.com/erlerobot/gym-gazebo
进入到/baselines/baselines/her/experiment
文件夹下,发现含有config.py
play.py
train.py
等文件
其中config.py
一般是设置其参数,诸如下面
DEFAULT_PARAMS = {
# env
'max_u': 1., # max absolute value of actions on different coordinates
# ddpg
'layers': 3, # number of layers in the critic/actor networks
'hidden': 256, # number of neurons in each hidden layers
'network_class': 'baselines.her.actor_critic:ActorCritic',
'Q_lr': 0.001, # critic learning rate
'pi_lr': 0.001, # actor learning rate
'buffer_size': int(1E6), # for experience replay
'polyak': 0.95, # polyak averaging coefficient
'action_l2': 1.0, # quadratic penalty on actions (before rescaling by max_u)
'clip_obs': 200.,
'scope': 'ddpg', # can be tweaked for testing
'relative_goals': False,
# training
'n_cycles': 50, # per epoch
'rollout_batch_size': 2, # per mpi thread
'n_batches': 40, # training batches per cycle
'batch_size': 256, # per mpi thread, measured in transitions and reduced to even multiple of chunk_length.
'n_test_rollouts': 10, # number of test rollouts per epoch, each consists of rollout_batch_size rollouts
'test_with_polyak': False, # run test episodes with the target network
# exploration
'random_eps': 0.3, # percentage of time a random action is taken
'noise_eps': 0.2, # std of gaussian noise added to not-completely-random actions as a percentage of max_u
# HER
'replay_strategy': 'future', # supported modes: future, none
'replay_k': 4, # number of additional goals used for replay, only used if off_policy_data=future
# normalization
'norm_eps': 0.01, # epsilon used for observation normalization
'norm_clip': 5, # normalized observations are cropped to this values
}
其中train.py
为训练DDPG+HER
中的参数,诸如神经网络中的参数等,在训练参数过程中,还有许多可选选项
@click.option('--env', type=str, default='FetchReach-v1', help='the name of the OpenAI Gym environment that you want to train on') (选择HER执行的任务)
@click.option('--logdir', type=str, default=None, help='the path to where logs and policy pickles should go. If not specified, creates a folder in /tmp/') (选择训练完policy的参数)
@click.option('--n_epochs', type=int, default=50, help='the number of training epochs to run') (迭代的次数)
@click.option('--num_cpu', type=int, default=1, help='the number of CPU cores to use (using MPI)') (使用多少个CPU)
@click.option('--seed', type=int, default=0, help='the random seed used to seed both the environment and the training code')
@click.option('--policy_save_interval', type=int, default=5, help='the interval with which policy pickles are saved. If set to 0, only the best and latest policy will be pickled.')
@click.option('--replay_strategy', type=click.Choice(['future', 'none']), default='future', help='the HER replay strategy to be used. "future" uses HER, "none" disables HER.')
@click.option('--clip_return', type=int, default=1, help='whether or not returns should be clipped')
训练的命令是python train.py --num_cpu=2 (参数可选)
其中play.py
为调用训练好的参数进行执行,执行命令如下:
python play.py policy_best.pkl(后面需要跟着训练好的参数文件)