官方文档
Mac
release_19版本
git clone --branch release_19 https://github.com/Unity-Technologies/ml-agents.git
试了两遍,网速太慢,clone失败。直接去github网站下载zip文件,解压
unzip ml-agents-release_19.zip
这个项目中包含:
com.unity.ml-agents
unity包com.unity.ml-agents.extensions
unity包,实验性的,可选,依赖com.unity.ml-agentsmlagents
python库,训练agents,依赖mlagents_envsmlagents_envs
一个底层的python库gym-unity
支持OpenAI Gym的python库Project
一些democonda create -n ml-agents python=3.6
conda activate ml-agents
python -m pip install mlagents==0.28.0
官方文档
从UnityHub中直接打开项目:ml-agents-release_19/Project,如下图,有很多examples
打开3DBall/Scenes/3DBall场景,可以直接运行看到Agents的效果。
进入到Project文件夹下,运行
mlagents-learn config/ppo/3DBall.yaml --run-id=first3DBallRun
启动python训练程序。config文件夹下预置了Examples中项目的训练配置文件,run-id是此次训练的名称。然后到Unity中点击运行,提供训练环境,就开始训练了。要停止训练,直接CTRL+C
,要继续训练,运行:
mlagents-learn config/ppo/3DBall.yaml --run-id=first3DBallRun --resume
训练中会在当前目录下新建一个reuslt文件夹,存放训练过程中的log和模型结果,跟TF/Pytorch类似:
results
└── first3DBallRun
├── 3DBall
│ ├── 3DBall-151000.onnx
│ ├── 3DBall-151000.pt
│ ├── checkpoint.pt
│ └── events.out.tfevents.1653732428.bogon.89658.0
├── 3DBall.onnx
├── configuration.yaml
└── run_logs
├── timers.json
└── training_status.json
可以用tensorboard来查看训练过程中的指标
tensorboard --logdir results
在浏览器中输入localhost:6006,就能看到reward、loss等曲线
场景中有12个Agents,这12个Agents相互独立并共享一个模型,训练时,都单独对模型参数更新做贡献,相当于开12个训练线程或者batch为12,总之,相当于提升12倍训练速度。
至此,吧ML-Agents的Demo走了一遍,跑了起来,但是对里面的细节还不清楚,需要从0开始搭建一个完整的ML-Agents系统。
官方文档
Window->Package Manager->Add package from disk,选择ml-agents-release_19/com.unity.ml-agents/package.json,导入成功。
如图,Agent是一个小球(RollerAgent),目标是走到方块(Target)处,给RollerAgent添加Rigidbody组件,如果离开平面就会因为重力掉下去,进而失败。
新建脚本RollerAgent.cs,这边是核心,状态、reward、action都是在这边定义的。
using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using Unity.MLAgents;
using Unity.MLAgents.Sensors;
using Unity.MLAgents.Actuators;
public class RollerAgent : Agent
{
Rigidbody rBody;
void Start()
{
rBody = GetComponent<Rigidbody>();
}
public Transform Target;
public override void OnEpisodeBegin()
{
// If the Agent fell, zero its momentum
if (this.transform.localPosition.y < 0)
{
this.rBody.angularVelocity = Vector3.zero;
this.rBody.velocity = Vector3.zero;
this.transform.localPosition = new Vector3(0, 0.5f, 0);
}
// Move the target to a new spot
Target.localPosition = new Vector3(Random.value * 8 - 4, 0.5f, Random.value * 8 - 4);
}
// the data will be fed into a neural network as a feature vector
public override void CollectObservations(VectorSensor sensor)
{
// Target and Agent positions
sensor.AddObservation(Target.localPosition);
sensor.AddObservation(this.transform.localPosition);
// Agent velocity
sensor.AddObservation(rBody.velocity.x);
sensor.AddObservation(rBody.velocity.z);
}
public float forceMultiplier = 10;
public override void OnActionReceived(ActionBuffers actionBuffers)
{
// Actions, size = 2
Vector3 controlSignal = Vector3.zero;
controlSignal.x = actionBuffers.ContinuousActions[0];
controlSignal.z = actionBuffers.ContinuousActions[1];
rBody.AddForce(controlSignal * forceMultiplier);
// Rewards
float distanceToTarget = Vector3.Distance(this.transform.localPosition, Target.localPosition);
// Reached target
if (distanceToTarget < 1.42f)
{
SetReward(1.0f);
EndEpisode();
}
// Fell off platform
else if (this.transform.localPosition.y < 0)
{
EndEpisode();
}
}
public override void Heuristic(in ActionBuffers actionsOut)
{
var continuousActionsOut = actionsOut.ContinuousActions;
continuousActionsOut[0] = Input.GetAxis("Horizontal");
continuousActionsOut[1] = Input.GetAxis("Vertical");
}
}
强化学习主要由智能体(Agent)、环境(Environment)、状态(State)、动作(Action)、奖励(Reward)组成。一次事件(episode)从开始到任务成功/任务失败/timeout,在一个episode中优化Reward。
OnEpisodeBegin
事件开始时的状态初始化。
CollectObservations
设置state,state数据会被传入模型中,模型根据当前state输出action。
OnActionReceived
在这边控制action对env的改变,env对state的改变则是unity算的;Reward也是在这边给的。
给Agent添加如下组件,并修改一些参数:
RollerAgent
上面编写的脚本
DecisionRequester
“request decisions on its own at regular intervals” 目前没太明白,貌似不用也行?
BehaviorParameters
模型参数配置,包括state向量维度,action维度,模型文件等等。
至此,环境和Agent都搭建好了,还没有Model,在训练模型之前,先通过添加Heuristic函数(上面脚本中)来手工测试一下,通过按上下左右键来控制球的移动,相当于此时的Agent背后的Model是操作的人。以此也可以验证环境搭建的正确性。
在Assets目录下面新建一个模型训练的配置文件Config/rollerball_config.yaml:
behaviors:
RollerBall:
trainer_type: ppo
hyperparameters:
batch_size: 10
buffer_size: 100
learning_rate: 3.0e-4
beta: 5.0e-4
epsilon: 0.2
lambd: 0.99
num_epoch: 3
learning_rate_schedule: linear
beta_schedule: constant
epsilon_schedule: linear
network_settings:
normalize: false
hidden_units: 128
num_layers: 2
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
max_steps: 500000
time_horizon: 64
summary_freq: 2000
参数说明详见Config参数。
在Assets目录下运行
mlagents-learn Config/rollerball_config.yaml --run-id=RollerBall
并运行unity项目,开始训练,观察reward差不多时,Ctrl+C终止训练
…
[INFO] RollerBall. Step: 46000. Time Elapsed: 126.200 s. Mean Reward: 0.908. Std of Reward: 0.289. Training.
[INFO] RollerBall. Step: 48000. Time Elapsed: 131.253 s. Mean Reward: 0.862. Std of Reward: 0.345. Training.
[INFO] RollerBall. Step: 50000. Time Elapsed: 136.253 s. Mean Reward: 0.878. Std of Reward: 0.328. Training.
[INFO] RollerBall. Step: 52000. Time Elapsed: 141.376 s. Mean Reward: 0.915. Std of Reward: 0.279. Training.
[INFO] RollerBall. Step: 54000. Time Elapsed: 146.467 s. Mean Reward: 0.879. Std of Reward: 0.327. Training.
训练log和结果在Assets下的results目录中:
results
├── RollerBall
│ ├── RollerBall
│ │ ├── RollerBall-55804.onnx
│ │ ├── RollerBall-55804.onnx.meta
│ │ ├── RollerBall-55804.pt
│ │ ├── RollerBall-55804.pt.meta
│ │ ├── checkpoint.pt
│ │ ├── checkpoint.pt.meta
│ │ ├── events.out.tfevents.1653813243.bogon.91531.0
│ │ └── events.out.tfevents.1653813243.bogon.91531.0.meta
│ ├── RollerBall.meta
│ ├── RollerBall.onnx
│ ├── RollerBall.onnx.meta
│ ├── configuration.yaml
│ └── configuration.yaml.meta
└── RollerBall.meta
用tensorboard查看训练过程指标曲线:
tensorboard --logdir results
把训练好的模型results/RollerBall/RollerBall.onnx赋值给Behavior Parameter组件的Model参数:
运行unity,就可以看到效果。
两种方式:
mlagents-learn config/rollerball_config.yaml --run-id=RollerBall --num-envs=2
测试的时候用了第一种,第二种还没试。