Gym中的一些Mujoco环境奖励函数归纳

Profiles of Few Mujoco Locomotion Tasks

Peng Zhenghao

2020.04.18

Notations:

FR = forward reward = forward_reward_weight * x_velocity

HR = healthy reward = healthy_reward_weight * is_healthy [bool]

CC = contrl cost = ctrl_cost_weight * sum(square(action))

TC = contact cost = contact_cost_weight * sum(square(contact_force))

A general formulation of these tasks are: (though some environments does not compute some terms)

reward = (
  forward_reward_weight * x_velocity
  + healthy_reward_weight * is_healthy
  - control_cost_weight * sum(square(action))
  - control_cost_weight * sum(square(contact_force))
)
Ant-v3 HalfCheetah-v3 Hopper-v3 Humanoid-v3 Walker2d-v3
Obs Dim (low, high) 111 (-inf, inf) 17 (-inf, inf) 11 (-inf, inf) 376 (-inf, inf) 17 (-inf, inf)
Act Dim (low, high) 8 (-1, 1) 6 (-1, 1) 3 (-1, 1) 17 (-0.4, 0.4) 6 (-1, 1)
Reward Range (-inf, inf) (-inf, inf) (-inf, inf) (-inf, inf) (-inf, inf)
Reward Formulation FR + HR - CC - TC FR - CC FR + HR - CC FR + HR - CC - TC FR + HR - CC
Forward Reward Weight 1.0 1.0 1.0 0.25 1.0
Healthy Reward Weight 1.0 - 1.0 5.0 1.0
Control Cost Weight 0.5 0.1 0.001 0.1 0.001
Contact Cost Weight 0.0005 - - 5e-7 -
Done Criterion Z out of (0.2, 1.0) Alway False Angle out of (-0.2, 0.2) or Z < 0.7 Z out of (1.0, 2.0) Z out of (0.8, 2.0) or angle out of (-1.0, 1.0)
Max Episode Steps (Constrained by TimitWrapper) 1000 1000 1000 1000 1000
Reset Noise Scale 0.1 0.1 0.005 0.01 0.005
Notes (1). Contact Cost is clip by (-inf, 10.0)

Notes:

  1. I am using gym==0.12.1, mujoco_py==1.50.1.0

你可能感兴趣的:(到,大,风,大,浪,中,去,锻,炼)