1.项目概述,该项目在pyBullet中导入了一个双臂机器人,但只训练一个手臂去抓取桌面上得方块到大随机指定的目标方块处(红色方块作为示意)。
环境搭建中使用了PyBullet中的部分函数,
pyBullet文档:https://docs.google.com/document/d/10sXEhzFRSnvFcl3XxNGhnD4N2SedqwdAvK3dsihxVUA/edit#
loadURDF('URDF_model/bmirobot_description/urdf/robotarm_description.urdf',flags=9) #加载机械臂,flags=9代表取消自碰撞,详细教程可以参考pybullet的官方说明文档
由文档列表可知:除路径参数外,其它可为默认值(basePosition默认原点),但是flags=9不清楚(问题1),文档并没给出数值选项。此外,orn_target并非没有定义而是定义为随机Z坐标:
ang_target = 3.14 * 0.5 + 3.1415925438 * random.random()
orn_target = p.getQuaternionFromEuler([0, 0, ang_target])
targetUid = p.loadURDF("URDF_model/cube_small_target_push.urdf",
[xpos_target, ypos_target, zpos_target],
orn_target, useFixedBase=1)
required |
fileName |
string |
a relative or absolute path to the URDF file on the file system of the physics server. |
optional |
basePosition |
vec3 |
create the base of the object at the specified position in world space coordinates [X,Y,Z]. Note that this position is of the URDF link position. If the inertial frame is non-zero, this is different from the center of mass position. Use resetBasePositionAndOrientation to set the center of mass location/orientation. |
optional |
baseOrientation |
vec4 |
create the base of the object at the specified orientation as world space quaternion [X,Y,Z,W]. See note in basePosition. |
optional |
useMaximalCoordinates |
int |
By default, the joints in the URDF file are created using the reduced coordinate method: the joints are simulated using the Featherstone Articulated Body Algorithm (ABA, btMultiBody in Bullet 2.x). The useMaximalCoordinates option will create a 6 degree of freedom rigid body for each link, and constraints between those rigid bodies are used to model joints. Enabling useMaximalCoordinates for bodies that don't have links/joints can improve performance a lot, such as objects picked up by a robot. |
optional |
useFixedBase |
int |
force the base of the loaded object to be static |
optional |
flags |
int |
The following flags can be combined using a bitwise OR, |: URDF_MERGE_FIXED_LINKS: this will remove fixed links from the URDF file and merge the resulting links. This is good for performance, since various algorithms (articulated body algorithm, forward kinematics etc) have linear complexity in the number of joints, including fixed joints. URDF_USE_INERTIA_FROM_FILE: by default, Bullet recomputed the inertia tensor based on mass and volume of the collision shape. If you can provide more accurate inertia tensor, use this flag. URDF_USE_SELF_COLLISION: by default, Bullet disables self-collision. This flag let's you enable it. You can customize the self-collision behavior using the following flags: URDF_USE_SELF_COLLISION_INCLUDE_PARENT will enable collision between child and parent, it is disabled by default. Needs to be used together with URDF_USE_SELF_COLLISION flag. URDF_USE_SELF_COLLISION_EXCLUDE_ALL_PARENTS will discard self-collisions between a child link and any of its ancestors (parents, parents of parents, up to the base). Needs to be used together with URDF_USE_SELF_COLLISION. URDF_USE_IMPLICIT_CYLINDER, will use a smooth implicit cylinder. By default, Bullet will tesselate the cylinder into a convex hull. URDF_ENABLE_SLEEPING, will allow to disable simulation after a body hasn't moved for a while. Interaction with active bodies will re-enable simulation. URDF_INITIALIZE_SAT_FEATURES, will create triangle meshes for convex shapes. This will improve visualization and also allow usage of the separating axis test (SAT) instead of GJK/EPA. Requires to enableSAT using setPhysicsEngineParameter. URDF_USE_MATERIAL_COLORS_FROM_MTL, will use the RGB color from the Wavefront OBJ file, instead of from the URDF file. URDF_ENABLE_CACHED_GRAPHICS_SHAPES, will cache and re-use graphics shapes. It will improve loading performance for files with similar graphics assets. URDF_MAINTAIN_LINK_ORDER, will try to maintain the link order from the URDF file. Say in the URDF file, the order is: ParentLink0, ChildLink1 (attached to ParentLink0), ChildLink2 (attached to ParentLink0). Without this flag, the order could be P0, C2, C1. |
optional |
globalScaling |
float |
globalScaling will apply a scale factor to the URDF model. |
optional |
physicsClientId |
int |
if you are connected to multiple servers, you can pick one. |
2.作者仿照gym写了环境获取程序(有点乱,大概理解):
每次输入action时(文中定义为xyz以及夹爪开闭大小),env_setp返回一个27维度的状态观测obs,分别是夹爪位置3个维度,夹爪姿态3个维度,夹爪线速度3个维度,夹爪角速度,物块位置3个维度,物块姿态3个维度,物块与夹爪的相对空间位置3个维度,物块线速度3个维度,物块角速度。
obs = [
end_pos.flatten(),#折叠消除一个维度a = np.mat([[1,2,],[2,3],[3,4]]),a.flatten()=matrix([[1, 2, 2, 3, 3, 4]])
#gripperPos.flatten(),
gripperOrn.flatten(),
gripper_linear_Velocity.flatten(),
gripper_angular_Velocity.flatten(),
blockPos.flatten(),
blockOrn.flatten(),
relative_pos.flatten(),
#relative_orn.flatten(),
#target_pos.flatten(),
#target_relative_pos.flatten()
block_linear_velocity.flatten(),
block_angular_velocity.flatten(),
# block_relative_linear_velocity.flatten()
]
有一点挺不解的(问题2):
blockPos, blockOrn_temp = p.getBasePositionAndOrientation(self.blockUid)
blockPos = np.array(blockPos)#位置
blockOrn = p.getEulerFromQuaternion(gripperOrn_temp)
blockOrn = np.array(blockOrn)#迷的操作,为何要用夹爪的姿态来代替方块姿态
还返回了achieved_goal(这个是末端的位置或者是物体的位置和obs列表中的一样)和target_pos目标红色区域的位置(在每一轮的采样中是不变的)。
最后作者又将这三个列表做成了字典observation,obs(相同的变量名赋值成了不同的变量,有点不好)
#在trian function中会通过evn.reset()或者初始observation
def reset(self):
.....
.....
obs = self._get_obs()
self._observation = obs
return self._observation
def _get_obs(self):
....
....
return {
'observation': obs.copy(),
'achieved_goal': achieved_goal.copy(),
'desired_goal': target_pos.flatten(),
}
3.奖励函数在环境类的def step中
做的很简单,夹爪与目标位置的距离在限定的阈值之内则获得正奖励,否则为-1(那是如何反应去抓到物体,问题3)
reward = self.compute_reward(obs['achieved_goal'], self.goal, info)
def compute_reward(self, achieved_goal, goal, info):
# Compute distance between goal and the achieved goal.
d = goal_distance(achieved_goal, goal)
if self.reward_type == 'sparse':
return -(d > self.distance_threshold).astype(np.float32)#https://zhuanlan.zhihu.com/p/159397435
else:
return -d
4.网络更新,作者遵循DDPG算法
先更新critic网络
critic_loss = (target_q_value - real_q_value).pow(2).mean()
#其中:
target_q_value = r_tensor + self.args.gamma * q_next_value
又其中q_next_value:
q_next_value = self.critic_target_network(inputs_next_norm_tensor, actions_next)
#另外,inputs_next_norm_tensor是夹爪和方块与目标状态[obs_next_norm, g_next_norm]组合而成
然后更新actor_network
# the actor loss
actions_real = self.actor_network(inputs_norm_tensor)
actor_loss = -self.critic_network(inputs_norm_tensor, actions_real).mean()
actor_loss += self.args.action_l2 * (actions_real / self.env_params['action_max']).pow(2).mean()
这里应该可以写成:
actor_loss = -torch.mean(self.critic_network(inputs_norm_tensor, self.actor_network(inputs_norm_tensor)))
self.actor_optimizer.zero_grad()
actor_loss.backward()
self.actor_optimizer.step()
所以还是有3个问题没明白,记录一下希望以后能理解。
flags=9不清楚,文档没找到(问题1)
为何要用夹爪的姿态来代替方块姿态(问题2)(作者回复:应该只是单纯地写错了。。我也没用到。。。哈哈哈哈)
奖励函数设置,夹爪与目标位置的距离在限定的阈值之内则获得正奖励,否则为-1(那是如何反应去抓到物体,问题3)