SUMMARY |
---|
Estimated time to completion: 2 hours
In this unit, you will learn how to define the learning task of your robot by creating a Task Environment for a Moving Cube with a single disk in the roll axis. Also, you will use the Qlearn algorithm for training the RoboCube.
END OF SUMMARY |
---|
The Task Environment is the one in charge to define the Reinforcement Learning task. It is in charge of:
Let’s go to define the class that will allow us to define the task the robot must learn. We are going to set up everything to induce the robot to learn how to walk forwards through reinforcement learning.
First of all, let’s add the new script to our package:
Execute in WebShell #1
roscd my_moving_cube_pkg/scripts/
touch my_one_disk_walk.py;chmod +x my_one_disk_walk.py
Use the IDE to put the Task Environment template code inside your my_one_disk_walk.py. You can find the template on the openai_ros/templates directory.
Here you have the template for your convenience:
from gym import spaces
import my_robot_env
from gym.envs.registration import register
import rospy
# The path is __init__.py of openai_ros, where we import the MovingCubeOneDiskWalkEnv directly
timestep_limit_per_episode = 1000 # Can be any Value
register(
id='MyTrainingEnv-v0',
entry_point='template_my_training_env:MovingCubeOneDiskWalkEnv',
#timestep_limit=timestep_limit_per_episode,
)
class MyTrainingEnv(cube_single_disk_env.MyRobotEnv):
def __init__(self):
# Only variable needed to be set here
number_actions = rospy.get_param('/my_robot_namespace/n_actions')
self.action_space = spaces.Discrete(number_actions)
# This is the most common case of Box observation type
high = numpy.array([
obs1_max_value,
obs12_max_value,
...
obsN_max_value
])
self.observation_space = spaces.Box(-high, high)
# Variables that we retrieve through the param server, loded when launch training launch.
# Here we will add any init functions prior to starting the MyRobotEnv
super(MyTrainingEnv, self).__init__()
def _set_init_pose(self):
"""Sets the Robot in its init pose
"""
# TODO
def _init_env_variables(self):
"""
Inits variables needed to be initialised each time we reset at the start
of an episode.
:return:
"""
# TODO
def _set_action(self, action):
"""
Move the robot based on the action variable given
"""
# TODO: Move robot
def _get_obs(self):
"""
Here we define what sensor data of our robots observations
To know which Variables we have acces to, we need to read the
MyRobotEnv API DOCS
:return: observations
"""
# TODO
return observations
def _is_done(self, observations):
"""
Decide if episode is done based on the observations
"""
# TODO
return done
def _compute_reward(self, observations, done):
"""
Return the reward based on the observations given
"""
# TODO
return reward
# Internal TaskEnv Methods
Along this unit we are going to be filling all the parts of the template according with the robot and task to work with.
In this case, we want it to inherit from the RobotEnv we created in the previous unit, so we have to import the MyCubeSingleDiskEnv from the python module my_cube_single_disk_env.py.
So, change in the template code the following:
import my_robot_env
by the following:
import my_cube_single_disk_env
In order to be able to use a gym environment, we need to register it. This is done in the following sentence:
register(
id='MyTrainingEnv-v0',
entry_point='my_robot_env:MovingCubeOneDiskWalkEnv',
#timestep_limit=timestep_limit_per_episode,
)
Let’s fill that sentence with the proper values for the Cubli.
Let’s call this training environment as MyMovingCubeOneDiskWalkEnv.
Let’s say that our Task Environment class will be called MyMovingCubeOneDiskWalkEnv.
Let’s set it to 1000.
Here the python module where we have to import it from is exactly the one we are defining it as, so my_one_disk_walk. Change the template code by the following.
timestep_limit_per_episode = 1000
register(
id='MyMovingCubeOneDiskWalkEnv-v0',
entry_point='my_one_disk_walk:MyMovingCubeOneDiskWalkEnv',
#timestep_limit=timestep_limit_per_episode,
)
We decided that we were going to call our Task Environment as MyMovingCubeOneDiskWalkEnv. So let’s create it. Change the template code by the following:
class MyMovingCubeOneDiskWalkEnv(my_cube_single_disk_env.MyCubeSingleDiskEnv):
def __init__(self):
Your Task Environment class must inherit always from the Robot Environment class that you created in the previous unit. I mean, it must inherit from the robot that you are going to use in the RL task.
Here we are telling the Task Environment to inherit from the class MyCubeSingleDiskEnv, from the python module that we created in the previous unit called my_cube_single_disk_env.py.
Setting the action_space
The action_space indicates the number of actions that the robot can take for this learning task. Change the following code on the _init function by the following values:
# Only variable needed to be set here
number_actions = rospy.get_param('/moving_cube/n_actions')
self.action_space = spaces.Discrete(number_actions)
As you can see, we are retrieving the number of actions from the param server of ROS. We’ll see later how to upload those parameters to the param server.
Setting the observation_space
We now have to create the observations_space. This space contains a kind of matrix of the values that we are going to use in the observation vector.
In computational terms, we use a Box type because we need a range of floats that are different for each of the observations. For that, we use the library numpy and the spaces class from OpenAI Gym.
For the Cubli, it looks like this:
# This is the most common case of Box observation type
high = numpy.array([
self.roll_speed_fixed_value,
self.max_distance,
max_roll,
self.max_pitch_angle,
self.max_y_linear_speed,
self.max_y_linear_speed,
])
self.observation_space = spaces.Box(-high, high)
All the parameters of the array are extracted from the param server loaded variables for the most part. So if you add the param loading, the code of the observation_space will end like this (change it for this in the template):
# Actions and Observations
self.roll_speed_fixed_value = rospy.get_param('/moving_cube/roll_speed_fixed_value')
self.roll_speed_increment_value = rospy.get_param('/moving_cube/roll_speed_increment_value')
self.max_distance = rospy.get_param('/moving_cube/max_distance')
max_roll = 2 * math.pi
self.max_pitch_angle = rospy.get_param('/moving_cube/max_pitch_angle')
self.max_y_linear_speed = rospy.get_param('/moving_cube/max_y_linear_speed')
self.max_yaw_angle = rospy.get_param('/moving_cube/max_yaw_angle')
high = numpy.array([
self.roll_speed_fixed_value,
self.max_distance,
max_roll,
self.max_pitch_angle,
self.max_y_linear_speed,
self.max_y_linear_speed,
])
self.observation_space = spaces.Box(-high, high)
The observation_space indicates that we will return an array of six different values in the _get_obs function that define the robot’s state. More on that later.
Initializing the Robot Environment
And finally, we need to call the init method of the parent class in order to initialize the Robot Environmetn class. Change the template by the following line:
super(MyMovingCubeOneDiskWalkEnv, self).__init__()
We have to fill the following functions that are called by the RobotGazeboEnv to execute different aspects of simulation, and by the learning algorithm to get the reward, observations and take action.
def _set_init_pose(self):
"""Sets the Robot in its init pose
"""
# TODO
def _init_env_variables(self):
"""
Inits variables needed to be initialised each time we reset at the start
of an episode.
:return:
"""
# TODO
def _set_action(self, action):
"""
Move the robot based on the action variable given
"""
# TODO: Move robot
def _get_obs(self):
"""
Here we define what sensor data of our robot's observations
To know which Variables we have access to, we need to read the
MyRobotEnv API DOCS
:return: observations
"""
# TODO
return observations
def _is_done(self, observations):
"""
Decide if episode is done based on the observations
"""
# TODO
return done
def _compute_reward(self, observations, done):
"""
Return the reward based on the observations given
"""
# TODO
return reward
We first declare the function that sets the initial state in which the robot will start on every episode. For the Cubli, what matters is the speed of the internal joints. Change the code in the template by the following:
def _set_init_pose(self):
"""Sets the Robot in its init pose
"""
self.move_joints(self.init_roll_vel)
return True
We have allowed to set that initial speed as a parameter. Hence, we add the following code for init_roll_vel value import from the ROS param server in the init function:
self.init_roll_vel = rospy.get_param("/moving_cube/init_roll_vel")
This method is called each time we reset an episode to reset the TaskEnv variables. You must include here every variable of the task that needs to be initialized on episode starting. Change the template with the following code:
def _init_env_variables(self):
"""
Inits variables needed to be initialised each time we reset at the start
of an episode.
:return:
"""
self.total_distance_moved = 0.0
self.current_y_distance = self.get_y_dir_distance_from_start_point(self.start_point)
self.roll_turn_speed = self.init_roll_vel
Here you decide how the action that the RL selects for the next step (which is a number between zero and max_num_actions) is transformed into actual robot movements. Once obtained the robot commands necessary to execute the action, the action is executed at the end of this fuction.
In the case of Cubli, the robot can take 5 actions. The RL will select one of those by selecting a number from 0 to 4. Then it will call this function. This function will convert that number into one of the following actions:
With those actions, the Cubli will be able to ramp the speed up/down slowly and then increment it in bursts. This is vital for creating the momentum differences that make the cube move.
Then the values of speed are checked to not to be too much and then we send to the cube.
Change the code of the template with the following code:
def _set_action(self, action):
# We convert the actions to speed movements to send to the parent class CubeSingleDiskEnv
if action == 0:# Move Speed Wheel Forwards
self.roll_turn_speed = self.roll_speed_fixed_value
elif action == 1:# Move Speed Wheel Backwards
self.roll_turn_speed = -self.roll_speed_fixed_value
elif action == 2:# Stop Speed Wheel
self.roll_turn_speed = 0.0
elif action == 3:# Increment Speed
self.roll_turn_speed += self.roll_speed_increment_value
elif action == 4:# Decrement Speed
self.roll_turn_speed -= self.roll_speed_increment_value
# We clamp Values to maximum
rospy.logdebug("roll_turn_speed before clamp=="+str(self.roll_turn_speed))
self.roll_turn_speed = numpy.clip(self.roll_turn_speed,
-self.roll_speed_fixed_value,
self.roll_speed_fixed_value)
rospy.logdebug("roll_turn_speed after clamp==" + str(self.roll_turn_speed))
# We tell the OneDiskCube to spin the RollDisk at the selected speed
self.move_joints(self.roll_turn_speed)
We also need to add to the init method the following two lines in order to get the params:
# Variables that we retrieve through the param server, loaded when launch training launch.
self.roll_speed_fixed_value = rospy.get_param('/moving_cube/roll_speed_fixed_value')
self.roll_speed_increment_value = rospy.get_param('/moving_cube/roll_speed_increment_value')
Here we retrieve sensor data using our parent class CubeSingleDiskEnv to get all the sensor data, and then we process it to return the observations that we see fit.
def _get_obs(self):
"""
Here we define what sensor data defines our robots observations
To know which Variables we have access to, we need to read the
MyCubeSingleDiskEnv API DOCS
:return:
"""
# We get the orientation of the cube in RPY
roll, pitch, yaw = self.get_orientation_euler()
# We get the distance from the origin
#distance = self.get_distance_from_start_point(self.start_point)
y_distance = self.get_y_dir_distance_from_start_point(self.start_point)
# We get the current speed of the Roll Disk
current_disk_roll_vel = self.get_roll_velocity()
# We get the linear speed in the y axis
y_linear_speed = self.get_y_linear_speed()
cube_observations = [
round(current_disk_roll_vel, 0),
#round(distance, 1),
round(y_distance, 1),
round(roll, 1),
round(pitch, 1),
round(y_linear_speed,1),
round(yaw, 1),
]
return cube_observations
We use internal functions that we will define later, which return the robot’s sensory data, already processed for us. In this case, we leave only one decimal in the sensor data because it’s enough for our purposes, making the defining state for the Qlearn algorithm or whatever faster.
We also have to add some imports of parameters in the init function:
self.start_point = Point()
self.start_point.x = rospy.get_param("/moving_cube/init_cube_pose/x")
self.start_point.y = rospy.get_param("/moving_cube/init_cube_pose/y")
self.start_point.z = rospy.get_param("/moving_cube/init_cube_pose/z")
And an import:
from geometry_msgs.msg import Point
Based on the status of the robot after a step, we must decide if the learning episode is over.
In this case, we consider the episode done when the pitch angle surpasses a certain threshold (defined as a parameter). We also consider the episode done when the yaw exceeds the maximum angle (also a parameter).
Change the template with the following code:
def _is_done(self, observations):
pitch_angle = observations[3]
yaw_angle = observations[5]
if abs(pitch_angle) > self.max_pitch_angle:
rospy.logerr("WRONG Cube Pitch Orientation==>" + str(pitch_angle))
done = True
else:
rospy.logdebug("Cube Pitch Orientation Ok==>" + str(pitch_angle))
if abs(yaw_angle) > self.max_yaw_angle:
rospy.logerr("WRONG Cube Yaw Orientation==>" + str(yaw_angle))
done = True
else:
rospy.logdebug("Cube Yaw Orientation Ok==>" + str(yaw_angle))
done = False
return done
Also, modify the init function to add the import of the parameters:
self.max_pitch_angle = rospy.get_param('/moving_cube/max_pitch_angle')
self.max_yaw_angle = rospy.get_param('/moving_cube/max_yaw_angle')
This is probably one of the most important methods. Here you will condition how the robot will learn by rewarding good-result actions and punishing bad practices. This has an enormous effect on emerging behaviours that the robot will have while trying to solve the task at hand.
For the Cubli walking problem, we compute the total reward as the sum of three subrewards:
Modify the template to include the following code:
def _compute_reward(self, observations, done):
if not done:
y_distance_now = observations[1]
delta_distance = y_distance_now - self.current_y_distance
rospy.logdebug("y_distance_now=" + str(y_distance_now)+", current_y_distance=" + str(self.current_y_distance))
rospy.logdebug("delta_distance=" + str(delta_distance))
reward_distance = delta_distance * self.move_distance_reward_weight
self.current_y_distance = y_distance_now
y_linear_speed = observations[4]
rospy.logdebug("y_linear_speed=" + str(y_linear_speed))
reward_y_axis_speed = y_linear_speed * self.y_linear_speed_reward_weight
# Negative Reward for yaw different from zero.
yaw_angle = observations[5]
rospy.logdebug("yaw_angle=" + str(yaw_angle))
# Worst yaw is 90 and 270 degrees, best 0 and 180. We use sin function for giving reward.
sin_yaw_angle = math.sin(yaw_angle)
rospy.logdebug("sin_yaw_angle=" + str(sin_yaw_angle))
reward_y_axis_angle = -1 * abs(sin_yaw_angle) * self.y_axis_angle_reward_weight
# We are not intereseted in decimals of the reward, doesnt give any advatage.
reward = round(reward_distance, 0) + round(reward_y_axis_speed, 0) + round(reward_y_axis_angle, 0)
rospy.logdebug("reward_distance=" + str(reward_distance))
rospy.logdebug("reward_y_axis_speed=" + str(reward_y_axis_speed))
rospy.logdebug("reward_y_axis_angle=" + str(reward_y_axis_angle))
rospy.logdebug("reward=" + str(reward))
else:
reward = -self.end_episode_points
return reward
Also modify the init method to import some parameters from ROSPARAM server:
self.move_distance_reward_weight = rospy.get_param("/moving_cube/move_distance_reward_weight")
self.y_linear_speed_reward_weight = rospy.get_param("/moving_cube/y_linear_speed_reward_weight")
self.y_axis_angle_reward_weight = rospy.get_param("/moving_cube/y_axis_angle_reward_weight")
self.end_episode_points = rospy.get_param("/moving_cube/end_episode_points")
Finally, add the following imports at the beginning of the Python file:
import numpy
import math
The private functions are methods of the Task Environment that are required for the internal working of the class. They are called by the virtual methods and are usually created to simplify the code. They highly depend on what the task is, so there is no rule for those.
As you can see in the code below, the methods defined were called in the _get_obs() function.
Add the following code at the end of the class:
# Internal TaskEnv Methods
def get_y_dir_distance_from_start_point(self, start_point):
"""
Calculates the distance from the given point and the current position
given by odometry. In this case the increase or decrease in y.
:param start_point:
:return:
"""
y_dist_dir = self.odom.pose.pose.position.y - start_point.y
return y_dist_dir
def get_distance_from_start_point(self, start_point):
"""
Calculates the distance from the given point and the current position
given by odometry
:param start_point:
:return:
"""
distance = self.get_distance_from_point(start_point,
self.odom.pose.pose.position)
return distance
def get_distance_from_point(self, pstart, p_end):
"""
Given a Vector3 Object, get distance from current position
:param p_end:
:return:
"""
a = numpy.array((pstart.x, pstart.y, pstart.z))
b = numpy.array((p_end.x, p_end.y, p_end.z))
distance = numpy.linalg.norm(a - b)
return distance
def get_orientation_euler(self):
# We convert from quaternions to euler
orientation_list = [self.odom.pose.pose.orientation.x,
self.odom.pose.pose.orientation.y,
self.odom.pose.pose.orientation.z,
self.odom.pose.pose.orientation.w]
roll, pitch, yaw = euler_from_quaternion(orientation_list)
return roll, pitch, yaw
############################################################################
# 解我心头大惑!!!!!!!!!!!!!!!!!!!!! #
############################################################################
def get_roll_velocity(self):
# We get the current joint roll velocity
roll_vel = self.joints.velocity[0]
return roll_vel
def get_y_linear_speed(self):
# We get the current joint roll velocity
y_linear_speed = self.odom.twist.twist.linear.y
return y_linear_speed
As you can see, not even the get_odom method, for example, is needed because its getting the data directly from the odom variable.
Finally, add the following import at the beginning of the Python file:
from tf.transformations import euler_from_quaternion
You should have something similar to this as the final TaskEnv for the Cubli robot:
my_one_disk_walk.py |
---|
#! /usr/bin/env python
import rospy
import numpy
import math
from gym import spaces
import my_cube_single_disk_env
from gym.envs.registration import register
from geometry_msgs.msg import Point
from tf.transformations import euler_from_quaternion
timestep_limit_per_episode = 10000 # Can be any Value
register(
id='MyMovingCubeOneDiskWalkEnv-v0',
entry_point='my_one_disk_walk:MyMovingCubeOneDiskWalkEnv',
#timestep_limit=timestep_limit_per_episode,
)
class MyMovingCubeOneDiskWalkEnv(my_cube_single_disk_env.MyCubeSingleDiskEnv):
def __init__(self):
# Only variable needed to be set here
number_actions = rospy.get_param('/moving_cube/n_actions')
self.action_space = spaces.Discrete(number_actions)
#number_observations = rospy.get_param('/moving_cube/n_observations')
"""
We set the Observation space for the 6 observations
cube_observations = [
round(current_disk_roll_vel, 0),
round(y_distance, 1),
round(roll, 1),
round(pitch, 1),
round(y_linear_speed,1),
round(yaw, 1),
]
"""
# Actions and Observations
self.roll_speed_fixed_value = rospy.get_param('/moving_cube/roll_speed_fixed_value')
self.roll_speed_increment_value = rospy.get_param('/moving_cube/roll_speed_increment_value')
self.max_distance = rospy.get_param('/moving_cube/max_distance')
max_roll = 2 * math.pi
self.max_pitch_angle = rospy.get_param('/moving_cube/max_pitch_angle')
self.max_y_linear_speed = rospy.get_param('/moving_cube/max_y_linear_speed')
self.max_yaw_angle = rospy.get_param('/moving_cube/max_yaw_angle')
high = numpy.array([
self.roll_speed_fixed_value,
self.max_distance,
max_roll,
self.max_pitch_angle,
self.max_y_linear_speed,
self.max_y_linear_speed,
])
self.observation_space = spaces.Box(-high, high)
rospy.logwarn("ACTION SPACES TYPE===>"+str(self.action_space))
rospy.logwarn("OBSERVATION SPACES TYPE===>"+str(self.observation_space))
# Variables that we retrieve through the param server, loded when launch training launch.
self.init_roll_vel = rospy.get_param("/moving_cube/init_roll_vel")
# Get Observations
self.start_point = Point()
self.start_point.x = rospy.get_param("/moving_cube/init_cube_pose/x")
self.start_point.y = rospy.get_param("/moving_cube/init_cube_pose/y")
self.start_point.z = rospy.get_param("/moving_cube/init_cube_pose/z")
# Rewards
self.move_distance_reward_weight = rospy.get_param("/moving_cube/move_distance_reward_weight")
self.y_linear_speed_reward_weight = rospy.get_param("/moving_cube/y_linear_speed_reward_weight")
self.y_axis_angle_reward_weight = rospy.get_param("/moving_cube/y_axis_angle_reward_weight")
self.end_episode_points = rospy.get_param("/moving_cube/end_episode_points")
self.cumulated_steps = 0.0
# Here we will add any init functions prior to starting the MyRobotEnv
super(MyMovingCubeOneDiskWalkEnv, self).__init__()
def _set_init_pose(self):
"""Sets the Robot in its init pose
"""
self.move_joints(self.init_roll_vel)
return True
def _init_env_variables(self):
"""
Inits variables needed to be initialised each time we reset at the start
of an episode.
:return:
"""
self.total_distance_moved = 0.0
self.current_y_distance = self.get_y_dir_distance_from_start_point(self.start_point)
self.roll_turn_speed = rospy.get_param('/moving_cube/init_roll_vel')
# For Info Purposes
self.cumulated_reward = 0.0
#self.cumulated_steps = 0.0
def _set_action(self, action):
# We convert the actions to speed movements to send to the parent class CubeSingleDiskEnv
if action == 0:# Move Speed Wheel Forwards
self.roll_turn_speed = self.roll_speed_fixed_value
elif action == 1:# Move Speed Wheel Backwards
self.roll_turn_speed = -1*self.roll_speed_fixed_value
elif action == 2:# Stop Speed Wheel
self.roll_turn_speed = 0.0
elif action == 3:# Increment Speed
self.roll_turn_speed += self.roll_speed_increment_value
elif action == 4:# Decrement Speed
self.roll_turn_speed -= self.roll_speed_increment_value
# We clamp Values to maximum
rospy.logdebug("roll_turn_speed before clamp=="+str(self.roll_turn_speed))
self.roll_turn_speed = numpy.clip(self.roll_turn_speed,
-1*self.roll_speed_fixed_value,
self.roll_speed_fixed_value)
rospy.logdebug("roll_turn_speed after clamp==" + str(self.roll_turn_speed))
# We tell the OneDiskCube to spin the RollDisk at the selected speed
self.move_joints(self.roll_turn_speed)
def _get_obs(self):
"""
Here we define what sensor data defines our robots observations
To know which Variables we have acces to, we need to read the
MyCubeSingleDiskEnv API DOCS
:return:
"""
# We get the orientation of the cube in RPY
roll, pitch, yaw = self.get_orientation_euler()
# We get the distance from the origin
#distance = self.get_distance_from_start_point(self.start_point)
y_distance = self.get_y_dir_distance_from_start_point(self.start_point)
# We get the current speed of the Roll Disk
current_disk_roll_vel = self.get_roll_velocity()
# We get the linear speed in the y axis
y_linear_speed = self.get_y_linear_speed()
cube_observations = [
round(current_disk_roll_vel, 0),
round(y_distance, 1),
round(roll, 1),
round(pitch, 1),
round(y_linear_speed,1),
round(yaw, 1)
]
rospy.logdebug("Observations==>"+str(cube_observations))
return cube_observations
def _is_done(self, observations):
pitch_angle = observations[3]
yaw_angle = observations[5]
if abs(pitch_angle) > self.max_pitch_angle:
rospy.logerr("WRONG Cube Pitch Orientation==>" + str(pitch_angle))
done = True
else:
rospy.logdebug("Cube Pitch Orientation Ok==>" + str(pitch_angle))
if abs(yaw_angle) > self.max_yaw_angle:
rospy.logerr("WRONG Cube Yaw Orientation==>" + str(yaw_angle))
done = True
else:
rospy.logdebug("Cube Yaw Orientation Ok==>" + str(yaw_angle))
done = False
return done
def _compute_reward(self, observations, done):
if not done:
y_distance_now = observations[1]
delta_distance = y_distance_now - self.current_y_distance
rospy.logdebug("y_distance_now=" + str(y_distance_now)+", current_y_distance=" + str(self.current_y_distance))
rospy.logdebug("delta_distance=" + str(delta_distance))
reward_distance = delta_distance * self.move_distance_reward_weight
self.current_y_distance = y_distance_now
y_linear_speed = observations[4]
rospy.logdebug("y_linear_speed=" + str(y_linear_speed))
reward_y_axis_speed = y_linear_speed * self.y_linear_speed_reward_weight
# Negative Reward for yaw different from zero.
yaw_angle = observations[5]
rospy.logdebug("yaw_angle=" + str(yaw_angle))
# Worst yaw is 90 and 270 degrees, best 0 and 180. We use sin function for giving reward.
sin_yaw_angle = math.sin(yaw_angle)
rospy.logdebug("sin_yaw_angle=" + str(sin_yaw_angle))
reward_y_axis_angle = -1 * abs(sin_yaw_angle) * self.y_axis_angle_reward_weight
# We are not intereseted in decimals of the reward, doesn't give any advatage.
reward = round(reward_distance, 0) + round(reward_y_axis_speed, 0) + round(reward_y_axis_angle, 0)
rospy.logdebug("reward_distance=" + str(reward_distance))
rospy.logdebug("reward_y_axis_speed=" + str(reward_y_axis_speed))
rospy.logdebug("reward_y_axis_angle=" + str(reward_y_axis_angle))
rospy.logdebug("reward=" + str(reward))
else:
reward = -1*self.end_episode_points
self.cumulated_reward += reward
rospy.logdebug("Cumulated_reward=" + str(self.cumulated_reward))
self.cumulated_steps += 1
rospy.logdebug("Cumulated_steps=" + str(self.cumulated_steps))
return reward
# Internal TaskEnv Methods
def get_y_dir_distance_from_start_point(self, start_point):
"""
Calculates the distance from the given point and the current position
given by odometry. In this case the increase or decrease in y.
:param start_point:
:return:
"""
y_dist_dir = self.odom.pose.pose.position.y - start_point.y
return y_dist_dir
def get_distance_from_start_point(self, start_point):
"""
Calculates the distance from the given point and the current position
given by odometry
:param start_point:
:return:
"""
distance = self.get_distance_from_point(start_point,
self.odom.pose.pose.position)
return distance
def get_distance_from_point(self, pstart, p_end):
"""
Given a Vector3 Object, get distance from current position
:param p_end:
:return:
"""
a = numpy.array((pstart.x, pstart.y, pstart.z))
b = numpy.array((p_end.x, p_end.y, p_end.z))
distance = numpy.linalg.norm(a - b)
return distance
def get_orientation_euler(self):
# We convert from quaternions to euler
orientation_list = [self.odom.pose.pose.orientation.x,
self.odom.pose.pose.orientation.y,
self.odom.pose.pose.orientation.z,
self.odom.pose.pose.orientation.w]
roll, pitch, yaw = euler_from_quaternion(orientation_list)
return roll, pitch, yaw
############################################################################
# 解我心头大惑!!!!!!!!!!!!!!!!!!!!! #
############################################################################
def get_roll_velocity(self):
# We get the current joint roll velocity
roll_vel = self.joints.velocity[0]
return roll_vel
def get_y_linear_speed(self):
# We get the current joint roll velocity
y_linear_speed = self.odom.twist.twist.linear.y
return y_linear_speed
END my_one_disk_walk.py |
---|
Now, we are going to use the same scripts that you used for the CartPole example to train through Qlearn, but adapting them to use the Cube environments we have just created. We will only need to change the Environments we use, the configuration yaml file, and some minor elements to adapt to the change of the environments.
Let’s create the training script for the Cubli. We are going to do a copy of the one created for the Cartpole.
Go to the shell and type the following in order to copy the start_training.py file from Cartpole to Cubli:
roscd cartpole_v0_training/scripts
cp ./start_training.py ~/catkin_ws/src/my_moving_cube_pkg/scripts
Now let’s modify that training script to be useful for Cubli.
Importing will only have the effect of triggering the register environment method at the top of the file.
# import our training environment
from openai_ros.task_envs.cartpole_stay_up import stay_up
Change it to:
# import our training environment
import my_one_disk_walk
Modify the line that instantiates the Cartpole Environment to intantiate Cubli’s one. Change the following line:
# Create the Gym environment
env = gym.make('CartPoleStayUp-v0')
By this one:
env = gym.make('MyMovingCubeOneDiskWalkEnv-v0')
# Set the logging system
rospack = rospkg.RosPack()
pkg_path = rospack.get_path('cartpole_v0_training')
outdir = pkg_path + '/training_results'
env = wrappers.Monitor(env, outdir, force=True)
rospy.loginfo ( "Monitor Wrapper started")
last_time_steps = numpy.ndarray(0)
# Loads parameters from the ROS param server
# Parameters are stored in a yaml file inside the config directory
# They are loaded at runtime by the launch file
Alpha = rospy.get_param("/cartpole_v0/alpha")
Epsilon = rospy.get_param("/cartpole_v0/epsilon")
Gamma = rospy.get_param("/cartpole_v0/gamma")
epsilon_discount = rospy.get_param("/cartpole_v0/epsilon_discount")
nepisodes = rospy.get_param("/cartpole_v0/nepisodes")
nsteps = rospy.get_param("/cartpole_v0/nsteps")
running_step = rospy.get_param("/cartpole_v0/running_step")
We have to change the package cartpole_v0_training to our own package; in this case, we will pick my_moving_cube_pkg.
Then, change the namespace from cartpole_v0 to moving_cube.
# Set the logging system
rospack = rospkg.RosPack()
pkg_path = rospack.get_path('my_moving_cube_pkg')
outdir = pkg_path + '/training_results'
env = wrappers.Monitor(env, outdir, force=True)
rospy.loginfo ( "Monitor Wrapper started")
last_time_steps = numpy.ndarray(0)
# Loads parameters from the ROS param server
# Parameters are stored in a yaml file inside the config directory
# They are loaded at runtime by the launch file
Alpha = rospy.get_param("/moving_cube/alpha")
Epsilon = rospy.get_param("/moving_cube/epsilon")
Gamma = rospy.get_param("/moving_cube/gamma")
epsilon_discount = rospy.get_param("/moving_cube/epsilon_discount")
nepisodes = rospy.get_param("/moving_cube/nepisodes")
nsteps = rospy.get_param("/moving_cube/nsteps")
running_step = rospy.get_param("/moving_cube/running_step")
We sometimes want to filter the observations given, so that the reinforcement algorithms go faster. We have to add the following lines
# Initialize the environment and get first state of the robot
observation = env.reset()
state = ''.join(map(str, observation))
Turn it into this:
# Initialize the environment and get first state of the robot
observation = env.reset()
simplified_observations = convert_obs_to_state(observation)
state = ''.join(map(str, simplified_observations))
And also this:
nextState = ''.join(map(str, observation))
Turn it into this:
simplified_observations = convert_obs_to_state(observation)
nextState = ''.join(map(str, simplified_observations))
and define this convert_obs_to_state method:
def convert_obs_to_state(observations):
"""
Converts the observations used for reward and so on to the essentials for the robot state
In this case, we only need the orientation of the cube and the speed of the disc.
The distance doesn't condition at all the actions
"""
disk_roll_vel = observations[0]
# roll_angle = observations[2]
y_linear_speed = observations[4]
yaw_angle = observations[5]
state_converted = [disk_roll_vel, y_linear_speed, yaw_angle]
return state_converted
As for the Cartpole, we need to create a parameters file that will contain the configuration of the RL task, including parameters for the RL algorithm, for the task and for the robot too.
Create a config file in the config directory of your package:
roscd my_moving_cube_pkg
mkdir config
cd config
touch my_one_disk_walk_openai_params.yaml
Then copy the following content inside it:
my_one_disk_walk_openai_params.yaml |
---|
moving_cube: #namespace
running_step: 0.04 # amount of time the control will be executed
pos_step: 0.016 # increment in position for each command
#qlearn parameters
alpha: 0.1
gamma: 0.7
epsilon: 0.9
epsilon_discount: 0.999
nepisodes: 500
nsteps: 1000
number_splits: 10 #set to change the number of state splits for the continuous problem and also the number of env_variable splits
running_step: 0.06 # Time for each step
wait_time: 0.1 # Time to wait in the reset phases
n_actions: 5 # We have 3 actions
speed_step: 1.0 # Time to wait in the reset phases
init_roll_vel: 0.0 # Initial speed of the Roll Disk
roll_speed_fixed_value: 100.0 # Speed at which it will move forwards or backwards
roll_speed_increment_value: 10.0 # Increment that could be done in each step
max_distance: 2.0 # Maximum distance allowed for the RobotCube
max_pitch_angle: 0.2 # Maximum Angle radians in Pitch that we allow before terminating episode
max_yaw_angle: 0.1 # Maximum yaw angle deviation, after that it starts getting negative rewards
max_y_linear_speed: 100 # Maximum speed in y axis
init_cube_pose:
x: 0.0
y: 0.0
z: 0.0
end_episode_points: 1000 # Points given when ending an episode
move_distance_reward_weight: 1000.0 # Multiplier for the moved distance reward, Ex: inc_d = 0.1 --> 100points
y_linear_speed_reward_weight: 1000.0 # Multiplier for moving fast on the y Axis
y_axis_angle_reward_weight: 1000.0 # Multiplier of angle of yaw, to keep it straight
END my_one_disk_walk_openai_params.yaml |
---|
That is the file that provides all the parameters that we have been loading along the Task and Robot Environments.
The RL is the same for the Cartpole and for the Cubli problems. Hence, the algorithm parameter names are the same. However, the name of the ROS parameters from which we obtain the RL params are different for each robot.
That is why we need to modify the load of the parameters. Based on the my_one_disk_walk_openai_params.yaml file, you need to modify the following code of the training script:
# Loads parameters from the ROS param server
# Parameters are stored in a yaml file inside the config directory
# They are loaded at runtime by the launch file
Alpha = rospy.get_param("/cartpole_v0/alpha")
Epsilon = rospy.get_param("/cartpole_v0/epsilon")
Gamma = rospy.get_param("/cartpole_v0/gamma")
epsilon_discount = rospy.get_param("/cartpole_v0/epsilon_discount")
nepisodes = rospy.get_param("/cartpole_v0/nepisodes")
nsteps = rospy.get_param("/cartpole_v0/nsteps")
running_step = rospy.get_param("/cartpole_v0/running_step")
By the following code:
# Loads parameters from the ROS param server
# Parameters are stored in a yaml file inside the config directory
# They are loaded at runtime by the launch file
Alpha = rospy.get_param("/moving_cube/alpha")
Epsilon = rospy.get_param("/moving_cube/epsilon")
Gamma = rospy.get_param("/moving_cube/gamma")
epsilon_discount = rospy.get_param("/moving_cube/epsilon_discount")
nepisodes = rospy.get_param("/moving_cube/nepisodes")
nsteps = rospy.get_param("/moving_cube/nsteps")
running_step = rospy.get_param("/moving_cube/running_step")
Having done that, then the final version of the training script should be:
start_training.py |
---|
#!/usr/bin/env python
import gym
import time
import numpy
import random
import time
import qlearn
from gym import wrappers
# ROS packages required
import rospy
import rospkg
# import our training environment
#from openai_ros.task_envs.cartpole_stay_up import stay_up
# import our training environment
import my_one_disk_walk
def convert_obs_to_state(observations):
"""
Converts the observations used for reward and so on to the essentials for the robot state
In this case, we only need the orientation of the cube and the speed of the disc.
The distance doesn't condition at all the actions
"""
disk_roll_vel = observations[0]
# roll_angle = observations[2]
y_linear_speed = observations[4]
yaw_angle = observations[5]
state_converted = [disk_roll_vel, y_linear_speed, yaw_angle]
return state_converted
if __name__ == '__main__':
rospy.init_node('cartpole_gym', anonymous=True, log_level=rospy.WARN)
# Create the Gym environment
#env = gym.make('CartPoleStayUp-v0')
env = gym.make('MyMovingCubeOneDiskWalkEnv-v0')
rospy.loginfo ( "Gym environment done")
# Set the logging system
rospack = rospkg.RosPack()
pkg_path = rospack.get_path('my_moving_cube_pkg')
outdir = pkg_path + '/training_results'
env = wrappers.Monitor(env, outdir, force=True)
rospy.loginfo ( "Monitor Wrapper started")
last_time_steps = numpy.ndarray(0)
# Loads parameters from the ROS param server
# Parameters are stored in a yaml file inside the config directory
# They are loaded at runtime by the launch file
Alpha = rospy.get_param("/moving_cube/alpha")
Epsilon = rospy.get_param("/moving_cube/epsilon")
Gamma = rospy.get_param("/moving_cube/gamma")
epsilon_discount = rospy.get_param("/moving_cube/epsilon_discount")
nepisodes = rospy.get_param("/moving_cube/nepisodes")
nsteps = rospy.get_param("/moving_cube/nsteps")
running_step = rospy.get_param("/moving_cube/running_step")
# Initialises the algorithm that we are going to use for learning
qlearn = qlearn.QLearn(actions=range(env.action_space.n),
alpha=Alpha, gamma=Gamma, epsilon=Epsilon)
initial_epsilon = qlearn.epsilon
start_time = time.time()
highest_reward = 0
# Starts the main training loop: the one about the episodes to do
for x in range(nepisodes):
rospy.logdebug("############### START EPISODE => " + str(x))
cumulated_reward = 0
done = False
if qlearn.epsilon > 0.05:
qlearn.epsilon *= epsilon_discount
# Initialize the environment and get first state of the robot
observation = env.reset()
simplified_observations = convert_obs_to_state(observation)
state = ''.join(map(str, simplified_observations))
# Show on screen the actual situation of the robot
# for each episode, we test the robot for nsteps
for i in range(nsteps):
rospy.loginfo("############### Start Step => "+str(i))
# Pick an action based on the current state
action = qlearn.chooseAction(state)
rospy.loginfo ("Next action is: %d", action)
# Execute the action in the environment and get feedback
observation, reward, done, info = env.step(action)
rospy.loginfo(str(observation) + " " + str(reward))
cumulated_reward += reward
if highest_reward < cumulated_reward:
highest_reward = cumulated_reward
simplified_observations = convert_obs_to_state(observation)
nextState = ''.join(map(str, simplified_observations))
# Make the algorithm learn based on the results
#rospy.logwarn("############### State we were => " + str(state))
#rospy.logwarn("############### Action that we took => " + str(action))
#rospy.logwarn("############### Reward that action gave => " + str(reward))
#rospy.logwarn("############### State in which we will start next step => " + str(nextState))
qlearn.learn(state, action, reward, nextState)
if not(done):
state = nextState
else:
rospy.loginfo ("DONE")
last_time_steps = numpy.append(last_time_steps, [int(i + 1)])
break
rospy.loginfo("############### END Step =>" + str(i))
#raw_input("Next Step...PRESS KEY")
#rospy.sleep(2.0)
m, s = divmod(int(time.time() - start_time), 60)
h, m = divmod(m, 60)
rospy.logwarn ( ("EP: "+str(x+1)+" - [alpha: "+str(round(qlearn.alpha,2))+" - gamma: "+str(round(qlearn.gamma,2))+" - epsilon: "+str(round(qlearn.epsilon,2))+"] - Reward: "+str(cumulated_reward)+" Time: %d:%02d:%02d" % (h, m, s)))
rospy.loginfo ( ("\n|"+str(nepisodes)+"|"+str(qlearn.alpha)+"|"+str(qlearn.gamma)+"|"+str(initial_epsilon)+"*"+str(epsilon_discount)+"|"+str(highest_reward)+"| PICTURE |"))
l = last_time_steps.tolist()
l.sort()
#print("Parameters: a="+str)
rospy.loginfo("Overall score: {:0.2f}".format(last_time_steps.mean()))
rospy.loginfo("Best 100 score: {:0.2f}".format(reduce(lambda x, y: x + y, l[-100:]) / len(l[-100:])))
env.close()
END start_training.py |
---|
Now we need to create a launch file that will do the following:
Go to the launch directory and create the launch file:
roscd my_moving_cube_pkg
mkdir launch
cd launch
touch start_training_cube.launch
Then copy in the launch file the following content:
start_training_cube.launch |
---|
<launch>
<rosparam command="load" file="$(find my_moving_cube_pkg)/config/my_one_disk_walk_openai_params.yaml" />
<!-- Launch the training system -->
<node pkg="my_moving_cube_pkg" name="movingcube_gym" type="start_training.py" output="screen"/>
</launch>
END start_training_cube.launch |
---|
Finally, add the actual learning algorithm (remember we are not still using the OpenAI Baselines, so we need the RL algorithm implemented somehow).
Go to the scripts directory and create the QLearning file:
roscd my_moving_cube_pkg/scripts
touch qlearn.py
Then copy on the file the following content:
qlearn.py |
---|
'''
Q-learning approach for different RL problems
as part of the basic series on reinforcement learning @
https://github.com/vmayoral/basic_reinforcement_learning
Inspired by https://gym.openai.com/evaluations/eval_kWknKOkPQ7izrixdhriurA
@author: Victor Mayoral Vilches
'''
import random
class QLearn:
def __init__(self, actions, epsilon, alpha, gamma):
self.q = {}
self.epsilon = epsilon # exploration constant
self.alpha = alpha # discount constant
self.gamma = gamma # discount factor
self.actions = actions
def getQ(self, state, action):
return self.q.get((state, action), 0.0)
def learnQ(self, state, action, reward, value):
'''
Q-learning:
Q(s, a) += alpha * (reward(s,a) + max(Q(s') - Q(s,a))
'''
oldv = self.q.get((state, action), None)
if oldv is None:
self.q[(state, action)] = reward
else:
self.q[(state, action)] = oldv + self.alpha * (value - oldv)
def chooseAction(self, state, return_q=False):
q = [self.getQ(state, a) for a in self.actions]
maxQ = max(q)
if random.random() < self.epsilon:
minQ = min(q); mag = max(abs(minQ), abs(maxQ))
# add random values to all the actions, recalculate maxQ
q = [q[i] + random.random() * mag - .5 * mag for i in range(len(self.actions))]
maxQ = max(q)
count = q.count(maxQ)
# In case there're several state-action max values
# we select a random one among them
if count > 1:
best = [i for i in range(len(self.actions)) if q[i] == maxQ]
i = random.choice(best)
else:
i = q.index(maxQ)
action = self.actions[i]
if return_q: # if they want it, give it!
return action, q
return action
def learn(self, state1, action1, reward, state2):
maxqnew = max([self.getQ(state2, a) for a in self.actions])
self.learnQ(state1, action1, reward, reward + self.gamma*maxqnew)
END qlearn.py |
---|
Now it is time to launch the training and see the robot learn!
But before running this launch, check that all the topics needed are available, just in case the simulation died. For example check that the topic /moving_cube/odom is publishing. Again, if not unpause the simulation as explained in unit3.
roslaunch my_moving_cube_pkg start_training_cube.launch
You should see something like this:
EXTRA exercise U2.1 |
---|
As a challenge, how would we change the direction in which the robot learns to move?
END EXTRA exercise U2.1 |
---|
Solution Exercise U2.1 |
---|
Please try to do it by yourself unless you get stuck or need some inspiration. You will learn much more if you fight for each exercise.
Well, the answer is simple: We go to the MyMovingCubeOneDiskWalkEnv and change the way in which we assign the rewards:
y_linear_speed = observations[4]
rospy.logdebug("y_linear_speed=" + str(y_linear_speed))
reward_y_axis_speed = y_linear_speed * self.y_linear_speed_reward_weight
We have to invert the sign, although we don’t even need to access the code. We just have to change the value of the self.y_linear_speed_reward_weight, which is imported from the yaml file that defines the parameters:
y_linear_speed_reward_weight: 1000.0 # Multiplier for moving fast in the y Axis
y_linear_speed_reward_weight: -1000.0 # Multiplier for moving fast in the y Axis
Solution Exercise U2.1 |
---|
NEXT STEP: How to change the learning algorithm