方小汪

【openai_ros】5 - Exploring the OpenAI Structure: RoboCube. Part 3

文章目录

Step 0. Create the Task Environment file from template
Step 1. Import the RobotEnv you want it to inherit from
Step 2. Register the new TaskEnv
Step 3. Initialize the TaskEnv Class
Step 4. Fill in the virtual functions

4.0 Set the initial position of the robot
4.1 Initialize environment variables
4.2 Define how the RL actions convert into robot actions
4.3 Get the observations vector
4.4 Detecting when the episode must be finished
4.5 Compute the reward

Step 5. Add some private functions
Final Script
STEP 6. Add the learning algorithm

6.0 Create the *start_training.py* file
6.1 Import the Task Environment for Cubli
6.2 Instantiate the new Environment
6.3 Change the location for the logs
6.4 Filter observations for simpler states generation
6.5 Create the parameters file
6.6 Change the load of parameters
6.7 Create the launch file
6.8 Add the learning algorithm
6.9 Launch the training!

Unit 4: How to define a Task Environment for a new robot

【openai_ros】5 - Exploring the OpenAI Structure: RoboCube. Part 3_第1张图片

SUMMARY

Estimated time to completion: 2 hours

In this unit, you will learn how to define the learning task of your robot by creating a Task Environment for a Moving Cube with a single disk in the roll axis. Also, you will use the Qlearn algorithm for training the RoboCube.

END OF SUMMARY

The Task Environment is the one in charge to define the Reinforcement Learning task. It is in charge of:

convert the actions selected by the RL algorithm into real commands to the robot
converte the sensors data from the robot into the observation vector that the RL understands
compute the reward based on the action taken and the observation
dedice whether the training episode is finished or not

Let’s go to define the class that will allow us to define the task the robot must learn. We are going to set up everything to induce the robot to learn how to walk forwards through reinforcement learning.

Step 0. Create the Task Environment file from template

First of all, let’s add the new script to our package:

Execute in WebShell #1

roscd my_moving_cube_pkg/scripts/

touch my_one_disk_walk.py;chmod +x my_one_disk_walk.py

Use the IDE to put the Task Environment template code inside your my_one_disk_walk.py. You can find the template on the openai_ros/templates directory.

Here you have the template for your convenience:

from gym import spaces
import my_robot_env
from gym.envs.registration import register
import rospy

# The path is __init__.py of openai_ros, where we import the MovingCubeOneDiskWalkEnv directly
timestep_limit_per_episode = 1000 # Can be any Value

register(
        id='MyTrainingEnv-v0',
        entry_point='template_my_training_env:MovingCubeOneDiskWalkEnv',
        #timestep_limit=timestep_limit_per_episode,
    )

class MyTrainingEnv(cube_single_disk_env.MyRobotEnv):
    def __init__(self):
        
        # Only variable needed to be set here
        number_actions = rospy.get_param('/my_robot_namespace/n_actions')
        self.action_space = spaces.Discrete(number_actions)
        
        # This is the most common case of Box observation type
        high = numpy.array([
            obs1_max_value,
            obs12_max_value,
            ...
            obsN_max_value
            ])
            
        self.observation_space = spaces.Box(-high, high)
        
        # Variables that we retrieve through the param server, loded when launch training launch.
        


        # Here we will add any init functions prior to starting the MyRobotEnv
        super(MyTrainingEnv, self).__init__()


    def _set_init_pose(self):
        """Sets the Robot in its init pose
        """
        # TODO

    def _init_env_variables(self):
        """
        Inits variables needed to be initialised each time we reset at the start
        of an episode.
        :return:
        """
        # TODO


    def _set_action(self, action):
        """
        Move the robot based on the action variable given
        """
        # TODO: Move robot

    def _get_obs(self):
        """
        Here we define what sensor data of our robots observations
        To know which Variables we have acces to, we need to read the
        MyRobotEnv API DOCS
        :return: observations
        """
        # TODO
        return observations

    def _is_done(self, observations):
        """
        Decide if episode is done based on the observations
        """
        # TODO
        return done

    def _compute_reward(self, observations, done):
        """
        Return the reward based on the observations given
        """
        # TODO
        return reward
        
    # Internal TaskEnv Methods

Along this unit we are going to be filling all the parts of the template according with the robot and task to work with.

Step 1. Import the RobotEnv you want it to inherit from

In this case, we want it to inherit from the RobotEnv we created in the previous unit, so we have to import the MyCubeSingleDiskEnv from the python module my_cube_single_disk_env.py.

So, change in the template code the following:

import my_robot_env

by the following:

import my_cube_single_disk_env

Step 2. Register the new TaskEnv

In order to be able to use a gym environment, we need to register it. This is done in the following sentence:

register(
        id='MyTrainingEnv-v0',
        entry_point='my_robot_env:MovingCubeOneDiskWalkEnv',
        #timestep_limit=timestep_limit_per_episode,
    )

Let’s fill that sentence with the proper values for the Cubli.

id: is a label for the training. It has to follow the format -v#. This is mandatory from OpenAI. label_for_task can be anything you want, then you must provide a version number after the v.

Let’s call this training environment as MyMovingCubeOneDiskWalkEnv.

entry_point: indicates the Python class we must use to start the training. You indicate :

Let’s say that our Task Environment class will be called MyMovingCubeOneDiskWalkEnv.

timesteps_limit: how many steps we are allowing an episode to last (in case that it doesn’t fail nor accomplishes the task).

Let’s set it to 1000.

Here the python module where we have to import it from is exactly the one we are defining it as, so my_one_disk_walk. Change the template code by the following.

timestep_limit_per_episode = 1000

register(
        id='MyMovingCubeOneDiskWalkEnv-v0',
        entry_point='my_one_disk_walk:MyMovingCubeOneDiskWalkEnv',
        #timestep_limit=timestep_limit_per_episode,
    )

Step 3. Initialize the TaskEnv Class

We decided that we were going to call our Task Environment as MyMovingCubeOneDiskWalkEnv. So let’s create it. Change the template code by the following:

class MyMovingCubeOneDiskWalkEnv(my_cube_single_disk_env.MyCubeSingleDiskEnv):
    def __init__(self):

Your Task Environment class must inherit always from the Robot Environment class that you created in the previous unit. I mean, it must inherit from the robot that you are going to use in the RL task.

Here we are telling the Task Environment to inherit from the class MyCubeSingleDiskEnv, from the python module that we created in the previous unit called my_cube_single_disk_env.py.

Setting the action_space

The action_space indicates the number of actions that the robot can take for this learning task. Change the following code on the _init function by the following values:

# Only variable needed to be set here
number_actions = rospy.get_param('/moving_cube/n_actions')
self.action_space = spaces.Discrete(number_actions)

As you can see, we are retrieving the number of actions from the param server of ROS. We’ll see later how to upload those parameters to the param server.

Setting the observation_space

We now have to create the observations_space. This space contains a kind of matrix of the values that we are going to use in the observation vector.

In computational terms, we use a Box type because we need a range of floats that are different for each of the observations. For that, we use the library numpy and the spaces class from OpenAI Gym.

For the Cubli, it looks like this:

# This is the most common case of Box observation type
high = numpy.array([
    self.roll_speed_fixed_value,
    self.max_distance,
    max_roll,
    self.max_pitch_angle,
    self.max_y_linear_speed,
    self.max_y_linear_speed,
    ])

self.observation_space = spaces.Box(-high, high)

All the parameters of the array are extracted from the param server loaded variables for the most part. So if you add the param loading, the code of the observation_space will end like this (change it for this in the template):

# Actions and Observations
self.roll_speed_fixed_value = rospy.get_param('/moving_cube/roll_speed_fixed_value')
self.roll_speed_increment_value = rospy.get_param('/moving_cube/roll_speed_increment_value')
self.max_distance = rospy.get_param('/moving_cube/max_distance')
max_roll = 2 * math.pi
self.max_pitch_angle = rospy.get_param('/moving_cube/max_pitch_angle')
self.max_y_linear_speed = rospy.get_param('/moving_cube/max_y_linear_speed')
self.max_yaw_angle = rospy.get_param('/moving_cube/max_yaw_angle')

high = numpy.array([
    self.roll_speed_fixed_value,
    self.max_distance,
    max_roll,
    self.max_pitch_angle,
    self.max_y_linear_speed,
    self.max_y_linear_speed,
    ])
        
self.observation_space = spaces.Box(-high, high)

The observation_space indicates that we will return an array of six different values in the _get_obs function that define the robot’s state. More on that later.

Initializing the Robot Environment

And finally, we need to call the init method of the parent class in order to initialize the Robot Environmetn class. Change the template by the following line:

        super(MyMovingCubeOneDiskWalkEnv, self).__init__()

Step 4. Fill in the virtual functions

We have to fill the following functions that are called by the RobotGazeboEnv to execute different aspects of simulation, and by the learning algorithm to get the reward, observations and take action.

def _set_init_pose(self):
    """Sets the Robot in its init pose
    """
    # TODO
    
def _init_env_variables(self):
    """
    Inits variables needed to be initialised each time we reset at the start
    of an episode.
    :return:
    """
    # TODO


def _set_action(self, action):
    """
    Move the robot based on the action variable given
    """
    # TODO: Move robot

def _get_obs(self):
    """
    Here we define what sensor data of our robot's observations
    To know which Variables we have access to, we need to read the
    MyRobotEnv API DOCS
    :return: observations
    """
    # TODO
    return observations

def _is_done(self, observations):
    """
    Decide if episode is done based on the observations
    """
    # TODO
    return done

def _compute_reward(self, observations, done):
    """
    Return the reward based on the observations given
    """
    # TODO
    return reward

4.0 Set the initial position of the robot

We first declare the function that sets the initial state in which the robot will start on every episode. For the Cubli, what matters is the speed of the internal joints. Change the code in the template by the following:

def _set_init_pose(self):
    """Sets the Robot in its init pose
    """
    self.move_joints(self.init_roll_vel)

    return True

We have allowed to set that initial speed as a parameter. Hence, we add the following code for init_roll_vel value import from the ROS param server in the init function:

self.init_roll_vel = rospy.get_param("/moving_cube/init_roll_vel")

4.1 Initialize environment variables

This method is called each time we reset an episode to reset the TaskEnv variables. You must include here every variable of the task that needs to be initialized on episode starting. Change the template with the following code:

def _init_env_variables(self):
    """
    Inits variables needed to be initialised each time we reset at the start
    of an episode.
    :return:
    """
    self.total_distance_moved = 0.0
    self.current_y_distance = self.get_y_dir_distance_from_start_point(self.start_point)
    self.roll_turn_speed = self.init_roll_vel

4.2 Define how the RL actions convert into robot actions

Here you decide how the action that the RL selects for the next step (which is a number between zero and max_num_actions) is transformed into actual robot movements. Once obtained the robot commands necessary to execute the action, the action is executed at the end of this fuction.

In the case of Cubli, the robot can take 5 actions. The RL will select one of those by selecting a number from 0 to 4. Then it will call this function. This function will convert that number into one of the following actions:

“0”: Move wheel forward
“1”: Move wheel backwards
“2”: Stop the wheel
“3”: Increment the speed by a certain amount
“4”: Decrement speed by a certain amount

With those actions, the Cubli will be able to ramp the speed up/down slowly and then increment it in bursts. This is vital for creating the momentum differences that make the cube move.

Then the values of speed are checked to not to be too much and then we send to the cube.

Change the code of the template with the following code:

def _set_action(self, action):

    # We convert the actions to speed movements to send to the parent class CubeSingleDiskEnv
    if action == 0:# Move Speed Wheel Forwards
        self.roll_turn_speed = self.roll_speed_fixed_value
    elif action == 1:# Move Speed Wheel Backwards
        self.roll_turn_speed = -self.roll_speed_fixed_value
    elif action == 2:# Stop Speed Wheel
        self.roll_turn_speed = 0.0
    elif action == 3:# Increment Speed
        self.roll_turn_speed += self.roll_speed_increment_value
    elif action == 4:# Decrement Speed
        self.roll_turn_speed -= self.roll_speed_increment_value

    # We clamp Values to maximum
    rospy.logdebug("roll_turn_speed before clamp=="+str(self.roll_turn_speed))
    self.roll_turn_speed = numpy.clip(self.roll_turn_speed,
                                      -self.roll_speed_fixed_value,
                                      self.roll_speed_fixed_value)
    rospy.logdebug("roll_turn_speed after clamp==" + str(self.roll_turn_speed))

    # We tell the OneDiskCube to spin the RollDisk at the selected speed
    self.move_joints(self.roll_turn_speed)

We also need to add to the init method the following two lines in order to get the params:

# Variables that we retrieve through the param server, loaded when launch training launch.
self.roll_speed_fixed_value = rospy.get_param('/moving_cube/roll_speed_fixed_value')
self.roll_speed_increment_value = rospy.get_param('/moving_cube/roll_speed_increment_value')

4.3 Get the observations vector

Here we retrieve sensor data using our parent class CubeSingleDiskEnv to get all the sensor data, and then we process it to return the observations that we see fit.

def _get_obs(self):
    """
    Here we define what sensor data defines our robots observations
    To know which Variables we have access to, we need to read the
    MyCubeSingleDiskEnv API DOCS
    :return:
    """

    # We get the orientation of the cube in RPY
    roll, pitch, yaw = self.get_orientation_euler()

    # We get the distance from the origin
    #distance = self.get_distance_from_start_point(self.start_point)
    y_distance = self.get_y_dir_distance_from_start_point(self.start_point)

    # We get the current speed of the Roll Disk
    current_disk_roll_vel = self.get_roll_velocity()

    # We get the linear speed in the y axis
    y_linear_speed = self.get_y_linear_speed()

    cube_observations = [
        round(current_disk_roll_vel, 0),
        #round(distance, 1),
        round(y_distance, 1),
        round(roll, 1),
        round(pitch, 1),
        round(y_linear_speed,1),
        round(yaw, 1),
    ]

    return cube_observations

We use internal functions that we will define later, which return the robot’s sensory data, already processed for us. In this case, we leave only one decimal in the sensor data because it’s enough for our purposes, making the defining state for the Qlearn algorithm or whatever faster.

We also have to add some imports of parameters in the init function:

self.start_point = Point()
self.start_point.x = rospy.get_param("/moving_cube/init_cube_pose/x")
self.start_point.y = rospy.get_param("/moving_cube/init_cube_pose/y")
self.start_point.z = rospy.get_param("/moving_cube/init_cube_pose/z")

And an import:

from geometry_msgs.msg import Point

4.4 Detecting when the episode must be finished

Based on the status of the robot after a step, we must decide if the learning episode is over.

In this case, we consider the episode done when the pitch angle surpasses a certain threshold (defined as a parameter). We also consider the episode done when the yaw exceeds the maximum angle (also a parameter).

Change the template with the following code:

def _is_done(self, observations):

    pitch_angle = observations[3]
    yaw_angle = observations[5]

    if abs(pitch_angle) > self.max_pitch_angle:
        rospy.logerr("WRONG Cube Pitch Orientation==>" + str(pitch_angle))
        done = True
    else:
        rospy.logdebug("Cube Pitch Orientation Ok==>" + str(pitch_angle))
        if abs(yaw_angle) > self.max_yaw_angle:
            rospy.logerr("WRONG Cube Yaw Orientation==>" + str(yaw_angle))
            done = True
        else:
            rospy.logdebug("Cube Yaw Orientation Ok==>" + str(yaw_angle))
            done = False

    return done

Also, modify the init function to add the import of the parameters:

self.max_pitch_angle = rospy.get_param('/moving_cube/max_pitch_angle')
self.max_yaw_angle = rospy.get_param('/moving_cube/max_yaw_angle')

4.5 Compute the reward

This is probably one of the most important methods. Here you will condition how the robot will learn by rewarding good-result actions and punishing bad practices. This has an enormous effect on emerging behaviours that the robot will have while trying to solve the task at hand.

For the Cubli walking problem, we compute the total reward as the sum of three subrewards:

reward_distance: Rewards given when there is an increase from a previous step of the distance from the start point. This incourages the cube to keep moving.
reward_y_axis_speed: We give points for moving forwards on the Y-axis. This conditions the robot cube to go forwards.
reward_y_axis_angle: We give negative points based on how much the cube deviates from going in a straight line following the Y-axis. We use the sine function because the worst configuration is a 90/-90 degree devuation; after that, if it’s backwards, it just has to learn to move the other way round to move positive on the Y-AXIS.

Modify the template to include the following code:

def _compute_reward(self, observations, done):

    if not done:

        y_distance_now = observations[1]
        delta_distance = y_distance_now - self.current_y_distance
        rospy.logdebug("y_distance_now=" + str(y_distance_now)+", current_y_distance=" + str(self.current_y_distance))
        rospy.logdebug("delta_distance=" + str(delta_distance))
        reward_distance = delta_distance * self.move_distance_reward_weight
        self.current_y_distance = y_distance_now

        y_linear_speed = observations[4]
        rospy.logdebug("y_linear_speed=" + str(y_linear_speed))
        reward_y_axis_speed = y_linear_speed * self.y_linear_speed_reward_weight

        # Negative Reward for yaw different from zero.
        yaw_angle = observations[5]
        rospy.logdebug("yaw_angle=" + str(yaw_angle))
        # Worst yaw is 90 and 270 degrees, best 0 and 180. We use sin function for giving reward.
        sin_yaw_angle = math.sin(yaw_angle)
        rospy.logdebug("sin_yaw_angle=" + str(sin_yaw_angle))
        reward_y_axis_angle = -1 * abs(sin_yaw_angle) * self.y_axis_angle_reward_weight


        # We are not intereseted in decimals of the reward, doesnt give any advatage.
        reward = round(reward_distance, 0) + round(reward_y_axis_speed, 0) + round(reward_y_axis_angle, 0)
        rospy.logdebug("reward_distance=" + str(reward_distance))
        rospy.logdebug("reward_y_axis_speed=" + str(reward_y_axis_speed))
        rospy.logdebug("reward_y_axis_angle=" + str(reward_y_axis_angle))
        rospy.logdebug("reward=" + str(reward))
    else:
        reward = -self.end_episode_points

    return reward

Also modify the init method to import some parameters from ROSPARAM server:

self.move_distance_reward_weight = rospy.get_param("/moving_cube/move_distance_reward_weight")
self.y_linear_speed_reward_weight = rospy.get_param("/moving_cube/y_linear_speed_reward_weight")
self.y_axis_angle_reward_weight = rospy.get_param("/moving_cube/y_axis_angle_reward_weight")
self.end_episode_points = rospy.get_param("/moving_cube/end_episode_points")

Finally, add the following imports at the beginning of the Python file:

import numpy
import math

Step 5. Add some private functions

The private functions are methods of the Task Environment that are required for the internal working of the class. They are called by the virtual methods and are usually created to simplify the code. They highly depend on what the task is, so there is no rule for those.

As you can see in the code below, the methods defined were called in the _get_obs() function.

Add the following code at the end of the class:

# Internal TaskEnv Methods
def get_y_dir_distance_from_start_point(self, start_point):
    """
    Calculates the distance from the given point and the current position
    given by odometry. In this case the increase or decrease in y.
    :param start_point:
    :return:
    """
    y_dist_dir = self.odom.pose.pose.position.y - start_point.y
    
    return y_dist_dir
    
    
def get_distance_from_start_point(self, start_point):
    """
    Calculates the distance from the given point and the current position
    given by odometry
    :param start_point:
    :return:
    """
    distance = self.get_distance_from_point(start_point,
                                            self.odom.pose.pose.position)
    
    return distance
    
def get_distance_from_point(self, pstart, p_end):
    """
    Given a Vector3 Object, get distance from current position
    :param p_end:
    :return:
    """
    a = numpy.array((pstart.x, pstart.y, pstart.z))
    b = numpy.array((p_end.x, p_end.y, p_end.z))
    
    distance = numpy.linalg.norm(a - b)
    
    return distance
    
def get_orientation_euler(self):
    # We convert from quaternions to euler
    orientation_list = [self.odom.pose.pose.orientation.x,
                        self.odom.pose.pose.orientation.y,
                        self.odom.pose.pose.orientation.z,
                        self.odom.pose.pose.orientation.w]
    
    roll, pitch, yaw = euler_from_quaternion(orientation_list)
    return roll, pitch, yaw
    ############################################################################
    #             解我心头大惑！！！！！！！！！！！！！！！！！！！！！            #
    ############################################################################
    
def get_roll_velocity(self):
    # We get the current joint roll velocity
    roll_vel = self.joints.velocity[0]
    return roll_vel
    
def get_y_linear_speed(self):
    # We get the current joint roll velocity
    y_linear_speed = self.odom.twist.twist.linear.y
    return y_linear_speed

As you can see, not even the get_odom method, for example, is needed because its getting the data directly from the odom variable.

Finally, add the following import at the beginning of the Python file:

from tf.transformations import euler_from_quaternion

Final Script

You should have something similar to this as the final TaskEnv for the Cubli robot:

my_one_disk_walk.py

#! /usr/bin/env python

import rospy
import numpy
import math
from gym import spaces
import my_cube_single_disk_env
from gym.envs.registration import register
from geometry_msgs.msg import Point
from tf.transformations import euler_from_quaternion

timestep_limit_per_episode = 10000 # Can be any Value

register(
        id='MyMovingCubeOneDiskWalkEnv-v0',
        entry_point='my_one_disk_walk:MyMovingCubeOneDiskWalkEnv',
        #timestep_limit=timestep_limit_per_episode,
    )

class MyMovingCubeOneDiskWalkEnv(my_cube_single_disk_env.MyCubeSingleDiskEnv):
    def __init__(self):
        
        # Only variable needed to be set here
        number_actions = rospy.get_param('/moving_cube/n_actions')
        self.action_space = spaces.Discrete(number_actions)
        
        
        #number_observations = rospy.get_param('/moving_cube/n_observations')
        """
        We set the Observation space for the 6 observations
        cube_observations = [
            round(current_disk_roll_vel, 0),
            round(y_distance, 1),
            round(roll, 1),
            round(pitch, 1),
            round(y_linear_speed,1),
            round(yaw, 1),
        ]
        """
        
        # Actions and Observations
        self.roll_speed_fixed_value = rospy.get_param('/moving_cube/roll_speed_fixed_value')
        self.roll_speed_increment_value = rospy.get_param('/moving_cube/roll_speed_increment_value')
        self.max_distance = rospy.get_param('/moving_cube/max_distance')
        max_roll = 2 * math.pi
        self.max_pitch_angle = rospy.get_param('/moving_cube/max_pitch_angle')
        self.max_y_linear_speed = rospy.get_param('/moving_cube/max_y_linear_speed')
        self.max_yaw_angle = rospy.get_param('/moving_cube/max_yaw_angle')
        
        
        high = numpy.array([
            self.roll_speed_fixed_value,
            self.max_distance,
            max_roll,
            self.max_pitch_angle,
            self.max_y_linear_speed,
            self.max_y_linear_speed,
            ])
        
        self.observation_space = spaces.Box(-high, high)
        
        rospy.logwarn("ACTION SPACES TYPE===>"+str(self.action_space))
        rospy.logwarn("OBSERVATION SPACES TYPE===>"+str(self.observation_space))
        
        # Variables that we retrieve through the param server, loded when launch training launch.
        self.init_roll_vel = rospy.get_param("/moving_cube/init_roll_vel")
        
        
        # Get Observations
        self.start_point = Point()
        self.start_point.x = rospy.get_param("/moving_cube/init_cube_pose/x")
        self.start_point.y = rospy.get_param("/moving_cube/init_cube_pose/y")
        self.start_point.z = rospy.get_param("/moving_cube/init_cube_pose/z")
        
        # Rewards
        self.move_distance_reward_weight = rospy.get_param("/moving_cube/move_distance_reward_weight")
        self.y_linear_speed_reward_weight = rospy.get_param("/moving_cube/y_linear_speed_reward_weight")
        self.y_axis_angle_reward_weight = rospy.get_param("/moving_cube/y_axis_angle_reward_weight")
        self.end_episode_points = rospy.get_param("/moving_cube/end_episode_points")

        self.cumulated_steps = 0.0

        # Here we will add any init functions prior to starting the MyRobotEnv
        super(MyMovingCubeOneDiskWalkEnv, self).__init__()

    def _set_init_pose(self):
        """Sets the Robot in its init pose
        """
        self.move_joints(self.init_roll_vel)

        return True


    def _init_env_variables(self):
        """
        Inits variables needed to be initialised each time we reset at the start
        of an episode.
        :return:
        """
        self.total_distance_moved = 0.0
        self.current_y_distance = self.get_y_dir_distance_from_start_point(self.start_point)
        self.roll_turn_speed = rospy.get_param('/moving_cube/init_roll_vel')
        # For Info Purposes
        self.cumulated_reward = 0.0
        #self.cumulated_steps = 0.0


    def _set_action(self, action):

        # We convert the actions to speed movements to send to the parent class CubeSingleDiskEnv
        if action == 0:# Move Speed Wheel Forwards
            self.roll_turn_speed = self.roll_speed_fixed_value
        elif action == 1:# Move Speed Wheel Backwards
            self.roll_turn_speed = -1*self.roll_speed_fixed_value
        elif action == 2:# Stop Speed Wheel
            self.roll_turn_speed = 0.0
        elif action == 3:# Increment Speed
            self.roll_turn_speed += self.roll_speed_increment_value
        elif action == 4:# Decrement Speed
            self.roll_turn_speed -= self.roll_speed_increment_value

        # We clamp Values to maximum
        rospy.logdebug("roll_turn_speed before clamp=="+str(self.roll_turn_speed))
        self.roll_turn_speed = numpy.clip(self.roll_turn_speed,
                                          -1*self.roll_speed_fixed_value,
                                          self.roll_speed_fixed_value)
        rospy.logdebug("roll_turn_speed after clamp==" + str(self.roll_turn_speed))

        # We tell the OneDiskCube to spin the RollDisk at the selected speed
        self.move_joints(self.roll_turn_speed)

    def _get_obs(self):
        """
        Here we define what sensor data defines our robots observations
        To know which Variables we have acces to, we need to read the
        MyCubeSingleDiskEnv API DOCS
        :return:
        """

        # We get the orientation of the cube in RPY
        roll, pitch, yaw = self.get_orientation_euler()

        # We get the distance from the origin
        #distance = self.get_distance_from_start_point(self.start_point)
        y_distance = self.get_y_dir_distance_from_start_point(self.start_point)

        # We get the current speed of the Roll Disk
        current_disk_roll_vel = self.get_roll_velocity()

        # We get the linear speed in the y axis
        y_linear_speed = self.get_y_linear_speed()

        
        cube_observations = [
            round(current_disk_roll_vel, 0),
            round(y_distance, 1),
            round(roll, 1),
            round(pitch, 1),
            round(y_linear_speed,1),
            round(yaw, 1)
        ]
        
        rospy.logdebug("Observations==>"+str(cube_observations))

        return cube_observations
        

    def _is_done(self, observations):

        pitch_angle = observations[3]
        yaw_angle = observations[5]

        if abs(pitch_angle) > self.max_pitch_angle:
            rospy.logerr("WRONG Cube Pitch Orientation==>" + str(pitch_angle))
            done = True
        else:
            rospy.logdebug("Cube Pitch Orientation Ok==>" + str(pitch_angle))
            if abs(yaw_angle) > self.max_yaw_angle:
                rospy.logerr("WRONG Cube Yaw Orientation==>" + str(yaw_angle))
                done = True
            else:
                rospy.logdebug("Cube Yaw Orientation Ok==>" + str(yaw_angle))
                done = False

        return done

    def _compute_reward(self, observations, done):

        if not done:

            y_distance_now = observations[1]
            delta_distance = y_distance_now - self.current_y_distance
            rospy.logdebug("y_distance_now=" + str(y_distance_now)+", current_y_distance=" + str(self.current_y_distance))
            rospy.logdebug("delta_distance=" + str(delta_distance))
            reward_distance = delta_distance * self.move_distance_reward_weight
            self.current_y_distance = y_distance_now

            y_linear_speed = observations[4]
            rospy.logdebug("y_linear_speed=" + str(y_linear_speed))
            reward_y_axis_speed = y_linear_speed * self.y_linear_speed_reward_weight

            # Negative Reward for yaw different from zero.
            yaw_angle = observations[5]
            rospy.logdebug("yaw_angle=" + str(yaw_angle))
            # Worst yaw is 90 and 270 degrees, best 0 and 180. We use sin function for giving reward.
            sin_yaw_angle = math.sin(yaw_angle)
            rospy.logdebug("sin_yaw_angle=" + str(sin_yaw_angle))
            reward_y_axis_angle = -1 * abs(sin_yaw_angle) * self.y_axis_angle_reward_weight


            # We are not intereseted in decimals of the reward, doesn't give any advatage.
            reward = round(reward_distance, 0) + round(reward_y_axis_speed, 0) + round(reward_y_axis_angle, 0)
            rospy.logdebug("reward_distance=" + str(reward_distance))
            rospy.logdebug("reward_y_axis_speed=" + str(reward_y_axis_speed))
            rospy.logdebug("reward_y_axis_angle=" + str(reward_y_axis_angle))
            rospy.logdebug("reward=" + str(reward))
        else:
            reward = -1*self.end_episode_points


        self.cumulated_reward += reward
        rospy.logdebug("Cumulated_reward=" + str(self.cumulated_reward))
        self.cumulated_steps += 1
        rospy.logdebug("Cumulated_steps=" + str(self.cumulated_steps))
        
        return reward


    # Internal TaskEnv Methods
    def get_y_dir_distance_from_start_point(self, start_point):
        """
        Calculates the distance from the given point and the current position
        given by odometry. In this case the increase or decrease in y.
        :param start_point:
        :return:
        """
        y_dist_dir = self.odom.pose.pose.position.y - start_point.y
    
        return y_dist_dir
    
    
    def get_distance_from_start_point(self, start_point):
        """
        Calculates the distance from the given point and the current position
        given by odometry
        :param start_point:
        :return:
        """
        distance = self.get_distance_from_point(start_point,
                                                self.odom.pose.pose.position)
    
        return distance
    
    def get_distance_from_point(self, pstart, p_end):
        """
        Given a Vector3 Object, get distance from current position
        :param p_end:
        :return:
        """
        a = numpy.array((pstart.x, pstart.y, pstart.z))
        b = numpy.array((p_end.x, p_end.y, p_end.z))
    
        distance = numpy.linalg.norm(a - b)
    
        return distance
    
    def get_orientation_euler(self):
        # We convert from quaternions to euler
        orientation_list = [self.odom.pose.pose.orientation.x,
                            self.odom.pose.pose.orientation.y,
                            self.odom.pose.pose.orientation.z,
                            self.odom.pose.pose.orientation.w]
    
        roll, pitch, yaw = euler_from_quaternion(orientation_list)
        return roll, pitch, yaw
        ############################################################################
        #             解我心头大惑！！！！！！！！！！！！！！！！！！！！！            #
        ############################################################################
    
    def get_roll_velocity(self):
        # We get the current joint roll velocity
        roll_vel = self.joints.velocity[0]
        return roll_vel
    
    def get_y_linear_speed(self):
        # We get the current joint roll velocity
        y_linear_speed = self.odom.twist.twist.linear.y
        return y_linear_speed

END my_one_disk_walk.py

STEP 6. Add the learning algorithm

Now, we are going to use the same scripts that you used for the CartPole example to train through Qlearn, but adapting them to use the Cube environments we have just created. We will only need to change the Environments we use, the configuration yaml file, and some minor elements to adapt to the change of the environments.

6.0 Create the start_training.py file

Let’s create the training script for the Cubli. We are going to do a copy of the one created for the Cartpole.

Go to the shell and type the following in order to copy the start_training.py file from Cartpole to Cubli:

roscd cartpole_v0_training/scripts

cp ./start_training.py ~/catkin_ws/src/my_moving_cube_pkg/scripts

Now let’s modify that training script to be useful for Cubli.

6.1 Import the Task Environment for Cubli

Importing will only have the effect of triggering the register environment method at the top of the file.

# import our training environment
from openai_ros.task_envs.cartpole_stay_up import stay_up

Change it to:

# import our training environment
import my_one_disk_walk

6.2 Instantiate the new Environment

Modify the line that instantiates the Cartpole Environment to intantiate Cubli’s one. Change the following line:

# Create the Gym environment
env = gym.make('CartPoleStayUp-v0')

By this one:

env = gym.make('MyMovingCubeOneDiskWalkEnv-v0')

6.3 Change the location for the logs

# Set the logging system
rospack = rospkg.RosPack()
pkg_path = rospack.get_path('cartpole_v0_training')
outdir = pkg_path + '/training_results'
env = wrappers.Monitor(env, outdir, force=True) 
rospy.loginfo ( "Monitor Wrapper started")

last_time_steps = numpy.ndarray(0)

# Loads parameters from the ROS param server
# Parameters are stored in a yaml file inside the config directory
# They are loaded at runtime by the launch file
Alpha = rospy.get_param("/cartpole_v0/alpha")
Epsilon = rospy.get_param("/cartpole_v0/epsilon")
Gamma = rospy.get_param("/cartpole_v0/gamma")
epsilon_discount = rospy.get_param("/cartpole_v0/epsilon_discount")
nepisodes = rospy.get_param("/cartpole_v0/nepisodes")
nsteps = rospy.get_param("/cartpole_v0/nsteps")

running_step = rospy.get_param("/cartpole_v0/running_step")

We have to change the package cartpole_v0_training to our own package; in this case, we will pick my_moving_cube_pkg.

Then, change the namespace from cartpole_v0 to moving_cube.

# Set the logging system
rospack = rospkg.RosPack()
pkg_path = rospack.get_path('my_moving_cube_pkg')
outdir = pkg_path + '/training_results'
env = wrappers.Monitor(env, outdir, force=True) 
rospy.loginfo ( "Monitor Wrapper started")

last_time_steps = numpy.ndarray(0)

# Loads parameters from the ROS param server
# Parameters are stored in a yaml file inside the config directory
# They are loaded at runtime by the launch file
Alpha = rospy.get_param("/moving_cube/alpha")
Epsilon = rospy.get_param("/moving_cube/epsilon")
Gamma = rospy.get_param("/moving_cube/gamma")
epsilon_discount = rospy.get_param("/moving_cube/epsilon_discount")
nepisodes = rospy.get_param("/moving_cube/nepisodes")
nsteps = rospy.get_param("/moving_cube/nsteps")

running_step = rospy.get_param("/moving_cube/running_step")

6.4 Filter observations for simpler states generation

We sometimes want to filter the observations given, so that the reinforcement algorithms go faster. We have to add the following lines

# Initialize the environment and get first state of the robot
observation = env.reset()
state = ''.join(map(str, observation))

Turn it into this:

# Initialize the environment and get first state of the robot
observation = env.reset()
simplified_observations = convert_obs_to_state(observation)
state = ''.join(map(str, simplified_observations))

And also this:

nextState = ''.join(map(str, observation))

Turn it into this:

simplified_observations = convert_obs_to_state(observation)
nextState = ''.join(map(str, simplified_observations))

and define this convert_obs_to_state method:

def convert_obs_to_state(observations):
    """
    Converts the observations used for reward and so on to the essentials for the robot state
    In this case, we only need the orientation of the cube and the speed of the disc.
    The distance doesn't condition at all the actions
    """
    disk_roll_vel = observations[0]
    # roll_angle = observations[2]
    y_linear_speed = observations[4]
    yaw_angle = observations[5]

    state_converted = [disk_roll_vel, y_linear_speed, yaw_angle]

    return state_converted

6.5 Create the parameters file

As for the Cartpole, we need to create a parameters file that will contain the configuration of the RL task, including parameters for the RL algorithm, for the task and for the robot too.

Create a config file in the config directory of your package:

roscd my_moving_cube_pkg
mkdir config
cd config
touch my_one_disk_walk_openai_params.yaml

Then copy the following content inside it:

my_one_disk_walk_openai_params.yaml

moving_cube: #namespace

    running_step: 0.04 # amount of time the control will be executed
    pos_step: 0.016     # increment in position for each command
    
    #qlearn parameters
    alpha: 0.1
    gamma: 0.7
    epsilon: 0.9
    epsilon_discount: 0.999
    nepisodes: 500
    nsteps: 1000
    number_splits: 10 #set to change the number of state splits for the continuous problem and also the number of env_variable splits

    running_step: 0.06 # Time for each step
    wait_time: 0.1 # Time to wait in the reset phases

    n_actions: 5 # We have 3 actions

    speed_step: 1.0 # Time to wait in the reset phases

    init_roll_vel: 0.0 # Initial speed of the Roll Disk

    roll_speed_fixed_value: 100.0 # Speed at which it will move forwards or backwards
    roll_speed_increment_value: 10.0 # Increment that could be done in each step

    max_distance: 2.0 # Maximum distance allowed for the RobotCube
    max_pitch_angle: 0.2 # Maximum Angle radians in Pitch that we allow before terminating episode
    max_yaw_angle: 0.1 # Maximum yaw angle deviation, after that it starts getting negative rewards
    max_y_linear_speed: 100 # Maximum speed in y axis

    init_cube_pose:
      x: 0.0
      y: 0.0
      z: 0.0

    end_episode_points: 1000 # Points given when ending an episode

    move_distance_reward_weight: 1000.0 # Multiplier for the moved distance reward, Ex: inc_d = 0.1 --> 100points
    y_linear_speed_reward_weight: 1000.0 # Multiplier for moving fast on the y Axis
    y_axis_angle_reward_weight: 1000.0 # Multiplier of angle of yaw, to keep it straight

END my_one_disk_walk_openai_params.yaml

That is the file that provides all the parameters that we have been loading along the Task and Robot Environments.

6.6 Change the load of parameters

The RL is the same for the Cartpole and for the Cubli problems. Hence, the algorithm parameter names are the same. However, the name of the ROS parameters from which we obtain the RL params are different for each robot.

That is why we need to modify the load of the parameters. Based on the my_one_disk_walk_openai_params.yaml file, you need to modify the following code of the training script:

# Loads parameters from the ROS param server
# Parameters are stored in a yaml file inside the config directory
# They are loaded at runtime by the launch file
Alpha = rospy.get_param("/cartpole_v0/alpha")
Epsilon = rospy.get_param("/cartpole_v0/epsilon")
Gamma = rospy.get_param("/cartpole_v0/gamma")
epsilon_discount = rospy.get_param("/cartpole_v0/epsilon_discount")
nepisodes = rospy.get_param("/cartpole_v0/nepisodes")
nsteps = rospy.get_param("/cartpole_v0/nsteps")

running_step = rospy.get_param("/cartpole_v0/running_step")

By the following code:

# Loads parameters from the ROS param server
# Parameters are stored in a yaml file inside the config directory
# They are loaded at runtime by the launch file
Alpha = rospy.get_param("/moving_cube/alpha")
Epsilon = rospy.get_param("/moving_cube/epsilon")
Gamma = rospy.get_param("/moving_cube/gamma")
epsilon_discount = rospy.get_param("/moving_cube/epsilon_discount")
nepisodes = rospy.get_param("/moving_cube/nepisodes")
nsteps = rospy.get_param("/moving_cube/nsteps")

running_step = rospy.get_param("/moving_cube/running_step")

Having done that, then the final version of the training script should be:

start_training.py

#!/usr/bin/env python

import gym
import time
import numpy
import random
import time
import qlearn
from gym import wrappers

# ROS packages required
import rospy
import rospkg

# import our training environment
#from openai_ros.task_envs.cartpole_stay_up import stay_up
# import our training environment
import my_one_disk_walk


def convert_obs_to_state(observations):
    """
    Converts the observations used for reward and so on to the essentials for the robot state
    In this case, we only need the orientation of the cube and the speed of the disc.
    The distance doesn't condition at all the actions
    """
    disk_roll_vel = observations[0]
    # roll_angle = observations[2]
    y_linear_speed = observations[4]
    yaw_angle = observations[5]

    state_converted = [disk_roll_vel, y_linear_speed, yaw_angle]

    return state_converted

if __name__ == '__main__':
    
    rospy.init_node('cartpole_gym', anonymous=True, log_level=rospy.WARN)

    # Create the Gym environment
    #env = gym.make('CartPoleStayUp-v0')
    env = gym.make('MyMovingCubeOneDiskWalkEnv-v0')
    rospy.loginfo ( "Gym environment done")
        
    # Set the logging system
    rospack = rospkg.RosPack()
    pkg_path = rospack.get_path('my_moving_cube_pkg')
    outdir = pkg_path + '/training_results'
    env = wrappers.Monitor(env, outdir, force=True) 
    rospy.loginfo ( "Monitor Wrapper started")

    last_time_steps = numpy.ndarray(0)

    # Loads parameters from the ROS param server
    # Parameters are stored in a yaml file inside the config directory
    # They are loaded at runtime by the launch file
    Alpha = rospy.get_param("/moving_cube/alpha")
    Epsilon = rospy.get_param("/moving_cube/epsilon")
    Gamma = rospy.get_param("/moving_cube/gamma")
    epsilon_discount = rospy.get_param("/moving_cube/epsilon_discount")
    nepisodes = rospy.get_param("/moving_cube/nepisodes")
    nsteps = rospy.get_param("/moving_cube/nsteps")

    running_step = rospy.get_param("/moving_cube/running_step")

    # Initialises the algorithm that we are going to use for learning
    qlearn = qlearn.QLearn(actions=range(env.action_space.n),
                    alpha=Alpha, gamma=Gamma, epsilon=Epsilon)
    initial_epsilon = qlearn.epsilon

    start_time = time.time()
    highest_reward = 0

    # Starts the main training loop: the one about the episodes to do
    for x in range(nepisodes):
        rospy.logdebug("############### START EPISODE => " + str(x))
        
        cumulated_reward = 0  
        done = False
        if qlearn.epsilon > 0.05:
            qlearn.epsilon *= epsilon_discount
        
        # Initialize the environment and get first state of the robot
        observation = env.reset()
        simplified_observations = convert_obs_to_state(observation)
        state = ''.join(map(str, simplified_observations))
        
        # Show on screen the actual situation of the robot
        # for each episode, we test the robot for nsteps
        for i in range(nsteps):
            
            rospy.loginfo("############### Start Step => "+str(i))
            # Pick an action based on the current state
            action = qlearn.chooseAction(state)
            rospy.loginfo ("Next action is: %d", action)
            # Execute the action in the environment and get feedback
            observation, reward, done, info = env.step(action)
            rospy.loginfo(str(observation) + " " + str(reward))
            cumulated_reward += reward
            if highest_reward < cumulated_reward:
                highest_reward = cumulated_reward

            simplified_observations = convert_obs_to_state(observation)
            nextState = ''.join(map(str, simplified_observations))

            # Make the algorithm learn based on the results
            #rospy.logwarn("############### State we were => " + str(state))
            #rospy.logwarn("############### Action that we took => " + str(action))
            #rospy.logwarn("############### Reward that action gave => " + str(reward))
            #rospy.logwarn("############### State in which we will start next step => " + str(nextState))
            qlearn.learn(state, action, reward, nextState)

            if not(done):
                state = nextState
            else:
                rospy.loginfo ("DONE")
                last_time_steps = numpy.append(last_time_steps, [int(i + 1)])
                break
            rospy.loginfo("############### END Step =>" + str(i))
            #raw_input("Next Step...PRESS KEY")
            #rospy.sleep(2.0)
        m, s = divmod(int(time.time() - start_time), 60)
        h, m = divmod(m, 60)
        rospy.logwarn ( ("EP: "+str(x+1)+" - [alpha: "+str(round(qlearn.alpha,2))+" - gamma: "+str(round(qlearn.gamma,2))+" - epsilon: "+str(round(qlearn.epsilon,2))+"] - Reward: "+str(cumulated_reward)+"     Time: %d:%02d:%02d" % (h, m, s)))

    
    rospy.loginfo ( ("\n|"+str(nepisodes)+"|"+str(qlearn.alpha)+"|"+str(qlearn.gamma)+"|"+str(initial_epsilon)+"*"+str(epsilon_discount)+"|"+str(highest_reward)+"| PICTURE |"))

    l = last_time_steps.tolist()
    l.sort()

    #print("Parameters: a="+str)
    rospy.loginfo("Overall score: {:0.2f}".format(last_time_steps.mean()))
    rospy.loginfo("Best 100 score: {:0.2f}".format(reduce(lambda x, y: x + y, l[-100:]) / len(l[-100:])))

    env.close()

END start_training.py

6.7 Create the launch file

Now we need to create a launch file that will do the following:

Load the parameters in the ROS param server
Launch the training script that will kick up the RL training

Go to the launch directory and create the launch file:

roscd my_moving_cube_pkg
mkdir launch
cd launch
touch start_training_cube.launch

Then copy in the launch file the following content:

start_training_cube.launch

<launch>
    <rosparam command="load" file="$(find my_moving_cube_pkg)/config/my_one_disk_walk_openai_params.yaml" />
    <!-- Launch the training system -->
    <node pkg="my_moving_cube_pkg" name="movingcube_gym" type="start_training.py" output="screen"/>
</launch>

END start_training_cube.launch

6.8 Add the learning algorithm

Finally, add the actual learning algorithm (remember we are not still using the OpenAI Baselines, so we need the RL algorithm implemented somehow).

Go to the scripts directory and create the QLearning file:

roscd my_moving_cube_pkg/scripts
touch qlearn.py

Then copy on the file the following content:

qlearn.py

'''
Q-learning approach for different RL problems
as part of the basic series on reinforcement learning @
https://github.com/vmayoral/basic_reinforcement_learning
 
Inspired by https://gym.openai.com/evaluations/eval_kWknKOkPQ7izrixdhriurA
 
        @author: Victor Mayoral Vilches 
'''
import random

class QLearn:
    def __init__(self, actions, epsilon, alpha, gamma):
        self.q = {}
        self.epsilon = epsilon  # exploration constant
        self.alpha = alpha      # discount constant
        self.gamma = gamma      # discount factor
        self.actions = actions

    def getQ(self, state, action):
        return self.q.get((state, action), 0.0)

    def learnQ(self, state, action, reward, value):
        '''
        Q-learning:
            Q(s, a) += alpha * (reward(s,a) + max(Q(s') - Q(s,a))            
        '''
        oldv = self.q.get((state, action), None)
        if oldv is None:
            self.q[(state, action)] = reward
        else:
            self.q[(state, action)] = oldv + self.alpha * (value - oldv)

    def chooseAction(self, state, return_q=False):
        q = [self.getQ(state, a) for a in self.actions]
        maxQ = max(q)

        if random.random() < self.epsilon:
            minQ = min(q); mag = max(abs(minQ), abs(maxQ))
            # add random values to all the actions, recalculate maxQ
            q = [q[i] + random.random() * mag - .5 * mag for i in range(len(self.actions))] 
            maxQ = max(q)

        count = q.count(maxQ)
        # In case there're several state-action max values 
        # we select a random one among them
        if count > 1:
            best = [i for i in range(len(self.actions)) if q[i] == maxQ]
            i = random.choice(best)
        else:
            i = q.index(maxQ)

        action = self.actions[i]        
        if return_q: # if they want it, give it!
            return action, q
        return action

    def learn(self, state1, action1, reward, state2):
        maxqnew = max([self.getQ(state2, a) for a in self.actions])
        self.learnQ(state1, action1, reward, reward + self.gamma*maxqnew)

END qlearn.py

6.9 Launch the training!

Now it is time to launch the training and see the robot learn!

But before running this launch, check that all the topics needed are available, just in case the simulation died. For example check that the topic /moving_cube/odom is publishing. Again, if not unpause the simulation as explained in unit3.

roslaunch my_moving_cube_pkg start_training_cube.launch

You should see something like this:

EXTRA exercise U2.1

As a challenge, how would we change the direction in which the robot learns to move?

END EXTRA exercise U2.1

Solution Exercise U2.1

Please try to do it by yourself unless you get stuck or need some inspiration. You will learn much more if you fight for each exercise.

Well, the answer is simple: We go to the MyMovingCubeOneDiskWalkEnv and change the way in which we assign the rewards:

y_linear_speed = observations[4]
rospy.logdebug("y_linear_speed=" + str(y_linear_speed))
reward_y_axis_speed = y_linear_speed * self.y_linear_speed_reward_weight

We have to invert the sign, although we don’t even need to access the code. We just have to change the value of the self.y_linear_speed_reward_weight, which is imported from the yaml file that defines the parameters:

y_linear_speed_reward_weight: 1000.0 # Multiplier for moving fast in the y Axis

y_linear_speed_reward_weight: -1000.0 # Multiplier for moving fast in the y Axis

Solution Exercise U2.1

NEXT STEP: How to change the learning algorithm