Assignment 2
ECS7002P - Artificial Intelligence in Games
November 9, 2020
In this assignment, you will implement a variety of reinforcement learning algorithms to find policies for the
frozen lake environment. Please read this entire document before you start working on the assignment.
1 Environment
The frozen lake environment has two main variants: the small frozen lake (Fig. 1) and the big frozen lake (Fig.
2). In both cases, each tile in a square grid corresponds to a state. There is also an additional absorbing state,
which will be introduced soon. There are four types of tiles: start (grey), frozen lake (light blue), hole (dark
blue), and goal (white). The agent has four actions, which correspond to moving one tile up, left, down, or
right. However, with probability 0.1, the environment ignores the desired direction and the agent slips (moves
one tile in a random direction). An action that would cause the agent to move outside the grid leaves the state
unchanged.
Figure 1: Small frozen lake
Figure 2: Big frozen lake
The agent receives reward 1 upon taking an action at the goal. In every other case, the agent receives zero
reward. Note that the agent does not receive a reward upon moving into the goal (nor a negative reward upon
moving into a hole). Upon taking an action at the goal or in a hole, the agent moves into the absorbing state.
Every action taken at the absorbing state leads to the absorbing state, which also does not provide rewards.
Assume a discount factor of γ = 0.9.
1
For the purposes of model-free reinforcement learning (or interactive testing), the agent is able to interact
with the frozen lake for a number of time steps that is equal to the number of tiles.
Your first task is to implement the frozen lake environment. Using either Python or Java, try to mimic the
Python interface presented in Listing 1.
Listing 1: Frozen lake environment.
The class EnvironmentModel represents a model of an environment. The constructor of this class receives
a number of states, a number of actions, and a seed that controls the pseudorandom number generator. Its
subclasses must implement two methods: p and r. The method p returns the probability of transitioning from
state to next state given action. The method r returns the expected reward in having transitioned from state to
next state given action. The method draw receives a pair of state and action and returns a state drawn according
to p together with the corresponding expected reward. Note that states and actions are represented by integers
starting at zero. We highly recommend that you follow the same convention, since this will facilitate immensely
the implementation of reinforcement learning algorithms. You can use a Python dictionary (or equivalent data
structure) to map (from and to) integers to a more convenient representation when necessary. Note that, in
general, agents may receive rewards drawn probabilistically by an environment, which is not supported in this
simplified implementation.
The class Environment represents an interactive environment and inherits from EnvironmentModel. The
constructor of this class receives a number of states, a number of actions, a maximum number of steps for
interaction, a probability distribution over initial states, and a seed that controls the pseudorandom number
generator. Its subclasses must implement two methods: p and r, which were already explained above. This
class has two new methods: reset and step. The method reset restarts the interaction between the agent and
the environment by setting the number of time steps to zero and drawing a state according to the probability
distribution over initial states. This state is stored by the class. The method step receives an action and returns
a next state drawn according to p, the corresponding expected reward, and a flag variable. The new state is
stored by the class. This method also keeps track of how many steps have been taken. Once the number of steps
matches or exceeds the pre-defined maximum number of steps, the flag variable indicates that the interaction
should end.
The class FrozenLake represents the frozen lake environment. Your task is to implement the methods p and
r for this class. The constructor of this class receives a matrix that represents a lake, a probability that the
agent will slip at any given time step, a maximum number of steps for interaction, and a seed that controls
the pseudorandom number generator. This class overrides the method step to indicate that the interaction
should also end when the absorbing state is reached. The method render is capable of rendering the state of
the environment or a pair of policy and value function.
The function play can be used to test your implementation of the environment before you try the next tasks.
4
2 Tabular model-based reinforcement learning
Your next task is to implement policy evaluation, policy improvement, policy iteration, and value iteration.
You may follow the interface suggested in Listing 2.
Listing 2: Tabular model-based algorithms.
The function policy evaluation receives an environment model, a deterministic policy, a discount factor, a
tolerance parameter, and a maximum number of iterations. A deterministic policy may be represented by an
array that contains the action prescribed for each state.
The function policy improvement receives an environment model, the value function for a policy to be
improved, and a discount factor.
The function policy iteration receives an environment model, a discount factor, a tolerance parameter, a
maximum number of iterations, and (optionally) the initial policy.
The function value iteration receives an environment model, a discount factor, a tolerance parameter, a
maximum number of iterations, and (optionally) the initial value function.
3 Tabular model-free reinforcement learning
Your next task is to implement Sarsa control and Q-learning control. You may follow the interface suggested in
Listing 3. We recommend that you use the small frozen lake to test your implementation, since these algorithms
may need many episodes to find an optimal policy for the big frozen lake.
Listing 3: Tabular model-free algorithms.
The function sarsa receives an environment, a maximum number of episodes, an initial learning rate, a
discount factor, an initial exploration factor, and an (optional) seed that controls the pseudorandom number
generator. Note that the learning rate and exploration factor decrease linearly as the number of episodes
increases (for instance, eta[i] contains the learning rate for episode i).
The function q learning receives an environment, a maximum number of episodes, an initial learning rate,
a discount factor, an initial exploration factor, and an (optional) seed that controls the pseudorandom number
generator. Note that the learning rate and exploration factor decrease linearly as the number of episodes
increases (for instance, eta[i] contains the learning rate for episode i).
Important: The -greedy policy based on Q should break ties randomly between actions that maximize Q
for a given state. This plays a large role in encouraging exploration.
4 Non-tabular model-free reinforcement learning
In this task, you will treat the frozen lake environment as if it required linear action-value function approximation.
Your task is to implement Sarsa control and Q-learning control using linear function approximation.
In the process, you will learn that tabular model-free reinforcement learning is a special case of non-tabular
model-free reinforcement learning. You may follow the interface suggested in Listing 4.
Listing 4: Non-tabular model-free algorithms.
The class LinearWrapper implements a wrapper that behaves similarly to an environment that is given to
its constructor. However, the methods reset and step return a feature matrix when they would typically return
a state s. Each row a of this feature matrix contains the feature vector φ(s, a) that represents the pair of action
and state (s, a). The method encode state is responsible for representing a state by such a feature matrix.
More concretely, each possible pair of state and action is represented by a different vector where all elements
except one are zero. Therefore, the feature matrix has |S||A| columns. The method decode policy receives a
parameter vector θ obtained by a non-tabular reinforcement learning algorithm and returns the corresponding
greedy policy together with its value function estimate.
The function linear sarsa receives an environment (wrapped by LinearWrapper ), a maximum number of
episodes, an initial learning rate, a discount factor, an initial exploration factor, and an (optional) seed that
controls the pseudorandom number generator. Note that the learning rate and exploration factor decay linearly
as the number of episodes grows (for instance, eta[i] contains the learning rate for episode i).
The function linear q learning receives an environment (wrapped by LinearWrapper ), a maximum number
of episodes, an initial learning rate, a discount factor, an initial exploration factor, and an (optional) seed that
controls the pseudorandom number generator. Note that the learning rate and exploration factor decay linearly
as the number of episodes grows (for instance, eta[i] contains the learning rate for episode i).
The Q-learning control algorithm for linear function approximation is presented in Algorithm 1. Note that
this algorithm uses a slightly different convention for naming variables and omits some details for the sake of
simplicity (such as learning rate/exploration factor decay).
Algorithm 1 Q-learning control algorithm for linear function approximation
Input: feature vector φ(s, a) for all state-action pairs (s, a), number of episodes N, learning rate α, probability
of choosing random action ,
Important: The -greedy policy based on Q should break ties randomly between actions that maximize Q
(Algorithm 1, Line 9). This plays a large role in encouraging exploration.
5 Main function
Your final implementation task is to write a program that uses all the algorithms that you have implemented for
this assignment. Your main function should behave analogously to the function presented in Listing 5. Using
the small frozen lake as a benchmark, find and render an optimal policy using policy iteration, value iteration,
Sarsa control, Q-learning control, linear Sarsa control, and linear Q-learning. For grading purposes, if your
main function does not call one of these algorithms, we will assume that it is not implemented correctly.
Listing 5: Main function.
6 Submission instructions
This assignment corresponds to 40% of the final grade for this module. You will work in groups of 3 students. The
deadline for submitting this assignment is December 11th, 2020. Penalties for late submissions will be applied
in accordance with the School policy. The submission cut-off date is 7 days after the deadline. Submissions
should be made through QM+. Submissions by e-mail will be ignored. Please always check whether the files
were uploaded correctly to QM+. Cases of extenuating circumstances have to go through the proper procedure
in accordance with the School policy. Only cases approved by the School in due time will be considered.
You will find the group selection page in QM+. You must be part of a group in QM+ before submitting
your assignment. Plagiarism leads to irreversible non-negotiable failure in the module. If you are unsure about
what constitutes plagiarism, please ask.
This assignment requires a group submission and an individual submission, which are detailed in the next
sections.
6.1 Group submission
The group submission must be a single zip file. This file must contain a single folder named group[group id].
This folder must contain a report and a folder named code.
The code folder must contain a file named README.txt, which explains how to run your main function (see
Section 5). Based solely on the correctness and clarity of your code, you will receive the following number of
points for accomplishing each of the following tasks:
- Implementing the frozen lake environment [10/100]
- Implementing policy iteration [10/100]
- Implementing value iteration [10/100]
- Implementing Sarsa control [10/100]
- Implementing Q-learning [10/100]
- Implementing Sarsa control using linear function approximation [10/100]
- Implementing Q-learning control using linear function approximation [10/100]
The report must be a single pdf file. Other formats are not acceptable. The report must be excellently
organized and identified with your names, student numbers, and module identifier. You will receive the following
number of points for answering each of the following questions: - Explain how your code for this assignment is organized. Did you make implementation decisions that
deviate significantly from what we suggested? [10/100] - How many iterations did policy iteration require to find an optimal policy for the big frozen lake? How
many iterations did value iteration require? Which algorithm was faster? [10/100] - How many episodes did Sarsa control require to find an optimal policy for the small frozen lake? How
many episodes did Q-learning control require? Hint: you may use policy evaluation to compare the value
of each policy obtained by these algorithms to the value of an optimal policy [10/100] - In linear action-value function approximation, how can each element of the parameter vector θ be interpreted
when each possible pair of state s and action a is represented by a different feature vector φ(s, a)
where all elements except one are zero? Explain why the tabular model-free reinforcement learning algorithms
that you implemented are a special case of the non-tabular model-free reinforcement learning
algorithms that you implemented. [Bonus 5/100] - Try to find an optimal policy for the big frozen lake by tweaking the parameters for Sarsa control and
Q-learning control (maximum number of episodes, learning rate, and exploration factor). You must use
policy evaluation to confirm that the resulting policy is optimal. Even if you fail, describe your experience.
[Bonus 5/100]
10
6.2 Individual submission
Each student must submit a text that describes in no more than 300 words the role that each member of their
group had in the assignment. Your grade may be penalized if the individual submissions reveal that you have
not contributed enough to the assignment. The individual submission must be a single pdf file. Other formats
are not acceptable.