by Thomas Simonini
This article is part of Deep Reinforcement Learning Course with Tensorflow ?️. Check the syllabus h ere.
Today we’ll learn about Q-Learning. Q-Learning is a value-based Reinforcement Learning algorithm.
This article is the second part of a free series of blog post about Deep Reinforcement Learning. For more information and more resources, check out the syllabus of the course. See the first article here.
In this article you’ll learn:
Let’s say you’re a knight and you need to save the princess trapped in the castle shown on the map above.
You can move one tile at a time. The enemy can’t, but land on the same tile as the enemy, and you will die. Your goal is to go the castle by the fastest route possible. This can be evaluated using a “points scoring” system.
The question is: how do you create an agent that will be able to do that?
Here’s a first strategy. Let say our agent tries to go to each tile, and then colors each tile. Green for “safe,” and red if not.
The same map, but colored in to show which tiles are safe to visit.
Then, we can tell our agent to take only green tiles.
But the problem is that it’s not really helpful. We don’t know the best tile to take when green tiles are adjacent each other. So our agent can fall into an infinite loop by trying to find the castle!
Here’s a second strategy: create a table where we’ll calculate the maximum expected future reward, for each action at each state.
Thanks to that, we’ll know what’s the best action to take for each state.
Each state (tile) allows four possible actions. These are moving left, right, up, or down.
0 are impossible moves (if you’re in top left hand corner you can’t go left or up!)
In terms of computation, we can transform this grid into a table.
This is called a Q-table (“Q” for “quality” of the action). The columns will be the four actions (left, right, up, down). The rows will be the states. The value of each cell will be the maximum expected future reward for that given state and action.
Each Q-table score will be the maximum expected future reward that I’ll get if I take that action at that state with the best policy given.
Why do we say “with the policy given?” It’s because we don’t implement a policy. Instead, we just improve our Q-table to always choose the best action.
Think of this Q-table as a game “cheat sheet.” Thanks to that, we know for each state (each line in the Q-table) what’s the best action to take, by finding the highest score in that line.
Yeah! We solved the castle problem! But wait… How do we calculate the values for each element of the Q table?
To learn each value of this Q-table, we’ll use the Q learning algorithm.
The Action Value Function (or “Q-function”) takes two inputs: “state” and “action.” It returns the expected future reward of that action at that state.
We can see this Q function as a reader that scrolls through the Q-table to find the line associated with our state, and the column associated with our action. It returns the Q value from the matching cell. This is the “expected future reward.”
But before we explore the environment, the Q-table gives the same arbitrary fixed value (most of the time 0). As we explore the environment, the Q-table will give us a better and better approximation by iteratively updating Q(s,a) using the Bellman Equation (see below!).
The Q-learning algorithm Process
The Q learning algorithm’s pseudo-code
Step 1: Initialize Q-values
We build a Q-table, with m cols (m= number of actions), and n rows (n = number of states). We initialize the values at 0.
Step 2: For life (or until learning is stopped)
Steps 3 to 5 will be repeated until we reached a maximum number of episodes (specified by the user) or until we manually stop the training.
Step 3: Choose an action
Choose an action a in the current state s based on the current Q-value estimates.
But…what action can we take in the beginning, if every Q-value equals zero?
That’s where the exploration/exploitation trade-off that we spoke about in the last article will be important.
The idea is that in the beginning, we’ll use the epsilon greedy strategy:
Steps 4–5: Evaluate!
Take the action a and observe the outcome state s’ and reward r. Now update the function Q(s,a).
We take the action a that we chose in step 3, and then performing this action returns us a new state s’ and a reward r (as we saw in the Reinforcement Learning process in the first article).
Then, to update Q(s,a) we use the Bellman equation:
The idea here is to update our Q(state, action) like this:
New Q value = Current Q value + lr * [Reward + discount_rate * (highest Q value between possible actions from the new state s’ ) — Current Q value ]
Let’s take an example:
Step 1: We init our Q-table
The initialized Q-table
Step 2: Choose an action
From the starting position, you can choose between going right or down. Because we have a big epsilon rate (since we don’t know anything about the environment yet), we choose randomly. For example… move right.
We move at random (for instance, right)
We found a piece of cheese (+1), and we can now update the Q-value of being at start and going right. We do this by using the Bellman equation.
Steps 4–5: Update the Q-function
Think of the learning rate as a way of how quickly a network abandons the former value for the new. If the learning rate is 1, the new estimate will be the new Q-value.
The updated Q-table
Good! We’ve just updated our first Q value. Now we need to do that again and again until the learning is stopped.
We made a video where we implement a Q-learning agent that learns to play Taxi-v2 with Numpy.
Now that we know how it works, we’ll implement the Q-learning algorithm step by step. Each part of the code is explained directly in the Jupyter notebook below.
You can access it in the Deep Reinforcement Learning Course repo.
Or you can access it directly on Google Colaboratory:
Q* Learning with Frozen Lake
colab.research.google.com
That’s all! Don’t forget to implement each part of the code by yourself — it’s really important to try to modify the code I gave you.
Try to add epochs, change the learning rate, and use a harder environment (such as Frozen-lake with 8x8 tiles). Have fun!
Next time we’ll work on Deep Q-learning, one of the biggest breakthroughs in Deep Reinforcement Learning in 2015. And we’ll train an agent that that plays Doom and kills enemies!
Doom!
If you liked my article, please click the ? below as many time as you liked the article so other people will see this here on Medium. And don’t forget to follow me!
If you have any thoughts, comments, questions, feel free to comment below or send me an email: [email protected], or tweet me @ThomasSimonini.
Keep learning, stay awesome!
Deep Reinforcement Learning Course with Tensorflow ?️
? Syllabus
? Video version
Part 1: An introduction to Reinforcement Learning
Part 2: Diving deeper into Reinforcement Learning with Q-Learning
Part 3: An introduction to Deep Q-Learning: let’s play Doom
Part 3+: Improvements in Deep Q Learning: Dueling Double DQN, Prioritized Experience Replay, and fixed Q-targets
Part 4: An introduction to Policy Gradients with Doom and Cartpole
Part 5: An intro to Advantage Actor Critic methods: let’s play Sonic the Hedgehog!
Part 6: Proximal Policy Optimization (PPO) with Sonic the Hedgehog 2 and 3
Part 7: Curiosity-Driven Learning made easy Part I
Countinue reading about
See all 223 posts →
#PROGRAMMING
A YEAR AGO
#TECH
A YEAR AGO
Diving deeper into Reinforcement Learning with Q-Learning
Share this
freeCodeCamp is a donor-supported tax-exempt 501(c)(3) nonprofit organization (United States Federal Tax Identification Number: 82-0779546)
Our mission: to help people learn to code for free. We accomplish this by creating thousands of videos, articles, and interactive coding lessons - all freely available to the public. We also have thousands of freeCodeCamp study groups around the world.
Donations to freeCodeCamp go toward our education initiatives, and help pay for servers, services, and staff. You can make a tax-deductible donation here.
Our Nonprofit
AboutDonateShopSponsorsEmail Us
Our Community
NewsAlumni NetworkStudy GroupsForumGitterGitHubSupportAcademic HonestyCode of ConductPrivacy PolicyTerms of Service
Our Learning Resources
LearnGuideYoutubePodcastTwitterInstagram