Contents
All blogs / Tutorial: An Introduction to Reinforcement Learning Using OpenAI Gym
May 05, 2021 • Joy Zhang • Tutorial • 8 minutes
Edit 5 Oct 2021: I've added a Colab notebook version of this tutorial here.
In this introductory tutorial, we'll apply reinforcement learning (RL) to train an agent to solve the 'Taxi' environment from OpenAI Gym. We'll cover:
Taxi is one of many environments available on OpenAI Gym. These environments are used to develop and benchmark reinforcement learning algorithms.
The goal of Taxi is to pick-up passengers and drop them off at the destination in the least amount of moves. In this tutorial, you'll start with a taxi agent that takes actions randomly:
…and train the agent to be a better taxi driver using reinforcement learning:
Think about how you might teach a dog a new trick, like telling it to sit:
By continuing to do things that lead to positive outcomes, the dog will learn to sit when it hears the command in order to get its treat. Reinforcement learning is a subdomain of machine learning which involves training an 'agent' (the dog) to learn the correct sequences of actions to take (sitting) on its environment (in response to the command 'sit') in order to maximise its reward (getting a treat). This can be illustrated more formally as:
Source: Sutton & Barto
We'll be using the 'Taxi-v3' environment for this tutorial.
You'll need to install:
pip install gym
pip install numpy
The following snippet will import the necessary packages, and create the Taxi environment:
import numpy as np
import gym
import random
# create Taxi environment
env = gym.make('Taxi-v3')
We'll start by implementing an agent that doesn't learn at all. Instead, it will sample actions at random. This will be our baseline.
The first step is to give our agent an initial state of its environment. A state is how our agent will observe its environment. In Taxi, a state defines the current positions of the taxi, passenger, and pick-up and drop-off locations. Below are examples of three different states for Taxi:
Note: Yellow = taxi, Blue letter = pickup location, Purple letter = drop-off destination
To get the initial state:
# create a new instance of taxi, and get the initial state
state = env.reset()
Next, we'll run a for-loop to cycle through the game. At each iteration, our agent will:
Here's our random agent script:
import gym
import numpy as np
import random
# create Taxi environment
env = gym.make('Taxi-v3')
# create a new instance of taxi, and get the initial state
state = env.reset()
num_steps = 99
for s in range(num_steps+1):
print(f"step: {s} out of {num_steps}")
# sample a random action from the list of available actions
action = env.action_space.sample()
# perform this action on the environment
env.step(action)
# print the new state
env.render()
# end this instance of the taxi environment
env.close()
You can run this and watch your agent make random moves. Not super exciting, but hopefully this helped you get familiar with the OpenAI Gym toolkit.
Next, we'll implement the Q-learning algorithm that will enable our agent to learn from rewards.
Q-learning is a reinforcement learning algorithm that seeks to find the best possible next action given its current state, in order to maximise the reward it receives (the 'Q' in Q-learning stands for quality - i.e. how valuable an action is).
Let's take the following starting state:
Which action (up, down, left, right, pick-up or drop-off) should it take in order to maximise its reward? (Note: blue = pick-up location and purple= drop-off destination)
First, let's take a look at how our agent is 'rewarded' for its actions. Remember in reinforcement learning, we want our agent to take actions that will maximise the possible rewards it receives from its environment.
According to the Taxi documentation:
"…you receive +20 points for a successful drop-off, and lose 1 point for every timestep it takes. There is also a 10 point penalty for illegal pick-up and drop-off actions."
Looking back at our original state, the possible actions it can take and the corresponding rewards it will receive are shown below:
In the image above, the agent loses 1 point per timestep it takes. It will also lose 10 points if it uses the pick-up or drop-off action here.
We want our agent to go North towards the pick-up location denoted by a blue R - but how will it know which action to take if they are all equally punishing?
Our agent currently has no way of knowing which action will lead it closest to the blue R. This is where trial-and-error comes in - we'll have our agent take random actions, and observe what rewards it gets (i.e. our agent will explore).
Over many iterations, our agent will have observed that certain sequences of actions will be more rewarding than others. Along the way, our agent will need to keep track of which actions led to what rewards.
A Q-table is simply a look-up table storing values representing the maximum expected future rewards our agent can expect for a certain action in a certain state (known as Q-values). It will tell our agent that when it encounters a certain state, some actions are more likely than others to lead to higher rewards. It becomes a 'cheatsheet' telling our agent what the best action to take is.
The image below illustrates what our 'Q-table' will look like:
Before we begin training our agent, we'll need to initialize our Q-table as so:
state_size = env.observation_space.n # total number of states (S)
action_size = env.action_space.n # total number of actions (A)
# initialize a qtable with 0's for all Q-values
qtable = np.zeros((state_size, action_size))
As our agent explores, it will update the Q-table with the Q-values it finds. To calculate our Q-values, we'll introduce the Q-learning algorithm.
The Q-learning algorithm is given below. We won't go into details, but you can read more about it in Ch 6 of Sutton & Barto (2018).
The Q-learning algorithm will help our agent update the current Q-value (Q(St,At)) with its observations after taking an action. I.e. increase Q if it encountered a positive reward, or decrease Q if it encountered a negative one.
Note that in Taxi, our agent doesn't receive a positive reward until it successfully drops off a passenger (+20 points). Hence even if our agent is heading in the correct direction, there will be a delay in the positive reward it should receive. The following term in the Q-learning equation addresses this:
This term adjusts our current Q-value to include a portion of the rewards it may receive sometime in the future (St+1). The 'a' term refers to all the possible actions available for that state. The equation also contains two hyperparameters which we can specify:
Here's our implementation of the Q-learning algorithm:
# hyperparameters to tune
learning_rate = 0.9
discount_rate = 0.8
# Qlearning algorithm: Q(s,a) := Q(s,a) + learning_rate * (reward + discount_rate * max Q(s',a') - Q(s,a))
qtable[state, action] += learning_rate * (reward + discount_rate * np.max(qtable[new_state,:]) - qtable[state,action])
We can let our agent explore to update our Q-table using the Q-learning algorithm. As our agent learns more about the environment, we can let it use this knowledge to take more optimal actions and converge faster - known as exploitation.
During exploitation, our agent will look at its Q-table and select the action with the highest Q-value (instead of a random action). Over time, our agent will need to explore less, and start exploiting what it knows instead.
Here's our implementation of an exploration-exploitation strategy:
# exploration-exploitation tradeoff
epsilon = 1.0 # probability that our agent will explore
decay_rate = 0.01 # of epsilon
if random.uniform(0,1) < epsilon:
# explore
action = env.action_space.sample()
else:
# exploit
action = np.argmax(qtable[state,:])
# epsilon decreases exponentially --> our agent will explore less and less
epsilon = np.exp(-decay_rate*episode)
In the example above, we set some value epsilon
between 0 and 1. If epsilon
is 0.7, there is a 70% chance that on this step our agent will explore instead of exploit. epsilon
exponentially decays with each step, so that our agent explores less and less over time.
We're done with all the building blocks needed for our reinforcement learning agent. The process for training our agent will look like:
Here's the full implementation:
import numpy as np
import gym
import random
def main():
# create Taxi environment
env = gym.make('Taxi-v3')
# initialize q-table
state_size = env.observation_space.n
action_size = env.action_space.n
qtable = np.zeros((state_size, action_size))
# hyperparameters
learning_rate = 0.9
discount_rate = 0.8
epsilon = 1.0
decay_rate= 0.005
# training variables
num_episodes = 1000
max_steps = 99 # per episode
# training
for episode in range(num_episodes):
# reset the environment
state = env.reset()
done = False
for s in range(max_steps):
# exploration-exploitation tradeoff
if random.uniform(0,1) < epsilon:
# explore
action = env.action_space.sample()
else:
# exploit
action = np.argmax(qtable[state,:])
# take action and observe reward
new_state, reward, done, info = env.step(action)
# Q-learning algorithm
qtable[state,action] = qtable[state,action] + learning_rate * (reward + discount_rate * np.max(qtable[new_state,:])-qtable[state,action])
# Update to our new state
state = new_state
# if done, finish episode
if done == True:
break
# Decrease epsilon
epsilon = np.exp(-decay_rate*episode)
print(f"Training completed over {num_episodes} episodes")
input("Press Enter to watch trained agent...")
# watch trained agent
state = env.reset()
done = False
rewards = 0
for s in range(max_steps):
print(f"TRAINED AGENT")
print("Step {}".format(s+1))
action = np.argmax(qtable[state,:])
new_state, reward, done, info = env.step(action)
rewards += reward
env.render()
print(f"score: {rewards}")
state = new_state
if done == True:
break
env.close()
if __name__ == "__main__":
main()
There are many other environments available on OpenAI Gym for you to try (e.g. Frozen Lake). You can also try optimising the implementation above to solve Taxi in fewer episodes.
Below are some other useful resources to check out.
Lectures and further reading
Tutorials
Reinforcement learning project ideas