Edit 5 Oct 2021: I've added a Colab notebook version of this tutorial here.

In this introductory tutorial, we'll apply reinforcement learning (RL) to train an agent to solve the 'Taxi' environment from OpenAI Gym. We'll cover:

A basic introduction to RL
Setting up OpenAI Gym & Taxi
Step-by-step tutorial on how to train a Taxi agent in Python3 using RL

Before we start, what's 'Taxi'?

Taxi is one of many environments available on OpenAI Gym. These environments are used to develop and benchmark reinforcement learning algorithms.

The goal of Taxi is to pick-up passengers and drop them off at the destination in the least amount of moves. In this tutorial, you'll start with a taxi agent that takes actions randomly:

random taxi agent

…and train the agent to be a better taxi driver using reinforcement learning:

trained taxi agent

💡 An introduction to Reinforcement Learning

Think about how you might teach a dog a new trick, like telling it to sit:

If it performs the trick correctly (it sits), you'll reward it with a treat (positive feedback) ✔️
If it doesn't sit correctly, it doesn't get a treat (negative feedback) ❌

By continuing to do things that lead to positive outcomes, the dog will learn to sit when it hears the command in order to get its treat. Reinforcement learning is a subdomain of machine learning which involves training an 'agent' (the dog) to learn the correct sequences of actions to take (sitting) on its environment (in response to the command 'sit') in order to maximise its reward (getting a treat). This can be illustrated more formally as:

sutton barto rl
Source: Sutton & Barto

🏋️ Installing OpenAI Gym and Taxi

We'll be using the 'Taxi-v3' environment for this tutorial.

You'll need to install:

OpenAI Gym pip install gym
NumPy pip install numpy

The following snippet will import the necessary packages, and create the Taxi environment:

import numpy as np
import gym
import random

# create Taxi environment
env = gym.make('Taxi-v3')

🎲 Random Agent

We'll start by implementing an agent that doesn't learn at all. Instead, it will sample actions at random. This will be our baseline.

The first step is to give our agent an initial state of its environment. A state is how our agent will observe its environment. In Taxi, a state defines the current positions of the taxi, passenger, and pick-up and drop-off locations. Below are examples of three different states for Taxi:

taxi states

Note: Yellow = taxi, Blue letter = pickup location, Purple letter = drop-off destination

To get the initial state:

# create a new instance of taxi, and get the initial state
state = env.reset()

Next, we'll run a for-loop to cycle through the game. At each iteration, our agent will:

Make a random action from the action space (0 - south, 1 - north, 2 - east, 3 - west, 4 - pick-up, 5 - drop-off)
Receive the new state

Here's our random agent script:

import gym
import numpy as np
import random

# create Taxi environment
env = gym.make('Taxi-v3')

# create a new instance of taxi, and get the initial state
state = env.reset()

num_steps = 99
for s in range(num_steps+1):
    print(f"step: {s} out of {num_steps}")

    # sample a random action from the list of available actions
    action = env.action_space.sample()

    # perform this action on the environment
    env.step(action)

    # print the new state
    env.render()

# end this instance of the taxi environment
env.close()

You can run this and watch your agent make random moves. Not super exciting, but hopefully this helped you get familiar with the OpenAI Gym toolkit.

Next, we'll implement the Q-learning algorithm that will enable our agent to learn from rewards.

📖 Q-Learning Agent

Q-learning is a reinforcement learning algorithm that seeks to find the best possible next action given its current state, in order to maximise the reward it receives (the 'Q' in Q-learning stands for quality - i.e. how valuable an action is).

Let's take the following starting state:

taxi state

Which action (up, down, left, right, pick-up or drop-off) should it take in order to maximise its reward? (Note: blue = pick-up location and purple= drop-off destination)

First, let's take a look at how our agent is 'rewarded' for its actions. Remember in reinforcement learning, we want our agent to take actions that will maximise the possible rewards it receives from its environment.

'Taxi' reward system

According to the Taxi documentation:

"…you receive +20 points for a successful drop-off, and lose 1 point for every timestep it takes. There is also a 10 point penalty for illegal pick-up and drop-off actions."

Looking back at our original state, the possible actions it can take and the corresponding rewards it will receive are shown below:

taxi rewards

In the image above, the agent loses 1 point per timestep it takes. It will also lose 10 points if it uses the pick-up or drop-off action here.

We want our agent to go North towards the pick-up location denoted by a blue R - but how will it know which action to take if they are all equally punishing?

Exploration

Our agent currently has no way of knowing which action will lead it closest to the blue R. This is where trial-and-error comes in - we'll have our agent take random actions, and observe what rewards it gets (i.e. our agent will explore).

Over many iterations, our agent will have observed that certain sequences of actions will be more rewarding than others. Along the way, our agent will need to keep track of which actions led to what rewards.

Introducing… Q-tables

A Q-table is simply a look-up table storing values representing the maximum expected future rewards our agent can expect for a certain action in a certain state (known as Q-values). It will tell our agent that when it encounters a certain state, some actions are more likely than others to lead to higher rewards. It becomes a 'cheatsheet' telling our agent what the best action to take is.

The image below illustrates what our 'Q-table' will look like:

Each row corresponds to a unique state in the 'Taxi' environment
Each column corresponds to an action our agent can take
Each cell corresponds to the Q-value for that state-action pair - a higher Q-value means a higher maximum reward our agent can expect to get if it takes that action in that state.

Q table

Before we begin training our agent, we'll need to initialize our Q-table as so:

state_size = env.observation_space.n  # total number of states (S)
action_size = env.action_space.n      # total number of actions (A)

# initialize a qtable with 0's for all Q-values
qtable = np.zeros((state_size, action_size))

As our agent explores, it will update the Q-table with the Q-values it finds. To calculate our Q-values, we'll introduce the Q-learning algorithm.

Q-Learning Algorithm

The Q-learning algorithm is given below. We won't go into details, but you can read more about it in Ch 6 of Sutton & Barto (2018).

Q learning algorithm

The Q-learning algorithm will help our agent update the current Q-value (Q(St,At)) with its observations after taking an action. I.e. increase Q if it encountered a positive reward, or decrease Q if it encountered a negative one.

Note that in Taxi, our agent doesn't receive a positive reward until it successfully drops off a passenger (+20 points). Hence even if our agent is heading in the correct direction, there will be a delay in the positive reward it should receive. The following term in the Q-learning equation addresses this:

maximum q

This term adjusts our current Q-value to include a portion of the rewards it may receive sometime in the future (St+1). The 'a' term refers to all the possible actions available for that state. The equation also contains two hyperparameters which we can specify:

Learning rate (α): how easily the agent should accept new information over previously learnt information
Discount factor (γ): how much the agent should take into consideration the rewards it could receive in the future versus its immediate reward

Here's our implementation of the Q-learning algorithm:

# hyperparameters to tune
learning_rate = 0.9
discount_rate = 0.8

# Qlearning algorithm: Q(s,a) := Q(s,a) + learning_rate * (reward + discount_rate * max Q(s',a') - Q(s,a))
qtable[state, action] += learning_rate * (reward + discount_rate * np.max(qtable[new_state,:]) - qtable[state,action])

Exploration vs Exploitation Trade-off

We can let our agent explore to update our Q-table using the Q-learning algorithm. As our agent learns more about the environment, we can let it use this knowledge to take more optimal actions and converge faster - known as exploitation.

During exploitation, our agent will look at its Q-table and select the action with the highest Q-value (instead of a random action). Over time, our agent will need to explore less, and start exploiting what it knows instead.

Here's our implementation of an exploration-exploitation strategy:

# exploration-exploitation tradeoff
epsilon = 1.0     # probability that our agent will explore
decay_rate = 0.01 # of epsilon

if random.uniform(0,1) < epsilon:
    # explore
    action = env.action_space.sample()
else:
    # exploit
    action = np.argmax(qtable[state,:])

# epsilon decreases exponentially --> our agent will explore less and less
epsilon = np.exp(-decay_rate*episode)

In the example above, we set some value epsilon between 0 and 1. If epsilon is 0.7, there is a 70% chance that on this step our agent will explore instead of exploit. epsilon exponentially decays with each step, so that our agent explores less and less over time.

Bringing it all together

We're done with all the building blocks needed for our reinforcement learning agent. The process for training our agent will look like:

Initialising our Q-table with 0's for all Q-values
Let our agent play Taxi over a large number of games
Continuously update the Q-table using the Q-learning algorithm and an exploration-exploitation strategy

Here's the full implementation:

import numpy as np
import gym
import random

def main():

    # create Taxi environment
    env = gym.make('Taxi-v3')

    # initialize q-table
    state_size = env.observation_space.n
    action_size = env.action_space.n
    qtable = np.zeros((state_size, action_size))

    # hyperparameters
    learning_rate = 0.9
    discount_rate = 0.8
    epsilon = 1.0
    decay_rate= 0.005

    # training variables
    num_episodes = 1000
    max_steps = 99 # per episode

    # training
    for episode in range(num_episodes):

        # reset the environment
        state = env.reset()
        done = False

        for s in range(max_steps):

            # exploration-exploitation tradeoff
            if random.uniform(0,1) < epsilon:
                # explore
                action = env.action_space.sample()
            else:
                # exploit
                action = np.argmax(qtable[state,:])

            # take action and observe reward
            new_state, reward, done, info = env.step(action)

            # Q-learning algorithm
            qtable[state,action] = qtable[state,action] + learning_rate * (reward + discount_rate * np.max(qtable[new_state,:])-qtable[state,action])

            # Update to our new state
            state = new_state

            # if done, finish episode
            if done == True:
                break

        # Decrease epsilon
        epsilon = np.exp(-decay_rate*episode)

    print(f"Training completed over {num_episodes} episodes")
    input("Press Enter to watch trained agent...")

    # watch trained agent
    state = env.reset()
    done = False
    rewards = 0

    for s in range(max_steps):

        print(f"TRAINED AGENT")
        print("Step {}".format(s+1))

        action = np.argmax(qtable[state,:])
        new_state, reward, done, info = env.step(action)
        rewards += reward
        env.render()
        print(f"score: {rewards}")
        state = new_state

        if done == True:
            break

    env.close()

if __name__ == "__main__":
    main()

👏 What's next?

There are many other environments available on OpenAI Gym for you to try (e.g. Frozen Lake). You can also try optimising the implementation above to solve Taxi in fewer episodes.

Below are some other useful resources to check out.

Lectures and further reading

DeepMind x UCL Reinforcement Learning Lecture Series [2021]
A (Long) Peek into Reinforcement Learning by Lilian Weng
Reinforcement Learning by Sutton and Barto

Tutorials

Hands-on introduction to deep reinforcement learning

Reinforcement learning project ideas

Tutorial: An Introduction to Reinforcement Learning Using OpenAI Gym