Skip to content

Solve the Taxi-v3 environment using Q-learning, ensuring efficient AI-driven transportation.

In the quest for efficiency and effectiveness in urban transportation, finding the optimal routes to take passengers from their initial locations to their desired destinations is paramount. This challenge is not just about reducing travel time; it's about enhancing the overall experience for both drivers and passengers, ensuring safety, and minimizing environmental impact.

You have been asked to revolutionize the way taxis navigate the urban landscape, ensuring passengers reach their destinations swiftly, safely, and satisfactorily. As an initial step, your goal is to build a reinforcement learning agent that solves this problem within a simulated environment.

The Taxi-v3 environment

The Taxi-v3 environment is a strategic simulation, offering a grid-based arena where a taxi navigates to address daily challenges akin to those faced by a taxi driver. This environment is defined by a 5x5 grid where the taxi's mission involves picking up a passenger from one of four specific locations (marked as Red, Green, Yellow, and Blue) and dropping them off at another designated spot. The goal is to accomplish this with minimal time on the road to maximize rewards, emphasizing the need for route optimization and efficient decision-making for passenger pickup and dropoff.

Key Components:

  • Action Space: Comprises six actions where 0 moves the taxi south, 1 north, 2 east, 3 west, 4 picks up a passenger, and 5 drops off a passenger.
  • Observation Space: Comprises 500 discrete states, accounting for 25 taxi positions, 5 potential passenger locations, and 4 destinations.
  • Rewards System: Includes a penalty of -1 for each step taken without other rewards, +20 for successful passenger delivery, and -10 for illegal pickup or dropoff actions. Actions resulting in no operation, like hitting a wall, also incur a time step penalty.

Project Description:

Navigate the bustling streets of a virtual city as a taxi driver in this engaging reinforcement learning project. Utilize Q-learning to optimize your routes, ensuring passengers are efficiently picked up and dropped off. Train a reinforcement learning (RL) agent to solve the Taxi-v3 Gymnasium environment, ensuring optimal AI-driven transportation!

Project Instructions:

  • Train an agent over 2,000 episodes, allowing for a maximum of 100 actions per episode (max_actions), utilizing Q-learning. Record the total rewards achieved in each episode and save these in a list named episode_returns
  • What are the learned Q-values? Save these in a numpy array named q_table
  • What is the learned policy? Save it in a dictionary named policy
  • Test the agent's learned policy for one episode, starting with a seed of 42. Save the encountered states from env.render() as frames in a list named frames, and the sum of collected rewards in a variable named episode_total_reward. Make sure your agent does not execute more than 16 actions to solve the episode. If your learning process is efficient, the episode_total_reward should be at least 4
  • Execute the last provided cell to visualize your agent's performance in navigating the environment effectively. Please note that it might take up to one minute to render

1. Training the agent with Q-learning

Train the agent for 2,000 episodes with Q-learning, limiting to 100 actions per episode and recording the rewards per episode in episode_returns.

# Re-run this cell to install and import the necessary libraries and load the required variables
import numpy as np
import gymnasium as gym
import imageio
from IPython.display import Image
from gymnasium.utils import seeding

# Initialize the Taxi-v3 environment
env = gym.make("Taxi-v3", render_mode='rgb_array')

# Seed the environment for reproducibility
env.np_random, _ = seeding.np_random(42)
env.action_space.seed(42)
np.random.seed(42)

# Maximum number of actions per training episode
max_actions = 100 

Q-table initialization

  • The q-table is initialized with zeros, with rows equal to the number of states in the environment and columns equal to the number of possible actions
  • env.observation_space.n gives the number of states in the environment
  • env.action_space.n returns the number of possible actions
# Parameters for training
epsilon = 1.0
min_epsilon = 0.01
epsilon_decay = 0.001
alpha = 0.1  # Learning rate
gamma = 1 # Discount factor

# Determine the environment's number of states and actions
num_states = env.observation_space.n
num_actions = env.action_space.n

# Initialize the Q-table with zeros
q_table = np.zeros((num_states, num_actions))
# Epsilon-greedy strategy function
def epsilon_greedy(state):
    if np.random.rand() < epsilon:
        return env.action_space.sample()  # Explore
    else:
        return np.argmax(q_table[state, :])  # Exploit

Q-learning update rule

  • The Q-learning update rule is represented in the equation below.

access the Q-value of a state s and action a using q_table[s, a]

# Q-learning update function
def q_learning_update(state, action, reward, next_state):
    old_value = q_table[state, action]
    next_max = max(q_table[next_state]) 
    q_table[state, action] = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)

Episode loop

  • Construct a loop to iterate through 000 episodes. Inside this loop, manage each episode's actions with another loop that runs until 100 actions are reached or the episode ends
  • To initialize the environment, you can use env.reset() which returns an initial state along with some auxiliary info
  • The env.action_space.sample() helps in selecting a random action from the action space.
  • To execute a specific action, you can use env.step(action)

Action selection strategies

  • When selecting actions during training, think of action selection strategies that balance exploration and exploitation (epsilon-greedy, decayed epsilon-greedy) for effective learning process
  • To determine the balance between exploration and exploitation, define an exploration probability epsilon and generate a random number; if this number falls below epsilon, the agent explores; otherwise, it exploits
  • To generate a random number between 0 and 1, you can use np.random.rand()
# List to store the total reward per episode
episode_returns = []

# Training loop
for episode in range(2000):
    state, info = env.reset()
    terminated = False
    total_reward = 0

    for i in range(max_actions):
        action = epsilon_greedy(state)
        next_state, reward, terminated, truncated, info = env.step(action)
        q_learning_update(state, action, reward, next_state)
        state = next_state
        total_reward += reward
        if terminated:
          break
          
    episode_returns.append(total_reward)
    
    # Decay epsilon
    epsilon = max(min_epsilon, epsilon * epsilon_decay)

2. Analyzing learned Q-values and policy

Post-training, inspect q_table for learned Q-values. Derive and save the optimal action per state in policy.

Defining the policy

  • A policy is a dictionary that maps each state in the environment to the action that yields the highest Q-value.

Extracting the policy

  • Use the np.argmax() function on the q_table for each state to find the action with the highest Q-value.