Skip to main content

Reinforcement Learning with Gymnasium: A Practical Guide

Understand the basics of Reinforcement Learning (RL) and explore the Gymnasium software package to build and test RL algorithms using Python.
Dec 25, 2024  · 30 min read

Reinforcement Learning (RL) is one of the three main machine learning paradigms, the other two being supervised and unsupervised learning. In RL, an agent learns to interact with its environment to maximize the cumulative rewards. It learns the optimal action under different environmental conditions through trial and error. Reinforcement Learning with Human Feedback (RLHF) allows the agent to modify behavior based on human inputs at each step.

RL solves problems like self-driving cars, automated trading, computer players in video games, training robots, and more. When deep neural networks are used to apply RL algorithms, it is called Deep Reinforcement Learning

In this tutorial, I’ll show you how to get started with Gymnasium, an open-source Python library for developing and comparing reinforcement learning algorithms. I'll demonstrate how to set it up, explore various RL environments, and use Python to build a simple agent to implement an RL algorithm. 

What is Gymnasium?

Gymnasium is an open-source Python library designed to support the development of RL algorithms. To facilitate research and development in RL, Gymnasium provides: 

  • A wide variety of environments, from simple games to problems mimicking real-life scenarios.
  • Streamlined APIs and wrappers to interface with the environments.
  • The ability to create custom environments and take advantage of the API framework.

Developers can build RL algorithms and use API calls for tasks like:

  • Passing the agent’s chosen action to the environment.
  • Knowing the environment’s state and reward following each action. 
  • Training the model.
  • Testing the model’s performance.

OpenAI’s Gym versus Farama’s Gymnasium

OpenAI hasn’t committed significant resources to developing Gym because it was not a business priority for the company. The Farama Foundation was created to standardize and maintain RL libraries over the long term. Gymnasium is the Farama Foundation’s fork of OpenAI’s Gym. Gymnasium 0.26.2 is a drop-in replacement for Gym 0.26.2. With the fork, Farama aims to add functional (in addition to class-based) methods for all API calls, support vector environments, and improve the wrappers. The overall goal is to make the framework cleaner and more efficient.

Become an ML Scientist

Upskill in Python to become a machine learning scientist.
Start Learning for Free

Setting Up Gymnasium

Gymnasium needs specific versions (not the latest releases) of various dependency programs like NumPy and PyTorch. Thus, we recommend creating a fresh Conda or venv environment or a fresh notebook to install, use Gymnasium, and run RL programs. 

You can use this DataLab workbook to follow along with the tutorial.

Installing Gymnasium

To install Gymnasium on a server or local machine, run: 

$ pip install gymnasium 

To install using a Notebook like Google’s Colab or DataCamp’s DataLab, use:

!pip install gymnasium

The above command installs Gymnasium and the correct versions of dependencies.

Exploring Gymnasium environments

As of November 2024, Gymnasium includes over 60 inbuilt environments. To browse available inbuilt environments, use the gym.envs.registry.all() function, as illustrated in the example below: 

import gymnasium as gym
for i in gym.envs.registry.keys():
	print(i)

You can also visit the Gymnasium homepage. The left-hand column has links to all the environments. The webpage of each environment includes details about it, such as actions, states, etc. 

Environments are organized into categories like Classic Control, Box2D, and more. Below, I list some of the common environments in each group:

  • Classic Control: These are canonical environments used in RL development; they form the basis of many textbook examples. They give the right mix of complexity and simplicity to test and benchmark new RL algorithms. Classic control environments in Gymnasium include: 
    • Acrobot
    • Cart Pole
    • Mountain Car Discrete
    • Mountain Car Continuous
    • Pendulum
  • Box2D: Box2D is a 2D Physics Engine for Games. Environments based on this engine include simple games like:
    • Lunar Lander
    • Car Racing
  • ToyText: These are small and simple environments often used to debug RL algorithms. Many of these environments are based on the small grid world model and simple card games. Examples include: 
    • Blackjack
    • Taxi
    • Frozen Lake
  • MuJoCo: Multi-Joint dynamics with Contact (MuJoCo) is an open-source physics engine that simulates environments for applications like robotics, biomechanics, ML, etc. MuJoCo environments in Gymnasium include:
    • Ant
    • Hopper
    • Humanoid
    • Swimmer
    • And more

In addition to the built-in environments, Gymnasium can be used with many external environments using the same API. 

We’ll use one of the canonical Classic Control environments in this tutorial. To import a specific environment, use the .make() command and pass the name of the environment as an argument. For example, to create a new environment based on CartPole (version 1), use the command below: 

import gymnasium as gym
env = gym.make("CartPole-v1")

Understanding Reinforcement Learning Concepts in Gymnasium

In a nutshell, Reinforcement Learning consists of an agent (like a robot) that interacts with its environment. A policy decides the agent’s actions. Depending on the agent’s actions, the environment gives a reward (or penalty) at each timestep. The agent uses RL to figure out the optimal policy that maximizes the total rewards the agent earns. 

Components of an RL environment

The following are the key components of an RL environment: 

  • Environment: The external system, world, or context. The agent interacts with the environment in a series of timesteps. In each timestep, based on the agent’s action, the environment:
    • Gives a reward (or penalty) 
    • Decides the next state 
  • State: A mathematical representation of the current configuration of the environment. 
    • For example, the state of a pendulum environment can include the pendulum's position and angular velocity at each timestep. 
    • Terminal state: A state that does not lead to new/other states. 
  • Agent: The algorithm that observes the environment and takes various actions based on this observation. The agent’s goal is to maximize its rewards. 
    • For example, the agent decides how hard and in what direction to push the pendulum.  
  • Observation: A mathematical representation of the agent’s view of the environment, acquired, for example, using sensors. 
  • Action: The decision made by the agent before proceeding to the next step. The action affects the next state of the environment and earns the agent a reward. 
  • Reward: The feedback from the environment to the agent. It can be positive or negative, depending on the action and the state of the environment. 
  • Return: The expected cumulative return over future timesteps. Rewards from future timesteps can be discounted using a discount factor. 
  • Policy: The agent’s strategy about what action to take in various states. It is typically represented as a probability matrix, P, which maps states to actions.
    • Given a finite set of m possible states and n possible actions, element Pmn in the matrix denotes the probability of taking action an in the state sm.  
  • Episode: The series of timesteps from the (randomized) initial state until the agent reaches a terminal state.

Observation space and action space

The observation is the information that the agent gathers about the environment. An agent, for example, a robot, could collect environmental information using sensors. Ideally, the agent should be able to observe the complete state, which describes all the aspects of the environment. In practice, the agent uses its observations as a proxy for the state. Thus, the observations decide the agent’s actions. 

A space is analogous to a mathematical set. The space of items X includes all possible instances of X. The space of X also defines the structure (syntax and format) of all items of type X. Each Gymnasium environment has two spaces, the action space, action_space, and the observation space, observation_space. Both the action and observation spaces derive from the parent gymnasium.spaces.Space superclass. 

Observation space 

The observation space is the space that includes all possible observations. It also defines the format in which observations are stored. The observation space is typically represented as an object of datatype Box. This is an ndarray which describes the parameters of the observations. The box specifies the bounds of each dimension. You can view the observation space for an environment using the observation_space method:

print("observation space: ", env.observation_space)

In the case of the CartPole-v1 environment, the output looks like the example below: 

observation space:  Box([-4.8 -inf -0.41887903 -inf], [4.8 inf 0.41887903 inf], (4,), float32)

In this example, the CartPole-v1 observation space has 4 dimensions. The 4 elements of the observation array are:

  • Cart position - varies between -4.8 and +4.8
  • Cart velocity - ranges between - to +
  • Pole angle - varies between -0.4189 and +0.4189
  • Pole angular velocity - ranges between  - to +

To see an example of an individual observation array, use the .reset() command. 

observation, info = env.reset()
print("observation: ", observation)

In the case of the CartPole-v1 environment, the output looks like the example below: 

[ 0.03481963 -0.0277232   0.01703267 -0.04870504]                                                                                                       

The four elements of this array correspond to the four observed quantities (cart position, cart velocity, pole angle, pole angular velocity ), as explained earlier. 

Action space

The action space includes all possible actions that the agent can take. The action space also defines the format in which actions are represented. You can view the action space for an environment using the action_space method:

print("action space: ", env.action_space)

In the case of the CartPole-v1 environment, the output looks like the example below: 

action space:  Discrete(2)

In the case of the CartPole-v1 environment, the action space is discrete. There are a total of two actions that the agent can take:

  • 0: Push the cart to the left
  • 1: Push the cart to the right

Building Your First RL Agent with Gymnasium

In the previous sections, we explored the basic concepts of RL and Gymnasium. This section shows you how to use Gymnasium to build an RL agent. 

Creating and resetting the environment

The first step is to create an instance of the environment. To create new environments, use the .make() method. 

env = gym.make('CartPole-v1')

The agent’s interactions change the environment’s state. The .reset() method resets the environment to an initial state. By default, the environment is initialized to a random state. You can use a SEED parameter with the .reset() method to initialize the environment to the same state every time the program is run. The code below shows how to do this:

SEED = 1111
env.reset(seed=SEED)

The sampling of actions also involves randomness. To control this randomness and get a fully reproducible training path, we can seed the random generators of NumPy and PyTorch:

np.random.seed(SEED)
torch.manual_seed(SEED)

Random versus intelligent actions

In each step in a Markov process, the agent can randomly choose an action and explore the environment until it arrives at a terminal state. By choosing actions at random: 

  • It can take a long time to reach the terminal state.
  • The cumulative rewards are much lower than what they could have been.

Training the agent to optimize the selection of actions based on previous experiences (of interacting with the environment) is more efficient to maximize long-term rewards. 

The untrained agent starts with random actions based on a randomly initialized policy. This policy is typically represented as a neural network. During training, the agent learns the optimal policy that maximizes the rewards. In RL, the training process is also called policy optimization. 

There are various methods of policy optimization. The Bellman equations describe how to calculate the value of RL policies and determine the optimal policy. In this tutorial, we’ll use a simple technique called policy gradients. Other methods exist, such as Proximal Policy Optimization (PPO)

Implementing a Simple Policy Gradient Agent

To build an RL agent that uses policy gradients, we create a neural network to implement the policy, write functions to calculate the returns and loss from the stepwise rewards and the action probabilities, and iteratively update the policy using standard backpropagation techniques. 

Setting up the policy network

We use a neural network to implement the policy. Because CartPole-v1 is a simple environment, we use a neural network with:

  • Input dimensions equal to the dimensionality of the environment’s observation space. 
  • A single hidden layer with 64 neurons. 
  • Output dimensions equal to the dimensionality of the environment’s action space.

Thus, the function of the policy network is to map observed states to actions. Given an input observation, it predicts the right action. The code below implements the policy network:

class PolicyNetwork(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, dropout):
        super().__init__()
        self.layer1 = nn.Linear(input_dim, hidden_dim)
        self.layer2 = nn.Linear(hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
    def forward(self, x):
        x = self.layer1(x)
        x = self.dropout(x)
        x = F.relu(x)
        x = self.layer2(x)
        return x

Reward collection and the forward pass

As mentioned, in each step of the Markov process, the environment gives a reward based on the agent’s action and state. The goal in RL is to maximize the total return. 

  • The return at each timestep is the cumulative sum of the rewards obtained from the beginning till that step. 
  • The total return in each episode is obtained by accumulating all the stepwise rewards from that episode. Thus, the total return is the return at the last timestep (when the agent reaches a terminal state). 

In practice, while accumulating rewards, it is common to: 

  • Adjust future rewards using a discount factor. 
  • Normalize the array of stepwise returns to ensure smooth and stable training. 

The code below shows how to do this: 

def calculate_stepwise_returns(rewards, discount_factor):
    returns = []
    R = 0
    for r in reversed(rewards):
        R = r + R * discount_factor
        returns.insert(0, R)
    returns = torch.tensor(returns)
    normalized_returns = (returns - returns.mean()) / returns.std()
    return normalized_returns

The forward pass consists of running the agent based on the current policy until it reaches a terminal state and collecting the stepwise rewards and action probabilities. The steps below explain how to implement the forward pass: 

  • Reset the environment to an initial state. 
  • Initialize buffers to store the action probabilities, the rewards, and the cumulative return
  • Use the .step() function to iteratively run the agent in the environment until it terminates:
    • Get the observation of the environment’s state.
    • Get the action predicted by the policy based on the observation.
    • Use the Softmax function to estimate the probability of taking the predicted action.
    • Simulate a categorical probability distribution based on these estimated probabilities.
    • Sample this distribution to get the agent’s action.
    • Estimate the log probability of the sampled action from the simulated distribution. 
  • Append the log probability of the actions and the rewards from each step to their respective buffers. 
  • Estimate the normalized and discounted values of the returns at each step based on the rewards. 
def forward_pass(env, policy, discount_factor):
    log_prob_actions = []
    rewards = []
    done = False
    episode_return = 0
    policy.train()
    observation, info = env.reset()
    while not done:
        observation = torch.FloatTensor(observation).unsqueeze(0)
        action_pred = policy(observation)
        action_prob = F.softmax(action_pred, dim = -1)
        dist = distributions.Categorical(action_prob)
        action = dist.sample()
        log_prob_action = dist.log_prob(action)
        observation, reward, terminated, truncated, info = env.step(action.item())
        done = terminated or truncated
        log_prob_actions.append(log_prob_action)
        rewards.append(reward)
        episode_return += reward
    log_prob_actions = torch.cat(log_prob_actions)
    stepwise_returns = calculate_stepwise_returns(rewards, discount_factor)
    return episode_return, stepwise_returns, log_prob_actions

Updating policy based on rewards

The loss represents the quantity on which we apply gradient descent. The goal in RL is to maximize the returns. So, we use the expected return value as a proxy for the loss. The expected return value is calculated as the product of the stepwise expected returns and the log probability of the stepwise actions. The code below calculates the loss:

def calculate_loss(stepwise_returns, log_prob_actions):
    loss = -(stepwise_returns * log_prob_actions).sum()
    return loss

To update the policy, you run backpropagation with respect to the loss function. The update_policy() method below invokes the calculate_loss() method. It then runs backpropagation on this loss to update the policy parameters, i.e., model weights of the policy network.

def update_policy(stepwise_returns, log_prob_actions, optimizer):
    stepwise_returns = stepwise_returns.detach()
    loss = calculate_loss(stepwise_returns, log_prob_actions)
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    return loss.item()

Updating the policy based on the gradient of the returns is called the policy gradient method

Training the policy

We now have all the components needed to train and evaluate the policy. We implement the training loop as explained in the following steps:  

Before starting, we declare the hyperparameters, instantiate a policy, and create an optimizer:

  • Declare the hyperparameters as Python constants:
    • MAX_EPOCHS is the maximum number of iterations we are prepared to run to train the policy. 
    • DISCOUNT_FACTOR decides the relative significance of rewards from future time steps. A discount factor of 1 means all rewards are equally important, while a value of 0 means only the reward from the current time step is important. 
    • N_TRIALS is the number of episodes over which we average the returns to evaluate the agent’s performance. We decide the training is successful if the average return over N_TRIALS episodes is above the threshold. 
    • REWARD_THRESHOLD: If the policy can achieve a return greater than the threshold, it is considered successful. 
    • DROPOUT decides the fraction of the weights that should be randomly zeroed. The dropout function randomly sets a fraction of the model weights to zero. This reduces reliance on specific neurons and prevents overfitting, making the network more robust.
    • LEARNING_RATE decides how much the policy parameters can be modified in each step. The update to the parameters in each iteration is the product of the gradient and the learning rate. 
  • Define the policy as an instance of the PolicyNetwork class (implemented earlier). 
  • Create an optimizer using the Adam algorithm and the learning rate. 

To train the policy, we iteratively run the training steps till the average return (over N_TRIALS) is greater than the reward threshold:

  • For each episode, run the forward pass once. Collect the log probability of actions, the stepwise returns, and the total return from that episode. Accumulate the episodic returns in an array. 
  • Calculate the loss using the log probabilities and the stepwise returns. Run the backpropagation on the loss. Use the optimizer to update the policy parameters. 
  • Check if the average return over N_TRIALS exceeds the reward threshold. 

The code below implements these steps:

def main(): 
    MAX_EPOCHS = 500
    DISCOUNT_FACTOR = 0.99
    N_TRIALS = 25
    REWARD_THRESHOLD = 475
    PRINT_INTERVAL = 10
    INPUT_DIM = env.observation_space.shape[0]
    HIDDEN_DIM = 128
    OUTPUT_DIM = env.action_space.n
    DROPOUT = 0.5
    episode_returns = []
    policy = PolicyNetwork(INPUT_DIM, HIDDEN_DIM, OUTPUT_DIM, DROPOUT)
    LEARNING_RATE = 0.01
    optimizer = optim.Adam(policy.parameters(), lr = LEARNING_RATE)
    for episode in range(1, MAX_EPOCHS+1):
        episode_return, stepwise_returns, log_prob_actions = forward_pass(env, policy, DISCOUNT_FACTOR)
        _ = update_policy(stepwise_returns, log_prob_actions, optimizer)
        episode_returns.append(episode_return)
        mean_episode_return = np.mean(episode_returns[-N_TRIALS:])
        if episode % PRINT_INTERVAL == 0:
            print(f'| Episode: {episode:3} | Mean Rewards: {mean_episode_return:5.1f} |')
        if mean_episode_return >= REWARD_THRESHOLD:
            print(f'Reached reward threshold in {episode} episodes')
            break

Lastly, invoke the main() function to train the policy:

main()

Use this DataLab workbook to run the above algorithm directly and solve the CartPole environment using RL. 

Advanced Techniques in Gymnasium

Having demonstrated how to implement an RL algorithm, we now discuss some advanced techniques commonly used in practice. 

Using pre-built architectures

Implementing RL algorithms from scratch is a long and difficult process, especially for complex environments and state-of-the-art policies. 

A more practical alternative is to use software like Stable Baselines3. It comes with tried and tested implementations of RL algorithms. It includes pre-trained agents, training scripts, evaluation tools, and modules to plot graphs and record videos. 

Ray RLib is another popular tool for RL. RLib is designed as a scalable solution, making it easy to implement RL algorithms on multi-GPU systems. It also supports multi-agent RL, which opens up new possibilities like:

  • Independent multi-agent learning: Each agent treats other agents as part of the environment.
  • Collaborative multi-agent training: A group of agents share the same policy and value functions and learn from each other’s experiences in parallel. 
  • Adversarial training: Agents (or groups of agents) compete against each other in competitive game-like environments. 

With both RLib and Stable Baselines3, you can import and use environments from OpenAI Gymnasium. 

Custom environments

Environments packaged with Gymnasium are the right choice for testing new RL strategies and training policies. However, for most practical applications, you need to create and use an environment that accurately reflects the problem you want to solve. You can use Gymnasium to create a custom environment. The advantage of using Gymnasium custom environments is that many external tools like RLib and Stable Baselines3 are already configured to work with the Gymnasium API structure. 

To create a custom environment in Gymnasium, you need to define: 

  • The observation space.
  • The terminal conditions.
  • The set of actions the agent can choose from.
  • How to initialize the environment (when the reset() function is called). 
  • How the environment decides the next state given the agent’s actions (when the step() function is called). 

To learn more, follow the Gymnasium guide on creating custom environments

Best Practices for Using Gymnasium

Experiment with different environments

The code in this tutorial showed how to implement the policy gradient algorithm in the CartPole environment. This is a simple environment with a discrete action space. To understand RL better, we advise you to apply the same policy gradient algorithm (and other algorithms, like PPO) in other environments. 

For example, the Pendulum environment has a continuous action space. It consists of a single input represented as a continuous variable - the (magnitude and direction of the) torque applied to the pendulum in any given state. This torque can take on any value between -2 and +2

Experimenting with different algorithms in various environments helps you better understand different kinds of RL solutions and their challenges. 

Monitor training progress

RL environments often consist of robots, pendulums, mountain cars, video games, etc. Visualizing the agent’s actions within the environment gives a better intuitive understanding of the policy’s performance. 

In Gymnasium, the env.render() method visualizes the agent’s interactions with the environment. It graphically displays the current state of the environment—game screens, the position of the pendulum or cart pole, etc. Visual feedback of the agent’s actions and the environment’s responses helps monitor the agent's performance and progress through the training process. 

There are four render modes: “human”, “rgb_array”, “ansi”, and “rgb_array_list”. To visualize the agent’s performance, use the “human” render mode. The render mode is specified when the environment is initialized. For example:

env = gym.make(‘CartPole-v1’, render_mode=’human’)

To perform the rendering, involve the .render() method after each action performed by the agent (via calling the .step() method). The pseudo-code below illustrates how to do this:

while not done:
    …
   step, reward, terminated, truncated, info = env.step(action.item())
   env.render()
    …

Troubleshooting common errors

Gymnasium makes it easy to interface with complex RL environments. However, is a continuously updated software with many dependencies. So, watching out for a few common types of errors is essential.

Version mismatches

  • Gymnasium version mismatch: Farama’s Gymnasium software package was forked from OpenAI’s Gym from version 0.26.2. There have been a few breaking changes between older Gym versions and new versions of Gymnasium. Many publicly available implementations are based on the older Gym releases and may not work directly with the latest release. In such cases, it is necessary to either roll back the installation to an older version or to adapt the code to work with the newer release. 
  • Environment version mismatch: Many Gymnasium environments have different versions. For example, there are two CartPole environments - CartPole-v1 and CartPole-v0. Although the behavior of the environment is the same across both versions, some of the parameters, like the episode length, reward threshold, etc., can be different. A policy trained on one version might not perform as well on another version of the same environment. You need to update the training parameters and retrain the policy for each environment version. 
  • Dependencies version mismatch: Gymnasium depends on dependencies like NumPy and PyTorch. As of December 2024, the latest versions of these dependencies are numpy 2.1.3 and torch 2.5.1. However, Gymnasium works best with torch 1.13.0 and numpy 1.23.3. You might encounter issues if you install Gymnasium into an environment with these dependencies pre-installed. We recommend installing and working with Gymnasium in a fresh Conda environment. 

Convergence problems

  • Hyperparameters: Like other machine learning algorithms, RL policies are sensitive to hyperparameters like learning rate, discount factor, etc. We recommend experimenting and tuning the hyperparameters manually or using automated techniques like grid search and random search. 
  • Exploration versus exploitation: For some policy classes (such as PPO), the agent adopts a two-pronged strategy: explore the environment to discover new paths and adopt a greedy approach to maximize the rewards based on the paths known so far. If it explores too much, the policy does not converge. Conversely, it never tries the optimal path if it doesn't explore enough. So, finding the right balance between exploration and exploitation is essential. It is also common to prioritize exploration in earlier episodes and exploitation in later episodes during the training. 

Training instability

  • Large learning rates: If the learning rate is too high, the policy parameters undergo large updates in each step. This could potentially lead to missing the optimum set of values. A common solution is to gradually decay the learning rate, ensuring smaller and more stable updates as the training converges. 
  • Excessive exploration: Too much randomness (entropy) in action selection prevents convergence and leads to large variations in the loss function between subsequent steps. To have a stable and convergent training process, balance exploration with exploitation. 
  • Wrong choice of algorithm: Simple algorithms like policy gradient might lead to unstable training in complex environments with large action and state spaces. In such cases, we recommend using more robust algorithms like PPO and Trust Region Policy Optimization (TRPO). These algorithms avoid large policy updates in each step and can be more stable.
  • Randomness: RL algorithms are notoriously sensitive to initial states and the randomness inherent to action selection. When a training run is unstable, it can sometimes be stabilized using a different random seed or by reinitializing the policy.

Conclusion

In this tutorial, we explored the basic principles of RL, discussed Gymnasium as a software package with a clean API to interface with various RL environments, and showed how to write a Python program to implement a simple RL algorithm and apply it in a Gymnasium environment.

After understanding the basics in this tutorial, I recommend using Gymnasium environments to apply the concepts of RL to solve practical problems such as taxi route optimization and stock trading simulations.

Build Machine Learning Skills

Elevate your machine learning skills to production level.

FAQs

What is the role of Gymnasium in reinforcement learning?

Gymnasium comes with many pre-built environments for testing and developing RL algorithms. It standardizes the interface for these environments and new custom environments. This makes it easier to implement and evaluate RL algorithms.

Are you limited to the inbuilt environments when using OpenAI Gymnasium?

No. In addition to over 60 pre-built environments, Gymnasium allows you to create new environments by declaring their state and action spaces, and by defining how the environment to respond to various actions.

What is the difference between Gym and Gymnasium?

In a nutshell, Gymnasium, maintained by the Farama Foundation, is the new and upgraded version of Gym, which OpenAI no longer maintains. Gymnasium is expected to work as a drop-in replacement for Gym 0.26.2.

What is the difference between policy-based and value-based RL methods?

In policy-based methods, like policy gradients, the agent directly modifies the policy parameters to find which policy leads to the maximum expected return. 

In value-based methods, the agent is trained on the value function. Learning the optimal value function helps to derive the optimal policy.

Is hand-coding algorithms the best and only way to solve RL problems?

Writing code from scratch is the best way to understand different RL algorithms. For production use, there are tools like Stable Baselines3, RLib, and CleanRL. These tools include pre-built deployments of common RL algorithms, which you must configure and finetune depending on the environment.


Arun Nanda's photo
Author
Arun Nanda
LinkedIn

Arun is a former startup founder who enjoys building new things. He is currently exploring the technical and mathematical foundations of Artificial Intelligence. He loves sharing what he has learned, so he writes about it.

In addition to DataCamp, you can read his publications on Medium, Airbyte, and Vultr.

Topics

Learn more about reinforcement learning with these courses!

course

Reinforcement Learning with Gymnasium in Python

4 hr
4K
Start your reinforcement learning journey! Learn how agents can learn to solve environments through interactions.
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related

tutorial

Reinforcement Learning: An Introduction With Python Examples

Learn the fundamentals of reinforcement learning through the analogy of a cat learning to use a scratch post.
Bex Tuychiev's photo

Bex Tuychiev

14 min

tutorial

Proximal Policy Optimization with PyTorch and Gymnasium

Learn the first principles of Proximal Policy Optimization, including its implementation in PyTorch with Gymnasium!
Arun Nanda's photo

Arun Nanda

25 min

tutorial

An Introduction to Q-Learning: A Tutorial For Beginners

Learn about the most popular model-free reinforcement learning algorithm with a Python tutorial.
Abid Ali Awan's photo

Abid Ali Awan

16 min

tutorial

SARSA Reinforcement Learning Algorithm in Python: A Full Guide

Learn SARSA, an on-policy reinforcement learning algorithm. Understand its update rule, hyperparameters, and differences from Q-learning with practical Python examples and its implementation.
Bex Tuychiev's photo

Bex Tuychiev

15 min

tutorial

Machine Learning Basics - The Norms

Learn linear algebra through code and visualization.
Hadrien Jean's photo

Hadrien Jean

19 min

code-along

Getting Started with Machine Learning in Python

Learn the fundamentals of supervised learning by using scikit-learn.
George Boorman's photo

George Boorman

See MoreSee More