Skip to main content

Getting Started with TorchRL for Deep Reinforcement Learning

A beginner-friendly guide to TorchRL for deep reinforcement learning—learn to build RL agents with PyTorch through practical examples.
Jan 29, 2025  · 30 min read

Reinforcement Learning (RL) is used to solve many complex problems, from training self-driving cars to training large language models (LLMs) to give human-like responses. 

RL agents are trained to respond based on human feedback, using Reinforcement Learning from Human Feedback (RLHF). Though Python-based frameworks like Keras and Tensorflow are used in enterprise deep learning applications, most new projects are now based on PyTorch and PyTorch Lightning.

TorchRL is an open-source library for building RL solutions using PyTorch. In this tutorial, I will show you how to set up TorchRL, understand its underlying components, and use it to build a simple RL agent. We will also discuss using TorchRL to implement pre-built versions of RL algorithms like Proximal Policy Optimization (PPO). Finally, we will cover the basic principles of logging and monitoring RL agents.

What is TorchRL and Why Use It?

RL algorithms are often complex. You need to create the agent(s), calculate the returns and losses, define the forward and the backward passes, and evaluate the agent’s performance. 

TorchRL packages many commonly used RL functions into modules you can access directly. This makes implementing and experimenting with various algorithms to solve practical problems easier. It also makes it simpler to build new algorithms because researchers have access to the rest of the RL ecosystem without having to build it themselves. 

TorchRL comes with many prebuilt modules that make RL development more efficient. For example: 

  • Environments: TorchRL provides a standardized API to import and use RL environments from various sources, including Gymnasium, Jumanji, RoboHive, and more. In many cases, it is necessary to customize the output of the environment to suit specific training needs. TorchRL includes modules for various environment transformations using a single function call with desired parameters. 
  • Data collectors and replay buffers: Many RL training algorithms involve collecting data about the agent’s interactions with the environment - the agent’s action, the reward it received, and the next state it ended up in. TorchRL includes packages with the data structures to collect and sample this information. 
  • Objective functions: The core methodology of an RL algorithm is implemented in its objective function. TorchRL packages common RL algorithms like DQN (Deep Q Networks), A2C (Advantage Actor-Critic), PPO, and many more into prebuilt modules that you can directly invoke and use to train the agent. 

Given the above functionality, TorchRL has proved useful to streamline and simplify building RL solutions for various use cases such as:

  • Robotics: The robot has to be trained to navigate a complex environment successfully. It is penalized for tripping or otherwise failing to navigate. Using RL training methods, it learns the right actions to perform to successfully navigate an environment like an uneven natural surface. 
  • Game AI: Computer-based players in video and console games need to devise and improvise their moves in response to the human player’s actions. RL, with rewards and penalties, is used to train such players to choose the right moves to play against the human. 
  • Autonomous systems: Self-driving cars need to independently navigate a complex environment (like road traffic) and complete the goal (reach the destination) without any accidents. RL is used to train autonomous systems to mimic the behavior of a human driver under a variety of real-world conditions. 

Develop AI Applications

Learn to build AI applications using the OpenAI API.
Start Upskilling For Free

Setting Up TorchRL

In this section, I will show you how to install and get started with TorchRL. 

Prerequisites

You need a few dependency software packages before installing and using TorchRL:

  • PyTorch: TorchRL is based on PyTorch, so you need it as a prerequisite. 
  • Gymnasium: You need the Gymnasium package to import RL environments. As of Jan 2025, the latest version of Gymnasium is not compatible with TorchRL, as explained on this Git Discussions page. Thus, install the older version 0.29.1
  • PyGame: It is a Python package for video games.  TorchRL needs it to simulate game-like RL environments, like CartPole. 
  • TensorDict: Using a dictionary-like data structure to store the inputs and outputs of neural networks makes it convenient to work with tensors in TorchRL. TensorDict provides a tensor container that stores tensors as key-value pairs. 

Install the prerequisite packages:

!pip install torch tensordict  gymnasium==0.29.1 pygame

Installing TorchRL

Install TorchRL using pip. I strongly recommend installing these packages in a Conda environment if you are working on a personal computer or server. 

!pip install torchrl

Verifying the installation

After installing, test that TorchRL has been installed successfully. Try to import torchrl in a Python shell (or notebook). Use the check_env_specs() method to check that a standard environment (such as CartPole) matches the specifications of torchrl:

import torchrl 
from torchrl.envs import GymEnv 
from torchrl.envs.utils import check_env_specs

check_env_specs(GymEnv("CartPole-v1"))

The output should indicate that it has successfully created the environment and that it works with TorchRL. 

[torchrl][INFO] check_env_specs succeeded!

Key Components of TorchRL

Before building your first RL agent, let’s look at the core building blocks of TorchRL.

Environments

TorchRL provides a uniform API to interface with different environments. It does this by wrapping individual environment-specific functionalities to a standard set of wrapper classes and functions. You pass the appropriate parameters to the wrapper, and it internally maps your command to the corresponding function call to the specific environment. In particular:

  • TorchRL converts environment states (observations), actions, and rewards into PyTorch tensor objects, which can be directly used by the modules that implement RL algorithms.
  • It makes it possible to apply preprocessing and postprocessing steps, for example, to normalize and scale input tensors or to present output tensors in a specific format.

For example, to create an environment from Gymnasium, use the GymEnv module:

env = GymEnv("CartPole-v1")

Transforms

Extending the stock environment with add-on features and transformations that you would otherwise have had to implement yourself is common. For example, you can add a step counter module to the environment instead of coding the counter yourself. The step counter keeps track of the number of steps in each episode. The transformedEnv module helps with this:

from torchrl.envs import GymEnv, StepCounter, TransformedEnv
env = TransformedEnv(GymEnv("CartPole-v1"), StepCounter())

Similarly, normalizing tensors before operating on them is a common preprocessing step. A normalization transformation, using the ObservationNorm module normalizes the tensors. The documentation of the TransformedEnv function discusses various kinds of transformations. 

You can also combine multiple transformations using the compose() parameter:

base_env = GymEnv('CartPole-v1', device=device) 

env = TransformedEnv( 
    base_env, 
    Compose(
        ObservationNorm(in_keys=["observation"]), 
        StepCounter()
    )
)

Agents and policies

In RL, the agent decides its actions using the policy and based on the observed state of the environment. The agent’s goal is to maximize the cumulative rewards from the environment. It receives rewards when it chooses the right action (such as stopping at a red light). 

For example, in a self-driving car, the agent drives the car. Its decisions (such as steering the car in a particular direction) are based on its observations of the state of the environment (such as traffic, the position of the car, etc.) and its policy (such as avoiding pedestrians and other obstacles, stopping at red lights, etc.). 

The simplest policy is to choose an action at random. The actor randomly chooses one of the actions from the space of possible actions. A random policy is sometimes used to generate an initial dataset of interactions before starting to train the model. 

Use the RandomPolicy module to create a random policy. This random policy accepts an action_spec parameter which specifies the action-space. In the example below, the action space consists of  (continuous) numbers between -1 and +1. So, the random policy chooses an action represented by a random number between -1 and +1. 

import torchrl 
import torch
from tensordict import TensorDict
from torchrl.data.tensor_specs import Bounded

action_spec = Bounded(-torch.ones(1), torch.ones(1))
actor = torchrl.envs.utils.RandomPolicy(action_spec=action_spec) 
td = actor(TensorDict({}, batch_size=[])) 
print(td.get("action"))

The output should be a tensor, as shown below:

tensor([0.9258])

Check that the policy is indeed random by running the actor again:

td = actor(TensorDict({}, batch_size=[])) 
print(td.get("action"))

It should output a different random tensor.

If you need a primer on RL concepts, check out the Reinforcement Learning in Python skill track on DataCamp!

Building Your First RL Agent with TorchRL

In this section, I show you how to implement a simple Reinforcement Learning agent using TorchRL. 

Before starting, import the prerequisite software packages in Python:

  • time to measure the time taken to train the agent.
  • GymEnv, StepCounter, and TransformedEnv to work with Gymnasium environments.
  • MLP to create a simple MLP (multi-layer perceptron) neural network. 
  • EGreedyModule to balance exploring the environment and exploiting the best-known policy.
  • QValueModule and DQNLoss to implement the Deep Q-Learning algorithm.
  • SoftUpdate to update the neural network.
  • SyncDataCollector to collect data from the agent’s interactions. 
  • ReplayBuffer to store the data from the agent’s interactions.
  • Adam for backpropagation.
  • matplotlib to visually display the training progress. 
  • torchrl_logger to log the training session.
import time
import matplotlib.pyplot as plt
from torchrl.envs import GymEnv, StepCounter, TransformedEnv
from tensordict.nn import TensorDictModule as TensorDict, TensorDictSequential as Seq
from torchrl.modules import EGreedyModule, MLP, QValueModule
from torchrl.objectives import DQNLoss, SoftUpdate
from torchrl.collectors import SyncDataCollector
from torchrl.data import LazyTensorStorage, ReplayBuffer
from torch.optim import Adam
from torchrl._utils import logger as torchrl_logger

Step 1: Define the environment

In this example, we solve the CartPole environment. Import this environment from Gymnasium along with a step counter to keep track of the number of training steps:

env = TransformedEnv(GymEnv("CartPole-v1"), StepCounter())

Seed the Python and RL environments to replicate similar results across training sessions. 

torch.manual_seed(0)
env = TransformedEnv(GymEnv("CartPole-v1"), StepCounter())
env.set_seed(0)

Define the parameters and hyperparameters for the training: 

  • INIT_RAND_STEPS: The number of steps for which the agent acts randomly before using the policy. These initial steps are used to collect initial data to start training the policy.
  • FRAMES_PER_BATCH: The number of data points (one for each interaction or time-step) in a training batch. 
  • OPTIM_STEPS: The number of steps for which to accumulate the losses before running one pass of the backward pass. 
  • EPS_0: The initial value of epsilon, the exploration coefficient. 
  • BUFFER_LEN: The size of the replay buffer.
  • ALPHA: The learning rate.
  • TARGET_UPDATE_EPS: The decay factor for updating the target network using the soft-update module.
  • REPLAY_BUFFER_SAMPLE: The size of the random sample to pick from the replay buffer in each training iteration. The replay buffer stores the results of the agent’s interactions with the environment in each timestep. 
  • LOG_EVERY: The number of steps after which to print the training progress.
  • MLP_SIZE: The size of the neural network.
INIT_RAND_STEPS = 5000 
FRAMES_PER_BATCH = 100
OPTIM_STEPS = 10
EPS_0 = 0.5
BUFFER_LEN = 100_000
ALPHA = 0.05
TARGET_UPDATE_EPS = 0.95
REPLAY_BUFFER_SAMPLE = 128
LOG_EVERY = 1000
MLP_SIZE = 64

Step 2: Create the policy

Define a simple neural network to implement the policy: 

  • Define the MLP (neural network). Given an observation, the MLP outputs the action value, effectively doing the job of the Q-function. 
  • Define a tensor dictionary using the TensorDict module on the above MLP. This dictionary maps the environment’s state observations (dictionary keys) to the action probability values (dictionary values) corresponding to that state. 
  • Use the QValueModule to implement the Q-function. Given a tensor containing action values, it returns the action corresponding to the highest action value. It implements the greedy (exploitation) strategy. 
  • Combine (using the TensorDictSequential module) the MLP’s tensor dictionary and the QValueModule to define the policy. 
value_mlp = MLP(out_features=env.action_spec.shape[-1], num_cells=[MLP_SIZE, MLP_SIZE])
value_net = TensorDict(value_mlp, in_keys=["observation"], out_keys=["action_value"])
policy = Seq(value_net, QValueModule(spec=env.action_spec))
  • Define the exploration function using the EGreedyModule - it takes as input the environment’s action space, the length of the replay buffer, and the exploration coefficient, epsilon. Chain together (using the TensorDictSequential module) this exploration function with the policy defined above to get the final policy:
exploration_module = EGreedyModule(
    env.action_spec, annealing_num_steps=BUFFER_LEN, eps_init=EPS_0
)
policy_explore = Seq(policy, exploration_module)

Step 3: Train the agent

The first step in training the agent is to collect the data from the agent’s interactions with the environment. Use the SyncDataCollector to build a collector to execute the policy and collect the results of the agent’s interactions: 

collector = SyncDataCollector(
    env,
    policy_explore,
    frames_per_batch=FRAMES_PER_BATCH,
    total_frames=-1,
    init_random_frames=INIT_RAND_STEPS,
)

Create a replay buffer to store the results of the interactions:

rb = ReplayBuffer(storage=LazyTensorStorage(BUFFER_LEN))

You also need to declare training-specific modules, such as the loss function (to use the DQN-based loss calculation), the optimizer (the traditional Adam algorithm), and the updater functions (to update the neural network). These are all based on predefined TorchRL modules:

loss = DQNLoss(value_network=policy, action_space=env.action_spec, delay_value=True)
optim = Adam(loss.parameters(), lr=ALPHA)
updater = SoftUpdate(loss, eps=TARGET_UPDATE_EPS)

Initialize the counters for keeping track of the total number of steps and episodes, successful steps per episode, and the execution time:

total_count = 0
total_episodes = 0
t0 = time.time()
success_steps = []

The training function consists of 2 for loops: 

  • The topmost loop executes the policy and appends the results of the agent’s interactions to the replay buffer.
  • The inner loop splits the training samples into batches. It processes each batch as follows: 
    • Pick a random sample from the replay buffer.
    • Calculate the losses using the loss function.
    • Run the optimizer and updater.
    • Update the counters.

Each step involves calling prebuilt TorchRL modules without coding anything from scratch. The code below shows how to implement these steps:

for i, data in enumerate(collector):
    rb.extend(data)
    max_length = rb[:]["next", "step_count"].max()
    if len(rb) > INIT_RAND_STEPS:
        for _ in range(OPTIM_STEPS):
            sample = rb.sample(REPLAY_BUFFER_SAMPLE)
            loss_vals = loss(sample)
            loss_vals["loss"].backward()
            optim.step()

            optim.zero_grad()
            # Update exploration factor
            exploration_module.step(data.numel())

            # Update target params
            updater.step()
            total_count += data.numel()
            total_episodes += data["next", "done"].sum()
    success_steps.append(max_length)

Step 4: Evaluate the agent

The loop in the previous section continuously trains the policy. We need to set the criteria for evaluating performance and decide when to consider the training successful. We also want to output the progress of the training. 

We print the progress of the training at periodic intervals using the TorchRL logger:

if total_count > 0 and total_count % LOG_EVERY == 0:
    torchrl_logger.info(f"Successful steps in the last episode: {max_length}, rb length {len(rb)}, Number of episodes: {total_episodes}")

We use the maximum number of steps the agent achieved in the last episode to determine whether the training is successful. The CartPole-v1 environment caps the maximum number of steps and the total reward per episode at 500. It is conventional to consider a policy to be successful if it achieves more than 475 steps:

if max_length > 475:
    print("TRAINING COMPLETE")
    break

The above two code snippets should be appended at the end of the training loop (shown earlier). The following snippet shows the training loop with the code to evaluate the agent: 

for i, data in enumerate(collector):
    # Write data in replay buffer
    rb.extend(data)
    max_length = rb[:]["next", "step_count"].max()
    if len(rb) > INIT_RAND_STEPS:
        for _ in range(OPTIM_STEPS):
            sample = rb.sample(REPLAY_BUFFER_SAMPLE)
            loss_vals = loss(sample)
            loss_vals["loss"].backward()
            optim.step()

            optim.zero_grad()
            # Update exploration factor
            exploration_module.step(data.numel())

            # Update target params
            updater.step()
            total_count += data.numel()
            total_episodes += data["next", "done"].sum()
    success_steps.append(max_length)

    if total_count > 0 and total_count % LOG_EVERY == 0:
        torchrl_logger.info(f"Successful steps in the last episode: {max_length}, rb length {len(rb)}, Number of episodes: {total_episodes}")

    if max_length > 475:
        print("TRAINING COMPLETE")
        break

Finally, after the training and evaluation, print the total training time and plot the training progress:

t1 = time.time()

torchrl_logger.info(
    f"solved after {total_count} steps, {total_episodes} episodes and in {t1-t0}s."
)

def plot_steps():
    plt.plot(success_steps)
    plt.title('Successful steps over training episodes')
    plt.xlabel('Training episodes')
    plt.ylabel('Steps')
    plt.show()

plot_steps()

You can find and execute the complete code to implement DQN using TorchRL in this DataLab workbook

In this section, we built and trained an RL agent using the simple DQN algorithm. In the next section, we explore how to use prebuilt TorchRL modules to implement a more complex algorithm like PPO. 

Exploring Prebuilt Algorithms in TorchRL

As mentioned in previous sections, TorchRL comes with a few pre-built algorithms. Let’s take a look at them and how they work.

Supported algorithms

TorchRL includes prebuilt modules for many common Deep Reinforcement Learning algorithms, such as: 

  • Deep Q networks (DQN)
  • Deep deterministic policy gradient (DDPG)
  • Soft actor-critic (SAC)
  • Randomized ensembled double Q-learning (REDQ)
  • CrossQ
  • Implicit Q-learning (IQL)
  • Continuous Q-learning (CQL)
  • Generative adversarial imitation learning (GAIL)
  • Decision transformer (DT)
  • Twin-delayed DDPG (TD3) 
  • Actor-advantage critic (A2C)
  • Proximal policy optimization (PPO)
  • REINFORCE
  • and more 

This makes it efficient to experiment with different types of algorithms and study the performance of each for solving a given problem. 

Example: PPO in TorchRL

Proximal Policy Optimization uses a clipped surrogate objective function to achieve a smooth training process that balances exploration and exploitation. 

In this section, I will show how to use the prebuilt PPO module in TorchRL. Similar to the earlier implementation, import the pre-requisite modules. In addition to the modules (such as ReplayBuffer, SyncDataCollector, etc.) needed for the previous implementation, you need a few additional packages for PPO:

  • ProbabilisticActor, to choose an action stochastically. Choosing the action stochastically (instead of always following the policy and maximizing the reward) facilitates exploration of the environment and discovering more optimal paths during training.
  • OneHotCategorical, to generate a one-hot encoding of the tensor denoting the log probabilities of the actions. 
  • ValueOperator, to build a TorchRL module based on the neural network that implements the value function.
  • GAE, to implement the generalized advantage estimate. PPO uses an advantage function as a proxy for the value function. The value module (above) is an input to the GAE. 
  • ClipPPOLoss, to build the clipped objective function for implementing PPO. 
import torch
from torch import nn

from torchrl.envs import Compose, ObservationNorm, DoubleToFloat, StepCounter, TransformedEnv
from torchrl.envs.libs.gym import GymEnv
from torchrl.envs.utils import check_env_specs
from torchrl.modules import ProbabilisticActor, OneHotCategorical, ValueOperator
from torchrl.collectors import SyncDataCollector
from torchrl.data.replay_buffers import ReplayBuffer
from torchrl.data.replay_buffers.storages import LazyTensorStorage
from torchrl.data.replay_buffers.samplers import SamplerWithoutReplacement
from torchrl.objectives.value import GAE
from torchrl.objectives import ClipPPOLoss
from tensordict.nn import TensorDictModule

Declare the parameters and hyperparameters for the training:

  • FRAMES_PER_BATCH: The number of frames (timesteps) in each batch of data resulting from the agent’s interactions with the environment. 
  • SUB_BATCH_SIZE: The number of timesteps in each sub-batch. In PPO training, each batch is divided into sub-batches. 
  • TOTAL_FRAMES: The total number of timesteps for which to run the training. 
  • GAMMA: The discount factor for discounting the value of the reward from future timesteps.
  • CLIP_EPSILON: This defines the trust region within which the policy can change in each subsequent training iteration. It is implemented by clipping the updated policy so that the ratio between the new and old probabilities falls within a certain limit. 
  • ALPHA: The learning rate.
  • ENTROPY_EPS: This hyperparameter controls the ratio of exploration and exploitation. 
  • OPTIM_STEPS: The number of times for which to run the optimizer on each sub-batch of data. 
  • LOG_EVERY: The number of steps after which to evaluate the policy and print the rewards. 
FRAMES_PER_BATCH = 1024
TOTAL_FRAMES = 1048576
GAMMA = 0.99
LAMBDA = 0.95
CLIP_EPSILON = 0.2
ALPHA = 1e-4
ENTROPY_EPS = 5e-4 
SUB_BATCH_SIZE = 64
OPTIM_STEPS = 8
LOG_EVERY = 16

Import the base CartPole environment from Gymnasium:

device="cpu"
base_env = GymEnv('CartPole-v1', device=device) 

Declare the TorchRL environment by importing the base environment and adding modules for normalizing the observation tensors and counting the number of steps:

env = TransformedEnv( 
    base_env, 
    Compose(
        ObservationNorm(in_keys=["observation"]), 
        DoubleToFloat(), 
        StepCounter()
    )
)

Initialize the environment and seed the random number generators:

env.transform[0].init_stats(1024) 
torch.manual_seed(0)
env.set_seed(0)
check_env_specs(env) 

Declare the actor as a neural network with 1 hidden layer with 16 weights. It takes as input the observation from the environment and outputs the likelihood of each action in the action space. 

actor_net = nn.Sequential(
    nn.Linear(env.observation_spec["observation"].shape[-1], 32, device=device),
    nn.ReLU(),
    nn.Linear(32, 32, device=device),
    nn.ReLU(),
    nn.Linear(32, env.action_spec.shape[-1], device=device),
    nn.ReLU()
)

Create a tensor dictionary (based on the above neural network) with keys and values mapping states to action probabilities. 

actor_module = TensorDictModule(actor_net, in_keys=["observation"], out_keys=["logits"])

For training PPO (and many other RL algorithms), you don’t always choose the action that leads to the highest reward in the next step. Having some degree of randomness in choosing the action allows the agent to explore the environment and discover better paths that can lead to higher long-term returns. Create a probabilistic actor to do this:

actor = ProbabilisticActor(
    module = actor_module,
    spec = env.action_spec,
    in_keys = ["logits"],
    distribution_class = OneHotCategorical, 
    return_log_prob = True
)

Create the network to implement the value function. This network has 16 weights. It takes as input the environment’s state (observation) and outputs the expected value of that state. 

value_net = nn.Sequential(
    nn.Linear(env.observation_spec["observation"].shape[-1], 16, device=device),
    nn.ReLU(),
    nn.Linear(16, 1, device=device),
    nn.ReLU()
)

Create a TorchRL module wrapping the value function (implemented above) using the ValueOperator(). This module interfaces with other TorchRL components. 

value_module = ValueOperator(
    module = value_net,
    in_keys = ["observation"]
)

Create the replay buffer to store the results of the agent’s interactions with the environment: 

replay_buffer = ReplayBuffer(
    storage = LazyTensorStorage(max_size=FRAMES_PER_BATCH),
    sampler = SamplerWithoutReplacement()
)

Create a data collector to run the policy on the environment and collect the results of the agent’s interactions with the environment in each timestep (frame):

collector = SyncDataCollector(
    env,
    actor,
    frames_per_batch = FRAMES_PER_BATCH,
    total_frames = TOTAL_FRAMES,
    split_trajs = True,
    reset_at_each_iter = True,
    device=device
)

Use the GAE module to implement the advantage function for PPO. The advantage function is based on the value function. 

advantage_module = GAE(
    gamma = GAMMA, 
    lmbda = GAMMA, 
    value_network = value_module,
    average_gae = True
) 

Use the built-in ClipPPOLoss module to implement the loss function according to the PPO algorithm:

loss_module = ClipPPOLoss(
    actor_network = actor,
    critic_network = value_module,
    clip_epsilon = CLIP_EPSILON,
    entropy_bonus = bool(ENTROPY_EPS),
    entropy_coef = ENTROPY_EPS
)

Declare the Adam optimizer: 

optim = torch.optim.Adam(loss_module.parameters(), lr=ALPHA)

Create a scheduler to gradually reduce the learning rate as the training progresses: 

scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optim, TOTAL_FRAMES // FRAMES_PER_BATCH)

Build the training loop using the components created above. The training function consists of three for loops:

  • The outermost loop runs through the data collected from the agent’s interactions with the environment. The parameter TOTAL_FRAMES decides how many timesteps to run the agent. 
  • The middle loop does the following: 
    • It runs the innermost loop for a fixed number of iterations (typically between 5 and 10), as defined by the OPTIM_STEPS hyperparameter. 
    • It updates the learning rate after each training iteration. 
    • It also calculates the value of the advantage function for each timestep. 
    • Evaluate the policy periodically and print the rewards. 
  • The innermost loop runs the training loop:
    • Sample a batch of training data from the replay buffer.
    • Calculate the loss using the loss module. 
    • Run backpropagation on the loss.
    • Use the optimizer to do the gradient descent. 

The code below implements the training loop:

for i, tensordict_data in enumerate(collector): 
    for _ in range(OPTIM_STEPS): 
        advantage_module(tensordict_data)
        replay_buffer.extend(tensordict_data.reshape(-1).cpu())
        for _ in range(FRAMES_PER_BATCH // SUB_BATCH_SIZE): 
            data = replay_buffer.sample(SUB_BATCH_SIZE)
            loss = loss_module(data.to(device))
            loss_value = loss["loss_objective"] + loss["loss_critic"] + loss["loss_entropy"]
            loss_value.backward()
            optim.step()
            optim.zero_grad()
    scheduler.step()

    if i % LOG_EVERY == 0:
        with torch.no_grad():
            rollout = env.rollout(FRAMES_PER_BATCH, actor)
            reward_eval = rollout["next","reward"].sum()
            #print(rollout["next","reward"].sum())
            print(reward_eval)
            rewards.append(reward_eval)
            del rollout

This DataLab workbook has the code (as shown above) for training an RL agent using the PPO algorithm from TorchRL. Use it as a starting point to finetune the parameters and improve the agent’s performance. 

Customizing algorithms

TorchRL has a modular design that makes it possible to customize RL solutions. The framework takes care of the interfacing between different components. To test and compare their performance, you can plug and play environments, different policies, replay buffers, and RL algorithms. Because of this modularity:

  • You can swap out different components without having to rewrite large parts of the program. 
  • Individual components can also be modified independently to create custom solutions. 
  • You can experiment with different modules and build an optimized solution. 

Some examples of customization in TorchRL are:

  • Training loops can be customized with learning rate schedulers, loggers, and custom metrics. 
  • Different lengths of replay buffers can be used to adjust the training duration.
  • The environment can be customized to include scaling and normalizing functions, step counters, etc. 

Visualizing and Debugging RL Training

The previous sections showed how to use TorchRL modules to train RL agents. In this section, we discuss ways to monitor and visualize the training progress. 

Monitoring training progress

It is helpful to log various metrics during the training process to monitor its progress. Packages like TensorBoard make it easy to directly log the results during the training process. For example, TensorBoard’s SummaryWriter allows you to call it within the training loop to log the metrics. The following pseudo-code shows how:

from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter(log_dir="training_logs")

for _ in training_data (): # training loop
    loss = … 
    loss.backprop() 
    writer.add_scalar(“Loss - “, loss.item(), step)
   … 

writer.close() 

After finishing the training and logging, you can plot and visualize the results. For example, use the tensorboard command to visualize the training progress. 

tensorboard --logdir=”training_logs”

Debugging with TorchRL

Debugging is essential to verify the agent’s interaction with the environment. As a first step, you want to check the action space and the observation space. The environment’s documentation should include the specifications for the environment. For example, CartPole’s documentation shows that: 

  • The action space includes two discrete values - 0 (for pushing the cart left) and 1 (pushing the cart right) 
  • The observation space has 4 values: 
    • Cart position, with a value between 
    • -4.8 and +4.8. 
    • Cart velocity, with any real-number value
    • Pole angle, with a value between -24 ° and  +24 °
    • Pole angular velocity, with any real-number value

Inspect the observation and action spaces to check that they correspond to the expected values. For example:

print("Observation space: ", env.observation_spec)
print("Action space: ", env.action_spec)

Additionally, you can draw random samples from the observation and action spaces and check their values: 

print("Sample Observation:", base_env.observation_spec.sample().get("observation"))
print("Sample Action:", base_env.action_spec.sample())

The output resembles the sample below:

Sample Observation: tensor([-4.5960e+00,  3.4028e+38,  2.2261e-02,  3.4028e+38])
Sample Action: tensor([1, 0])

Notice that the 4 state values fall within the ranges specified earlier. Repeat the above commands a few times and notice that the different (randomly sampled) output values fall within the specified ranges. 

The above two commands are based on the base environment, directly imported from Gymnasium. In practice, we use the transformed environment, which applies transformations, like normalization, on the original tensor values. Thus, when you use the transformed TorchRL environment to draw a sample of observations, their values may no longer fall within the same range as in the original Gymnasium environment. 

For example, draw a sample from the transformed environment:  

print("Sample Observation:", env.observation_spec.sample().get("observation"))

The output resembles: 

Sample Observation: tensor([-57.5869,      nan,   4.5276,      nan])

Notice that the cart position and pole angle values are outside the ranges. This is because the tensor values have been normalized. Furthermore, the cart velocity and pole angular velocity have nan values because the agent has not yet started interacting with the environment. 

Visualizing agent performance

In addition to plotting the progress of the training, it can be helpful to render the environment and visually observe the agent’s interactions. This can give insights into how the agent is performing. 

The most pragmatic way to visualize the environment is by rendering a video. You need a few additional packages: 

  • torchvision, to work with multimedia files
  • av, to wrap the ffmpeg package so it can interface with PyTorch. 
pip install torchvision
pip install av==12.0.0

After finishing the training, follow these preparatory steps to render the video:

  • Declare the (relative) path to store the video.
  • Initialize the logger to store the results of the agent’s interactions in CSV format.
  • Initialize the video recorder to use the CSV logs to generate a video.
  • Transform the TorchRL environment to include the video recorder.
…
# training loop
# for _ in …
…
…

from torchrl.record import CSVLogger, VideoRecorder 

path = "./training_loop"
logger = CSVLogger(exp_name="dqn", log_dir=path, video_format="mp4")
video_recorder = VideoRecorder(logger, tag="video")
record_env = TransformedEnv(
    GymEnv("CartPole-v1", from_pixels=True, pixels_only=False), video_recorder
)

Run the (trained) policy on the environment and dump the renderings into the video file:

record_env.rollout(max_steps=1000, policy=policy)
video_recorder.dump()

After running the program, you will find the video in the directory path (./training_loop in the above snippet) you had set earlier. Note that the DataLab workbook for implementing DQN using TorchRL does not include the code to create the video because exporting video files with an online notebook is difficult. Run the program (and add the above steps) on a local computer or server to record videos.

Best Practices for Using TorchRL

Finally, let’s touch on some best practices. Here are my recommendations.

Start with simple environments

Notwithstanding the ease of development due to TorchRL, training RL agents to perform well in complex environments remains challenging. Thus, before trying to solve hard problems, it is essential to solve simple environments like CartPole. 

TorchRL’s modularity allows experimentation with different algorithms and parameters. Before using alternative algorithms and customization options in practice, it is necessary to gain some insights into their workings. This is best done in a simple environment, where the effects of individual changes are easy to observe visually. 

Experiment with hyperparameters

The training performance of RL models is sensitive to various hyperparameters, such as learning rate, exploration rate, discount rate, and other algorithm-specific hyperparameters, like the clipping ratio for PPO. For example, a learning rate that’s too high makes the training unstable, while a learning rate that is too low takes too long to converge. 

Similarly, a very high exploration rate prevents the training from converging, whereas a too-low exploration rate can prevent the agent from discovering the optimal path. 

There is no formula for determining the right hyperparameter values. Guidelines and recommended values for standard environments exist, but they may not apply to all problems. Thus, it is necessary to experiment with various methods, such as grid search or random search, to determine the best values for the hyperparameters. 

It is also possible to use automated libraries like Weights & Biases Sweeps or Optuna to test the training performance across various hyperparameter combinations. 

Leverage prebuilt algorithms

As you saw in the previous examples, using the prebuilt modules from TorchRL saves considerable development effort. The alternative is to build all the functionality from scratch, which would consume significantly more time and cost to develop and test. 

Because many developers use TorchRL’s modules, they also benefit from having been tested in various scenarios. Thus, you can expect them to be more bug-free than a custom-built module. Thus, for most standard use cases, it is strongly advisable to use prebuilt modules. 

Conclusion

In this article, we covered the basic concepts of TorchRL, a PyTorch-based framework to implement RL algorithms. We also saw hands-on examples of using TorchRL to implement simpler solutions like Deep Q-Learning and more complex algorithms like PPO. 

As the next step, you can use these programs as a basis to experiment with other environments and algorithms. 

If you're looking to deepen your understanding of RL fundamentals and practical applications, check out Reinforcement Learning with Gymnasium in Python to get more hands-on experience with Gymnasium-based environments.

Artificial Intelligence (AI) Concepts in Python

Get Started with AI

Arun Nanda's photo
Author
Arun Nanda
LinkedIn

Arun is a former startup founder who enjoys building new things. He is currently exploring the technical and mathematical foundations of Artificial Intelligence. He loves sharing what he has learned, so he writes about it.

In addition to DataCamp, you can read his publications on Medium, Airbyte, and Vultr.

Topics

Learn more about reinforcement learning with these courses!

course

Deep Reinforcement Learning in Python

4 hr
1.5K
Learn and use powerful Deep Reinforcement Learning algorithms, including refinement and optimization techniques.
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related

cheat-sheet

Deep Learning with PyTorch Cheat Sheet

Learn everything you need to know about PyTorch in this convenient cheat sheet
Richie Cotton's photo

Richie Cotton

6 min

tutorial

Reinforcement Learning: An Introduction With Python Examples

Learn the fundamentals of reinforcement learning through the analogy of a cat learning to use a scratch post.
Bex Tuychiev's photo

Bex Tuychiev

14 min

tutorial

Reinforcement Learning with Gymnasium: A Practical Guide

Understand the basics of Reinforcement Learning (RL) and explore the Gymnasium software package to build and test RL algorithms using Python.
Arun Nanda's photo

Arun Nanda

30 min

tutorial

PyTorch Lightning: A Comprehensive Hands-On Tutorial

This comprehensive, hands-on tutorial teaches you how to simplify deep learning model development with PyTorch Lightning. Perfect for beginners and experienced developers alike, it covers environment setup, model training, and practical examples.
Bex Tuychiev's photo

Bex Tuychiev

35 min

tutorial

An Introduction to Q-Learning: A Tutorial For Beginners

Learn about the most popular model-free reinforcement learning algorithm with a Python tutorial.
Abid Ali Awan's photo

Abid Ali Awan

16 min

code-along

Deep Learning on Rails with PyTorch Lightning

In this session, you'll learn how to create a simple deep learning model using PyTorch Lightning. You'll work through the steps of defining and training the model, making predictions, and evaluating model performance.
Luca Antiga's photo

Luca Antiga

See MoreSee More