Programma
How many times have you seen a NaN loss value while training a deep neural network?
After hours of training time, the loss curve looks healthy, and then it spikes to infinity out of nowhere. The reason why is usually exploding gradients - gradient values that grow so large during backpropagation that parameter updates become unstable and the model breaks. This problem hits recurrent networks the hardest, but it also shows up in transformers and deep feedforward networks.
Gradient clipping fixes this by restricting the size of gradients before they reach the optimizer. It's a one-line addition to your training loop that keeps updates bounded without making any changes to the model.
In this article, I'll cover the intuition behind gradient clipping, the two main methods, how to pick a threshold, and how to implement it in PyTorch and TensorFlow.
But what exactly is loss in data science? Read our Loss Function in Machine Learning blog post to find out.
What Is Gradient Clipping?
Gradient clipping is a technique that limits the magnitude of gradients during training to prevent unstable parameter updates.
When a gradient gets too large, the optimizer takes a huge step in parameter space and pushes weights into a region where the loss explodes. Clipping helps you by capping that step size before it can do any damage.
It’s important to note that gradient clipping doesn't affect the model architecture. You don't add layers or change activation functions. It only modifies the training process by intercepting gradients between backpropagation and the optimizer step.
This makes it cheap to try and easy to remove. As you’ll see later, it only takes one line of code.
How Gradient Clipping Works
The mechanics are simple. Clipping operation is placed between your backward pass and your optimizer step, and it follows the same four steps every iteration.
- Compute gradients: Run the forward pass, calculate the loss, and run backpropagation. Nothing changes here, as gradients flow through the network the same way they always do.
- Check gradient magnitude: Measure how large the gradients are. Depending on the method, this means looking at individual values or computing the overall norm across all parameters.
- Reduce gradients if they exceed a threshold: If the magnitude crosses the limit you've set, scale the gradients down. If it doesn't, leave them as they are.
- Update model parameters: Pass the clipped gradients to the optimizer and apply the weight update.
Most of the time, your gradients stay below it and training proceeds as it would without gradient clipping. When a spike happens, clipping catches it before the optimizer can react.
That’s it.
Common Gradient Clipping Methods
There are two ways to clip gradients, and the difference comes down to what you measure and what you scale.
Clip by value
Clip by value individually caps each gradient element.
You pick a range, say [-1.0, 1.0], and any gradient value outside that range gets rounded to the nearest boundary. A gradient of 2.5 becomes 1.0. A gradient of -2.5 becomes -1.0. Values already inside the range remain unchanged.

Clip by value example
The appeal is how simple it is. There's no math beyond a min/max operation, and it’s fast to run.
But this approach has a downside. Clipping individual values changes the direction of the gradient vector. If one component gets clipped and the others don't, the updated vector no longer points where backpropagation said it should. Your optimizer ends up taking a step in a slightly wrong direction.
That's why clip by value is less common in practice.
Clip by norm
Clip by norm scales the entire gradient vector when its overall magnitude exceeds a threshold.
Instead of looking at individual values, it computes the norm of all gradients together (usually the L2 norm) and compares it to a maximum value. If the norm is below the threshold, nothing happens. If it's above, every gradient gets multiplied by the same scaling factor to bring the norm back down to the limit.

Clip by norm example
The advantage is direction preservation. Since every component shrinks by the same factor, the gradient vector still points in the original direction. You're just shortening the step, not redirecting it.
This is why clip by norm became the standard. PyTorch's clip_grad_norm_ and TensorFlow's clipnorm both implement this method, and most modern training pipelines use it by default.
Exploding Gradients vs Vanishing Gradients
Exploding and vanishing gradients are both common problems in deep learning, but only one of them is what gradient clipping solves.
Exploding gradients
Exploding gradients happen when gradient values grow too large during backpropagation.
This usually shows up in deep networks or recurrent architectures, where gradients get multiplied across many layers or time steps. If those multiplications compound in the wrong direction, the gradient magnitude blows up. The optimizer then makes a huge parameter update, weights jump to extreme values, and the loss often turns into NaN or Inf.
You'll see it as sudden loss spikes or a model that diverges out of nowhere.
Vanishing gradients
Vanishing gradients are the opposite problem. Gradient values shrink toward zero as they propagate backward through the network.
When gradients get too small, weight updates become tiny. Early layers stop learning, deeper layers learn slowly, and training practically stops. The loss curve flattens out and doesn’t improve, even after many epochs.
This was the main reason RNNs struggled with long sequences before LSTMs and GRUs came along.
Where gradient clipping fits
Gradient clipping addresses exploding gradients, not vanishing gradients.
Clipping shrinks gradients that are too large, but it does nothing when gradients are too small. For vanishing gradients, you need better weight initialization, residual connections, batch normalization, or architectures designed to preserve gradient flow.
Gradient Clipping by Norm Explained
Clipping by norm is the method most readers actually want when they search for gradient clipping.
The process has three steps. First, compute the norm of all gradients combined. Second, compare that norm against your chosen threshold. Third, rescale the gradients if the norm is too large.
The norm is usually the L2 norm, which means you square every gradient value, sum them up, and take the square root. If you have gradients g_1, g_2, ..., g_n across all your model parameters, the L2 norm is:

Clipping by norm formula
Once you have the norm, you compare it to your threshold c. If ||g|| <= c, the gradients pass through unchanged. If ||g|| > c, every gradient gets multiplied by the scaling factor c / ||g||. This brings the new norm down to exactly c.
This matters because every component shrinks by the same factor. The relative proportions between gradient values stay unchanged, which means the vector still points in the original direction. You're shortening the step the optimizer takes, not changing where it goes.
That direction-preserving property is what makes norm clipping the default choice. Clip by value can twist the gradient vector into a new direction. Clip by norm only changes its length.
PyTorch's clip_grad_norm_ and TensorFlow's clipnorm both do exactly this. When someone says "I'm using gradient clipping," they almost always mean clipping by norm.
Choosing a Gradient Clipping Threshold
The threshold is a hyperparameter, which means there's no universal value that works for every model.
If you set it too high, the clipping will almost never activate. Your gradients almost always stay below the limit, so the safety net never catches anything. Training proceeds as if clipping wasn't there, and you'll still see loss spikes when gradients explode.
If you set it too low, you clip too aggressively. Every batch gets its gradients shrunk, which makes weight updates smaller than they should be. Learning slows down and your model takes longer to converge, sometimes much longer.
A common starting point is 1.0, which works well for many architectures. Values between 0.5 and 5.0 cover most practical use cases.
The better approach is to monitor your gradient norms during training. Log the unclipped norm at each step and look at the distribution. If most norms sit around 0.3 with occasional spikes to 50, set the threshold somewhere above the typical range but well below the spikes - 2.0 or 3.0 would be reasonable here.
Treat it like any other hyperparameter. Start with 1.0, watch what happens, and adjust based on training behavior.
Gradient Clipping in Recurrent Neural Networks
RNNs are where gradient clipping first became a standard technique.
The reason is how RNNs propagate gradients through time. Backpropagation through time multiplies the same weight matrices across many time steps, and those repeated multiplications can compound into massive values. Long sequences make the problem worse.
LSTMs and GRUs reduced the issue with their gating mechanisms, but they didn't get rid of it. Both architectures still benefit from clipping, especially when training on long sequences or with high learning rates.
For RNN training, clip by norm with a threshold between 1.0 and 5.0 is the typical default. If you're using PyTorch's nn.LSTM or nn.GRU and your loss explodes during training, adding clip_grad_norm_ is usually the first thing to try.
Gradient Clipping in Modern Deep Learning
Gradient clipping never went away when transformers replaced RNNs.
Large language models like GPT and BERT use clipping during pretraining and fine-tuning. The same applies to vision transformers, diffusion models, and most deep architectures with hundreds of layers. The Adam and AdamW optimizers, which dominate modern training, are often paired with norm clipping at thresholds around 1.0.
The reason is the same as it was for RNNs. Deep networks multiply gradients across many layers, and large batch sizes combined with high learning rates can produce occasional gradient spikes. Clipping handles those spikes without affecting normal training steps.
Most reference implementations include clipping by default. Hugging Face's Trainer, PyTorch Lightning, and DeepSpeed all expose clipping as a standard config option. If you're training anything bigger than a small toy model, clipping is almost certainly part of the pipeline.
It's a one-line addition that costs almost nothing and prevents training runs from crashing after hours of compute. That's why it stuck around.
Gradient Clipping in PyTorch
PyTorch handles gradient clipping with a single utility function: torch.nn.utils.clip_grad_norm_.
The clipping call goes between loss.backward() and optimizer.step(). Backpropagation needs to fill in the gradients first, then clipping shrinks them if needed, then the optimizer applies the update. Putting the call anywhere else won't work.
Here's a complete, runnable training script that trains a small MLP on synthetic regression data with gradient clipping enabled:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
torch.manual_seed(42)
# Synthetic regression data
n_samples = 1000
n_features = 20
inputs = torch.randn(n_samples, n_features)
targets = (inputs.sum(dim=1, keepdim=True) * 2.0 + torch.randn(n_samples, 1) * 0.1)
dataset = TensorDataset(inputs, targets)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
# Small feedforward network
model = nn.Sequential(
nn.Linear(n_features, 64),
nn.ReLU(),
nn.Linear(64, 64),
nn.ReLU(),
nn.Linear(64, 1),
)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.MSELoss()
# Training loop with gradient clipping
n_epochs = 5
max_grad_norm = 1.0
for epoch in range(n_epochs):
epoch_loss = 0.0
for batch_inputs, batch_targets in dataloader:
optimizer.zero_grad()
predictions = model(batch_inputs)
loss = loss_fn(predictions, batch_targets)
loss.backward()
# Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=max_grad_norm)
optimizer.step()
epoch_loss += loss.item()
print(f"Epoch {epoch + 1}: loss = {epoch_loss / len(dataloader):.4f}")

PyTorch output
The clip_grad_norm_ function takes two main arguments:
-
parameters: the model parameters whose gradients you want to clip. Passmodel.parameters()to cover the whole model. -
max_norm: the threshold for the gradient norm. A value of1.0is a common starting point.
There's an optional norm_type argument that defaults to 2.0 for L2 norm. You'll rarely need to change it.
The trailing underscore in clip_grad_norm_ signals an in-place operation. The function modifies the gradients directly inside the .grad attribute of each parameter, so you don't need to keep track of the return value. It does return the total norm of the gradients before clipping, which is handy if you want to log it.
For clip-by-value instead of clip-by-norm, PyTorch has torch.nn.utils.clip_grad_value_:
torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=0.5)
But as discussed earlier, you’ll rarely (if ever) use this implementation.
That's the entire setup. Two lines added to your training loop.
Gradient Clipping in TensorFlow
TensorFlow handles clipping at the optimizer level instead of as a separate function call.
When you create an optimizer, you pass clipnorm or clipvalue as an argument. The optimizer applies clipping internally during each step, so you don't need to modify your training loop at all.
Here's a full working example using the Keras API on synthetic regression data:
import numpy as np
import tensorflow as tf
tf.random.set_seed(42)
np.random.seed(42)
# Synthetic regression data
n_samples = 1000
n_features = 20
x_train = np.random.randn(n_samples, n_features).astype(np.float32)
y_train = (x_train.sum(axis=1, keepdims=True) * 2.0
+ np.random.randn(n_samples, 1).astype(np.float32) * 0.1)
# Small feedforward network
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation="relu", input_shape=(n_features,)),
tf.keras.layers.Dense(64, activation="relu"),
tf.keras.layers.Dense(1),
])
# Optimizer with gradient clipping by norm
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3, clipnorm=1.0)
model.compile(optimizer=optimizer, loss="mse")
model.fit(x_train, y_train, epochs=5, batch_size=32)

TensorFlow output
The two arguments do different things:
-
clipnormclips by the L2 norm of each gradient tensor. If the norm exceeds the threshold, the tensor gets proportionally scaled down. -
clipvalueindividually clips each gradient element. Any value above the threshold gets clamped to the threshold, and any value below the negative threshold gets clamped to the negative threshold.
To switch from norm clipping to value clipping, just swap the argument:
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3, clipvalue=0.5)
Both arguments work with every Keras optimizer: Adam, SGD, RMSprop, AdamW, and the rest. There's also a global_clipnorm argument that clips based on the norm computed across all gradients combined, rather than per-tensor. This matches PyTorch's default behavior more closely.
If you're writing a custom training loop with tf.GradientTape, the optimizer still handles clipping when you call apply_gradients:
for epoch in range(5):
for batch_x, batch_y in zip(np.array_split(x_train, 32), np.array_split(y_train, 32)):
with tf.GradientTape() as tape:
predictions = model(batch_x, training=True)
loss = tf.reduce_mean(tf.square(predictions - batch_y))
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
That's the difference between the two frameworks. PyTorch puts clipping in your hands inside the loop. TensorFlow pushes it into the optimizer itself. The overall underlying logic is identical.
Gradient Clipping vs Other Stabilization Techniques
Gradient clipping isn't the only way to stabilize training, and it's not always the right tool for the job.
Other techniques handle related but different problems. Some prevent gradients from growing too large in the first place, others keep them from vanishing, and some just make the loss surface easier to optimize. Let me show you a couple of different techniques.
Batch normalization
Batch normalization normalizes activations within each mini-batch during training.
It keeps layer outputs in a stable range, which makes gradient magnitudes more predictable. Networks trained with batch norm tolerate higher learning rates and converge faster, and they're less sensitive to weight initialization choices.
But batch norm doesn't directly stop gradient explosions. It reduces how often they happen, not what to do when they occur. Many models still pair batch norm with gradient clipping for that reason.
Residual connections
Residual connections add shortcut paths that skip over one or more layers, letting gradients flow directly from later layers to earlier ones.
This solves the vanishing gradient problem in deep networks. Without residual connections, training networks with more than 20-30 layers becomes hard because gradients shrink toward zero as they propagate backward. With them, networks with hundreds of layers train without issue.
Residual connections target the opposite end of the gradient problem from clipping. Clipping handles gradients that are too large. Residuals handle gradients that get too small.
Careful weight initialization
The initial values of your weights set the starting magnitude of activations and gradients. Bad initialization can cause gradients to explode or vanish from the very first step.
Methods like Xavier and He initialization scale initial weights based on layer size. This keeps activation variances stable across layers at the start of training, which prevents many gradient problems before they happen.
Good initialization reduces the chance you'll need clipping, but it doesn't eliminate it. Gradient spikes can still appear later in training, especially with high learning rates or unusual batches.
How they fit together
These techniques I listed aren't alternatives. They're complementary tools that solve different parts of the same overall problem.
A typical modern training setup uses careful initialization at the start, residual connections in the architecture, batch normalization (or layer normalization) inside the network, and gradient clipping as a safety net during optimization. Each one handles a specific failure mode, and together they make deep networks trainable.
Conclusion
Gradient clipping is one of the simplest fixes in deep learning, and it solves a problem that can run hours of training in a single step.
The good news is that you don't need to change your model architecture or rewrite your training code. One line in PyTorch or one argument in TensorFlow is enough to implement gradient clipping.
It works best as part of a larger setup. Pair it with careful weight initialization, residual connections, and batch or layer normalization, and you'll have a training pipeline that handles instability from multiple angles.
If your loss is exploding, start with clipping. If it's vanishing, look elsewhere. And if you're training anything bigger than a small model, add clipping to your pipeline by default and forget about it.
Gradient clipping is just one of many terms every machine learning engineer must know. If you want to learn the others and get job-ready in 2026, enroll in our Machine Learning Engineer track today.
Become an ML Scientist
Gradient Clipping FAQs
What is gradient clipping in deep learning?
Gradient clipping is a technique that limits the size of gradients during neural network training to prevent unstable parameter updates. When gradients grow too large during backpropagation, the optimizer takes huge steps that push weights into bad regions and cause the loss to explode. Clipping caps the gradient magnitude before the optimizer step, so updates stay bounded even when the raw gradients spike.
When should I use gradient clipping?
Use gradient clipping whenever your training is unstable, especially if you see sudden loss spikes, NaN values, or divergence after hours of training. It's standard practice for recurrent networks like LSTMs and GRUs, and any deep architecture trained with high learning rates. If your loss curve looks healthy, you can skip it, but adding it as a safety net doesn’t hurt.
What's the difference between clip by value and clip by norm?
Clip by value individually caps each gradient element, which can change the direction of the gradient vector. Clip by norm scales the entire vector when its overall magnitude exceeds a threshold, preserving the original direction while shortening the step. Clip by norm is the standard choice in modern deep learning because it doesn't distort the update direction.
How do I choose a good gradient clipping threshold?
Start with 1.0 and adjust based on what you see during training. Log your unclipped gradient norms across batches and look at the distribution. Set the threshold above the typical norm range but well below the spikes you want to catch. When set too high, the clipping never activates, and when set too low, you'll unnecessarily slow down learning.
Does gradient clipping fix vanishing gradients too?
No. Gradient clipping only addresses exploding gradients, where values grow too large. Vanishing gradients are the opposite problem, where values shrink toward zero and learning practically stops. For vanishing gradients, use better weight initialization, residual connections, batch normalization, or architectures like LSTMs and GRUs.


