Course
Activation functions decide which signals pass through a neural network and which don't. When you choose the wrong one, your model either learns too slowly or fails to generalize. ReLU was the reasonable default choice for years because it was fast and good enough for most tasks.
GELU (Gaussian Error Linear Unit) changed that. It's now the activation function behind some of the most capable models ever built, including BERT and GPT.
In this article, I'll cover the intuition behind GELU, its formula, how it compares to other activation functions, and where you'd actually use it in practice.
If you’re completely new to activation functions in machine learning, read our Beginner’s Guide to the Rectified Linear Unit (ReLU) blog post.
What Is the GELU Activation Function?
GELU, or Gaussian Error Linear Unit, is an activation function that weights inputs based on their magnitude using a smooth and probabilistic approach.
Most activation functions make the decision to either pass the signal through or block it. ReLU, for example, zeroes out anything negative and passes everything else unchanged. GELU works differently. Instead of a hard cutoff, it scales inputs smoothly based on how large or small they are, which means even small negative values can still contribute to the output.
The difference from ReLU is that GELU is smooth and continuous everywhere. There's no sharp corner at zero and no abrupt transitions. That smoothness can matter during training because it gives the optimizer cleaner gradient information to work with.
Intuition Behind GELU
Think of GELU as a filter that doesn't treat all inputs the same way.
ReLU is blunt - anything negative gets zeroed out, every time. On the other hand, GELU asks “how likely is this input value to be useful?” Values that are clearly large and positive pass through almost unchanged. Values that are small or negative get scaled down, not entirely cut.
As a result, you end up with a smooth curve that suppresses less relevant signals without completely discarding them.
Imagine you're reviewing a stack of job applications. A strict filter would remove anyone without a degree, without exception. A smarter filter would still consider candidates who are close as maybe they have relevant experience that compensates. GELU works like the smarter filter. It doesn't make strict cuts, but instead, it weighs each input based on its magnitude and decides how much of it to let through.
This gradual and probabilistic scaling is what makes GELU different. There are no sharp transitions and no dead neurons - just a smooth pass-or-suppress decision made for every input value.
GELU Formula
The exact GELU formula is built on the Gaussian cumulative distribution function (CDF), written as:

Gaussian cumulative distribution function
where x is the input value and Φ(x) is the probability that a random variable drawn from a standard normal distribution is less than or equal to x. In plain English, Φ(x) tells you how "normal" or expected the input value is - and that probability is what GELU uses to scale the input.
The higher the input, the closer Φ(x) gets to 1, which means the input passes through almost unchanged. The lower the input, the closer Φ(x) gets to 0, which means the input gets suppressed.
The approximation used in practice
The problem with the exact formula is that computing Φ(x) is expensive. It involves the error function, which doesn't have a simple closed form and is slow to compute at scale.
Deep learning frameworks use this approximation instead:

GELU approximation formula
This approximation uses tanh, which is fast and well-supported on modern hardware. The result is nearly identical to the exact formula across the input range that matters in practice, which is why frameworks like PyTorch and TensorFlow use it by default.
Now, of course, you don't need to memorize either formula. But knowing that the approximation exists - and why - helps you understand what's actually happening when you call GELU in your code.
GELU vs Other Activation Functions
Each activation function differently handles the inputs, and those differences show up in how well your model trains.
Here’s how the difference looks visually before explaining it through text:

GELU versus other activation functions graph
Sigmoid
Sigmoid squashes all inputs into a range between 0 and 1. It's smooth, but it has a well-known problem: vanishing gradients. For inputs that are very large or very small, the gradient gets close to zero, which means deeper layers stop learning. GELU doesn't have this problem because its gradient stays meaningful across a wider input range.
Tanh
Tanh is similar to Sigmoid but centered at zero, with outputs between -1 and 1. It handles negative inputs better than Sigmoid, but it still suffers from vanishing gradients at the extremes. GELU produces a smoother output curve with better gradient flow through deep networks.
ReLU
ReLU is fast and simple: positive inputs pass through unchanged, negative inputs get zeroed out. The sharp cutoff at zero is what causes the dying neuron problem - neurons that always receive negative inputs over time entirely stop updating. GELU avoids this by scaling negative inputs instead of cutting them off.
Leaky ReLU
Leaky ReLU fixes the dying neuron problem by letting a small fraction of negative inputs through. It's a step up from ReLU, but the transition at zero is still sharp. GELU produces a smoother curve overall, which tends to work better in deep architectures where gradient quality matters more.
So, to summarize, here are the difference between these five activation functions:

GELU versus other activation functions table
Why GELU Is Used in Transformers
Transformers are just deep neural networks. And the deeper your network, the more gradient quality matters.
Models like BERT and GPT stack dozens of layers on top of each other. At that depth, small problems with gradient flow gets compounded. If your activation function produces unstable or near-zero gradients in certain regions, the earlier layers in the network barely update during training, which means they don't learn much.
GELU avoids this by keeping gradients smooth and non-zero across a wider input range. There's no cutoff like ReLU's zero boundary, so the optimizer gets cleaner signal at every layer, not just the ones near the output.
There's an additional reason GELU fits well in transformer architectures.
Transformers process inputs through attention mechanisms that produce a wide range of activation values - both positive and negative. A smooth activation function handles that range better than one with sharp transitions.
When the original BERT paper was published, the authors chose GELU over ReLU and reported better results on their benchmarks. GPT followed the same choice. Since then, GELU has become the default activation in most transformer-based architectures, not because it's new, but because it works better at the scale these models operate at.
GELU in Practice
Using GELU in your models is as easy as using any other activation functions. Both PyTorch and TensorFlow have built-in support.
PyTorch
In PyTorch, you can apply GELU as a standalone module or inline inside a model definition. Here's a simple feedforward block using GELU:
import torch
import torch.nn as nn
class FeedForwardBlock(nn.Module):
def __init__(self, input_dim, hidden_dim):
super().__init__()
self.fc1 = nn.Linear(input_dim, hidden_dim)
self.act = nn.GELU()
self.fc2 = nn.Linear(hidden_dim, input_dim)
def forward(self, x):
return self.fc2(self.act(self.fc1(x)))
block = FeedForwardBlock(input_dim=512, hidden_dim=2048)
x = torch.randn(8, 512)
output = block(x)
nn.GELU() is between the two linear layers, which is exactly where you'd find it in a transformer's feedforward sublayer. The activation runs after the first projection and before the second one.
TensorFlow
In TensorFlow, GELU is available through the Keras API:
import tensorflow as tf
from tensorflow import keras
model = keras.Sequential([
keras.layers.Dense(2048, input_shape=(512,)),
keras.layers.Activation("gelu"),
keras.layers.Dense(512)
])
x = tf.random.normal((8, 512))
output = model(x)
You can also pass it directly as a string argument to a Dense layer:
keras.layers.Dense(2048, activation="gelu")
Both approaches produce the same result.
Where GELU fits in a network
GELU belongs in the same place as any other activation function - right after a linear transformation and before the next layer. In transformer architectures, that means inside the feedforward sublayer, between the two dense projections. In other deep networks, you place it after your linear or convolutional layer, and let it scale the output before passing it forward.
Advantages of GELU
If you’re still reading at this point, you know the biggest selling points of GELU when compared to other activation functions. Here’s a short recap:
- Smooth activation: GELU produces a continuous, differentiable curve with no sharp transitions, which gives the optimizer cleaner information to work with at every step.
- Better gradient flow: GELU doesn't zero out negative inputs, so gradients can still propagate through neurons that receive negative values. This reduces the risk of neurons going dead during training.
- Better performance in deep models: In deep architectures like transformers, the cumulative effect of smoother gradients tends to translate into better training results compared to simpler activation functions.
Limitations of GELU
GELU isn't the right choice for every situation. Here are a couple of limitations you must be aware of:
-
More expensive to compute than ReLU: GELU involves either an error function or a
tanhbased approximation, both of which cost more than ReLU's simple threshold operation. In large models with many layers, this can add up. -
Less intuitive: Functions like ReLU are easy to reason about - positive values pass, negative values don't. GELU's probabilistic scaling is harder to interpret.
-
Not always necessary: For shallow networks or simpler tasks, GELU doesn’t offer meaningful advantages. ReLU or Leaky ReLU will often perform just as well at a lower computational cost.
To conclude, if you're building a transformer or another deep architecture, GELU is a solid default. For everything else, benchmark before committing to it.
Conclusion
GELU isn't a universal upgrade, nor is it a one-size-fits-all solution that replaces ReLU. It's a deliberate design choice that’s worth it in specific contexts - think deep networks and transformer models.
If you're working with BERT, GPT, or any transformer-based model, you're already using GELU whether you realized it or not. Now you know why it's there.
For everything else, the choice of activation function comes down to trade-offs. No single function wins every time, and understanding what each one does is how you make that call with confidence rather than habit.
If you still find the differences between activation functions confusing, enroll in our Machine Learning Engineer Track to get career-ready in machine learning and MLOps.
Become a ML Scientist
FAQs
What is the GELU activation function?
GELU, or Gaussian Error Linear Unit, is an activation function used in neural networks. Unlike ReLU, GELU smoothly scales inputs based on their magnitude using a probabilistic approach. This makes it a better fit for deep architectures where gradient quality matters.
How is GELU different from ReLU?
ReLU zeroes out all negative inputs with a hard cutoff at zero, which can cause neurons to stop learning - a problem known as dying neurons. GELU avoids this by gradually scaling negative inputs down instead of cutting them off. You end up with a smoother gradient flow and better performance in deep networks.
When should I use GELU over other activation functions?
GELU works best in deep architectures, especially transformer-based models like BERT and GPT. For shallow networks or simpler tasks, the computational overhead rarely justifies the switch from ReLU. Start with ReLU as a benchmark, and you can always test other things.
What is the difference between the exact GELU formula and the approximation?
The exact GELU formula uses the Gaussian cumulative distribution function, which requires computing the error function - an operation that's slow at scale. The approximation replaces it with a tanh , which is a based expression that's faster and well-supported on modern hardware. In practice, the two produce nearly identical results, which is why most frameworks use the approximation by default.
Does GELU work in both PyTorch and TensorFlow?
Yes, both frameworks support GELU. In PyTorch, you can use nn.GELU() as a module inside your model definition. In TensorFlow, you can pass "gelu" as a string to any layer's activation argument, or use keras.layers.Activation("gelu").


