Lewati ke konten utama

ResNet Architecture: Residual Networks and Skip Connections

A deep dive into ResNet architecture, covering how residual learning and skip connections solve the vanishing gradient and degradation problems that make training deep neural networks difficult.
11 Apr 2026  · 12 mnt baca

Deeper neural networks should perform better. But in practice, that’s not always the case.

After a certain depth, accuracy can actually start reducing. Not because the model is overfitting - because training itself breaks down. Gradients tend vanish before they reach the early layers, and those layers stop learning. You would assume that adding more layers would fix it, but it often just makes it worse.

ResNet fixed this with the core idea of skipping connections. Instead of forcing every layer to learn from scratch, it lets the network skip over layers and add the input directly to the output.

In this article, I'll cover how ResNet works, what its architecture looks like, and why it's still a go-to algorithm in modern deep learning.

Want to see ResNet’s in practice? Solve our Image classification with ResNet exercise as a part of the Deep Learning for Images with PyTorch course.

What Is ResNet Architecture?

ResNet - short for Residual Network - is a neural network architecture designed to make training deep networks practical.

The idea was introduced by Microsoft Research all the way back in 2015. The algorithm uses residual connections to work around the training problems that limited deep networks at the time. The idea was simple, but after the discovery, you could reliably train networks with 50, 101, or even 152 layers for the first time - without watching performance degrade.

Before ResNet, going that deep wasn't really an option.

Why Deep Networks Are Hard to Train

More layers should mean more chance for a network to learn. In practice, past a certain depth, things start breaking down.

There are two problems at play here.

The first is the vanishing gradient problem. Neural networks learn by sending error signals backward through the network - a process called backpropagation. Each layer adjusts its weights based on that signal. But as the signal travels back through many layers, it gets multiplied by small numbers over and over, and shrinks. By the time it reaches the early layers, there's almost nothing left. Those layers stop updating, which means they stop learning.

The second is the degradation problem. This one is counterintuitive. You'd expect a 56-layer network to perform at least as well as a 20-layer one - it has more capacity, after all. But researchers found the opposite to be true. The deeper network performed worse, even on training data. That rules out overfitting as the cause. The model isn't memorizing too much. Instead, it's having a hard time optimizing.

This is the key distinction. These aren't generalization problems you can fix with dropout or regularization. They're optimization problems - the network can't find good weights in the first place.

ResNets were designed to solve these two problems. Let me show you how.

The Core Idea: Residual Learning

Traditional neural networks try to learn a direct mapping from input to output. Each layer looks at what came in and tries to figure out what should come out. That works fine for shallow networks. But as you go deeper, you run into two problems discussed earlier.

With ResNet, instead of asking each block to learn the full mapping, it asks a simpler question: what do I need to add to the input to get the right output?

That difference is called the residual.

So instead of learning:

Residual learning (1)

Residual learning (1)

The network learns:

Residual learning (2)

Residual learning (2)

Where F(input) is the residual - the small correction the network needs to make. If the layer doesn't need to change anything, it can just push F(input) toward zero and pass the input through unchanged.

This might sound like a small tweak. But it changes what the network has to learn. Learning a small correction is a much easier optimization problem than learning a full transformation from scratch, and that's what makes deeper networks trainable.

What Are Skip Connections in ResNet?

A skip connection is exactly what it sounds like - a direct path that bypasses one or more layers and feeds the input to a later point in the network.

In a traditional network, data flows through each layer in sequence. Every layer transforms the input and passes the result to the next one. Skip connections take the original input and add it directly to the output of a layer further down the block.

Here's a simple way to picture it:

Skip connection graph example

Skip connection graph example

The input x travels two paths at once. One path goes through the convolutional layers, which learn the residual F(x). The other path skips those layers and connects to the addition step. The final output is F(x) + x.

This shortcut does something important for training. During backpropagation, gradients can travel back through the skip connection, without passing through the intermediate layers. That gives the early layers a cleaner, stronger signal to learn from - which is exactly what was missing in deep networks before ResNet.

Structure of a ResNet Block

A residual block is the repeating unit that makes up a ResNet. If you understand one block, you understand the whole network.

Here's what’s going on inside a single block:

  1. The input x enters the block and splits into two paths

  2. One path goes through two convolutional layers, each followed by batch normalization and a ReLU activation

  3. The other path skips those layers - this is the skip connection

  4. Both paths meet at an addition step, where the original input is added to the output of the convolutional layers

  5. A final ReLU activation is applied to the result

Or in diagram form:

ResNet block diagram

ResNet block diagram

The skip connection here is called an identity mapping - the input passes through unchanged and gets added directly to the learned output. It's the simplest possible shortcut with no transformation and no extra parameters.

But for the addition to work, both paths need to produce tensors of the same shape. If the convolutional layers change the spatial dimensions or the number of channels, the input x can't be added. In those cases, ResNet applies a projection shortcut - a 1×1 convolution on the skip path that reshapes x to match.

Most blocks in a ResNet use identity shortcuts. Projection shortcuts only show up when dimensions change, typically when the network moves between stages.

Types of ResNet Architectures

ResNet comes in a few standard variants, each named after its total number of layers. The right one depends on what you're optimizing for - speed, accuracy, or somewhere in between.

ResNet architecture comparison

ResNet architecture comparison

ResNet-18 and ResNet-34 use the standard basic block - two 3×3 convolutional layers with a skip connection. They're fast and cheap to run, making them a good starting point when you're prototyping or working with limited hardware.

ResNet-50 and above switch to a different design called the bottleneck block, which uses three layers instead of two. That change makes deeper networks easier to train without a proportional jump in compute cost. You'll read more about how that works in the next section.

ResNet-101 and ResNet-152 go one step further at the cost of longer training times and higher memory use. They're common in research and in production systems where accuracy matters more than speed.

For most practical work, ResNet-50 is the default starting point. It has a good balance between depth and cost, and it's well-supported across every major deep learning framework.

ResNet Bottleneck Architecture

Deeper ResNets don't use the same block design as shallower ones. Starting from ResNet-50, the architecture switches to a bottleneck block, which is a three-layer design that keeps computation manageable as depth increases.

The block uses three convolutions in sequence:

  • 1×1 convolution - reduces the number of channels, making the input smaller
  • 3×3 convolution - does the actual feature learning on that smaller representation
  • 1×1 convolution - expands the channels back to the original size

The first and last 1×1 convolutions act as a bottleneck - hence the name. They compress the data before the more expensive 3×3 convolution runs, then restore it afterward.

A 3×3 convolution on a high-channel input is computationally heavy. By reducing the channels first, the bottleneck block lets the 3×3 layer do its job on a much smaller input. The result is a block that goes deeper without a proportional jump in compute cost.

The skip connection works the same way as in a basic block - the input is added to the output before the final activation. The only difference is that a projection shortcut is almost always needed here, since the channel dimensions change inside the block.

How ResNet Solves the Vanishing Gradient Problem

The vanishing gradient problem comes down to distance. The further a gradient has to travel through a network, the more it shrinks - and by the time it reaches the early layers, there's not much left to learn from.

Skip connections get around this problem by giving gradients a shorter path to travel.

During backpropagation, gradients don't have to pass through every layer in sequence. They can travel back through the skip connection directly, completely bypassing the convolutional layers. That shortcut keeps the gradient large enough to actually update the early layers.

This also changes what each block has to learn. Instead of finding a full transformation from scratch, the network only needs to learn a small correction on top of the input. That's a much easier optimization problem, and it means the network can go deeper without training becoming unstable.

To summarize, networks that were previously too deep to reliably train become trainable.

ResNet vs Traditional CNN Architectures

Traditional CNNs and ResNets both learn features from images, but they go about it in different ways.

In a traditional CNN, data flows through layers in a straight line. Each layer takes the output of the previous one, applies a transformation, and passes the result forward. That works well up to a point. Past a certain depth, the sequential structure becomes unreliable during backpropagation - gradients shrink, early layers stop learning, and accuracy starts to drop.

ResNet doesn't go in a straight line. Skip connections let the input bypass one or more layers and get added directly to the output further down the block. The network still learns transformations, but it also has a direct path for both data and gradients to travel through.

Here's how the two approaches compare:

ResNet versus traditional CNN

ResNet versus traditional CNN

The skip connections both help with gradients and make the optimization smoother, which means the network finds good weights faster and more reliably.

Applications of ResNet Architecture

ResNet architecture shows up across a wide range of real-world tasks.

Image classification is where ResNet started. It won the ImageNet Large Scale Visual Recognition Challenge in 2015, and it's still a go-to choice for classifying images into categories, whether that's medical scans, satellite imagery, or product photos.

Object detection workflows often use ResNets. Frameworks like Faster R-CNN and Mask R-CNN combine ResNet with a detection head that identifies and localizes objects within an image. ResNet does the feature extraction and the detection head does the rest.

Transfer learning is where ResNet gets genuinely useful for most data scientists. Instead of training from scratch - which takes days and a lot of data - you load a ResNet pretrained on ImageNet and fine-tune it on your own dataset. The pretrained weights already encode useful low-level features like edges, textures, and shapes, so you're starting from a much better place.

Feature extraction takes a similar approach. You run your images through a pretrained ResNet and pull the output from one of the later layers. Those outputs are dense, meaningful representations of your images that you can feed into a simpler classifier or clustering algorithm.

In all of these use cases, ResNet works as a pretrained starting point. Most deep learning frameworks come with pretrained ResNet weights out of the box, which makes it one of the easiest architectures to get started with.

Advantages and Limitations of ResNet

ResNet was a real step forward in deep learning - but like any architecture, it comes with tradeoffs. Let me go over a couple of advantages and disadvantages.

Advantages

The most obvious one is depth. Skip connections allow data scientists to train networks with 50, 100, or even 150+ layers without running into the degradation problem. That wasn't reliably possible before ResNet.

Training is also more stable. The shortcut paths give gradients a clean route back through the network, which means less tuning, fewer collapses, and more predictable results across different tasks and datasets.

And the performance is an advantage too. ResNet variants consistently rank well on image benchmarks, and pretrained ResNet models transfer well to new domains, which is why they're still a default starting point for so many computer vision projects.

Limitations

ResNet is computationally heavy. Deeper variants like ResNet-101 and ResNet-152 need a lot of memory and processing power, which can be a constraint when you're working with limited hardware or need fast inference.

It's also not the best fit for every task. For smaller datasets or simpler problems, a lighter architecture often does just as well at a fraction of the cost. Going with ResNet-50 by default isn't always the right choice.

And in some areas, ResNet has been replaced. Architectures like EfficientNet get better accuracy per parameter on image tasks, and transformers have taken over in others. ResNet is still widely used, but it's no longer the only serious option.

ResNet in Modern Deep Learning

Eleven years after its introduction, ResNet architecture is still standing strong. That's not common in deep learning.

Most practitioners still reach for ResNet when they need a reliable baseline for a computer vision task. It's well-understood, well-supported across every major framework, and pretrained weights are available in every major library. So, when you need something that works without a lot of experimentation, ResNet is usually the first option you try.

But its influence goes beyond its own variants.

ResNet's core idea - that you can add a shortcut around layers to help information and gradients flow - turned out to be broadly useful. DenseNet improved on that idea by connecting every layer to every other layer, not just skipping one or two. And while transformers have a different architecture, the residual connections inside each transformer block follow the same principle ResNet introduced.

Newer architectures like EfficientNet, ConvNeXt, and vision transformers have pushed performance further in specific areas. But they didn't replace ResNet so much as build on top of what it established.

Conclusion

ResNet architecture is all about one thing: skip connections. That one idea solved two problems that had been holding deep networks back - vanishing gradients and the degradation problem - and made it practical to train networks at a depth that wasn't possible before.

The idea of adding shortcuts between layers is now a standard building block in modern deep learning, showing up in DenseNet, transformers, and most architectures built after 2015.

If you're working on a computer vision problem today, ResNet is still a solid starting point. It's not the newest option, but it's one of the most reliable ones. Treat it as a baseline - you’d be surprised how it can still outperform the competition in 2026.

If you’re new to deep learning but know the fundamentals of Python, explore our Introduction to TensorFlow in Python course - it’ll get you started with topics like ResNet’s in a weekend.


Dario Radečić's photo
Author
Dario Radečić
LinkedIn
Senior Data Scientist based in Croatia. Top Tech Writer with over 700 articles published, generating more than 10M views. Book Author of Machine Learning Automation with TPOT.

FAQs

What is ResNet and why was it important?

ResNet, short for Residual Network, is a deep learning architecture introduced by Microsoft Research in 2015. It solved two problems that made training deep networks difficult: vanishing gradients and the degradation problem. Skip connections mechanism made it possible to reliably train networks with 50, 100, or even 150+ layers for the first time.

What are skip connections in a neural network?

A skip connection is a direct path that bypasses one or more layers and adds the input straight to the output of a later layer. This gives both data and gradients a shortcut through the network, keeping the gradient signal strong enough to update early layers during training.

What is the vanishing gradient problem?

The vanishing gradient problem happens when gradients shrink as they travel backward through a deep network. By the time the signal reaches the early layers, it's too small to update them - which means those layers stop learning. ResNet addresses this by letting gradients flow back through skip connections, bypassing intermediate layers.

What's the difference between ResNet's basic block and bottleneck block?

The basic block uses two 3×3 convolutional layers and is found in shallower variants like ResNet-18 and ResNet-34. The bottleneck block, used in ResNet-50 and deeper, uses a 1×1 - 3×3 - 1×1 convolution sequence that reduces computation by compressing channels before the expensive 3×3 convolution runs.

How do I choose the right ResNet variant for my project?

For most practical work, ResNet-50 is a good default - it balances depth, accuracy, and compute cost well. ResNet-18 and ResNet-34 are faster options when hardware is limited, while ResNet-101 and ResNet-152 make sense when accuracy is the priority and compute isn't a constraint.

Topik

Learn with DataCamp

Kursus

Pengantar Deep Learning dengan Python

4 Hr
262.2K
Pelajari dasar-dasar jaringan saraf tiruan dan cara membangun model pembelajaran mendalam menggunakan Keras 2.0 dalam Python.
Lihat DetailRight Arrow
Mulai Kursus
Lihat Lebih BanyakRight Arrow
Terkait

blogs

Attention Residuals Explained: Rethinking Transformer Depth

Learn how Attention Residuals rethink depth in Transformers by replacing uniform residual accumulation with selective, attention-based aggregation.
Aashi Dutt's photo

Aashi Dutt

8 mnt

blogs

A Beginner’s Guide to the Rectified Linear Unit (ReLU)

Discover the basics of one of the most popular activation functions for neural networks
Javier Canales Luna's photo

Javier Canales Luna

11 mnt

Tutorials

Introduction to Deep Neural Networks

Understanding deep neural networks and their significance in the modern deep learning world of artificial intelligence
Bharath K's photo

Bharath K

Tutorials

Feed-Forward Neural Networks Explained: A Complete Tutorial

Feed-Forward Neural Networks (FFNNs) are the foundation of deep learning, used in image recognition, Transformers, and recommender systems. This complete FFNN tutorial explains their architecture, differences from MLPs, activations, backpropagation, real-world examples, and PyTorch implementation.
Vaibhav Mehra's photo

Vaibhav Mehra

Tutorials

Mastering Backpropagation: A Comprehensive Guide for Neural Networks

Dive into the essentials of backpropagation in neural networks with a hands-on guide to training and evaluating a model for an image classification use scenario.
Zoumana Keita 's photo

Zoumana Keita

Tutorials

Demystifying Generative Adversarial Nets (GANs)

Learn what Generative Adversarial Networks are without going into the details of the math and code a simple GAN that can create digits!
DataCamp Team's photo

DataCamp Team

Lihat Lebih BanyakLihat Lebih Banyak