Skip to main content

YOLO Object Detection Explained

Understand YOLO object detection, its benefits, how it has evolved over the years, and some real-life applications.
Updated Sep 28, 2024  · 22 min read

Object detection is a computer vision technique for identifying and localizing objects within an image or a video. 

Image localization is the process of identifying the correct location of one or multiple objects using bounding boxes, which correspond to rectangular shapes around the objects. This process is sometimes confused with image classification or image recognition, which aims to predict the class of an image or an object within an image into one of the categories or classes. 

The illustration below corresponds to the visual representation of the previous explanation. The object detected within the image is a “Person.” 

Object detection illustrated from image recognition and localization

Image by Author

In this conceptual blog, you will first understand the benefits of object detection before introducing YOLO, the state-of-the-art object detection algorithm. 

In the second part, we will focus more on the YOLO algorithm and how it works. After that, we will provide some real-life applications using YOLO. 

The last section will explain how YOLO evolved from 2015 to 2024 before concluding on the next steps. 

What is YOLO?

You Only Look Once (YOLO) is a state-of-the-art, real-time object detection algorithm introduced in 2015 by Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi in their famous research paper You Only Look Once: Unified, Real-Time Object Detection

The authors frame the object detection problem as a regression rather than a classification task by spatially separating bounding boxes and associating probabilities to each detected image using a single convolutional neural network (CNN).

If you're interested in image classification, consider taking the Image Processing with Keras in Python course, where you'll build Keras-based deep neural networks for image classification tasks. If you are more interested in Pytorch, Deep Learning with Pytorch will teach you about convolutional neural networks and how to use them to build much more powerful models.

Some of the reasons why YOLO is leading the competition include its:

  • Speed 
  • Detection accuracy 
  • Good generalization 
  • Open-source

Let's see these features in more detail.

1. Speed

YOLO is extremely fast because it does not deal with complex pipelines. It can process images at 45 Frames Per Second (FPS). In addition, YOLO reaches more than twice the mean Average Precision (mAP) compared to other real-time systems, which makes it a great candidate for real-time processing. 

From the graphic below, we observe that YOLO is far beyond the other object detectors with 91 FPS.

YOLO Speed compared to other state-of-the-art object detectors

YOLO speed compared to other state-of-the-art object detectors (source)

2. High detection accuracy

YOLO is far beyond other state-of-the-art models in accuracy, with very few background errors. 

3. Better generalization

This is especially true for the new versions of YOLO, which will be discussed later in the article. With those advancements, YOLO has gone a little further by providing better generalization for new domains, which makes it great for applications relying on fast and robust object detection. 

For instance, the Automatic Detection of Melanoma with Yolo Deep Convolutional Neural Networks paper shows that the first version, YOLOv1, has the lowest mAP for the automatic detection of melanoma disease, compared to YOLOv2 and YOLOv3.

4. Open-source 

Making YOLO open-source has led the community to improve the model constantly. This is one of the reasons why YOLO has made so many improvements in such a limited time. 

YOLO Architecture

YOLO architecture is similar to GoogleNet. As illustrated below, it has 24 convolutional layers, four max-pooling layers, and two fully connected layers.

YOLO Architecture from the original paper

YOLO Architecture from the original paper (Modified by Author)

The architecture works as follows:

  • Resizes the input image into 448x448 before going through the convolutional network.
  • A 1x1 convolution is first applied to reduce the number of channels, followed by a 3x3 convolution to generate a cuboidal output.
  • The activation function under the hood is ReLU, except for the final layer, which uses a linear activation function.
  • Some additional techniques, such as batch normalization and dropout, regularize the model and prevent it from overfitting.

By completing the Deep Learning in Python course, you will be ready to use Keras to train and test complex, multi-output networks and dive deeper into deep learning.

How Does YOLO Object Detection Work?

Now that you understand the architecture let’s take a high-level overview of how the YOLO algorithm performs object detection using a simple use case.

“Imagine you built a YOLO application that detects players and soccer balls from a given image. 

But how can you explain this process to someone, especially non-initiated people?

 → That is the whole point of this section. You will understand the whole process of how YOLO performs object detection and how to get image (B) from image (A).”

YOLO Object Detection Image by Jeffrey F Lin on UnsplashImage by Author

The algorithm works based on the following four approaches: 

  • Residual blocks
  • Bounding box regression
  • Intersection Over Unions or IOU for short
  • Non-Maximum Suppression. 

Let’s have a closer look at each one of them. 

1. Residual blocks

This first step starts by dividing the original image (A) into NxN grid cells of equal shape, where N, in our case, is 4, as shown in the image on the right. Each cell in the grid is responsible for localizing and predicting the class of the object that it covers, along with the probability/confidence value. 

Application of grid cells to the original image

Image by Author

2. Bounding box regression

The next step is to determine the bounding boxes corresponding to rectangles, highlighting all the objects in the image. We can have as many bounding boxes as there are objects within a given image. 

YOLO determines the attributes of these bounding boxes using a single regression module in the following format, where Y is the final vector representation for each bounding box. 

Y = [pc, bx, by, bh, bw, c1, c2]

This is especially important during the training phase of the model.

  • pc corresponds to the probability score of the grid containing an object. For instance, all the grids in red will have a probability score higher than zero. The image on the right is the simplified version since the probability of each yellow cell is zero (insignificant). 

Identification of significant and insignificant grids

Image by Author

  • bx, by are the x and y coordinates of the center of the bounding box with respect to the enveloping grid cell. 
  • bh, bw correspond to the height and the width of the bounding box with respect to the enveloping grid cell. 
  • c1 and c2 correspond to the two classes, Player and Ball. We can have as many classes as your use case requires. 

To understand, let’s pay closer attention to the player on the bottom right. 

Bounding box regression identificationImage by Author

3. Intersection Over Unions or IOU

Most of the time, a single object in an image can have multiple grid box candidates for prediction, even though not all are relevant. The goal of the IOU (a value between 0 and 1) is to discard such grid boxes to only keep those that are relevant. Here is the logic behind it: 

  • The user defines its IOU selection threshold, which can be, for instance, 0.5. 
  • Then, YOLO computes the IOU of each grid cell, which is the Intersection area divided by the Union Area. 
  • Finally, it ignores the prediction of the grid cells having an IOU ≤ threshold and considers those with an IOU > threshold. 

Below is an illustration of applying the grid selection process to the bottom left object. We can observe that the object originally had two grid candidates, and only “Grid 2” was selected at the end. 

Process of selecting the best grids for prediction

Image by Author

4. Non-Max Suppression or NMS

Setting a threshold for the IOU is not always enough because an object can have multiple boxes with IOU beyond the threshold, and leaving all those boxes might include noise. Here is where we can use NMS to keep only the boxes with the highest probability detection score. 

YOLO Applications

YOLO object detection has different applications in our day-to-day life. In this section, we will cover some of them in the following domains: healthcare, agriculture, security surveillance, and self-driving cars.

Object detection has been introduced in many practical industries, such as healthcare and agriculture. Let’s understand each one with specific examples. 

Healthcare

Specifically, in surgery, it can be challenging to localize organs in real time due to biological diversity from one patient to another. Kidney Recognition in CT used YOLOv3 to facilitate localizing kidneys in 2D and 3D from computerized tomography (CT) scans. 

The Biomedical Image Analysis in Python course can help you learn the fundamentals of exploring, manipulating, and measuring biomedical image data using Python.  

2D Kidney detection by YOLOv3

2D Kidney detection by YOLOv3 (Image from Kidney Recognition in CT using YOLOv3)

Agriculture

Artificial Intelligence and robotics play a major role in modern agriculture. Harvesting robots are vision-based robots that were introduced to replace manual picking of fruits and vegetables. One of the best models in this field uses YOLO. In Tomato detection based on a modified YOLOv3 framework, the authors describe how they used YOLO to identify the types of fruits and vegetables for efficient harvest. 

Comparison of YOLO-tomato models

Image from Tomato detection based on modified YOLOv3 framework (source)

Security surveillance

Even though object detection is mostly used in security surveillance, it is not the only application. YOLOv3 has been used during the COVID-19 pandemic to estimate social distance violations between people. 

You can further your reading on this topic from A deep-learning-based social distance monitoring framework for COVID-19

Self-driving cars

Real-time object detection is part of the DNA of autonomous vehicle systems. This integration is vital for autonomous vehicles because they need to properly identify the correct lanes and all the surrounding objects and pedestrians to increase road safety. YOLO's real-time aspect makes it a better candidate compared to simple image segmentation approaches. 

The Evolution of YOLO: From 2015 to 2024

Since its first release in 2015, YOLO has evolved greatly, with different versions. In this section, we will understand the differences between these versions.

YOLO or YOLOv1: The starting point

This first version of YOLO was a game changer for object detection because it could quickly and efficiently recognize objects. 

However, like many other solutions, the first version of YOLO has its own limitations: 

  • It struggles to detect smaller images within a group of images, such as a group of people in a stadium. This is because each grid in YOLO architecture is designed for single-object detection. 
  • Then, YOLO is unable to detect new or unusual shapes successfully. 
  • Finally, the loss function used to approximate the detection performance treats errors the same for both small and large bounding boxes, which in fact creates incorrect localizations. 

YOLOv2 or YOLO9000

YOLOv2 was created in 2016 with the idea of making the YOLO model better, faster and stronger.

The improvement includes but is not limited to the use of Darknet-19 as new architecture, batch normalization, higher resolution of inputs, convolution layers with anchors, dimensionality clustering, and (5) Fine-grained features. 

Batch normalization 

Adding a batch normalization layer improved the performance by 2% mAP. This batch normalization included a regularization effect, preventing overfitting. 

Higher input resolution

YOLOv2 directly uses a higher resolution 448×448 input instead of 224×224, which makes the model adjust its filter to perform better on higher resolution images. This approach increased the accuracy by 4% mAP, after being trained for 10 epochs on the ImageNet data

Convolution layers using anchor boxes

Instead of predicting the exact coordinates of the objects' bounding boxes as YOLOv1 operates, YOLOv2 simplifies the problem by replacing the fully connected layers with anchor boxes. This approach slightly decreases the accuracy but improves the model recall by 7%, which gives more room for improvement.

Dimensionality clustering

YOLOv2 automatically finds the previously mentioned anchor boxes using k-means dimensionality clustering with k=5 instead of performing a manual selection. This novel approach provides a good tradeoff between the recall and the precision of the model.

For a better understanding of the k-means dimensionality clustering, take a look at our K-Means Clustering in Python with scikit-learn and K-Means Clustering in R tutorials. They dive into the concept of k-means clustering using Python and R. 

Fine-grained features

YOLOv2 predictions generate 13x13 feature maps, which is of course enough for large object detection. But for much finer objects detection, the architecture can be modified by turning the 26 × 26 × 512 feature map into a 13 × 13 × 2048 feature map, concatenated with the original features. This approach improved the model performance by 1%. 

YOLOv3: An incremental improvement 

An incremental improvement has been performed on the YOLOv2 to create YOLOv3. 

The change mainly includes a new network architecture: Darknet-53. This is a 106 neural network, with upsampling networks and residual blocks. It is much bigger, faster, and more accurate compared to Darknet-19, which is the backbone of YOLOv2. This new architecture has been beneficial on many levels:

Better bounding box prediction

A logistic regression model is used by YOLOv3 to predict the objectness score for each bounding box. 

More accurate class predictions 

Instead of using softmax as performed in YOLOv2, independent logistic classifiers have been introduced to accurately predict the class of the bounding boxes. This is even useful when facing more complex domains with overlapping labels (e.g. Person → Soccer Player). Using a softmax would constrain each box to have only one class, which is not always true.

More accurate prediction at different scales

YOLOv3 performs three predictions at different scales for each location within the input image to help with the upsampling from the previous layers. This strategy allows getting fine-grained and more meaningful semantic information for a better quality output image. 

YOLOv4 : Optimal speed and accuracy of object detection

This version of YOLO has an Optimal Speed and Accuracy of Object Detection compared to all the previous versions and other state-of-the-art object detectors. 

The image below shows the YOLOv4 outperforming YOLOv3 and FPS in speed by 10% and 12% respectively. 

YOLOv4 Speed compared to YOLOv3

YOLOv4 Speed compared to YOLOv3 and other state-of-the-art object detectors (source)

YOLOv4 is specifically designed for production systems and optimized for parallel computations. 

The backbone of YOLOv4’s architecture is CSPDarknet53, a network containing 29 convolution layers with 3 × 3 filters and approximately 27.6 million parameters. 

This architecture, compared to YOLOv3, adds the following information for better object detection: 

  • Spatial Pyramid Pooling (SPP) block significantly increases the receptive field, segregates the most relevant context features, and does not affect the network speed. 
  • Instead of the Feature Pyramid Network (FPN) used in YOLOv3, YOLOv4 uses PANet for parameter aggregation from different detection levels. 
  • Data augmentation uses the mosaic technique that combines four training images in addition to a self-adversarial training approach. 
  • Perform optimal hyper-parameter selection using genetic algorithms. 

YOLOR : You Only Look One Representation

As a Unified Network for Multiple Tasks, YOLOR is based on the unified network which is a combination of explicit and implicit knowledge approaches.

YOLOR unified network architecture

Unified network architecture (source)

Explicit knowledge is normal or conscious learning. Implicit learning on the other hand is one performed subconsciously (from experience).

Combining these two technics, YOLOR is able to create a more robust architecture based on three processes: (1) feature alignment, (2) prediction alignment for object detection, and (3) canonical representation for multi-task learning

Prediction alignment

This approach introduces an implicit representation into the feature map of every feature pyramid network (FPN), which improves the precision by about 0.5%. 

Prediction refinement for object detection

The model predictions are refined by adding implicit representation to the output layers of the network. 

Canonical representation for multi-task learning

Performing multi-task training requires the execution of the joint optimization on the loss function shared across all the tasks. This process can decrease the overall performance of the model, and this issue can be mitigated with the integration of the canonical representation during the model training. 

From the following graphic, we can observe that YOLOR achieved on the MS COCO data state-of-the-art inference speed compared to other models.

YOLOR vs YOLOv4

YOLOR performance vs. YOLOv4 and other models (source)

YOLOX:  Exceeding YOLO series in 2021

This uses a baseline that is a modified version of YOLOv3, with Darknet-53 as its backbone.

Published in the paper Exceeding YOLO Series in 2021, YOLOX brings to the table the following four key characteristics to create a better model compared to the older versions.

An efficient decoupled head

The coupled head used in the previous YOLO versions is shown to reduce the models’ performance. YOLOX uses a decoupled instead, which allows separating classification and localization tasks, thus increasing the performance of the model. 

Robust data augmentation

Integration of Mosaic and MixUp into the data augmentation approach considerably increased YOLOX’s performance.  

An anchor-free system

 Anchor-based algorithms perform clustering under the hood, which increases the inference time. Removing the anchor mechanism in YOLOX reduced the number of predictions per image, and significantly improved inference time. 

SimOTA for label assignment

Instead of using the intersection of union (IoU) approach, the author introduced SimOTA, a more robust label assignment strategy that achieves state-of-the-art results by not only reducing the training time but also avoiding extra hyperparameter issues. In addition to that, it improved the detection mAP by 3%.

YOLOv5

YOLOv5, compared to other versions, does not have a published research paper, and it is the first version of YOLO to be implemented in Pytorch, rather than Darknet. 

Released by Glenn Jocher in June 2020, YOLOv5, similarly to YOLOv4, uses CSPDarknet53 as the backbone of its architecture. The release includes five different model sizes: YOLOv5s (smallest), YOLOv5m, YOLOv5l, and YOLOv5x (largest).

One of the major improvements in YOLOv5 architecture is the integration of the Focus layer, represented by a single layer, which is created by replacing the first three layers of YOLOv3. This integration reduced the number of layers, and number of parameters and also increased both forward and backward speed without any major impact on the mAP. 

The following illustration compares the training time between YOLOv4 and YOLOv5s.

YOLOv4 vs YOLOv5 Training Time

Training time comparison between YOLOv4 and YOLOv5 (source)

YOLOv6: A single-stage object detection framework for industrial applications

Dedicated to industrial applications with hardware-friendly efficient design and high performance, the YOLOv6 (MT-YOLOv6) framework was released by Meituan, a Chinese e-commerce company. 

Written in Pytorch, this new version was not part of the official YOLO but still got the name YOLOv6 because its backbone was inspired by the original one-stage YOLO architecture. 

YOLOv6 introduced three significant improvements to the previous YOLOv5: a hardware-friendly backbone and neck design, an efficient decoupled head, and a more effective training strategy.

YOLOv6 provides outstanding results compared to the previous YOLO versions in terms of accuracy and speed on the COCO dataset as illustrated below.

YOLO Model Comparison

Comparison of state-of-the-art efficient object detectors. All models are tested with TensorRT 7 except that the quantized model is with TensorRT 8 (source)

  • YOLOv6-N achieved 35.9% AP on the COCO dataset at a throughput of 1234 (throughputs) FPS on an NVIDIA Tesla T4 GPU.
  • YOLOv6-S reached a new state-of-the-art 43.3% AP at 869 FPS.
  • YOLOv6-M and YOLOv6-L also achieved better accuracy performance respectively at 49.5% and 52.3% with the same inference speed. 

All these characteristics make YOLOv5, the right algorithm for industrial applications.

YOLOv7 : Trainable bag-of-freebies sets and new state-of-the-art for real-time object detectors

YOLOv7 was released in July 2022 in the paper Trained bag-of-freebies sets new state-of-the-art for real-time object detectors. This version is making a significant move in the field of object detection, and it surpassed all the previous models in terms of accuracy and speed.

YOLOV7 VS Competitors

Comparison of YOLOv7 inference time with other real-time object detectors (source)

YOLOv7 has made a major change in its (1) architecture and (2) at the Trainable bag-of-freebies level:

Architectural level

YOLOv7 reformed its architecture by integrating the Extended Efficient Layer Aggregation Network (E-ELAN) which allows the model to learn more diverse features for better learning. 

In addition, YOLOv7 scales its architecture by concatenating the architecture of the models it is derived from such as YOLOv4, Scaled YOLOv4, and YOLO-R. This allows the model to meet the needs of different inference speeds.

YOLO Compound Scaling Depth

Compound scaling up depth and width for concatenation-based model (source)

Trainable bag-of-freebies

The term bag-of-freebies refers to improving the model’s accuracy without increasing the training cost, and this is the reason why YOLOv7 increased not only the inference speed but also the detection accuracy.

YOLOv8:  Expanding modularity and flexibility

YOLOv8 introduces a more modular and flexible design, allowing easier customization and fine-tuning. Built-in support for various tasks beyond object detection, such as segmentation and pose estimation.

  • Lightweight models further optimize speed and accuracy trade-offs, with smaller model sizes aimed at real-time applications on edge devices.
  • It adds support for custom data to integrate easily with custom datasets, making it versatile for specific applications.
  • YOLOv8 also adds new APIs for easier deployment and model management in production settings.

YOLO-NAS (Neural Architecture Search): Optimizing architecture

YOLO-NAS utilizes Neural Architecture Search (NAS) to automatically design an optimized architecture, maximizing performance without manual tuning.

  • It's tailored for optimal balance between performance and resource usage, perfect for both high-accuracy models and low-latency applications.
  • It automatically adjusts the resolution for different objects within an image, further optimizing the inference process.

YOLOv9: Groundbreaking techniques for real-time object detection

Launched in 2024, the v9 version of YOLO introduces several innovative techniques, such as the following:

  • Programmable Gradient Information (PGI): A new technique that optimizes gradient flow during training, improving the model's ability to learn from complex datasets more efficiently.
  • Generalized Efficient Layer Aggregation Network (GELAN): A significant architectural enhancement that further improves feature learning and aggregation, contributing to both accuracy and speed improvements.

YOLOv9 sets new benchmarks on the MS COCO dataset, demonstrating superior performance compared to previous versions, particularly in terms of precision and adaptability across various tasks. Although developed by a separate open-source team, YOLOv9 builds upon the codebase of Ultralytics YOLOv5, showing a collaborative effort within the AI community to push the boundaries of object detection.

This latest version highlights substantial advancements in both the training process and the architecture, focusing on efficiency, adaptability, and precision for real-time applications.

Conclusion

This article has covered the benefit of YOLO compared to other state-of-the-art object detection algorithms, and its evolution from 2015 to 2020 with a highlight of its benefits. 

Given the rapid advancement of YOLO, there is no doubt that it will remain the leader in the field of object detection for a very long time. 

The next step of this article will be the application of the YOLO algorithm to real-world cases. Until then, our Introduction to Deep Learning in Python course can help you learn the fundamentals of neural networks and how to build deep learning models using Keras 2.0 in Python.

FAQs

Can we use YOLO on videos?

Yes, YOLO is a real-time detection algorithm that works on both images and videos.

Is YOLO better than faster R-CNN?

In terms of Mean Average Precision (mAP), Faster R-CNN reached 87.69%. However, YOLOv3 has incredible speed, and its Frame Per Second (FPS) is 8 times Faster R-CNN’s.

Why is YOLO called You Only Look Once?

This is because YOLO predicts all the objects within an image in ONE forward pass.

How many classes can YOLO detect?

YOLO is capable of detecting more than 9000 classes at the same time.


Photo of Zoumana Keita
Author
Zoumana Keita
LinkedIn
Twitter

A multi-talented data scientist who enjoys sharing his knowledge and giving back to others, Zoumana is a YouTube content creator and a top tech writer on Medium. He finds joy in speaking, coding, and teaching . Zoumana holds two master’s degrees. The first one in computer science with a focus in Machine Learning from Paris, France, and the second one in Data Science from Texas Tech University in the US. His career path started as a Software Developer at Groupe OPEN in France, before moving on to IBM as a Machine Learning Consultant, where he developed end-to-end AI solutions for insurance companies. Zoumana joined Axionable, the first Sustainable AI startup based in Paris and Montreal. There, he served as a Data Scientist and implemented AI products, mostly NLP use cases, for clients from France, Montreal, Singapore, and Switzerland. Additionally, 5% of his time was dedicated to Research and Development. As of now, he is working as a Senior Data Scientist at IFC-the world Bank Group.

Topics

Deep Learning Courses

course

Cluster Analysis in R

4 hr
41.1K
Develop a strong intuition for how hierarchical and k-means clustering work and learn how to apply them to extract insights from your data.
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related
A tiny computer used for ML

blog

What is TinyML? An Introduction to Tiny Machine Learning

Learn about TinyML, its applications and benefits, and how you can get started with this emerging field of machine learning.
Kurtis Pykes 's photo

Kurtis Pykes

8 min

cheat-sheet

Docker for Data Science Cheat Sheet

In this cheat sheet, learn how to apply Docker in your Data Science projects
Richie Cotton's photo

Richie Cotton

5 min

tutorial

A Beginner's Guide to Object Detection

Explore the key concepts in object detection and learn how they are implemented in SSD and Faster RCNN, which are available in the Tensorflow Detection API.
Lars Hulstaert's photo

Lars Hulstaert

14 min

tutorial

Face Detection with Python Using OpenCV

Learn about object detection in Python using the OpenCV library and discover how to apply it to tasks such as facial detection.
Natassha Selvaraj's photo

Natassha Selvaraj

8 min

tutorial

A Comprehensive Introduction to Anomaly Detection

A tutorial on mastering the fundamentals of anomaly detection - the concepts, terminology, and code.
Bex Tuychiev's photo

Bex Tuychiev

14 min

code-along

How to Explain Black-Box Machine Learning Models

Learn about the importance of model interpretation.
Serg Masis's photo

Serg Masis

See MoreSee More