A Comprehensive Overview of Object Detection

Have you ever thought about how a self-driving car can drive through a busy street or how a security system can spot a danger immediately? In real time? The answer lies in object detection technology. A computer vision technique that helps machines to “see” and understand the things around them, like humans, buildings, cars, and more.

Object detection goes beyond simple image classification, which only assigns a single label to an entire image. Instead, object detection identifies and marks the locations of different objects using bounding boxes, providing a more detailed understanding of the visual scene.

This article will discuss object detection, its types, and challenges.

Object Detection Overview

Object detection uses neural networks like CNN to classify and locate objects, such as humans, buildings, or cars, within images or videos. Along with the neural network (NN) approaches traditional object detection methods have also been used.

In object detection models, the confidence level comes from the classification scores generated by the neural network. Every detected object receives a probability score reflecting the chance of belonging to a specific class. For instance, If the model detects a car in an image, it assigns a confidence score of 0.95, indicating a 95% chance that the object is a car.

The model also returns bounding box coordinates that specify the exact location of the detected object within the image. Object detection is used in various applications, from identifying product defects to improving transportation safety and enabling autonomous vehicles.

To understand object detection even better, let’s compare it with similar tasks:

Image Classification: This task assigns a single label to an image, classifying the most likely object present. For example, if an image contains a car, it will classify it as “car.”
Object Detection: This task uses a bounding box to classify and locate objects in the image, such as identifying the position of a person within a scene.
Image Segmentation: Segmentation gives pixel-level accuracy by outlining the object’s precise shape rather than just using a bounding box. It assigns each pixel to a specific class. For example, it identifies which pixels belong to the tree and its shape, including branches and leaves.

*Figure 2:* *Classification vs. Localization vs. Segmentation*

Technical Aspects Behind Object Detection

Object detection includes several stages, each important in accurately identifying and locating objects within images or video frames. These stages include:

1. Input Processing

Images or video frames are preprocessed to ensure size, format, and quality uniformity. Preprocessing steps include resizing, normalization, and augmenting the data to improve the model.

2. Feature Extraction

The extraction step identifies important characteristics like edges, corners, and textures. Convolutional Neural Networks (CNNs) are mostly used to automatically learn and extract features by applying filters to the image.

Basic features like edges and corners are identified in early layers of a CNN and deeper layers learn more complex features, like parts of objects. Some other methods are also used for feature extraction, such as HOG (Histogram of Oriented Gradients) and SIFT (Scale-Invariant Feature Transform).

*Figure 3. Feature extraction using CNN*

3. Object Localization

A bounding box is drawn around each detected object during localization. A bounding box has four values: x, y, width (w), and height (h). Depending on the model, (x, y) can denote the box’s center or top-left corner, while w and h set the dimensions.

In deep learning models, these coordinates are adjusted through regression to fit the object better. The Intersection over Union (IoU) metric measures the overlap between predicted and actual bounding boxes.

4. Object Classification:

After localization, the object detection system assigns a label to each detected object, such as “car” or “person,”. This uses fully connected layers in CNNs to take extracted features as input and output class probabilities.

*Figure 5. Object Classification Layers*

Types of Object Detection

Object detection models can be broadly classified into two types:

One-stage detection: These models combine object localization and classification into a single step focusing on speed rather than precision. Examples of single-stage detectors include YOLO and SSD (Single Shot Multibox Detector).
Two-stage detection: These models separate object localization and classification into two stages and use a Region Proposal Network (RPN) to generate candidate bounding boxes. They give higher localization accuracy but are slower than single-stage detectors. R-CNN, Fast R-CNN, and Faster R-CNN are examples of two-stage detectors.

Object Detection Architectures

Object detection models generally consist of three main components:

Backbone
Neck
Head

1. Backbone

The backbone extracts features from the input image. It consists of multiple convolutional layers that process the image to identify important features like edges, textures, and patterns.

2. Neck

The neck acts as a concentrator between the backbone and the head. It combines feature maps from different levels of the backbone to improve spatial and semantic information. This component uses layers like FPN (Feature Pyramid Network) or PAN (Path Aggregation Network).

3. Head

The head is the final component that predicts bounding boxes and classifies objects. It consists of fully connected layers or convolutional layers that take the processed feature maps from the neck and predict:

Bounding boxes: They predict the coordinates and dimensions (x, y, width, and height) using regression layers.
Class probabilities: Assign a class label to each detected object using softmax or sigmoid activation.

Here are two example model architectures:

YOLO (One-stage detection)
Faster R-CNN (Two-stage detection)

YOLO (You Only Look Once)

YOLO is a single-stage detection architecture that processes the entire image in a single pass.

Input Layer: Divides the image into an S×S grid, where each grid cell predicts bounding boxes and class probabilities.
Feature Extraction Layers: The backbone, such as Darknet, extracts features from the image.
Prediction Layers: Predict both bounding box coordinates and class probabilities for each grid cell.

However, YOLO also has limitations. Its accuracy can suffer, particularly when detecting smaller objects, due to the grid structure where each cell can predict only one object. Overlapping or small grid objects may cause missed or misclassified detections.

Faster R-CNN (Regions with Convolutional Neural Networks)

Faster R-CNN is an accurate object detection model that improves the original R-CNN architecture. It is a two-stage detector that performs region proposal and object classification in two distinct phases.

Feature Extraction Layers: The backbone processes the image to generate feature maps.
Region Proposal Network (RPN): A network slides over the feature maps to propose candidate regions where objects might be located.
Pooling Layer: Refines the proposals by aligning them to a uniform size, ensuring consistent input for the next stage.
Classification and Regression Layers:
- Classification: Fully connected layers predict the object class for each region.
- Regression: Predicts precise bounding box coordinates for the detected objects.

Performance Metrics and Evaluation

Object detection systems’ effectiveness relies on key performance metrics. Including:

Mean Average Precision (mAP)
Intersection over Union (IoU)

Mean Average Precision

Mean Average Precision (mAP) is the primary metric for evaluating object detection models, combining classification accuracy with bounding box precision. Precision and recall values are determined at different confidence thresholds for each object class to compute mAP.

Precision measures the ratio of correctly detected objects (true positives) out of all predicted objects (true positives + false positives).

Recall measures the ratio of correctly detected objects (true positives) out of all actual objects (true positives + false negatives).

The average precision for each class is computed using the precision and recall values at various thresholds. The final mean Average Precision (mAP) is calculated by averaging the Average Precision (AP) scores of all object classes.

Mean Average Precision Equation

Intersection over Union

The Intersection over Union (IoU) metric measures the alignment of predicted bounding boxes with ground truth boxes by calculating the overlap dividing the intersection area by the union area.

Intersection over Union Equation

A higher IoU score, usually above 0.5, indicates better localization accuracy. Different applications might require different IoU thresholds – autonomous vehicles often need higher thresholds than retail inventory systems.

Figure 9. Object Detection at Different IoU Threshold Examples

Benchmark Datasets for Object Detection

Benchmark datasets have an important role in developing and evaluating object detection models. These datasets provide standardized data and annotations to train, validate, and compare the performance of different models effectively.

Some widely used benchmark datasets include:

COCO (Common Objects in Context) contains over 200,000 images with 80 object categories. It is widely used due to its diverse annotations, which include object segmentation, key points, and dense captioning. It is suitable for different tasks, not only for object detection.
Pascal VOC provides around 20,000 images annotated across 20 object categories. This dataset has historically been a standard for object detection challenges.
Open Images, a large dataset, includes millions of labeled images with 600 object classes. It annotates bounding boxes, segmentation masks, and object relationships, supporting advanced model training and evaluation.

Object Detection Models

Co-DETR

Co-DETR introduces a collaborative hybrid assignment scheme to enhance Detection Transformer (DETR)-based object detectors. It improves encoder and decoder training with auxiliary heads using one-to-many label assignments.

The approach boosts detection accuracy and uses less GPU memory due to faster training. It achieves SOTA performance, including 66.0% AP on COCO test-dev and 67.9% AP on LVIS val.

InternImage

InternImage is a large-scale CNN-based foundation model leveraging deformable convolution for adaptive spatial aggregation and a large, effective receptive field.

The architecture decreases the inductive bias in legacy CNNs and increases the model’s ability to learn more robust patterns from extensive visual data. It achieves 65.4 mAP on COCO test-dev and 62.9 mIoU on ADE20K.

Focal-Stable-DINO

Focal-Stable-DINO is a robust and reproducible object detector combining the powerful FocalNet-Huge backbone and the Stable-DETR with an Improved deNoising anchOr boxes (DINO) detector.

The Stable-DINO detector solves the issue of multi-optimization paths by addressing the matching stability problem in several decoder layers.

With FocalNet-Huge as the backbone, the framework achieves 64.8 AP on COCO test-dev without complex testing techniques like test time augmentation. The model’s simplicity makes it ideal for further research and adaptability in object detection.

EVA

EVA is a vision-centric foundation model designed to push the limits of visual representation at scale using public data. Experts pre-train the model on NVIDIA A100-SXM4-40GB using PyTorch-based code.

The pretraining task is to reconstruct image-text visual features using visible image patches. The framework excels in natural language processing (NLP) and enhances multimodal models like CLIP with efficient scaling and robust transfer learning.

YOLOv7

YOLOv7 introduces a new SOTA real-time object detector, achieving optimal speed and accuracy trade-offs. It uses extended bag-of-freebies techniques, model scaling, and an innovative planned re-parameterized convolution.

The re-parameterization removes the identity connections in RepConv to increase gradient diversity for multiple feature maps. YOLOv7 outperforms previous YOLO models, such as YOLOv5, and achieves 56.8% AP on COCO with efficient inference.

Model	Box Mean Average Precision (mAP) on COCO test-dev
Co-Detr	66.0
InternImage	65.4
Focal-Stable-DINO	64.8
EVA	64.7
YOLOv7	56.6

Challenges of Object Detection

Despite significant advancements, object detection still faces several challenges:

Dataset Limitations: Training data quality impacts performance, but collecting diverse datasets is resource-intensive. Synthetic data and domain adaptation help but pose challenges, like data quality, domain gap, and overfitting risk.
Adverse lighting conditions: Poor or inconsistent lighting can also reduce the accuracy of object detection models. This is mainly a problem outdoors.
Scale variance: It means objects in an image can appear in different sizes based on distance from the camera or their actual size.
Occlusion: it occurs when one object blocks another in an image. This is a challenge because object detection models rely on the complete visibility of objects to correctly identify them.

Conclusion:

Object detection is a transformative technology that allows computers to interpret and understand the visual world. It pinpoints objects’ precise locations using bounding boxes. We’ve explored fundamental principles and key components of object detection, such as feature extraction, bounding boxes, and classification. We’ve also examined model types and evaluation matrices.

Even though there are problems with object detection, like dealing with different scales, lighting, and occlusions, new methods and models are always being developed to address these problems.