Contents

Understanding Convolutional Neural Networks (CNNs)

Why CNNs Are Perfect for Object Detection

Feature Extraction and Convolutional Neural Networks

Understanding the Core Mechanics Behind Object Detection

Written by Brian Hulela

Updated at 25 Jun 2025, 20:31

4 min read

Edge Detection on Cat and Dog using the Sobel Filter

Object detection relies on identifying important patterns in images, a process known as feature extraction. Convolutional Neural Networks (CNNs) are the backbone of modern object detection systems because they excel at extracting these features. In this chapter, we will explore how CNNs work, their role in feature extraction, and why they are crucial for object detection.

What is Feature Extraction?

Feature extraction is the process of identifying important characteristics in an image that help distinguish objects. These characteristics include:

Edges – The outlines of objects
Textures – Patterns in the image, such as stripes or spots
Shapes – Geometric structures like circles or rectangles

A CNN extracts these features by applying mathematical operations to the image, detecting patterns at different levels.

Edge Detection Using Sobel Filter

Understanding Convolutional Neural Networks (CNNs)

A CNN is a type of deep learning model designed to process images efficiently. It consists of several layers, each with a specific purpose:

Convolutional Layers – Extract important features using filters (kernels).
Activation Function (ReLU) – Introduces non-linearity, allowing the network to learn complex patterns.
Pooling Layers – Reduce the size of feature maps, keeping only essential details.
Fully Connected Layers – Combine extracted features to make predictions.

Each layer refines the image representation, allowing the model to detect increasingly complex patterns.

Convolutional Neural Network showing an input image, passing through convolution, pooling, and fully connected layers.

How Convolution Works

Convolution is a process where a small matrix (filter/kernel) slides over the input image to detect specific features.

Example of Convolution:

Imagine a 3x3 filter applied to an image. It multiplies pixel values in its region and sums them up to produce a new pixel value in the feature map. This helps in detecting edges, textures, or patterns.

Key properties of convolution:

Stride – The step size of the filter as it moves.
Padding – Adds extra pixels around the image to preserve size.

This image show a 3x3 Laplacian filter scanning over image pixels with a stride of 1, no padding and producing a feature map.

Activation Function: Making CNNs Smarter

After convolution, we use an activation function called ReLU (Rectified Linear Unit). This function introduces non-linearity, allowing CNNs to detect more complex features.

ReLU works by:

Keeping positive values unchanged
Replacing negative values with zero

This helps prevent unwanted information from passing through the network.

ReLu Activation Function Applied on a Feature Map

Pooling: Reducing Complexity

Pooling is used to reduce the size of feature maps while keeping important information. It helps CNNs be more efficient and less sensitive to small variations in the image.

Two Types of Pooling:

Max Pooling – Selects the highest value in a region (helps in edge detection).
Average Pooling – Averages the values in a region (smooths out noise).