Convolutional Neural Network(CNN)


Convolutional Neural Network (CNN) is an advanced version of artificial neural networks (ANNs), primarily designed to extract features from grid-like matrix datasets. This is particularly useful for visual datasets such as images or videos, where data patterns play a crucial role. CNNs are widely used in computer vision applications due to their effectiveness in processing visual data.

CNNs consist of multiple layers like the input layer, Convolutional layer, pooling layer, and fully connected layers.



How Convolutional Layers Works?

Imagine you have an image. It can be represented as a cuboid having its length, width (dimension of the image), and height (i.e the channel as images generally have red, green, and blue channels).


Convolution layers consist of a set of learnable filters (collection of kernels) having small widths and heights and the same depth as that of input volume (3 if the input layer is image input). Where each layer consists of multiple filters.

Steps:



  • ‌Now imagine taking a small patch of this image and running a small neural network, called a filter or kernel on it. For example, if we have to run convolution on an image with dimensions 34x34x3. The possible size of filters can be axax3, where ‘a’ can be anything like 3, 5, or 7 but smaller as compared to the image dimension.

  • ‌During the forward pass, we slide each filter across the whole input volume(image) step by step where each step is called stride  and compute the dot product between the kernel weights and patch from input volume.
  • ‌As we slide our filters we’ll get a 2-D output for each filter as the results of these patch-wise convolutions are combined into one 2D array (feature map) and we’ll stack them together as a result, we’ll get output volume having a depth equal to the number of filters with say, K outputs and representing them vertically. The network will learn all the filters.
  • ‌We will get another image with different widths, heights, and depths. Instead of just R, G, and B channels now we have more channels but lesser width and height. This operation is called Convolution. 
  • ‌Coordination Across Channels: Since a filter is made of multiple kernels (one for each input channel), all kernels within that filter slide across the image in sync: The Red channel kernel slides over the red pixels. The Green channel kernel slides over the green pixels.The Blue channel kernel slides over the blue pixels. They all move to the same x, y coordinates at the same time. Their individual results are then summed together to create the value for that specific coordinate in the final feature map.
Strided Convolutions
Consider a 6 x 6 image as shown in the figure. It is to be convoluted with a 3 x 3 filter. The convolution is done using element wise multiplication.

Element wise multiplication has two downsides:
  • ‌By applying the convolutional filter every time, the original image sinks. i.e. the output image has smaller dimensions than the original input image which may lead to information loss.
  • ‌Pixels at the corner of the image used in only one of the outputs than pixels in the middle which lead to huge information loss.
Layers Used to Build ConvNets
A complete Convolution Neural Networks architecture is also known as covnets. Let’s take an example by running a covnets on of image of dimension 34 x 34 x 3. 
  • Input Layers: It’s the layer in which we give input to our model. In CNN, Generally, the input will be an image or a sequence of images. This layer holds the raw input of the image with width 34, height 34, and depth 3.
  • Convolution Layer: A convolution layer is a type of neural network layer that applies a convolution operation to the input data. The convolution operation involves a filter (or kernel) that slides over the input data, performing element-wise multiplications and summing the results to produce a feature map. This process allows the network to detect patterns such as edges, textures, and shapes in the input images.
  • Activation Layer: By adding an activation function to the output of the preceding layer, activation layers add nonlinearity to the network. Some common activation functions are RELU: max(0, x),  Tanh, Leaky RELU, etc. The volume remains unchanged hence output volume will have dimensions 32 x 32 x 12.
  • ‌Pooling Layer:Pooling layer is used in CNNs to reduce the spatial dimensions (width and height) of the input feature maps while retaining the most important information. It involves sliding a two-dimensional filter over each channel of a feature map and summarizing the features within the region covered by the filter. For a feature map with dimensions nh × nw × nc--->


  • Fully Connected Layers: It takes the input from the previous layer and computes the final classification or regression task.
  • Output Layer: The output from the fully connected layers is then fed into a logistic function for classification tasks like sigmoid or softmax which converts the output of each class into the probability score of each class.
How Convolution Layers Work?

Key Components of a Convolution Layer
  • ‌Filters (Collection of Kernels): Filters are small, learnable matrices that extract specific features from the input data. 
  • ‌Stride: The stride determines how much the filter moves during the convolution operation. A stride of 1 means the filter moves one pixel at a time, while a stride of 2 means it moves two pixels at a time.
  • ‌Padding: Padding involves adding extra pixels around the input data to control the spatial dimensions of the output feature map.
  • Activation Function: After the convolution operation, an activation function, typically the Rectified Linear Unit (ReLU), is applied to introduce non-linearity into the model.
Steps in a Convolution Layer
  • Initialize Filters:Randomly initialize a set of filters with learnable parameters,Suppose we use a total of 12 filters for this layer.
  • Convolve Filters with Input: Slide the filters across the width and height of the input data, computing the dot product between the filter and the input sub-region.
  • ‌Apply Activation Function:Apply a non-linear activation function to the convolved output to introduce non-linearity.
How Pooling Layers Work?
  • ‌Define a Pooling Window (Filter): The size of the pooling window (e.g., 2x2) is chosen, along with a stride (the step size by which the window moves). A common choice is a 2x2 window with a stride of 2, which reduces the feature map size by half.
  • ‌Slide the Window Over the Input: The pooling operation is applied to each region of the input feature map covered by the window.
  • Apply the Pooling Operation: Depending on the type of pooling (max, average, etc.), the operation extracts the required value from each window.
  • Output the Downsampled Feature Map: The result is a smaller feature map that retains the most important information.
Types of Pooling Layers

1. Max Pooling
Max pooling selects the maximum element from the region of the feature map covered by the filter. Thus, the output after max-pooling layer would be a feature map containing the most prominent features of the previous feature map.
2. Average Pooling
Average pooling computes the average of the elements present in the region of feature map covered by the filter. Thus, while max pooling gives the most prominent feature in a particular patch of the feature map, average pooling gives the average of features present in a patch. 
Average pooling provides a more generalized representation of the input. It is useful in the cases where preserving the overall context is important.

3. Global Pooling
Global pooling reduces each channel in the feature map to a single value, producing a 
1×1×n c  output. This is equivalent to applying a filter of size nh × nw.
There are two types of global pooling:
  • Global Max Pooling: Takes the maximum value across the entire feature map.
  • ‌Global Average Pooling: Computes the average of all values in the feature map.



   Article Contributor: Joymalya Dey (ML Engineer)

Comments