Convolutional Neural Networks (CNN) are mostly used for images and videos. These tend to perform better than the feed-forward network as the image is nothing but matrices of different values that represent different values that range from 0–255. For e.g.: A black and white image of dimension 100×100 would have around 10000 values in it when flattened. Similarly, an HD image of resolution 1920x108x3 would generate around 6 million values. These 6million values belong to a single image and a bunch of these are would be required to train the machine and model on would round up to a very large amount that would be computationally heavy for the machine. So, this is the major reason we opt for CNN over a feed-forward network.
There are various operations that take place in the CNN and some of the most important that take place often in them are as follows:
- Convolutional Layer
- 1×1 Convolution Layer
- Fully Connected Layer
Before, we deep dive into the functioning of the CNN, let’s understand how the image looks like in matrix form. A black and white image can be represented in a form of matrix where it consists value ranging from [0–255] where 0 represents black color and 255 represents a white color.
This is how a black and white image looks like and a color image is nothing but a combination of different shades and values of Red, Green, and Blue color, or say the RGB. And due to that, the image is split into 3 channels/layers representing a different color of RGB. Those 3 channels are Red channel, Green channel and Blue channel which is given below as:
After understanding how the images are split into matrices, it’s time to understand how the CNN functions and how the operations take place in it. Before we try to understand how padding works, it’s important to understand how a convolutional layer is formed.
In the convolution; we actually perform a correlation operation as in convolution we flip the kernel but instead, it’s called a convolution. It requires 3 important things in it and they are as:
Let’s say an image of dimension 6×6 is given then we multiply with a filter of size 3×3 and the role of the filter is to detect the edges that exist within an image. The final image that’s formed when a filter is applied in this case would be of 4×4 size. All the values that exist within the red box are multiplied with the filter size and a matrix multiplication takes place and the size of the red box is the same as that of the filter and the final value obtained from it is stored in the final result box in the first box.
In this given example the value of stride is given as 1 and padding as 0. So the obtained matrix size can be denoted in the formula,
The final matrix dimension are = [(n +2p -f)/s + 1] x [(n +2p -f)/s + 1]
Stride is basically the number of units the red box shifts/moves towards after multiplication with the filter size values. That can be represented here below where the yellow box represents the values that are multiplied with the filter and then moves horizontally till the last value is covered and then it moves below by 1 unit as assigned to the stride value and then it moves horizontally and this operation is carried out till the end.
In this operation we can clearly notice one thing that the numbers that exist on the outer region(corners) are generally counted once and this means that a lot of amount of data is being discarded as it’s not being counted and hence to tackle this issue, we use a concept of Padding. The role of padding is to add a no. of units around edges which consists of value 0 and due to this, the values that were existing around the edges are now contributing a significant amount of data. And one more thing to notice is that the image size reduces and padding is used here to ensure that the image size remains the same after performing the operations.
Padding is generally given as:
p = (f-1)/2
And generally, padding is of two types and are:
In the valid padding, the image on which the operation is executed has no padding due to which the size of the matrix reduces, and in the same padding, the size remains the same after operation by adding a layer of padding around the matrix. This ensures that the corner edge values contribute while performing operations and nothing is missed.
One more thing to keep in mind is that while assigning a value to stride the size of the matrix reduces and in case a non-integer value is obtained for example 3.7 then its lower value is taken. And the filter is also known as kernel and its role is to identify edges that may exist in different directions as horizontal, vertical or of any angle.
CONVOLUTION ON VOLUME
So far, we have been performing convolution operations on a single matrix or a single channel image which is of a greyscale image. As explained before, the color images generally consist of 3 channels of red, green, and blue RGB. Here the matrix consists of the 3 layers and the filters to which they are multiplied are of 3 layers too. The values obtained from all three layers are added together to form a single-layer matrix. This is also known as the convolution on volume.
And a similar process is carried out of padding, striding, and here in the diagram below the red represents the red channel and respectively for others of a total of 3 layers. We can also see that how striding takes place where the mask shifts by 1 unit.
Let’s understand the neural network given below in the diagram and how the number of layers is formed passing through different operations.
Now let’s try to understand the above concept with the things we have learned so far. At first, an image of dimension 39x39x3 is taken. On which we perform a convolution operation of filter size 3, padding 0, stride 1, and a total of 10 filters. Mathematically,
f = 3; s =1; p = 0; n =39
The obtained matrix will be of dimension,
And, there are a total of 10 filters.
So, the obtained matrix after the operation is: 37x37x10
Similarly, the second convolutional layer will be of the dimension => (37 + 2*0–5)/2 + 1 = 17
Hence here the dimension obtained is: 17x17x20
For the third convolutional layer will be of the dimension => (17 + 2*0–5)/2 + 1 = 7
Hence the size of the matrix after operations is: 7x7x40
And after that, they are flattened into a single layer which consists of values 7*7*40 = 1960 values in it. From which one more layer is made of a number of neurons that represent the total number of classes from which the SoftMax operation is carried on to classify the types of the objects or mathematically speaking it returns the value of the object that tends to have the highest probability. This demonstrates a basic function of the convolutional neural network. But there are other concepts also that exist and that will be explained now and now we deal with the concept of Pooling and the types of pooling that exist and their importance.
Pooling is another concept that is performed and the reason behind carrying out this operation to select those particular features that important and also it reduces the dimension of the matrix making it easier for computation. Here there exist three parameters that are:
And generally, there are three types of pooling operation that are carried out and they are as:
Let’ discuss how pooling works and these types of pooling operations are carried out. The given matrix is given below which consists of certain values in it. The size is basically the mask of the region that is taken on which pooling is performed very much similar to the filter size mask. And after pooling operation is carried out on it the masks shift by certain units mentioned in the stride. Consider the matrix given below of dimension 4×4 where the size of the mask of pooling operation is 2 and stride 2. So, the mask shifts right by 2 units and then down by 2 units. So those different areas taken are represented by 4 different colors for representation purposes.
Max pooling is carried out when it has to choose the biggest amount of value present in that mask and this is generally used to pick the brighter pixels/value from the matrix and which is given below.
Similarly, min pooling is carried out to pick the lowest values in that mask which generally picks the darkest spot on the image and is given below as:
And then average pooling is carried out to pick the average of that mask region which is given as:
Similarly, a pooling operation is carried out for a color image where it consists of 3 layers representing 3 different channels of RGB. Below is a diagram of a color image on which max pool operation takes place and we can clearly see that the size of the matrix reduces.
In pooling as well the similar formula is applicable as we learned before and the dimension of the matrix obtained after performing the operations is given as:
[(n +2p -f)/s + 1] x [(n +2p -f)/s + 1] x (no. of filters)
1×1 Convolution or very well known as a network in the network used to decrease no. of channels in a multi-channel matrix. This doesn’t perform a dimensionality reduction but helps in decreasing the number of filters which is represented below. We can see that a volume of 32x32x192 got reduced to a volume of dimension 28x28x32.
FULLY CONNECTED LAYER
And lastly, after performing all the operations the matrix obtained consisting of different values are completely flattened out in a single layer that is given below on which SoftMax operation is carried out that can be used to classify the object or for any other operations.
The SoftMax layer consists of neurons as the no. of classes. In MNIST dataset there are 10 neurons in the SoftMax layer.
The image below depicts how the operations are carried on as discussed before and ends up generating a fully connected layer. And here there exist generally two fully connected layers in which the 1st one is an FC and the 2nd one is the SoftMax layer.
FORWARD AND BACKWARD PROPAGATION
Forward Propagation: To train the model both the forward and backward propagation is carried on alternatively. Now we will be mathematically understanding the functioning of the CNN and how both forward propagation and backward propagation take place. Generally, in this neural network, the trainable parameters are the weights of the filter that are multiplied during the convolution and the weights assigned in the fully connected layer. Here we do not consider the weights of max pooling as trainable parameters.
And here the loss can be calculated using L1 loss, L2 loss, or the cross-entropy loss and in the case of CNN, we use the cross-entropy loss. Here, we represent the output ẑ calculated during the forward propagation can be represented as :
w = [w1, w2, w3, …. wn]; weights consist of the weights
‘b’ consists of all the biases that exist equally to the number of filters that exist.
This forms the first layer obtained and then a maxpool operation is done on it: maxpool( ẑ)
And after the similar process is carried on till we obtained an FC from which a similar operation is carried on with sending it through the SoftMax layer. In the final output layer, we obtain the predictions after the first pass as:
From this, we calculate the loss and according to that, the weights are adjusted such that predicted values are close to the actual values. And for that to happen they are sent to the backward propagation.
Backward Propagation: In the backward propagation we calculate the derivatives of loss w.r.t weights and biases and they are adjusted along with the learning rate which is given as:
where, α = learning rate
Here we tend to use the cross-entropy loss instead of using L1 loss and L2 loss. Just to give a recap L1 loss is also known as the least absolute deviation and L2 loss is known as the least square error given as:
Cross-entropy loss for ‘c’ classes is calculated as:
‘z’ represents the logits and now we will derive the trainable parameters: ꝺL/ꝺw, ꝺL/ꝺb
Derivatives w.r.t weight:
Case 1: ( i = l )
We calculate the derivative of:
Here we use the quotient rule which is given as:
Case 2: ( i ≠ l )
In this case ‘i’ is not equal to ‘l’.
And now our next objective is to calculate the derivative of loss w.r.t ‘z’ given as : ꝺL/ꝺzl
- Now our objective is to calculate the derivative of cross-entropy loss and the cross-entropy loss is given as:
Since we calculated the derivative before, we substitute the values here and we obtain,
After this equation is calculated, we apply gradient descent to the derivative during the backpropagation and the weights are updated.
Now we need to calculate the derivatives of loss w.r.t to inputs: ꝺL/ꝺx. This derivative is calculated for the backpropagation on the maxpool layer. In the back propagation the highest values are retained at their location and the rest of all values are assigned a 0 value and according to this, the weights are adjusted in the convolution layer.
Now this matrix is obtained from the backward propagation is taken as: ꝺL/ꝺx’
The next objective is to calculate the partial derivate of loss w.r.t weight ꝺL/ꝺw. When an image is passed it produces an output by getting multiplied with the filter.
We can clearly see that from the derivative of ꝺL/ꝺw, we can calculate the input.
And hence we have calculated the partial derivative of loss w.r.t weights and inputs. And this totally describes the functioning of the convolutional neural network layer. And the diagram of the architecture/structure of a CNN is given below in the diagram.
This is how a convolutional neural network looks like but in general, they contain a lot of layers in it and that could stack around few hundreds of layers to a thousand layers and some of the other famous CNN architectures are given below as:
- LeNet-5 (1998)
- AlexNet (2012)
- ResNet50 (2015)
- Inception — V4 (2016)
And some of their architectures are given below:
And now we will be working on CNN where we will be building a CNN right from scratch.
CNN Architecture from Scratch
Currently, we discussed in depth how CNN functions and all the mathematics that goes behind it. Now our aim is to build a CNN right from scratch without any deep learning framework.
Our main objective here is to define functions and explain in a simple manner how the CNN will function and for that purpose we will be working on the famous example of Lena. At first, we install all the dependencies.
As it can be clearly noticed that we are just using NumPy, cv2, and MatPlot library. Then our objective is to define a convolution layer block or say a class that takes care of both forward propagation and also as well backward propagation as functions that can be seen below.
On passing the Lena image it detects all the edges by using the filters where the weights and biases are assigned at random. This block defines the working of the convolution operation and then we build other blocks. The next block that we make is for carrying out the maxpool operations and that is given below along with the outputs from the previous block and the output of the current block.
After the maxpool operation is carried out the important features in the images are selected and used. And finally, after the maxpool is done the final connected layer (FC) is formed and that is passed via the softmax layer which is given in the below diagram. Hereafter the maxpool is done, the image is completely flattened out, and then it’s passed through a neural network which is finally passed a softmax activation and after that, the loss is calculated by the entropy loss. And then the weights are adjusted during the backpropagation.
In the above snippet of the code, you can see how the mathematics that we derived earlier has been used here to calculate the loss and update the weights and biases. For carrying on such operation NumPy is used. While working with deep learning frameworks there are wrappers defined that make the coding and understanding really easy. And for the ease of understanding, we haven’t carried on few operations here like padding.
Our next aim is to use this neural network that we have developed on the MNIST dataset to understand how well the code is working that we have developed.
Here we will import the dataset from the Keras datasets and that is given below as:
After that, we define three blocks that care of the forward prop, backward prop, and also the adjustment of weights that is carried on by running over multiple epochs. After which we get the output with an accuracy of nearly 83% on the training data is:
That’s all from my side on CNN and I would highly appreciate your support on it. If you have any queries or doubts related to anything in the below-mentioned blog then feel free to contact me on my mentioned LinkedIn profile. Thank you!
Sudeep Das: https://www.linkedin.com/in/sudeepdas27/