An intuitive explanation of how meaningless filters in CNN take meaningful shapes

Arif
5 min readJun 20, 2021

--

Prerequisites: I need you to have some basic understanding of Convolutional Neural Networks. It is okay if you don’t understand the backpropagation in CNN yet. But of course, you need to have a reasonably clear understanding of how backpropagation works in a fully connected network. Have a look here if that is not clear to you yet.

The Question

You might know by now that in 2D CNN, filters are basically matrices, which are initialized with random values. During training, through backpropagation, these randomized matrices will take meaningful shapes.

Let’s take a deeper look.

The Stage

We will train a CNN to recognize only ONE 28 X 28 image from the MNIST dataset. We will use 5 filters, but all of size 28 X 28. In typical CNN models, filters are of much smaller than the image size. But in this experiment, for demonstration purpose, we are keeping the filter size same as the image.

We don’t need to build a smart machine here — just need to check how the filters look like at the end of the training.

Let’s get our hands dirty

Run this Github Project in your favorite IDE or mine (PyCharm).

If you get the same output as I did, you will see 11 images. The first one being the original image:

The next 5 images are the filters initiated with random values.

And the last 5 images are the modified filters at the end of the training.

So, filters took the shape of the actual image? What is happening here? 🤔

How this filter will help

I promise you I will explain the reason behind this transformation in a minute. But let us first quickly check how these generated filters will help us detect “4”.

Let’s have look at the matrix representation of our image. It is actually a simplified version for illustration purposes — setting all lower-valued pixels to 0 and higher valued pixels to 255.

Now, take a look at a similar representation to one of the generated filter — say, the first one:

Remember, the convolution operation is multiplying corresponding elements and then taking the sum of all products.

So, if you run a convolution operation with this filter on the image, it will first produce a matrix — each cell representing the product of the corresponding elements in these 2 matrices. And the sum of all the cells is the final result of the convolution.

For a particular position, say, (5,7), if both the pixels in the image and filter are ON (255), meaning:

image[5][7] = filter[5][7] = 255

it will produce a high value (255 X 255 = 65,026) in this case.

If pixels don’t match, one of the cells will be 0 — making the product 0.

So, the more the pixels match, the more non-zero elements in the resulting product matrix — and so the bigger the sum (the final result).

The summary is, the more the image looks like the filter the higher value will be produced by the convolution operation — which will eventually influence the machine to predict that this particular image (in this example) is a “4”.

How the filter took the shape

Before and after Training

Let’s try to find the answer in this section.

On the forward-pass, the randomly generated shape is applied to the image “4” which will produce an arbitrary value as result of the convolution. Now, during the backpass, the machine will try to update the filter in a way so that if the same training sample is applied again, the forward-pass produces a higher value as result of the convolution. In fact, it could be higher or lower since any extreme value (either highest possible or lowest possible) will influence the machine (through the following Linear layer — line#14 in main.py in the above-mentioned GitHub project) to predict “4” better. Let us consider “highest” for now.

What would be the wisest way for the machine to maximize the result? You got it right — making the filter look more and more like the image at every backpass! (understood now why our filters took a similar shape as the image?) How that is achieved? Gradient descent, partial derivative, chain rule …

Now, I won’t make this article dirty with scary equations. There are thousands of tutorials available on the internet on “backpropagation in CNN”. I will just try to give you here an intuitive explanation how CNN achieves it.

The Intuition

Let’s go back to the backpass case illustrated above. As explained, the machine will somewhat try to make the filter look like the image through some operations.

Why not just copying the pixel values? Well, that would work perfectly if we had to recognize this particular image as “4”. But remember, in practice, there will be more variations of “4”, even more digits. Also, a 28 X 28 filter — producing only one number as a result of the convolution operation — won’t give us much information about the image in concern. So, we take filters of smaller size and try to recognize common shapes present in the images — not the whole image. One example of such shapes is circle — which is present in 0, 8, and 9, 6.

Also, remember that, a filter is applied to all the regions in the image. For example, if you run a 2 X 2 filter on a 3 X 3 images, there will be 4 regions.

Taking this into consideration, let’s refine the objective of the backpass. For a particular training example, for every region of the image, during backpass, the machine will update the filter in an attempt to make the filter look a bit more like that region. Now how more is this “bit more”? That depends on:

  1. How close the machine predicted the right matrix during the forward pass. If the prediction was close, the update will be minimal.
  2. What random values the filter has been initialized with. This is actually the reason why different filters take different shapes despite similar operations are done on them. At every backpass, the randomly initialized matrix is updated a bit, and gradually takes a common shape present in the samples.

If the explanation is not intuitive enough, it is time to look at the underlying math. Here is a wonderful step-by-step guide on this.

--

--