Dimensions and implementation of the Convolution step in CNN

Question

I am trying to write my own convolutional neural network from scratch (Python) and after reading several articles and watching tutorials (on CNN) there are still a couple of issues that I am unable to understand and will appreciate it very much if someone could help clarify.

I understand the idea of the filter/kernel (and implemented it successfully) and have a trainable ANN that I wrote and works. I also have a working maxpool.

What is unclear to me is the conversion of the k-kernels (filters) with c-channels (e.g. RGB) to the eventual reduced h * w neurons for the so-called fully connected network in the end.

If I start with an h*w image, with 3 channels, and let's say for simplicity that I use 10 kernels, that would mean that I have a tensor of rank 4 weights on the convolution layer (h,w,c,k), but in all sources I could find, after flattening, all the kernels and channels are gone and only the reduced w and h remain (reduced if there is no padding and after the maxpooling).

So after this long (and hopefully clear) exposition, I don't understand what is being done with all the data from the different channels and kernels. I saw different codes, where some people apply the filter on all channels equally, then sum, but this seems like it would lose the color information. Is that indeed the solution? Are all the results of the filters and channels added together before being passed to the next layer? if so, is this done before ReLU or is ReLU applied to each separately, and then they are all summed?

Mark.F · Accepted Answer · 2019-02-13 14:45:18Z

When speaking about standard convolutional layers in CNNs, the kernel will be of spatial size K x K and have the same depth as the input layer, while the number of kernels used in the layer will determine the depth of the output.

For example:

Say your input is of size 10x10x3 (10x10 RGB image) and your kernel has a spatial size of 4x4, so the full size of the kernel will be 4x4x3 and have 49 trainable parameters (4x4x3 + one for the bias term).

Now to calculate the output, you take spatial strides with the kernel's tensor over the input's tensor, performing element-wise multiplication at every position and adding the bias term at the end, so for every spatial position you will have:

$\phi = b + \sum_{i,j}^{4,4} X_{ij} \cdot w_{ij}$

And for every position you will need to perform non-linear activation like ReLU so:

$y = ReLU(\phi) = ReLU(b + \sum_{i,j}^{4,4} X_{ij} \cdot w_{ij})$

Lets say you use a stride of a single pixel in each of the 2 spatial directions and you do not use any padding so the final output of your layer for this kernel will be 6x6x1, meaning you need to use those same 49 parameters at 36 different spatial positions of the input.

Now you will probably want a layer with a depth of more than 1, so lets say you have 5 such kernels, each with its own 49 parameters, summing to a total of 245 trainable parameters. You need to repeat the above process independently for each of the 5 kernels and the final output of the layer will be of size 6x6x5.

There are also models that use strides in the channel dimension as well, but those are considerably less common.

When you want to connect the output of a convolutional layer to a fully-connected layer, you can simply flatten it to a single vector. So in our example, you will get a flattened vector of size 180 (6x6x5).

Thank you for the very detailed answer! There are still a couple of things that I am not sure I understand correctly. You say there are 48 weights + 1 bias for each kernel, so that means that each color has effectively its own filter, so shouldn't the sum over i,j (4,4) be actually over i,j,k (4,4,3)? because if I end up with 6x6x1 then the channel dimension is gone by then. — Tacratis
– Tacratis, Commented Feb 13, 2019 at 17:41
The second question is: with a final vector of 6x6x5 it is clear that each kernel generates its own set of outputs for the neurons of the fully connected layer, but when I inspect CNNs generated by Keras using the exact same parameter, I see that the first fully connected layer has only 6x6 neurons - this is where the initial confusion started for me. Thanks again! — Tacratis
– Tacratis, Commented Feb 13, 2019 at 17:42

Stack Exchange Network

Dimensions and implementation of the Convolution step in CNN

1 Answer 1

Hot Network Questions

Dimensions and implementation of the Convolution step in CNN

1 Answer 1

Related

Hot Network Questions