The example below is taken from the lectures in deeplearning.ai shows that the result is the sum of the element-by-element product (or “element-wise multiplication”. The red numbers represent the weights in the filter:
= 4 $
HOWEVER, most resources say that it’s the dot product that’s used:
“…we can re-express the output of the neuron as , where is the bias
term. In other words, we can compute the output by y=f(x*w) where b is
the bias term. In other words, we can compute the output by
performing the dot product of the input and weight vectors, adding in
the bias term to produce the logit, and then applying the
Buduma, Nikhil; Locascio, Nicholas. Fundamentals of Deep Learning: Designing Next-Generation Machine Intelligence Algorithms (p. 8). O’Reilly Media. Kindle Edition.
“We take the 5*5*3 filter and slide it over the complete image and
along the way take the dot product between the filter and chunks of
the input image. For every dot product taken, the result is a
“Each neuron receives some inputs, performs a dot product and
optionally follows it with a non-linearity.”
“The result of a convolution is now equivalent to performing one large
matrix multiply np.dot(W_row, X_col), which evaluates the dot product
between every filter and every receptive field location.”
However, when I research how to compute the dot product of matrics, it seems that the dot product is not the same as summing the element-by-element multiplication. What operation is actually used (element-by-element multiplication or the dot product?) and what is the primary difference?
Any given layer in a CNN has typically 3 dimensions (we’ll call them height, width, depth). The convolution will produce a new layer with a new (or same) height, width and depth. The operation however is performed differently on the height/width and differently on the depth and this is what I think causes confusion.
Let’s first see how the convolution operation on the height and width of the input matrix. This case is performed exactly as depicted in your image and is most certainly an element-wise multiplication of the two matrices.
Two-dimensional (discrete) convolutions are calculated by the formula below:
$$C \left[ m, n \right] = \sum_u \sum_υ A \left[ m + u, n + υ\right] \cdot B \left[ u, υ \right]$$
As you can see each element of $C$ is calculated as the sum of the products of a single element of $A$ with a single element of $B$. This means that each element of $C$ is computed from the sum of the element-wise multiplication of $A$ and $B$.
You could test the above example with any number of packages (I’ll use scipy):
import numpy as np from scipy.signal import convolve2d A = np.array([[1,1,1,0,0],[0,1,1,1,0],[0,0,1,1,1],[0,0,1,1,0],[0,1,1,0,0]]) B = np.array([[1,0,1],[0,1,0],[1,0,1]]) C = convolve2d(A, B, 'valid') print(C)
The code above will produce:
[[4 3 4] [2 4 3] [2 3 4]]
Now, the convolution operation on the depth of the input can actually be considered as a dot product as each element of the same height/width is multiplied with the same weight and they are summed together. This is most evident in the case of 1×1 convolutions (typically used to manipulate the depth of a layer without changing it’s dimensions). This, however, is not part of a 2D convolution (from a mathematical viewpoint) but something convolutional layers do in CNNs.
1: That being said I think most of the sources you provided have misleading explanations to say the least and are not correct. I wasn’t aware so many sources have this operation (which is the most essential operation in CNNs) wrong. I guess it has something to do with the fact that convolutions sum the product between scalars and the product between two scalars is also called a dot product.
2: I think that the first reference refers to a Fully Connected layer instead of a Convolutional layer. If that is the case, a FC layer does perform the dot product as stated. I don’t have the rest of the context to confirm this.
tl;dr The image you provided is 100% correct on how the operation is performed, however this is not the full picture. CNN layers have 3 dimensions, two of which are handled as depicted. My suggestion would be to check up on how convolutional layers handle the depth of the input (the simplest case you could see are 1×1 convolutions).