The example below is taken from the lectures in deeplearning.ai shows that the result is the sum of the

element-by-element product(or “element-wise multiplication”. The red numbers represent the weights in the filter:$(1*1)+(1*0)+(1*1)+(0*0)+(1*1)+(1*0)+(0*1)+(0*0)+(1*1)

= 1+0+1+0+1+0+0+0+1

= 4 $HOWEVER, most resources say that it’s the

that’s used:dot product“…we can re-express the output of the neuron as , where is the bias

term. In other words, we can compute the output by y=f(x*w) where b is

the bias term. In other words, we can compute the output by

performing the dot product of the input and weight vectors, adding in

the bias term to produce the logit, and then applying the

transformation function.”Buduma, Nikhil; Locascio, Nicholas. Fundamentals of Deep Learning: Designing Next-Generation Machine Intelligence Algorithms (p. 8). O’Reilly Media. Kindle Edition.

“We take the 5*5*3 filter and slide it over the complete image and

along the way take the dot product between the filter and chunks of

the input image. For every dot product taken, the result is a

scalar.”“Each neuron receives some inputs, performs a dot product and

optionally follows it with a non-linearity.”http://cs231n.github.io/convolutional-networks/

“The result of a convolution is now equivalent to performing one large

matrix multiply np.dot(W_row, X_col), which evaluates the dot product

between every filter and every receptive field location.”http://cs231n.github.io/convolutional-networks/

However, when I research how to compute the dot product of matrics, it seems that the dot product is

notthe same as summing the element-by-element multiplication. What operation is actually used (element-by-element multiplication or the dot product?) and what is the primary difference?

**Answer**

Any given layer in a CNN has typically 3 dimensions (we’ll call them height, width, depth). The convolution will produce a new layer with a new (or same) height, width and depth. The operation however is performed **differently on the height/width and differently on the depth** and this is what I think causes confusion.

Let’s first see how the convolution operation on the **height** and **width** of the input matrix. This case is performed **exactly** as depicted in your image and is most certainly an **element-wise multiplication of the two matrices**.

**In theory**:

Two-dimensional (discrete) convolutions are calculated by the formula below:

$$C \left[ m, n \right] = \sum_u \sum_υ A \left[ m + u, n + υ\right] \cdot B \left[ u, υ \right]$$

As you can see each element of $C$ is calculated as the sum of the products of a single element of $A$ with a single element of $B$. This means that each element of $C$ is computed from the sum of the element-wise multiplication of $A$ and $B$.

**In practice**:

You could test the above example with any number of packages (I’ll use scipy):

```
import numpy as np
from scipy.signal import convolve2d
A = np.array([[1,1,1,0,0],[0,1,1,1,0],[0,0,1,1,1],[0,0,1,1,0],[0,1,1,0,0]])
B = np.array([[1,0,1],[0,1,0],[1,0,1]])
C = convolve2d(A, B, 'valid')
print(C)
```

The code above will produce:

```
[[4 3 4]
[2 4 3]
[2 3 4]]
```

Now, the convolution operation on the **depth** of the input can actually be considered as a **dot product** as each element of the same height/width is multiplied with the same weight and they are summed together. This is most evident in the case of 1×1 convolutions (typically used to manipulate the depth of a layer without changing it’s dimensions). This, however, is not part of a 2D convolution (from a mathematical viewpoint) but something convolutional layers do in CNNs.

**Notes**:

1: That being said I think most of the sources you provided have misleading explanations to say the least and are not correct. I wasn’t aware so many sources have this operation (which is the most essential operation in CNNs) wrong. I guess it has something to do with the fact that convolutions sum the product between scalars and the product between two scalars is also called a dot product.

2: I think that the first reference refers to a Fully Connected layer instead of a Convolutional layer. If that is the case, a FC layer does perform the dot product as stated. I don’t have the rest of the context to confirm this.

**tl;dr** The image you provided is 100% correct on how the operation is performed, however this is not the full picture. CNN layers have 3 dimensions, two of which are handled as depicted. My suggestion would be to check up on how convolutional layers handle the depth of the input (the simplest case you could see are 1×1 convolutions).

**Attribution***Source : Link , Question Author : Ryan Chase , Answer Author : Lucas*