# Are there mathematical reasons for convolution in neural networks beyond expediency?

In convolutional neural networks (CNN) the matrix of weights at each step gets its rows and columns flipped to obtain the kernel matrix, before proceeding with the convolution. This is explained on a series of videos by Hugo Larochelle here:

Computing the hidden maps would correspond to doing a discrete
convolution with a channel from the previous layer, using a kernel
matrix […], and that kernel is computed from the hidden weights
matrix $W_{ij}$, where we flip the rows and the columns.

If we were to compare the reduced steps of a convolution to regular matrix multiplication as in other types of NN, expediency would be a clear explanation. However, this might not be the most pertinent comparison…

In digital imaging processing the application of convolution of a filter to an image (this is a great youtube video for a practical intuition) seems related to:

1. The fact that convolution is associative while (cross-)correlation is not.
2. The possibility to apply filters in the frequency domain of the image as multiplications, since convolution in the time domain is equivalent to multiplication in the frequency domain (convolution theorem).

In this particular technical environment of DSP correlation is defined as:

which is essentially the sum of all the cells in a Hadamard product:

where $F(i,j)$ is a filter function (expressed as a matrix), and $I(x,y)$ is the pixel value of an image at location $(x,y)$:

The objective of cross-correlation is to assess how similar is a probe image to a test image. The calculation of a cross-correlation map relies on the convolution theorem.

On the other hand, convolution is defined as:

which as long as the filter is symmetric, it is the same as a correlation operation with the rows and columns of the filter flipped:

Convolution in DSP is meant to apply filters to the image (e.g. smoothing, sharpening). As an example, after convolving Joseph Fourier’s face with a Gaussian convolution filter: $\small\begin{bmatrix} 1&4&7&4&1\\ 4&16&26&16&4\\ 7&26&41&26&7\\ 4&16&26&16&4\\ 1&4&7&4&1\end{bmatrix}$ the edges on his face are fuzzier:

Computationally, both operations are a Frobenius inner product, amounting to calculating the trace of a matrix multiplication.

Questions (reformulating after comments and first answer):

1. Is the use of convolutions in CNN linked to FFT?

From what I gather so far the answer is no. FFTs have been used to
speed up GPU implementations of
convolutions
. However, FFT are not usually part of the structure or activation functions in CNNs, despite the use of convolutions in the pre-activation steps.

1. Is convolution and cross-correlation in CNN equivalent?

Yes, they are equivalent.

1. If it is a simple as “there is no difference”, what is the point of flipping the weights into the kernel matrix?

Neither the associativity of convolution (useful in math proofs), nor any considerations regarding FTs and the convolution theorem are applicable. In fact, it seems as though the flipping doesn’t even take place (cross-correlation being simply mislabeled as convolution) (?).