Are there mathematical reasons for convolution in neural networks beyond expediency?

In convolutional neural networks (CNN) the matrix of weights at each step gets its rows and columns flipped to obtain the kernel matrix, before proceeding with the convolution. This is explained on a series of videos by Hugo Larochelle here:

Computing the hidden maps would correspond to doing a discrete
convolution with a channel from the previous layer, using a kernel
matrix […], and that kernel is computed from the hidden weights
matrix Wij, where we flip the rows and the columns.

enter image description here

If we were to compare the reduced steps of a convolution to regular matrix multiplication as in other types of NN, expediency would be a clear explanation. However, this might not be the most pertinent comparison…

In digital imaging processing the application of convolution of a filter to an image (this is a great youtube video for a practical intuition) seems related to:

  1. The fact that convolution is associative while (cross-)correlation is not.
  2. The possibility to apply filters in the frequency domain of the image as multiplications, since convolution in the time domain is equivalent to multiplication in the frequency domain (convolution theorem).

In this particular technical environment of DSP correlation is defined as:

FI(x,y)=Nj=NNi=NF(i,j)I(x+i,y+j)

which is essentially the sum of all the cells in a Hadamard product:

FI(x,y)=[F[N,N]I[xN,yN]F[N,0]I[xN,yN]F[N,N]I[xN,y+N]F[0,N]I[x,yN]F[0,0]I[x,y]F[0,N]I[x,y+N]F[N,N]I[x+N,yN]F[N,0]I[x+N,y]F[N,N]I[x+N,y+N]]

where F(i,j) is a filter function (expressed as a matrix), and I(x,y) is the pixel value of an image at location (x,y):

enter image description here

The objective of cross-correlation is to assess how similar is a probe image to a test image. The calculation of a cross-correlation map relies on the convolution theorem.


On the other hand, convolution is defined as:

FI(x,y)=Nj=NNi=NF(i,j)I(xi,yj)

which as long as the filter is symmetric, it is the same as a correlation operation with the rows and columns of the filter flipped:

FI(x,y)=[F[N,N]I[xN,yN]F[N,0]I[xN,yN]F[N,N]I[xN,y+N]F[0,N]I[x,yN]F[0,0]I[x,y]F[0,N]I[x,y+N]F[N,N]I[x+N,yN]F[N,0]I[x+N,y]F[N,N]I[x+N,y+N]]

enter image description here


Convolution in DSP is meant to apply filters to the image (e.g. smoothing, sharpening). As an example, after convolving Joseph Fourier’s face with a Gaussian convolution filter: [1474141626164726412674162616414741] the edges on his face are fuzzier:

enter image description here


Computationally, both operations are a Frobenius inner product, amounting to calculating the trace of a matrix multiplication.


Questions (reformulating after comments and first answer):

  1. Is the use of convolutions in CNN linked to FFT?

From what I gather so far the answer is no. FFTs have been used to
speed up GPU implementations of
convolutions
. However, FFT are not usually part of the structure or activation functions in CNNs, despite the use of convolutions in the pre-activation steps.

  1. Is convolution and cross-correlation in CNN equivalent?

Yes, they are equivalent.

  1. If it is a simple as “there is no difference”, what is the point of flipping the weights into the kernel matrix?

Neither the associativity of convolution (useful in math proofs), nor any considerations regarding FTs and the convolution theorem are applicable. In fact, it seems as though the flipping doesn’t even take place (cross-correlation being simply mislabeled as convolution) (?).

Answer

There are no differences in what neural networks can do when they use convolution or correlation. This is because the filters are learned and if a CNN can learn to do a particular task using convolution operation, it can also learn to do the same task using correlation operation (It would learn the rotated version of each filter).

To find more details about the reasons that people sometimes find it more intuitive to think about convolution than correlation, this post may be useful.

There remains this question that if there is no difference between convolution and cross-correlation, what is the point of flipping the weights into the kernel matrix? I would like to include some sentences from the Deep learning book by Ian Goodfellow et al. to answer this question:

The only reason to flip the kernel is to obtain the commutative property. While the commutative property is useful for writing proofs, it is not usually an important property of a neural network implementation… Many machine learning libraries implement cross-correlation but call it convolution.

The takeaway is that although convolution is a favorite operation in classic machine vision applications, it is replaced by correlation in many of the implementations of the convolutional neural networks.

Attribution
Source : Link , Question Author : Antoni Parellada , Answer Author : mhdadk

Leave a Comment