I was reading Yoshua Bengio’s Book on deep learning and it says on page 224:
Convolutional networks are simply neural networks that use convolution in place of general matrix multiplication in at least one of their layers.
however, I was not 100% sure of how to “replace matrix multiplication by convolution” in a mathematically precise sense.
What really interest me is defining this for input vectors in 1D (as in x∈Rd), so I won’t have input as images and try to avoid the convolution in 2D.
So for example, in “normal” neural networks, the operations and the feed ward pattern can be concisely expressed as in Andrew Ng’s notes:
where z(l) is the vector computed before passing it through the non-linearity f. The non-linearity acts pero entry on the vector z(l) and a(l+1) is the output/activation of hidden units for the layer in question.
This computation is clear to me because matrix multiplication is clearly defined for me, however, just replacing the matrix multiplication by convolution seems unclear to me. i.e.
I want to make sure I understand the above equation mathematically precisely.
The first issue I have with just replacing matrix multiplication with convolution is that usually, one identifies one row of W(l) with a dot product. So one clearly knows how the whole a(l) relates to the weights and that maps to a vector z(l+1) of the dimension as indicated by W(l). However, when one replaces it by convolutions, its not clear to me which row or weights corresponds to which entries in a(l). Its not even clear to me that it makes sense to represent the weights as a matrix anymore in fact (I will provide an example to explain that point later)
In the case where the input and outputs are all in 1D, does one just compute the convolution according to its definition and then pass it through a singularity?
For example if we had the following vector as input:
and we had the following weights (maybe we learned it with backprop):
then the convolution is:
would it be correct to just pass the non-linearity through that and treat the result as the hidden layer/representation (assume no pooling for the moment)? i.e. as follows:
(the stanford UDLF tutorial I think trims the edges where the convolution convovles with 0’s for some reason, do we need to trim that?)
Is this how it should work? At least for an input vector in 1D? Is the W not a vector anymore?
I even drew a neural network of how this is suppose to look like I think:
It sounds to me like you’re on the right track, but maybe I can help clarify.
Let’s imagine a traditional neural network layer with n input units and 1 output (let’s also assume no bias). This layer has a vector of weights w∈Rn that can be learned using various methods (backprop, genetic algorithms, etc.), but we’ll ignore the learning and just focus on the forward propagation.
The layer takes an input x∈Rn and maps it to an activation a∈R by computing the dot product of x with w and then applying a nonlinearity σ: a=σ(x⋅w)
Here, the elements of w specify how much to weight the corresponding elements of x to compute the overall activation of the output unit. You could even think of this like a “convolution” where the input signal (x) is the same length as the filter (w).
In a convolutional setting, there are more values in x than in w; suppose now our input x∈Rm for m>n. We can compute the activation of the output unit in this setting by computing the dot product of w with contiguous subsets of x: a1=σ(x1:n⋅w)a2=σ(x2:n+1⋅w)a3=σ(x3:n+2⋅w)…am−n+1=σ(xm−n+1:m⋅w)
(Here I’m repeating the same annoying confusion between cross-correlation and convolution that many neural nets authors make; if we were to make these proper convolutions, we’d flip the elements of w. I’m also assuming a “valid” convolution which only retains computed elements where the input signal and the filter overlap completely, i.e., without any padding.)
You already put this in your question basically, but I’m trying to walk through the connection with vanilla neural network layers using the dot product to make a point. The main difference with vanilla network layers is that if the input vector is longer than the weight vector, a convolution turns the output of the network layer into a vector — in convolution networks, it’s vectors all the way down! This output vector is called a “feature map” for the output unit in this layer.
Ok, so let’s imagine that we add a new output to our network layer, so that it has n inputs and 2 outputs. There will be a vector w1∈Rn for the first output, and a vector w2∈Rn for the second output. (I’m using superscripts to denote layer outputs.)
For a vanilla layer, these are normally stacked together into a matrix W=[w1w2] where the individual weight vectors are the columns of the matrix. Then when computing the output of this layer, we compute a1=σ(x⋅w1)a2=σ(x⋅w2) or in shorter matrix notation, a=[a1a2]=σ(x⋅W) where the nonlinearity is applied elementwise.
In the convolutional case, the outputs of our layer are still associated with the same parameter vectors w1 and w2. Just like in the single-output case, the convolution layer generates vector-valued outputs for each layer output, so there’s a1=[a11a12…a1m−n+1] and a2=[a21a22…a2m−n+1] (again assuming “valid” convolutions). These filter maps, one for each layer output, are commonly stacked together into a matrix A=[a1a2].
If you think of it, the input in the convolutional case could also be thought of as a matrix, containing just one column (“one input channel”). So we could write the transformation for this layer as A=σ(X∗W) where the “convolution” is actually a cross-correlation and happens only along the columns of X and W.
These notation shortcuts are actually quite helpful, because now it’s easy to see that to add another output to the layer, we just add another column of weights to W.
Hopefully that’s helpful!