# Backpropagation on Variational Autoencoders

Once again, online tutorials describe in depth the statistical interpretation of Variational Autoencoders (VAE); however, I find that the implementation of this algorithm is quite different, and similar to that of regular NNs.

The typical vae image online looks like this:

As an enthusiast, I find this explanation very confusing especially in the topic introduction online posts.

Anyways, first let me try to explain how I understand backpropagation on a regular feed-forward neural network.

For example, the chain rule for the derivative of $$EE$$ (total error) with respect to weight $$w1w_1$$ is the following:

$$∂E∂W1=∂E∂HA1...∂HA1∂H1∂H1∂w1 \frac{\partial E}{\partial W_1} = \frac{\partial E}{\partial HA_1} ... \frac{\partial HA_1}{\partial H_1} \frac{\partial H_1}{\partial w_1}$$

Now let’s see the VAE equivalent and calculate the chain rule for the derivative of $$EE$$ (total error) with respect to weight $$W16W_{16}$$ (just an arbitrary weight on the encoder side – they are all the same).

Notice that each weight in the encoder side, including $$w16w_{16}$$, depends on all the connections in the decoder side ;hence, the highlighted connections. The chain rule looks as follows:

$$∂E∂w16=∂E∂OA1∂OA1∂O1∂O1∂HA4∂HA4∂H4∂H4∂Z∂Z∂μ∂μ∂w16+∂E∂OA2...+∂E∂OA3...+∂E∂OA4... \frac{\partial E}{\partial w_{16}} = \frac{\partial E}{\partial OA_1} \frac{\partial OA_1}{\partial O_1} \frac{\partial O_1}{\partial HA_4} \frac{\partial HA_4}{\partial H_4} \color{red}{\frac{\partial H_4}{\partial Z} \frac{\partial Z}{\partial \mu} \frac{\partial \mu}{\partial w_{16}}} \\ + \frac{\partial E}{\partial OA_2}... \\ + \frac{\partial E}{\partial OA_3}... \\ + \frac{\partial E}{\partial OA_4}... \\$$

Note that the part in red is the reparameterization trick which I am not going to cover here.

But wait that’s not all – assume for the regular neural network the batch is equal to one – the algorithm goes like this:

1. Pass the inputs and perform the feed-forward pass.
2. Calculate the total error and take the derivative for each weight in the network
3. Update the networks weights and repeat…

However, in VAEs the algorithm is a little different:

1. Pass the inputs and perform the feed-forward for the encoder and stop.
2. Sample the latent space ($$ZZ$$) say $$nn$$-times and perform the feed-forward step with the sampled random variates $$nn$$-times
3. Calculate the total error, for all outputs and samples, and take the derivative for each weight in the network
4. Update the networks weights and repeat…

Okay, okay, yes what is my question!

Question 1

Is my description of the VAE correct?

Question 2

I will try to walk step by step through the sampling of the latent space $$(Z)(Z)$$ and the backprop symbolically.

Let us assume that the VAE input is a one dimensional array (so even if its an image – it has been flattened). Also, the latent space $$(Z)(Z)$$ is one dimensional; hence, it contains one single value for mean $$(μ)(\mu)$$ and std.var $$(σ)(\sigma)$$ assuming the normal distributions.

• For simplicity, let the error for a single input $$xix_i$$ be
$$ei=(xi−¯xi)e_i=(x_i-\bar{x_i})$$ where $$¯xi\bar{x_i}$$ is the equivalent vae output.
• Also, let us assume that there are $$mm$$ inputs and outputs in this vae
example.
• Lastly let us assume that mini-batch is one so we update the
weights after wach backprop; therefore, we will not see the
mini-batch $$bb$$ index in the gradient formula.

In a regular feed-forward neural net, given the above setup, the total error would look as follows:

$$E=1mm∑i=1ei E = \frac{1}{m} \sum_{i=1}^{m} e_i$$

Therefore from the example above,

$$∂E∂w1=∂(1m∑mi=1ei)∂w1 \frac{\partial E}{\partial w_1} = \frac{\partial (\frac{1}{m} \sum_{i=1}^{m} e_i)}{\partial w_1}$$

and easily update the weight with gradient descent. Very straight forward. Note that we have a single value of each partial derivative i.e.: $$∂HA1∂H1\frac{\partial HA_1}{\partial H_1}$$ – this is an important distinction.

Option 1

Now for the VAE, as explained in the online posts, we have to sample $$nn$$ times from the latent space in order to get a good expectation representation.

So given the example and assumptions above, the total error for $$nn$$ samples and $$mm$$ outputs is:

$$E=1n1mn∑i=im∑j=1eij E = \frac{1}{n} \frac{1}{m} \sum_{i=i}^{n} \sum_{j=1}^{m} e_{ij}$$

If I understand correctly – we must have at least $$nn$$ samples in order to take the derivative $$∂E∂w16\frac{\partial E}{\partial w_{16}}$$. Taking the derivative (backprop) in one sample does not make sense.

So, in the VAE the derivative would look as such:

$$∂E∂w16=∂(1n1m∑ni=i∑mj=1eij)∂w16 \frac{\partial E}{\partial w_{16}} = \frac{\partial (\frac{1}{n} \frac{1}{m} \sum_{i=i}^{n} \sum_{j=1}^{m} e_{ij})}{\partial w_{16}}$$

This means that in the derivative chain we would have to calculate and add the derivatives of a variable or function $$nn$$ times i.e.:

$$...∂Z1∂μ+...+∂Z2∂μ+...∂Zn∂μ ...\frac{\partial Z_1}{\partial \mu} + ... +\frac{\partial Z_2}{\partial \mu} + ... \frac{\partial Z_n}{\partial \mu}$$

And finally, we update the weight with gradient decent:

$$wk+116=wk16−η∂E∂w16 w_{16}^{k+1} = w_{16}^{k} - \eta \frac{\partial E}{\partial w_{16}}$$

Option 2

We keep the total error formula the same as in the regular neural network except now we have to index because we are going to end up with $$nn$$ of them:

$$Ei=1mm∑j=1ej E_i = \frac{1}{m} \sum_{j=1}^{m} e_j$$

and do backprop after each sample of the latent spaze $$ZZ$$ but do not update the weights yet:

$$∂Ei∂w16=∂(1m∑mj=1ej)∂w16 \frac{\partial E_i}{\partial w_{16}} = \frac{\partial (\frac{1}{m} \sum_{j=1}^{m} e_j)}{\partial w_{16}}$$

where i.e.: now we only have one $$zz$$-derivative in the chain unlike $$nn$$ in Option 1

$$...∂Z∂μ+... ...\frac{\partial Z}{\partial \mu} + ...$$

and finally update the weights by averaging the gradient:

$$wk+116=wk16−ηnn∑i=1∂Ei∂w16 w_{16}^{k+1} = w_{16}^{k} - \frac{\eta}{n} \sum_{i=1}^{n} \frac{\partial E_i}{\partial w_{16}}$$

So in Question 2 – is Option 1 or Option 2 correct? Am I missing anything?

Thank you so much!

Q1: Your description seems to be pretty much correct.

Q2: The two options are equal:

$$∂E∂w=∂1n∑ni=1Ei∂w=1nn∑i=1∂Ei∂w \frac {\partial E} {\partial w} = \frac {\partial \frac 1 n \sum_{i=1}^n E_i} {\partial w} = \frac 1 n \sum_{i=1}^n \frac {\partial E_i} {\partial w}$$

Also, note that $$n=1n=1$$ is a valid choice:

In our experiments we found that the number of samples $$LL$$ per datapoint can be set to 1 as long as the minibatch size $$MM$$ was large enough, e.g. $$M=100M = 100$$.

Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013).