When generating samples using variational autoencoder, we decode samples from N(0,1)N(0,1) instead of μ+σN(0,1)\mu + \sigma N(0,1)

Context: I’m trying to understand the use of variational autoencoders as generators. My understanding:

  • During training, for an input point xi we want to learn latent μi and σi and then sample ziN(μi,σi) and feed it to the decoder to get a reconstruction ˆxi=decode(zi).
  • But we can’t do back propagation with sampling operator, so instead we reparametrize and use zi=μi+σiϵ where ϵN(0,1). Our reconstruction becomes ˆxi=decode(μi+σiϵ).

However when we’re done with training and ready to use it as generator, we sample zN(0,1) and feed it to decoder: xsample=decode(z) .

The part that confuse me is that during training, the decode operation was done using μi+σiϵ which to my understanding this is using N(μi,σi) with different μi and σi for each training example. However during the generation time, the decode operation is done (effectively) on ϵ alone from N(0,1). Why are we setting μ=0 and σ=1 during generation (i.e. using z=0+1ϵ)?


During training, we are drawing zP(z|x), and then decoding with ˆx=g(z).

During generation, we are drawing zP(z), and then decoding x=g(z).

So this answers your question: during generation, we want to generate samples from the prior distribution of latent codes, whereas during training, we are drawing samples from the posterior distribution, because we are trying to reconstruct a specific datapoint.

Source : Link , Question Author : Edward B. , Answer Author : shimao

Leave a Comment