Context: I’m trying to understand the use of variational autoencoders as generators. My understanding:
- During training, for an input point xi we want to learn latent μi and σi and then sample zi∼N(μi,σi) and feed it to the decoder to get a reconstruction ˆxi=decode(zi).
- But we can’t do back propagation with sampling operator, so instead we reparametrize and use zi=μi+σiϵ where ϵ∼N(0,1). Our reconstruction becomes ˆxi=decode(μi+σiϵ).
However when we’re done with training and ready to use it as generator, we sample z∼N(0,1) and feed it to decoder: xsample=decode(z) .
The part that confuse me is that during training, the decode operation was done using μi+σiϵ which to my understanding this is using N(μi,σi) with different μi and σi for each training example. However during the generation time, the decode operation is done (effectively) on ϵ alone from N(0,1). Why are we setting μ=0 and σ=1 during generation (i.e. using z=0+1⋅ϵ)?
During training, we are drawing z∼P(z|x), and then decoding with ˆx=g(z).
During generation, we are drawing z∼P(z), and then decoding x=g(z).
So this answers your question: during generation, we want to generate samples from the prior distribution of latent codes, whereas during training, we are drawing samples from the posterior distribution, because we are trying to reconstruct a specific datapoint.