# The connection between Bayesian statistics and generative modeling

Can someone refer me to a good reference that explains the connection between Bayesian statistics and generative modeling techniques? Why do we usually use generative models with Bayesian techniques?

Why it is especially appealing to use Bayesian statistics in the absence of complete data, if at all?

Note that I come from a more machine learning oriented view, and I am interested in reading more about it from the statistics community.

Any good reference that discusses these points would be greatly appreciated.
Thanks.

In machine learning a full probability model p(x,y) is called generative because it can be used to generate the data whereas a conditional model p(y|x) is called discriminative because it does not specify a probability model for p(x) and can only generate y given x. Both can be estimated in Bayesian fashion.

Bayesian estimation is inherently about specifying a full probability model and performing inference conditional on the model and data. That makes many Bayesian models have a generative feel. However to a Bayesian the important distinction is not so much about how to generate the data, but more about what is needed to obtain the posterior distribution of the unknown parameters of interest.

The discriminative model p(y|x) is part of bigger model where p(y, x) = p(y|x)p(x). In many instances, p(x) is irrelevant to the posterior distribution of the parameters in the model p(y|x). Specifically, if the parameters of p(x) are distinct from p(y|x) and the priors are independent, then the model p(x) contains no information about the unknown parameters of the conditional model p(y|x), so a Bayesian does not need to model it.

At a more intuitive level, there is a clear link between “generating data” and “computing the posterior distribution.” Rubin (1984) gives the following excellent description of this link:

Bayesian statistics is useful given missing data primarily because it provides a unified way to eliminate nuisance parameters — integration. Missing data can be thought of as (many) nuisance parameters. Alternative proposals such as plugging in the expected value typically will perform poorly because we can rarely estimate missing data cells with high levels of accuracy. Here, integration is better than maximization.

Discriminative models like p(y|x) also become problematic if x includes missing data because we only have data to estimate p(y|x_obs) but most sensible models are written with respect to the complete data p(y|x). If you have a fully probability model p(y,x) and are Bayesian, then you’re fine because you can just integrate over the missing data like you would any other unknown quantity.