Simple example of how “Bayesian Model Averaging” actually works

I’m trying to follow this tutorial on Bayesian Model Averaging by putting it in context of machine-learning and the notations that it generally uses (i.e.):

X_train: Training Array; dims = $(n, m)$;

y_train Target Vector; dims = $(n, )$ that you fit with the Training Array (correct values);

x: input vector of attributes for a sample; dims = $(m,)$; and

y: output prediction value; $(1,)$ scalar [scalar for simplicity] of prediction values).

These are all described below in the context of Bayesian…

.

Source describes this as a Class of models indexed by $m$:
$$P(y| x,\theta, m)$$
$\theta$ : Set of model parameters;

$m$ : The model index in a set of models

.

Bayesian Model Selection:

$$P(y|x,D) = $$

$x$ : Input Data : $(n_{test}, m)$ shaped input array (rows = samples, cols = attributes);

$y$ : Output Prediction : $(n_{test},)$ length output vector of predictions based on $x$;

$D$ : Training Data : A tuple containing (i) $(n_{train}, m)$ array of (rows = samples, cols = attributes); and (ii) $(n_{train},)$ length vector containing the actual value/category described by training array

(please let me know if this is confusing and I will elaborate)

$$ = \int P(y|x,D,m)*P(m|x,D)dm$$
$$P(y|x,D,m) = \int P(y|x,\theta,m)*P(\theta|D,m)d\theta$$
$y$ and $x$ are independent of the $D$ given $\theta$

The video says that this averages over the probabilities that are predicted for each of the models. The weights that you average with are $P(m|x,D)$ posterior distributions on $m$ given $D$.

My confusion:

Can someone please describe how this is averaging over models? Do you end up with a posterior that is created with all of the models? Where does the prior go in this context?

How does integrating over all the models average them? From what I remember, integrating gives you area under the curve but in statistics I often hear the term “summing/integrating out” parameters/variables. What does that mean exactly?

Please provide a simple example so I can understand how this works 🙂 It will definitely be useful for people trying to understand how Bayesian Model Averaging works exactly. I will put a link to this on that video because I know other people were confused as well.

Answer

I think it might help to think of this as a two-level “meta-model”. You have some collection of individual models (indexed by $m$), and then you have a meta-model, which is a distribution over the individual models (or equivalently, a distribution over values of $m$).

You can think about the model averaging as working in two steps:

  • First, you get the posterior predictive distribution for each model $m$ by integrating out its model-specific parameters $\theta$:

$$ P(y|x, D, m) = \int P(y|x, D, \theta, m)P(\theta| D, m)d\theta $$

  • Then you get the posterior predictive distribution for the meta-model, now integrating out the distribution over the models:

$$ P(y|x,D) = \int P(y|x, D, m)P(m|x, D)dm $$

Then in the machine learning context you would make predictions about $y$ based on its posterior predictive distribution given the observed covariates $x$.

To answer your question, the second step is where this is model averaging. When you “integrate out” or “sum out” a parameter (incidentally, you can think of these as the same operation for continuous and discrete distributions respectively), that’s equivalent to taking the expected value of some quantity (i.e. averaging) over that parameter. In this case, you’re taking the expected value of the posterior density of $y$, which is the definition of a posterior predictive distribution.

As for priors, you’re going to have two sets of them in this model: a prior for each model $m$, and a prior for the meta-model over different $m$. They will factor into determining the posterior distributions over parameters that we’ve integrated out (i.e. $P(\theta|D,m)$ and $P(m|x,D)$).

I will point out that in this model the authors have apparently specified that the posterior over $m$ might depend on the test predictors $x$, but the posterior over $\theta$ does not. That is, $x$ might influence how you weight the different models, but not how you weight the parameters of each individual model. I don’t think that’s a crazy choice, but it’s not the only way to do this.

Okay. An example. I can’t think of a machine learning example that’s simple, but here’s an easier textbook statistics example. In this model the individual models are going to be normal distributions with a fixed variance $\sigma^2$, and a random mean $\mu$. The collection of distributions (the meta-model) is over different values of $\sigma^2$. So here $\theta = \mu$ and $m = \sigma^2$. The standard prior for $\mu|\sigma^2$ is a normal distribution, and then the prior over $\sigma^2$ is an inverse-gamma distribution. You can show that the posterior predictive distribution $y$ over $\mu$ given a fixed value of $\sigma^2$ is another normal distribution with its mean pulled in the direction of the sample mean. Then you integrate out (model average) $\sigma^2$, and the posterior predictive distribution becomes a Student-t distribution over $y$. Essentially, you get something that looks kind of like a normal distribution, but it has fat tails because you’ve averaged over different possibilities for the variance.

Attribution
Source : Link , Question Author : O.rka , Answer Author : jeizenga

Leave a Comment