# What does Bayesian Hypothesis Testing mean in the framework of inference and decision theory?

My background is mainly in machine learning and I was trying to learn what Bayesian Hypothesis testing meant. I am ok with the bayesian interpretation of probability and I am familiar with it in the context of probabilistic graphical models. However, what is confusing me is what the word “Hypothesis” means in the context of statistical inference.

I think I am mostly getting confused about the vocabulary that I am used to in machine learning vs what is normally used in statistics and inference.

In the context of supervised learning, I normally think of the hypothesis as the predictive function that maps examples to its labels i.e. $h:\mathcal{X} \rightarrow \mathcal{Y}$. However, it seems to me that the term hypothesis, in the readings that I am doing don’t have the same meaning. Let me paste an extract of the readings I am reading:

If you read carefully it also says:

there is a different model for the observed data …

were they use the word model. For me the word model makes me think of a set of functions were we select a specific predictive function. i.e. a hypothesis class of function. For example, $\mathcal{H_{d2}}$ could be the hypothesis class of quadratic functions (polynomial of degree 2). However, it seems to me that they use the word model and hypothesis as synonymous in this extract (where for me they are completely different words).

Then it goes on to mention that we can put priors to hypothesis (a completely reasonable thing to do in a bayesian setting):

$$p_H(H_m), \ \ \ \ \ m=\{0, 1, …, M-1 \}$$

also we can characterize the data with a current hypothesis:

$$p_{y|H}( \cdot |H_m), \ \ \ \ \ m=\{0, 1, …, M-1 \}$$

and update our current believes given some data (and Baye’s rule):

$$p_{H|y}(H_m|y), \ \ \ \ \ m=\{0, 1, …, M-1 \}$$

However, I guess I am more used to putting a bayesian estimate to a particular parameter (say $\theta$) from a hypothesis class rather than to the whole hypothesis class. Basically since it seems that these “hypotheses” are not the same hypotheses from the machine learning context that I am used to, it seems to me that these hypotheses are more similar to a specific $\theta$ parameter than to a hypothesis class.

At this point I was convinced that “hypothesis” meant the same thing as in the predictive function (parametrized by a parameter $\theta$, for example), but I think I was wrong…

To make my confusion even worse, later these same reading went ahead to specify a particular “hypothesis” to each training example that they observed. Let me paste an extract of what I mean:

the reason that this confuses me is that, if I interpret hypothesis as a parameter, then for me it makes no sense to specify a specific parameter for each sample value that we see. At this point I concluded that I really didn’t know what they meant by hypothesis so I posted this question.

However, I didn’t fully give up, I researched what hypothesis means in frequentist statistics and found the following khan academy video. That video actually makes a lot of sense to me (maybe you are a frequentist! 🙂. However, it seems that they get a bunch of data (like some “sample set”) and based on the properties of the sample set, they decide whether to accept or reject the null hypothesis about the data. However, in the Bayesian context that I am reading, it seems to me that for each data [point] vector that is observed, they “label it” with a hypothesis with the “Likelihood ratio test”:

The way they are assigning hypothesis to each data sample, even seems like a supervised learning setting were we are attaching a label to each training set. However, I don’t think that’s what they are doing in this context. What are they doing? What does it mean to assign a hypothesis to each data sample? What is the meaning of a hypothesis? What does the word model mean?

Basically, after this long explanation of my confusion, does someone know what bayesian hypothesis testing means in this context?

If you need any clarification or anything to improve my question or so that the question makes sense, I am more than happy to help 🙂

In my search for an answer I found some useful things related to statistical hypothesis testing:

This one addresses a good introduction to the topic if you come from a CS background (like me):

What is a good introduction to statistical hypothesis testing for computer scientists?

At some point I asked about “default parameters” (which I should have defined what I meant. I thought it was a standard term but it isn’t, so here I will address it) and I think what I truly meant is how do you specify parameters for each hypothesis that you have. For example, how do you decide what your null hypothesis is and its parameters. There is a question related to that:

How to specify the null hypothesis in hypothesis testing

A statistical model is given by a family of probability distributions. When the model is parametric, this family is indexed by an unknown parameter $\theta$:
$$\mathcal{F}=\left\{ f(\cdot|\theta);\ \theta\in\Theta \right\}$$
If one wants to test an hypothesis on $\theta$ like $H_0:\,\theta\in\Theta_0$, one can consider two models are in opposition: $\mathcal{F}$ versus
$$\mathcal{F}_0=\left\{ f(\cdot|\theta);\ \theta\in\Theta_0 \right\}$$
From my Bayesian perspective, I am drawing inference on the index of the model behind the data, $\mathcal{M}$. Hence I put a prior on this index, $\rho_0$ and $\rho_a$, as well as on the parameters of both models, $\pi_0(\theta)$ over $\Theta_0$ and $\pi_a(\theta)$ over $\Theta$. And I then deduce the posterior distribution of this index:
$$\pi(m=0|x)=\dfrac{\rho_0\int_{\Theta_0} f(x|\theta)\pi_0(\theta)\text{d}\theta}{\rho_0\int_{\Theta_0} f(x|\theta)\pi_0(\theta)\text{d}\theta +(1-\rho_0)\int_{\Theta} f(x|\theta)\pi_a(\theta)\text{d}\theta}$$
The document you linked to goes into much more details into this perspective and should be your entry of choice into statistical testing of hypotheses, unless you can afford to go through a whole Bayesian book. Or even a machine learning book like Kevin Murphy‘s.

For instance, in the setting where $X\sim\mathcal{N}(\theta,1)$ is observed, if the hypothesis to be tested is $H_0:\theta=0$, the posterior probability that $\theta=0$ is the posterior probability that the model producing the data is $\mathcal{N}(0,1)$. According to the above formula, if the prior distribution on $\theta$ is $\theta\sim\mathcal{N}(0,10)$, and if we put equal weights on both hypotheses, i.e., $\rho_0=1/2$, this posterior probability is
\begin{align*}\pi(m=0|x)&=\dfrac{\frac{1}{\sqrt{2\pi}}\exp\{-x^2/2\}}{\frac{1}{\sqrt{2\pi}}\exp\{-x^2/2\}
+\int_{\mathbb{R}} \frac{1}{\sqrt{2\pi}}\exp\{-(x-\theta)^2/2\}\frac{1}{\sqrt{2\pi\times10}}\exp\{-\theta^2/20\}\text{d}\theta}\\
&=\dfrac{\exp\{-x^2/2\}}{\exp\{-x^2/2\}
+\frac{1}{\sqrt{11}}\exp\{-x^2/22\}}
\end{align*}