I will try to describe the problem at hand as general as possible. I am modeling observations as a categorical distribution with a parameter probability vector theta.

Then, I assume the parameter vector theta follows a Dirichlet prior distribution with parameters $\alpha_1,\alpha_2,\ldots,\alpha_k$.

Is it then possible to also impose a hyperprior distribution over the parameters $\alpha_1,\alpha_2,\ldots,\alpha_k$? Will it have to be a multivariate distribution such as the categorical and dirichlet distributions? Seems to me the alpha’s are always positive so a gamma hyperprior should work.

Not sure if anyone has tried fitting such (possibly) overparametrized models but seems reasonable to me to think that the alpha’s should not be fixed but rather come from a gamma distribution.

Please try to provide me with some references, insights on how I could try such approach in practice.

**Answer**

I don’t think this is an “overparamaterized” model at all. I would argue that by placing a prior over the Dirichlet paramaters, you’re being less committal about any particular outcome. In particular, as you probably know, for symmetric dirichlet distributions (i.e. $\alpha_1 = \alpha_2 = … \alpha_K$) setting $\alpha<1$ gives more prior probability to sparse multinomial distributions, while $\alpha>1$ gives more prior probability to smooth multinomial distributions.

In cases where one has no strong expectation for either sparse or dense multinomial distributions, placing a hyperprior over your Dirichlet distribution gives your model some added flexibility to chose between them.

I originally got the idea of doing this from this paper. The hyperprior they use is slightly different than what you suggest. They sample a probability vector from a dirichlet and then scale it by a draw from an exponential (or gamma). So the model is

\begin{eqnarray}

\beta &\sim &Dirichlet(1)\\

\lambda& \sim &Exponential(\cdot)\\

\theta& \sim &Dirichlet(\beta\lambda)

\end{eqnarray}

The extra Dirichlet is simply to avoid imposing symmetry.

I’ve also seen people use just the Gamma hyper prior for a Dirichlet in the context of hidden markov models with multinomial emission distributions, but I can’t seem to find a reference. Also, it seems like I’ve encountered similar hypers used in topic models.

**Attribution***Source : Link , Question Author : Dnaiel , Answer Author : jerad*