Suppose I will be getting some samples from a binomial distribution. One way to model my prior knowledge is with a Beta distribution with parameters α and β. As I understand it, this is equivalent to having seen “heads” α times in α+β trials. As such, a nice shortcut to doing the full-blown Bayesian inference is to use h+αn+α+β as my new mean for the probability of “heads” after having seen h heads in n trials.

Now suppose I have more than two states, so I will be getting some samples from a multinomial distribution. Suppose I want to use a Dirichlet distribution with parameter α as a prior. Again as a shortcut I can treat this as prior knowledge of event i‘s probability as being equivalent to αi∑αj, and if I witness event i h times in n trials my posterior for i becomes h+αin+∑αj.

Now in the binomial case, it works out that prior knowledge of “heads” occurring α times in α+β trials is equivalent to “tails” occurring β times in α+β trials. Logically I don’t believe I can have stronger knowledge of “heads” likelihood than of “tails.” This gets more interesting with more than two outcomes though. If I have say a 6-sided die, I can imagine my prior knowledge of side “1” being equivalent to 10 ones in 50 trials and my prior knowledge of side “2” as being equivalent to 15 twos in 100 trials.

So after all of that introduction, my question is how I can properly model such asymmetric prior knowledge in the multinomial case? It seems as though if I’m not careful I can easily get illogical results due to total probability/likelihood not summing to 1. Is there some way I can still use the Dirichlet shortcut, or do I need to sacrifice this altogether and use some other prior distribution entirely?

Please forgive any confusion caused by potential abuses in notation or terminology above.

**Answer**

You have framed your question very well.

I think what you are looking for here is a case of hierarchical modeling. And you may want to model multiple layers of hierarchy (at the moment you only talk about priors). Having another layer of hyper-priors for the hyper–parameters lets you model the additional variabilities in hyper-parameters (as you are concerned about the variability issues of hyper-parameters). It also makes your modeling flexible and robust (may be slower).

Specifically in your case, you may benefit by having priors for the Dirichlet distribution parameters (Beta is a special case). This post by Gelman talks about how to impose priors on the parameters of Dirichlet distribution. He also cites on of his papers in a journal of toxicology.

**Attribution***Source : Link , Question Author : Michael McGowan , Answer Author : suncoolsu*