Robby McKilliam says in a comment to this post:

It should be pointed out that, from the frequentists point of view, there is no reason that you can’t incorporate the prior knowledge into the model. In this sense, the frequentist view is simpler, you only have a model and some data. There is no need to separate the prior information from the model

Also, here, @jbowman says that frequentists use regularization by a cost/penalty function, while bayesians can make this a prior:

Frequentists realized that regularization was good, and use it quite commonly these days – and Bayesian priors can be easily interpreted as regularization.

So, my question is, can frequentists in general incorporate into their models what Bayesians specify as priors? Taking the regularization as an example, is the cost/penalty function really integrated into the model, or is this a purely artificial means of adjusting the solution (as well as making it unique)?

**Answer**

With respect to Robby McKilliam’s comment: I think the difficulty a frequentist would have with this lies in the definition of “prior knowledge”, not so much with the ability to incorporate prior knowledge in a model. For example, consider estimating the probability that a given coin will come up heads. Let us assume my prior knowledge was, essentially, an experiment in which that coin had been flipped 10 times and came up with 5 heads, or perhaps of the form “the factory made 1 million coins, and the dist’n of $p$, as determined by huge experiments, is $\beta(a,b)$”. Everyone uses Bayes’ Rule when you really do have prior information of this type (Bayes’ Rule just defines conditional probability, it’s not a Bayesian-only thing) so in real life the frequentist and the Bayesian would use the same approach, and incorporate the information into the model via Bayes’ Rule. (Caveat: unless your sample size is large enough that you are pretty sure the prior information’s not going to have an effect on the results.) However, the interpretation of the results is, of course, different.

Difficulty arises, especially from a philosophical point of view, as the knowledge becomes less objective / experimental and more subjective. As this happens, the frequentist will likely become less inclined to incorporate this information into the model at all, whereas the Bayesian still has some more-or-less formal mechanisms for doing so, difficulties of eliciting a subjective prior notwithstanding.

With respect to regularization: Consider a likelihood $l(\theta;x)$ and a prior $p(\theta)$. There is nothing to prevent, at least not technically, a frequentist from using maximum likelihood estimation “regularized” by $\log p(\theta)$, as in:

$\tilde{\theta} = \max_{\theta} \{\log l(\theta;x) + \log p(\theta) \}$

For $p(\theta)$ Gaussian, this amounts to a quadratic penalty shrinking $\theta$ towards the mean of the Gaussian, and so forth for other distributions. $\tilde{\theta}$ is equal to the maximum a posteriori (MAP) point estimate of a Bayesian using the same likelihood function and prior. Of course, again, the interpretation of the frequentist and Bayesian estimates will differ. The Bayesian is also not constrained to use a MAP point estimate, having access to a full posterior distribution – but then, the frequentist doesn’t have to maximize a regularized log likelihood either, being able to use various robust estimates, or method-of-moments, etc., if available.

Again, difficulty arises from a philosophical point of view. Why choose one regularization function over another? A Bayesian can do so – shifting to a prior – based view – by assessing the prior information. A frequentist would have a harder time (be unable to?) justifying a choice on those grounds, but instead would likely do so largely based on the properties of the regularization function as applied to his/her type of problem, as learned from the joint work / experience of many statisticians. OTOH, (pragmatic) Bayesians do that with priors too – if I had $100 for every paper on priors for variances I’ve read…

Other “thoughts”: I’ve skipped the entire issue of selecting a likelihood function by assuming that it is unaffected by the frequentist / Bayesian viewpoint. I’m sure in most cases it is, but I can imagine that in unusual situations it would be, e.g., for computational reasons.

Summary: I suspect frequentists can, except perhaps for some corner cases, incorporate pretty much any prior information into their models that a Bayesian can, from a strictly mathematical and computational viewpoint. Interpretation of results will of course be different. I don’t, however, believe the frequentist would regard it as philosophically correct to do so in all cases, e.g., the regularization function above where the person down the hall who actually knows something about $\theta$ says “I think $\theta$ should be around 1.5”. And incorporating close-to-ignorance via, say, a Jeffrey’s prior, is right out.

**Attribution***Source : Link , Question Author : Patrick , Answer Author : jbowman*