From any generic sampling algorithm, one can derive an optimization algorithm.
Indeed, to maximize an arbitrary function f:x→f(x), it suffices to draw samples from g∼ef/T. For T small enough, these samples will fall near the global maximum (or local maxima in practice) of the function f.
By “sampling” I mean, drawing a pseudo-random sample from a distribution given a log-likelihood function known up to a constant. For instance, MCMC sampling, Gibbs sampling, Beam Sampling, etc. By “optimization” I mean the attempt to find parameters maximizing the value of a given function.
Is the reverse possible?
Given a heuristic to find the maximum of a function or a combinatorial expression, can we extract an efficient sampling procedure?
HMC for instance seems to take advantage of gradient information. Can we construct a sampling procedure that takes advantage of a BFGS-like approximation of the Hessian?
(edit: apparently yes: http://papers.nips.cc/paper/4464-quasi-newton-methods-for-markov-chain-monte-carlo.pdf)
We can use MCTS in combinatorial problems, can we translate that into a sampling procedure?
Context: a difficulty in sampling is often that most of the mass of the probability distribution lies within a very small region. There are interesting techniques to find such regions, but they do not directly translate into unbiased sampling procedures.
Edit: I now have a lingering feeling that the answer to that question is somewhat equivalent to the equality of complexity classes #P and NP, making the answer a likely “no”. It does explain why every sampling technique yields an optimization technique but not vice versa.
One connection has been brought up by Max Welling and friends in these two papers:
- Bayesian Learning via Stochastic Gradient Langevin Dynamics
- Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring.
The gist is that the “learning”, ie. optimisation of a model smoothly transitions into sampling from the posterior.