I’m trying to understand some papers by Mark van der Laan. He’s a theoretical statistician at Berkeley working on problems overlap significantly with machine learning. One problem for me (besides the deep math) is that he often ends up describing familiar machine learning approaches using a completely different terminology. One of his main concepts is “Targeted Maximum Likelihood Expectation”.

TMLE is used to analyze censored observational data from a non-controlled experiment in a way that allows effect estimation even in the presence of confounding factors. I strongly suspect that the many of the same concepts exist under other names in other fields, but I don’t yet understand it well enough to match it directly to anything.

An attempt to bridge the gap to “Computational Data Analysis” is here:

And an introduction for statisticians is here:

Targeted Maximum Likelihood Based Causal Inference: Part I

From the second:

In this article, we develop a particular targeted maximum likelihood

estimator of causal effects of multiple time point interventions. This

involves the use of loss-based super-learning to obtain an initial

estimate of the unknown factors of the G-computation formula, and

subsequently, applying a target-parameter specific optimal fluctuation

function (least favorable parametric submodel) to each estimated

factor, estimating the fluctuation parameter(s) with maximum

likelihood estimation, and iterating this updating step of the initial

factor till convergence. This iterative targeted maximum likelihood

updating step makes the resulting estimator of the causal effect

double robust in the sense that it is consistent if either the initial

estimator is consistent, or the estimator of the optimal fluctuation

function is consistent. The optimal fluctuation function is correctly

specified if the conditional distributions of the nodes in the causal

graph one intervenes upon are correctly specified.In his terminology, “super learning” is ensemble learning with a a theoretically sound non-negative weighting scheme. But what does he mean by “applying a target-parameter specific optimal fluctuation function (least favorable parametric submodel) to each estimated factor”.

Or breaking it into three distinct questions, does TMLE have a parallel in machine learning, what is a “least favorable parametric submodel”, and what is a “fluctuation function” in other fields?

**Answer**

I agree that van der Laan has a tendency to invent new names for already existing ideas (e.g. the super-learner), but TMLE is not one of them as far as I know. It is actually a very clever idea, and I have seen nothing from the Machine Learning community which looks similar (although I might just be ignorant). The ideas come from the theory of semiparametric-efficient estimating equations, which is something that I think statisticians think much more about than ML people.

The idea essentially is this. Suppose P0 is a true data generating mechanism, and interest is in *a particular functional* Ψ(P0). Associated with such a functional is often an estimating equation

∑iφ(Yi∣θ)=0,

where θ=θ(P) is determined in some way by P, and contains enough information to identify Ψ. φ will be such that EPφ(Y∣θ)=0. Solving this equation in θ may, for example, be much easier than estimating all of P0. This estimating equation is *efficient* in the sense that any efficient estimator of Ψ(P0) is asymptotically equivalent to one which solves this equation. *(Note: I’m being a little loose with the term “efficient”, since I’m just describing the heuristic.)* The theory behind such estimating equations is quite elegant, with this book being the canonical reference. This is where one might find standard definitions of “least favorable submodels”; these aren’t terms van der Laan invented.

However, estimating P0 using machine learning techniques will not, in general, satisfy this estimating equation. Estimating, say, the density of P0 is an intrinsically difficult problem, perhaps much harder than estimating Ψ(P0), but machine learning techniques will typically go ahead and estimate P0 with some ˆP, and then use a plug-in estimate Ψ(ˆP). van der Laan would criticize this estimator as *not being targeted* and hence may be inefficient – perhaps, it may not even be √n-consistent at all! Nevertheless, van der Laan recognizes the power of machine learning, and knows that to estimate the effects he is interested in will ultimately require some density estimation. But he doesn’t care about estimating P0 itself; the density estimation is only done for the purpose of getting at Ψ.

The idea of TMLE is to start with the initial density estimate ˆp and then consider a new model like this:

ˆp1,ϵ=ˆpexp(ϵ φ(Y∣θ))∫ˆpexp(ϵ φ(y∣θ)) dy

where ϵ is called a fluctuation parameter. Now we do maximum likelihood on ϵ. If it happens to be the case that ϵ=0 is the MLE then one can easily verify by taking the derivative that **ˆp solves the efficient estimating equation, and hence is efficient for estimating Ψ!** On the other hand, if ϵ≠0 at the MLE, we have a new density estimator ˆp1 which fits the data better than ˆp (after all, we did MLE, so it has a higher likelihood). Then, we iterate this procedure and look at

ˆp2,ϵ∝ˆp1,ˆϵexp(ϵ φ(Y∣θ).

and so on until we get something, in the limit, which satisfies the efficient estimating equation.

**Attribution***Source : Link , Question Author : Nathan Kurz , Answer Author : guy*