I am using Multilayer Perceptron MLPClassifier for training a classification model for my problem. I noticed that using the solver

`lbfgs`

(I guess it implies Limited-memory BFGS in scikit learn) outperforms ADAM when the dataset is relatively small (less than 100K). Can someone provide a concrete justification for that? In fact, I couldn’t find a good resource that explains the reason behind that. Any participation is appreciated.Thank you

**Answer**

There are a lot of reasons that this could be the case. Off the top of my head I can think of one plausible cause, but without knowing more about the problem it is difficult to suggest that it is *the one*.

An L-BFGS solver is a true quasi-Newton method in that it estimates the curvature of the parameter space via an approximation of the Hessian. So if your parameter space has plenty of long, nearly-flat valleys then L-BFGS would likely perform well. It has the downside of additional costs in performing a rank-two update to the (inverse) Hessian approximation at every step. While this is reasonably fast, it does begin to add up, particularly as the input space grows. This may account for the fact that ADAM outperforms L-BFGS for you as you get more data.

ADAM is a first order method that attempts to compensate for the fact that it doesn’t estimate the curvature by adapting the step-size in every dimension. In some sense, this is similar to constructing a diagonal Hessian at every step, but they do it cleverly by simply using past gradients. In this way it is still a first order method, though it has the benefit of acting as though it is second order. The estimate is cruder than that of the L-BFGS in that it is only along each dimension and doesn’t account for what would be the off-diagonals in the Hessian. If your Hessian is nearly singular then these off-diagonals may play an important role in the curvature and ADAM is likely to underperform relative the BFGS.

**Attribution***Source : Link , Question Author : Steven , Answer Author : David Kozak*