I’ve been toying around with logistic regression with various batch optimization algorithms (conjugate gradient, newton-raphson, and various quasinewton methods). One thing I’ve noticed is that sometimes, adding more data to a model can actually make training the model take much less time. Each iteration requires looking at more data points, but the total number of iterations required can drop significantly when adding more data. Of course, this only happens on certain data sets, and at some point adding more data will cause the optimization to slow back down.
Is this a well studied phenomenon? Where can I find more information about why/when this might happen?
With less amounts of data, spurious correlation between regression inputs is often high, since you only have so much data. When regression variables are correlated, the likelihood surface is relatively flat, and it becomes harder for an optimizer, especially one that doesn’t use the full Hessian (e.g. Newton Raphson), to find the minimum.
There are some nice graphs here and more explanation, with how various algorithms perform against data with different amounts of correlation, here: http://fa.bianp.net/blog/2013/numerical-optimizers-for-logistic-regression/