Automated ML vs the entire replicability/reproducibility crisis

There is a trend in machine learning implementations to make things easier and easier for implementers, a very natural engineering concern. Easy APIs to create any kind of model you want, easy infrastructure to manage versions of data and models, easy deployment of models as APIs. One of these trends is AutoML, an end-to-end process of creating a model (out of many) on very few general hyperparameters, hiding more and more of the usual stats process, all to the point of reducing the need for understanding the many hard to learn nuances of the statistical practices involved.

On the whole other end of the spectrum are the methods to addresse replicability crisis occurring in many scientific areas, mostly motivated by the poor use of statistics: confusion of statistical and effect significance, p-hacking, HARK-ing, other superficial uses of statistics. All this is asking people who use these tools to know more and better the nuances of statistical thinking.

Details are missing about the innards of AutoML: is it running an SVM and a LR and a RF with multiple kernels, hyperparams, etc? Is it following basic defensive statistics like Bonferroni correction? Or is it just jumping straight in to picking he best p-value out of all?

I’ve set this up as a dichotomy between ease of use in engineering and the correct thought in the statistical procedures. AutoML seems like a great thing for creating successful models. But then I wonder if they’re not only ignoring the entire history of statistical thinking but even running away from it.

Are the AutoML researchers taking into account the statistical nuances successfully or are they enabling even more problems with models by ignoring the nuances (choosing between too many models for the amount of data)? And likewise are those who are statisticians making it harder to make reputable models? As a side question, is this characterization of AutoML as a more problematic statistical procedure accurate?

I suppose a TL;DR to all this is is AutoML just p-hacking all the models?

Answer

I agree with Alex R’s comments, and I’m expanding them into a full answer.

I’ll be talking about “black box” models in this answer, by which I mean machine learning (ML) models whose internal implementations are either not known or not understood. Using some sort of “Auto ML” framework would produce a black box model. More generally, many people would consider hard-to-interpret methods such as deep learning and large ensembles as black boxes.

It’s certainly possible that people could use black boxes in a statistically unrigorous way, but I think the question is somewhat misunderstanding what I believe to be the typical use case.

Are your model’s components important, or just its outputs?

In many fields, we use regression techniques as a way to try to understand the world. Having a super accurate prediction is not the main goal. Usually the goal is more explanatory, e.g. trying to see the effect dosage has on survival rates. Here, getting a rigorous, un-hacked p-value measures of significance for the components of your model (e.g. your coefficients/biases) is extremely important. Since the components are what’s important, you should not use a black box!

But there are also many other areas where the main goal is simply the most “accurate” (substitute accuracy for your favorite performance metric) prediction. In this case, we don’t really care about the p-value of specific components of our model. What we should care about, is the p-value of our model’s performance metric compared to a baseline. This is why you will see people split the data into a training set, a validation set, and a held out test set. That held out test set should be looked at only a very small number of times to avoid p-hacking and/or overfitting.

In short, if you care about using the internal components of your model to make statements about our world, then obviously you should know what the internals are and probably not be using hard-to-interpret or even unknown-to-you techniques. But if all you care about is the output of your model, then make sure you have a robust test set (no overlap with training/validation sets, i.i.d., etc.) that you don’t look at too much and you are likely good to go even if your model is a black box.

So there’s no reproducibility problems in performance-oriented machine learning?

I wanted to be clear about this — there are definitely reproducibility problems in performance-oriented machine learning. If you train thousands of models and see their performance on the same test set, you are likely getting non-reproducible results. If you take a biased sample for your test set, you are likely getting non-reproducible results. If you have “data leakage”, i.e. overlap between your train/validation set and your test set, you are likely getting non-reproducible results.

But none of these problems are inherent to the use of black box models. They are the problems of the craftsman, not the tools!

Attribution
Source : Link , Question Author : Mitch , Answer Author : kbrose

Leave a Comment