I am taking some statistics and machine learning courses and I realized that when doing some model comparison, statistics uses hypothesis tests, and machine learning uses metrics. So, I was wondering, why is that?
As a matter of principle, there is not necessarily any tension between hypothesis testing and machine learning. As an example, if you train 2 models, it’s perfectly reasonable to ask whether the models have the same or different accuracy (or another statistic of interest), and perform a hypothesis test.
But as a matter of practice, researchers do not always do this. I can only speculate about the reasons, but I imagine that there are several, non-exclusive reasons:
- The scale of data collection is so large that the variance of the statistic is very small. Two models with near-identical scores would be detected as “statistically different,” even though the magnitude of that difference is unimportant for its practical operation. In a slightly different scenario, knowing with statistical certainty that Model A is 0.001% more accurate than Model B is simply trivia if the cost to deploy Model A is larger than the marginal return implied by the improved accuracy.
- The models are expensive to train. Depending on what quantity is to be statistically tested and how, this might require retraining a model, so this test could be prohibitive. For instance, cross-validation involves retraining the same model, typically 3 to 10 times. Doing this for a model that costs millions of dollars to train once may make cross-validation infeasible.
- The more relevant questions about the generalization of machine learning models are not really about the results of repeating the modeling process in the controlled settings of a laboratory, where data collection and model interpretation are carried out by experts. Many of the more concerning failures of ML arise from deployment of machine learning models in uncontrolled environments, where the data might be collected in a different manner, the model is applied outside of its intended scope, or users are able to craft malicious inputs to obtain specific results.
- The researchers simply don’t know how to do statistical hypothesis testing for their models or statistics of interest.