I am currently following a master program focused on statistics/econometrics. In my master, all students had to do 3 months of research. Last week, all groups had to present their research to the rest of the master students.
Almost every group did some statistical modelling and some machine learning modelling for their research topics and every single time out-of-sample predictions came to talk the simple machine learning models beat the very sophisticated statistical models that every worked on very hard for the last 3 months. No matter how good everyones statistical models get, a simple random forest got lower out-of-sample errors pretty much always.
I was wondering if this is a generally accepted observation? That if it comes to out-of-sample forecasting there is simply no way to beat a simple random forest or extreme gradient boosting model? These two methods are super simple to implement by using R packages, whereas all the statistical models that everyone came up with require quite a lot of skill, knowledge and effort to estimate.
What are your thoughts of this? Is the only benefit of statistical/econometric models that you gain interpretation? Or were our models just not good enough that they failed to significantly outperform simple random forest predictions? Are there any papers that address this issue?
Statistical modeling is different from machine learning. For example, a linear regression is both a statistical model and a machine learning model. So if you compare a linear regression to a random forest, you’re just comparing a simpler machine learning model to a more complicated one. You’re not comparing a statistical model to a machine learning model.
Statistical modeling provides more than interpretation; it actually gives a model of some population parameter. It depends on a large framework of mathematics and theory, which allows for formulas for things like the variance of coefficients, variance of predictions, and hypothesis testing. The potential yield of statistical modeling is much greater than machine learning, because you can make strong statements about population parameters instead of just measuring error on holdout, but it’s considerably more difficult to approach a problem with a statistical model.