While the results of the private test set can not be used to refine the model further, isn’t model selection out of a huge number of models being performed based on the private test set results? Would you not, through that process alone, end up overfitting to the private test set?
According to “Pseudo-Mathematics and Financial Charlatanism: The Effects of
Backtest Overfitting on Out-of-Sample Performance”
by Bailey et.al. it is relatively easy to “overfit” when selecting the best out of a large number of models evaluated on the same dataset. Is that not happening with Kaggle’s private leaderboard?
- What are the statistical justifications for the best performing models on the private leaderboard being the models that generalize the best to out-of-sample data?
- Do companies actually end up using the winning models, or is the private leaderboard there just to provide the “rules of the game”, and the companies are actually more interested in the insight that arises from the discussion of the problem?
Well the points you present are fair, however I think that there is a far more real issue with people overfitting on the public leaderboard.
This may happen when you do 100 or so submissions, the public test set will eventually bleed out on to your hyperparameter selection and thus overfit. I think that the private leaderboard is necessary in that respect.