We have a dataset with 10,000 manually labeled instances, and a classifier that was trained on all of this data. The classifier was then evaluated on ALL of this data to obtain a 95% success rate.
What exactly is wrong with this approach? Is it just that the statistic 95% is not very informative in this setup? Can there still be some value in this 95% number? While I understand that, theoretically, it is not a good idea, I don’t have enough experience in this area to be sure by myself. Also note that I have neither built nor evaluated the classifier in question.
Common sense aside, could someone give me a very solid, authoritative reference, saying that this setup is somehow wrong?
All I find on the Internet are toy examples supposed to convey some intuition. Here I have a project by professionals with an established track record, so I can’t just say “this is wrong”, especially since I don’t know for sure.
For example, this page does say:
Evaluating model performance with the data used for training is not acceptable in data mining because it can easily generate overoptimistic and overfitted models.
However, this is hardly an authoritative reference. In fact, this quote is plainly wrong, as the evaluation has nothing to do with generating overfitted models. It could generate overoptimistic data scientists who would choose the wrong models, but a particular evaluation strategy does not have anything to do with overfitting models per se.
@jpl has provided a good explanation of the ideas here. If what you want is just a reference, I would use a solid, basic textbook. Some well regarded books that cover the idea of cross-validation and why it’s important might be:
Harrell, F. (2010). Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis. Springer.
Hastie, T., Tibshirani, R., & Friedman, J. (2011). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer.
James, G., Witten, T., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning: with Applications in R. Springer.