Say I commit the following sins while building a predictive model:
I take my dataset and split it into four subsets: Three for training
(Train_A, Train_B, and Train_C) and one for validation.
I train an initial model (Model_A) on Train_A. Because the goal is to maximize out-of-sample prediction accuracy, I use bias-variance-balancing techniques like cross-validation.
I generate predictions from Model_A on Train_B and record the prediction errors.
Next, I train a second model (Model_B) on Train_B, but I weigh the observations based on the magnitude of the prediction errors from Model_A. In other words, Model_B is told to focus most on learning to predict the observations that Model_A was really bad at predicting. Again, the goal is out-of-sample accuracy, so a technique like cross-validation is used.
I generate predictions from Model_A and Model_B on Train_C. These are used to explore the best way to combine the predictions (e.g., weighted average) from both models to (hopefully) increase out-of-sample prediction accuracy.
After determining the best way to weigh the predictions from Model_A and Model_B, I estimate the out-of-sample accuracy using the validation set.
Main Question: Am I damned? Is this approach inherently and irrecoverably prone to overfitting? Or, is there a way to use the errors from Model_A to inform how Model_B is trained in such a way that the strengths of Model_B address the weaknesses of Model_A?
Secondary Questions: Are there particular techniques or algorithms that are better/worse at extracting value from this kind of approach? For example, I wouldn’t be surprised if there are some NN techniques that inherently do this kind of thing and, therefore, wouldn’t benefit at all from this approach whereas something less flexible (like regularized regression) could potentially benefit greatly in comparison. What other thoughts or advice would you provide to someone who wishes to take this approach?
[Edit: I feel like I walked into a McMenamins and pitched the idea of a Microbrewery to the bartender, haha! Thanks everyone for your very kind and helpful comments!]
As noticed in the comments, you’ve re-discovered boosting. Nothing wrong with this approach, but usually it’s easier and safer to use a method already implemented and battle-tested by someone else than starting from scratch. If you really want to use your approach, I’d encourage you to first use some out-of-the-box implementation of boosting (AdaBoost, XGBoost, CatBoost, etc) to use it as a benchmark.