I’m dealing with a forecasting problem that has a time component, similar to the following: I have 2 years of data on daily performance for thousands of agents, and want to predict the future performance per agent using an ensemble of decision trees (e.g. RandomForestRegressor, GBR or XGBoost). My question deals with the optimal performance evaluation methodology employed when the underlying ground truth function is changing significantly over time (Cross-validation and OOB error are both inappropriate due to use of lag features).
I’m using TimeSeriesSplit as my methodology for both model performance estimation and hyperparameter selection. Optimal hyperparameters for earlier splits differ from later splits. Therefore, I want to emphasize performance of more recent predictions when tuning hyperparameters. (To exaggerate the problem, consider the use of stock market data over 20 years.) To do this, I believe it makes sense to weight the scores of recent splits higher; e.g. in a 15-fold split, performance of the first split (using the oldest data and smallest sample) would have the least weight, and the last split (which uses the most recent data and has the largest sample) would have the greatest weight.
A classic, naïve approach to finding optimal hyperparameters using
TimeSeriesSplit(sklearn) optimizes the model’s hyperparameters for average performance weighted equally over all splits – from the first split to the last – but this is clearly not as good as sacrificing some accuracy in predictions from long ago if, as a tradeoff, we can sufficiently improve predictions on more-recent periods. Furthermore, the naïve approach to hyperparameter tuning with
TimeSeriesSplitwill use the oldest data for training 15 times (for 15 splits), while the most-recent training data in evaluating model performance is only used once.
How should one choose split weights to implement such a custom time-series cross-validation methodology in scikit-learn? I’d like to use this “weighted TimeSeriesSplit” in my pipeline.