I am trying to use Random Forest Regression in scikits-learn. The problem is I am getting a really high test error:
train MSE, 4.64, test MSE: 252.25.
This is how my data looks: (blue:real data, green:predicted):
I am using 90% for training and 10% for test. This is the code I am using after trying several parameter combinations:
rf = rf = RandomForestRegressor(n_estimators=10, max_features=2, max_depth=1000, min_samples_leaf=1, min_samples_split=2, n_jobs=-1) test_mse = mean_squared_error(y_test, rf.predict(X_test)) train_mse = mean_squared_error(y_train, rf.predict(X_train)) print("train MSE, %.4f, test MSE: %.4f" % (train_mse, test_mse)) plot(rf.predict(X)) plot(y)
What are possible strategies to improve my fitting? Is there something else I can do to extract the underlying model? It seems incredible to me that after so many repetitions of the same pattern the model behaves so badly with new data. Do I have any hope at all trying to fit this data?
I think you are using wrong tool; if your whole X is equivalent to the index, you are basically having some sampled function f:R→R and trying to extrapolate it. Machine learning is all about interpolating history, so it is not surprising that it scores spectacular fail in this case.
What you need is a time series analysis (i.e. extracting trend, analysing spectrum and autoregressing or HMMing the rest) or physics (i.e. thinking if there is an ODE that may produce such output and trying to fit its parameters via conserved quantities).