# Random forest is overfitting

I am trying to use Random Forest Regression in scikits-learn. The problem is I am getting a really high test error:

train MSE, 4.64, test MSE: 252.25.

This is how my data looks: (blue:real data, green:predicted):

I am using 90% for training and 10% for test. This is the code I am using after trying several parameter combinations:

rf = rf = RandomForestRegressor(n_estimators=10, max_features=2, max_depth=1000, min_samples_leaf=1, min_samples_split=2, n_jobs=-1)
test_mse = mean_squared_error(y_test, rf.predict(X_test))
train_mse = mean_squared_error(y_train, rf.predict(X_train))

print("train MSE, %.4f, test MSE: %.4f" % (train_mse, test_mse))
plot(rf.predict(X))
plot(y)


What are possible strategies to improve my fitting? Is there something else I can do to extract the underlying model? It seems incredible to me that after so many repetitions of the same pattern the model behaves so badly with new data. Do I have any hope at all trying to fit this data?

I think you are using wrong tool; if your whole X is equivalent to the index, you are basically having some sampled function $f:\mathbb{R}\rightarrow\mathbb{R}$ and trying to extrapolate it. Machine learning is all about interpolating history, so it is not surprising that it scores spectacular fail in this case.