For predictive modeling, do we need to concern ourselves with statistical concepts such as random effects and non independence of observations (repeated measures)? For example….
I have data from 5 direct mail campaigns (occurred over the course of a year) with various attributes and a flag for purchase. Ideally, I would use all this data combined to build a model for purchase given customer attributes at the time of the campaign. The reason is that the event of purchase is rare and I would like to use as much information as possible. There is a chance that a given customer could be on anywhere from 1 to 5 of the campaigns – meaning there is not independence between the records.
Does this matter when using:
1) A machine learning approach (e.g. tree, MLP, SVM)
2) A statistical approach (logistic regression)?
My thought about predictive modeling is if the model works, use it. So that I have never really considered the importance of assumptions. Thinking about the case I describe above got me wondering.
Take machine learning algorithms such as a
MLP and SVM. These are used successfully to model a binary event such as my example above but also time series data that are clearly correlated. However, many use loss functions that are likelihoods and derived assuming the errors are iid. For example, gradient boosted trees in R
gbmuses deviance loss functions that are derived from the binomial (Page 10).
I have been wondering this myself, and here are my tentative conclusions. I would be happy if anyone could supplement/correct this with their knowledge and any references on this topic.
If you want to test hypotheses about logistic regression coefficients by checking statistical significance, you need to model the correlation across observations (or otherwise correct for non-independence) because otherwise your standard errors will be too small, at least when you are considering within-cluster effects. But regression coefficients are unbiased even with correlated observations, so it should be fine to use such a model for prediction.
In predictive modeling, you should not need to explicitly account for the correlation when training your model, whether you are using logistic regression or some other approach. However, if you want to use a holdout set for validation or computation of out-of-sample error, you would want to ensure that observations for each individual appeared only in one set, either training or validation but not both. Otherwise your model will be predicting for individuals it already has some information about and you’re not getting a true read on the out-of-sample classification ability.