Traditionally we use mixed model to model longitudinal data, i.e. data like:
id obs age treatment_lvl yield 1 0 11 M 0.2 1 1 11.5 M 0.5 1 2 12 L 0.6 2 0 17 H 1.2 2 1 18 M 0.9
we can assume random intercept or slope for different persons. However the question I’m trying to solve will involve huge datasets (millions of persons, 1 month daily observation, i.e. each person will have 30 observations), currently I’m not aware if there are packages can do this level of data.
I have access to spark/mahout, but they do not offer mixed models, my question is, is there anyway that I can modify my data so that I can use RandomForest or SVM to model this dataset?
Any feature engineering technique I can leverage on so that it can help RF/SVM to account for auto-correlation?
Some potential methods but I could not afford the time to write them into spark
If you only have a few variables, like in the example, then you should have no problem with some variant of
Where machine learning techniques really shine is when you’ve got a lot of variables and you wish to model nonlinearities and interactions between your variables. Few ML approaches have been developed that can do this with longitudinal data. RNNs are one option, though these are generally optimized for time series problems, rather than panel data.
In principle, a feed-forward neural network is a (generalized) linear model, with regressors that are nonlinear functions of the input data. If the derived regressors — the top layer of the model before the output — are considered the nonparametric part, then there is nothing stopping you from adding parametric structure along with it — perhaps in the form of random effects.
This hasn’t been implemented however for classification problems, which I assume that you’re doing because you’re interested in SVM as a candidate.