I am currently working on some time series data, I know I can use
The data is written to a vector whose length is 1000, which is a queue,
updating every 15 minutes,
Thus the old data will pop out while the new data push in the vector.
I can rerun the whole model on a scheduler, e.g. retrain the model every 15
minutes, that is, Use the whole 1000 value to train the LOESS model, However it is very inefficient, as every time only one
value is insert while another 999 vlaues still same as last time.
So how can I achieve better performance?
Let me re-formulate this into something more familiar to me. The ARIMA is an analog to PID approximation. I is integral. MA is P. the AR can be expressed as difference equations which are the D term. LOESS is an analog to least squares fitting (high-tech big brother really).
So if I wanted to improve a second order model (PID) what could be done?
- First, I could use a Kalman Filter to update the model with a single
piece of new information.
- I could also look at something called “gradient boosted trees”.
Using an analog of them, I would make a second ARIMA model whose
inputs are both the raw inputs fed to the first, augmented with the
errors of the first.
- I would consider looking at the PDF of the errors for multiple modes.
If I could cluster the errors then I might want to split models, or
use a Mixture model to separate the inputs into sub-models. The
submodels might be better at handling the local phenomenology better
than a single large-scale model.
One of the questions that I have failed to ask is “what does performance mean?”. If we do not have a clearly stated measure of goodness then there is no way to tell if a candidate method “improves”. It seems like you want better modeling, shorter compute time, and more efficient use of information. Having ephemeris about the actual data can also inform this. If you are modeling wind, then you can know where to look for augmenting models, or find transformations for your data that are useful.