# Data Augmentation strategies for Time Series Forecasting

I’m considering two strategies to do “data augmentation” on time-series forecasting.

First, a little bit of background. A predictor $$P$$ to forecast the next step of a time-series $$\lbrace A_i\rbrace$$ is a function that typically depends on two things, the time-series past states, but also the predictor’s past states:

$$P(\lbrace A_{i\leq t-1}\rbrace,P_{S_{t-1}})$$

If we want to adjust/train our system to obtain a good $$P$$, then we’ll need enough data. Sometimes available data won’t be enough, so we consider doing data augmentation.

First approach

Suppose we have the time-series $$\lbrace A_i \rbrace$$, with $$1 \leq i \leq n$$. And suppose also that we have $$\epsilon$$ that meets the following condition: $$0<\epsilon < |A_{i+1} – A_i| \forall i \in \lbrace 1, \ldots,n\rbrace$$.

We can construct a new time series $$\lbrace B_i = A_i+r_i\rbrace$$, where $$r_i$$ is a realization of the distribution $$N(0,\frac{\epsilon}{2})$$.

Then, instead of minimizing the loss function only over $$\lbrace A_i \rbrace$$, we do that also over $$\lbrace B_i \rbrace$$. So, if the optimization process takes $$m$$ steps, we have to “initialize” the predictor $$2m$$ times, and we’ll compute approximately $$2m(n-1)$$ predictor internal states.

Second approach

We compute $$\lbrace B_i \rbrace$$ as before, but we don’t update the predictor’s internal state using $$\lbrace B_i \rbrace$$, but $$\lbrace A_i \rbrace$$. We only use the two series together at the time of computing the loss function, so we’ll compute approximately $$m(n-1)$$ predictor internal states.

Of course, there is less computational work here (although the algorithm is a little bit uglier), but it does not matter for now.

The doubt

The problem is: from a statistical point of view, which is the the “best” option? And why?

My intuition tells me that the first one is better, because it helps to “regularize” the weights related with the internal state, while the second one only helps to regularize the weights related with the observed time-series’ past.

Extra:

• Any other ideas to do data augmentation for time series forecasting?
• How to weight the synthetic data in the training set?

Any other ideas to do data augmentation for time series forecasting?

I’m currently thinking about the same problem. I’ve found the paper “Data Augmentation for Time Series Classification
using Convolutional Neural Networks”
by Le Guennec et al. which doesn’t cover forecasting however. Still the augmentation methods mentioned there look promising. The authors communicate 2 methods:

Window Slicing (WS)

A first method that is inspired from the computer vision community [8,10]
consists in extracting slices from time series and performing classification at the slice level. This method has been introduced for time series in [6]. At training, each slice extracted from a time series of class y is assigned the same class and a classifier is learned using the slices. The size of the slice is a parameter of this method. At test time, each slice from a test time series is classified using the learned classifier and a majority vote is performed to decide a predicted label. This method is referred to as
window slicing (WS) in the following.

Window Warping (WW)

The last data augmentation technique we use is more time-series specific. It consists in warping a randomly selected slice of a time series by speeding it up or down, as shown in Fig. 2. The size of the original slice is a parameter of this method. Fig. 2 shows a time series from the “ECG200” dataset and corresponding transformed data. Note that this method generates input time series of different lengths. To deal with this issue, we perform
window slicing on transformed time series for all to have equal length. In this paper, we only consider warping ratios equal to 0.5 or 2, but other ratios could be used and the optimal ratio could even be fine tuned through cross-validation on the training set. In the following, this method will be referred to as window warping (WW).

The authors kept 90% of the series unchanged (i.e. WS was set to a 90% slice and for WW 10% of the series were warped). The methods are reported to reduce classification error on several types of (time) series data, except on 1D representations of image outlines. The authors took their data from here: http://timeseriesclassification.com

How to weight the synthetic data in the training set?

In image augmentation, since the augmentation isn’t expected to change the class of an image, it’s afaik common to weight it as any real data. Time series forecasting (and even time series classification) might be different:

1. A time series is not easily perceivable as a contiguous object for humans, so depending on how much you tamper with it, is it still the same class? If you only slice and warp a little and classes are visually distinct, this might not pose a problem for classification tasks
2. For forecasting, I would argue that

2.1 WS is still a nice method. No matter at which 90%-part of the series you look, you would still expect a forecast based on the same rules => full weight.

2.2 WW: The closer it happens to the end of the series, the more cautious I would be. Intuitively, I would come up with a weight factor sliding between 0 (warping at the end) and 1 (warping at the beginning), assuming that the most recent features of the curve are the most relevant.