# Series dimensionality reduction for classification Input

I am looking to construct a predictive model where the outcome variable is binary and the input is time series. To make it more concrete, the model will predict if a customer churns (left the company; coded as 1 or 0) based on the amount they spent with the company in the prior 60 days. So, the data is one customer per row and the columns are an outcome factor (1 or 0) and 60 additional columns for the amount spent the in time t-1, t-2….t-60.

Here is some example data:

#create the data a series of length 60 and a class ID

#binary class lable
classId <- as.factor(c(rep(0,300), rep(1,300)))
newSc <- data.frame(cbind(classId, sc))
newSc\$ID<-seq(1,600,1)


The actual model may have many of these series for each customer, so I need to reduce the dimensionality of the data for the series, e.g. instead of using 60 values, I need to reduce this down to a handful. Of course, I can use the mean, min, max etc of the series but I have been reading about using Discrete Fourier Transform.

Questions:

1. Is the DFFT in R a proper method to use for my purpose? Any information on how it works would be appreciated.

2. Assuming this R function is correct, how do you extract just the most meaningful coefficients to achieve dimensionality reduction?

There seems to be a consensus that using DFFT for dimension reduction is not a wise choice, but it seems that in data mining, this function, DWT and SVD are all commonly used:
Time Series Mining starting on page 20.

I’m not sure that I’d classify a Fourier transform as a dimensionality reduction technique per se, though you can certainly use it that way.

As you probably know, a Fourier transform converts a time-domain function $f(t)$ into a frequency-domain representation $F(\omega)$. In the original function, the $t$ usually denotes time: for example, f(1) might denote someone’s account balance on the first day, or the volume of the first sample of a song’s recording, while f(2) indicates the following day’s balance/sample value). However, the argument $\omega$ in $F(\omega$) usually denotes frequency: F(10) indicates the extent to which the signal fluctuates at 10 cycles/second (or whatever your temporal units are), while F(20) indicates the extent to which it fluctuates twice as fast. The Fourier transform “works” by reconstructing your original signal as a weighted sum of sinusoids (you actually get “weight”, usually called amplitude, and a “shift”, typically called the phase, values for each frequency component). The wikipedia article is a bit complex, but there are a bunch of decent tutorials floating around the web.

The Fourier transform, by itself, doesn’t get you any dimensionality reduction. If your signal is of length $N$, you’ll get about $N/2$ amplitudes and $N/2$ phases back (1), which is clearly not a huge savings. However, for some signals, most of those amplitudes are close to zero or are a priori known to be irrelevant. You could then throw out the coefficients for these frequencies, since you don’t need them to reconstruct the signal, which can lead to a considerable savings in space (again, depending on the signal). This is what the linked book is describing as “dimensionality reduction.”

A Fourier representation could be useful if:

1. Your signal is periodic, and
2. Useful information is encoded in the periodicity of the signal.

For example, suppose you’re recording a patient’s vital signs. The electrical signal from the EKG (or the sound from a stethoscope) is a high-dimensional signal (say, 200+ samples/second). However, for some applications, you might be more interested in the subject’s heart rate, which is likely to be the location of the peak in the FFT, and thus representable by a single digit.

A major limitation of the FFT is that it considers the whole signal at once–it cannot localize a changes in it. For example, suppose you look at the coefficient associated with 10 cycles/second. You’ll get similar amplitude values if

1. There is consistent, but moderate-sized 10 Hz oscillation in the signal,
2. That oscillation is twice as large in the first half of the signal, but totally absent in the 2nd half, and
3. The oscillation is totally absent in the first half, but twice as large as #1 in the 2nd half.
4. (and so on)

I obviously don’t know much about your business, but I’d imagine these could be very relevant features. Another major limitation of the FFT is that it operates on a single time scale. For example, suppose one customer religiously visits your business every other day: he has a “frequency” of 0.5 visits/day (or a period of 2 days). Another customer might also consistently come for two days in a row, take two off, and then visit again for the next two. Mathematically, the second customer is “oscillating” twice as slowly as the first, but I’d bet that these two are equally likely to churn.

A time-frequency approach helps get around this issues by localizing changes in both frequency and time. One simple approach is the short-time FFT, which divides your signal into little windows, and then computes the Fourier transform of each window. This assumes that the signal is stationary within a window, but changes across them. Wavelet analysis is a more powerful (and mathematically rigorous approach). There are lots of Wavelet tutorials around–the charmingly named Wavelets for Kids is a good place to start, even if it is a bit much for all but the smartest actual children. There are several wavelet packages for R, but their syntax is pretty straightforward (see page 3 of wavelet package documentation for one). You need to choose an appropriate wavelet for your application–this ideally looks something like the fluctuation of interest in your signal, but a Morlet wavelet might be a reasonable starting point. Like the FFT, the wavelet transform itself won’t give you much dimensionality reduction. Instead, it represents your original signal as a function of two parameters (“scale”, which is analogous to frequency, and “translation”, which is akin to position in time). Like the FFT coefficients, you can safely discard coefficients whose amplitude is close to zero, which gives you some effective dimensionality reduction.

Finally, I want to conclude by asking you if dimensionality reduction is really what you want here. The techniques you’ve been asking about are all essentially ways to reduce the size of the data while preserving it as faithfully as possible. However, to get the best classification performance we typically want to collect and transform the data to make relevant features as explicit as possible, while discarding everything else.

Sometimes, Fourier or Wavelet analysis is exactly what is needed (e.g., turning a high dimensional EKG signal into a single heart rate value); other times, you’d be better off with completely different approaches (moving averages, derivatives, etc). I’d encourage you to have a good think about your actual problem (and maybe even brainstorm with sales/customer retention folks to see if they have any intuitions) and use those ideas to generate features, instead of blindly trying a bunch of transforms.