# Classification with tall fat data

I need to train a linear classifier on my laptop with hundreds of thousands of data points and about ten thousand features. What are my options? What is the state of the art for this type of problem?

It seems like stochastic gradient descent is promising direction, and my sense is that this is state of the art:

“Pegasos: Primal Estimated sub-GrAdient SOlver for SVM” Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, Andrew Cotter.” Mathematical Programming, Series B, 127(1):3-30, year: 2007 . “

Is this the consensus? Should I be looking in some other direction?

I think you should look at Online Learning methods. The perceptron and the kernel perceptron are extremely easy to code and work extremely well in practice, and there are a whole host of other online methods. Note that any online learning method can be converted into a batch learning algorithm, in which case they closely resemble stochastic gradient descent methods.

If you’re using Matlab there’s a really nice toolbox called DOGMA by Francesco Orabona, which contains a range of online learning algorithms, and you can evaluate a few different methods using that. I’ve used this in some of my research and found it to be very useful (note that as far as I remember it expects the data as [features x examples] so you might have to transpose it).

As others have mentioned, you might want to try dimensionality reduction. PCA might not be such a good option here, as you have to compute the covariance matrix which will be very costly. You could try looking at Random Projections. The theory is tough, but the principle is very simple. It’s based on the Johnson-Lindenstrauss Lemma if you’re interested, but the basic idea is that if you randomly project to a lower dimensional space, then $\ell_2$ distances between points are preserved up to some $\epsilon$. If you’re using an RBF kernel, then $\ell_2$ distances are all you are interested in!