With the limited knowledge I have on SVM, it is good for a short and fat data matrix X, (lots of features, and not too many instances), but not for big data.
I understand one reason is the Kernel Matrix K is a n×n matrix where, n is number of instances in the data. If we have say, 100K data, the kernel matrix K will have 1010 elements, and may take ~80G of memory.
Is there any modification of SVM that can be used in large data? (Say on the scale of 100K to 1M data points?)
As you mention, storing the kernel matrix requires memory that scales quadratically with the number of data points. Training time for traditional SVM algorithms also scales superlinearly with the number of data points. So, these algorithms aren’t feasible for large data sets.
One possible trick is to reformulate a kernelized SVM as a linear SVM. Each element Kij of the kernel matrix represents the dot product between data points xi and xj after mapping them (possibly nonlinearly) into a feature space: Kij=Φ(xi)⋅Φ(xj). The feature space mapping Φ is defined implicitly by the kernel function, and kernelized SVMs don’t explicitly compute feature space representations. This is computationally efficient for small to medium size datasets, as the feature space can be very high dimensional, or even infinite dimensional. But, as above, this becomes infeasible for large datasets. Instead, we can explicitly map the data nonlinearly into feature space, then efficiently train a linear SVM on the feature space representations. The feature space mapping can be constructed to approximate a given kernel function, but use fewer dimensions than the ‘full’ feature space mapping. For large datasets, this can still give us rich feature space representations, but with many fewer dimensions than data points.
One approach to kernel approximation uses the Nyström approximation (Williams and Seeger 2001). This is a way to approximate the eigenvalues/eigenvectors of a large matrix using a smaller submatrix. Another approach uses randomized features, and is somtimes called ‘random kitchen sinks’ (Rahimi and Recht 2007).
Another trick for training SVMs on large datasets is to approximate the optimization problem with a set of smaller subproblems. For example, using stochastic gradient descent on the primal problem is one approach (among many others). Much work has been done on the optimization front. Menon (2009) gives a good survey.
Williams and Seeger (2001). Using the Nystroem method to speed up kernel machines.
Rahimi and Recht (2007). Random features for large-scale kernel machines.
Menon (2009). Large-scale support vector machines: Algorithms and theory.