I’ve been thinking about, implementing and using the Extreme Learning Machine (ELM) paradigm for more than a year now, and the longer I do, the more I doubt that it is really a good thing. My opinion, however, seems to be in contrast with scientific community where — when using citations and new publications as a measure — it seems to be a hot topic.
The ELM has been introduced by Huang et. al. around 2003. The underlying idea is rather simple: start with a 2-layer artificial neural network and randomly assign the coefficients in the first layer. This, one transforms the non-linear optimization problem which is usually handled via backpropagation into a simple linear regression problem. More detailed, for x∈RD, the model is
Now, only the wi are adjusted (in order to minimize squared-error-loss), whereas the vik‘s are all chosen randomly. As a compensation for the loss in degrees-of-freedom, the usual suggestion is to use a rather large number of hidden nodes (i.e. free parameters wi).
From another perspective (not the one usually promoted in the literature, which comes from the neural network side), the whole procedure is simply linear regression, but one where you choose your basis functions ϕ randomly, for example
(Many other choices beside the sigmoid are possible for the random functions. For instance, the same principle has also been applied using radial basis functions.)
From this viewpoint, the whole method becomes almost too simplistic, and this is also the point where I start to doubt that the method is really a good one (… whereas its scientific marketing certainly is). So, here are my questions:
The idea to raster the input space using random basis functions is, in my opinion, good for low dimensions. In high dimensions, I think it is just not possible to find a good choice using random selection with a reasonable number of basisfunctions. Therefore, does the ELM degrade in high-dimensions (due to the curse of dimensionality)?
Do you know of experimental results supporting/contradicting this opinion? In the linked paper there is only one 27-dimensional regression data set (PYRIM) where the method performs similar to SVMs (whereas I would rather like to see a comparison to a backpropagation ANN)
More generally, I would like to here your comments about the ELM method.
Your intuition about the use of ELM for high dimensional problems is correct, I have some results on this, which I am preparing for publication. For many practical problems, the data are not very non-linear and the ELM does fairly well, but there will always be datasets where the curse of dimensionality means that the chance of finding a good basis function with curvature just where you need it becomes rather small, even with many basis vectors.
I personally would use something like a least-squares support vector machine (or a radial basis function network) and try and choose the basis vectors from those in the training set in a greedy manner (see e.g. my paper, but there were other/better approaches that were published at around the same time, e.g. in the very good book by Scholkopf and Smola on “Learning with Kernels”). I think it is better to compute an approximate solution to the exact problem, rather than an exact solution to an approximate problem, and kernel machines have a better theoretical underpinning (for a fixed kernel ;o).