During an experiment for text classification, I found ridge classifier generating results that constantly top the tests among those classifiers that are more commonly mentioned and applied for text mining tasks, such as SVM, NB, kNN, etc. Though, I haven’t elaborated on optimizing each classifier on this specific text classification task except some simple tweaks about parameters.
Such result was also mentioned Dikran Marsupial.
Not coming from statistics background, after read through some materials online, I still cannot figure out the main reasons for this. Could anyone provide some insights on such outcome?
Text classification problems tend to be quite high dimensional (many features), and high dimensional problems are likely to be linearly separable (as you can separate any d+1 points in a d-dimensional space with a linear classifier, regardless of how the points are labelled). So linear classifiers, whether ridge regression or SVM with a linear kernel, are likely to do well. In both cases, the ridge parameter or C for the SVM (as tdc mentions +1) control the complexity of the classifier and help to avoid over-fitting by separating the patterns of each class by large margins (i.e. the decision surface passes down the middle of the gap between the two collections of points). However to get good performance the ridge/regularisation parameters need to be properly tuned (I use leave-one-out cross-validation as it is cheap).
However, the reason that ridge regression works well is that non-linear methods are too powerful and it is difficult to avoid over-fitting. There may be a non-linear classifier that gives better generalisation performance than the best linear model, but it is too difficult to estimate those parameters using the finite sample of training data that we have. In practice, the simpler the model, the less problem we have in estimating the parameters, so there is less tendency to over-fit, so we get better results in practice.
Another issues is feature selection, ridge regression avoids over-fitting by regularising the weights to keep them small, and model selection is straight forward as you only have to choose the value of a single regression parameter. If you try to avoid over-fitting by picking the optimal set of features, then model selection becomes difficult as there is a degree of freedom (sort of) for each feature, which makes it possible to over-fit the feature selection criterion and you end up with a set of features that is optimal for this particular sample of data, but which gives poor generalisation performance. So not performing feature selection and using regularisation can often give better predictive performance.
I often use Bagging (form a committee of models trained on bootstraped samples from the training set) with ridge-regression models, which often gives an improvement in performance, and as all the models are linear you can combine them to form a single linear model, so there is no performance hit in operation.