I’ve seen different error metrics used in the Kaggle competitions: RMS, mean-square, AUC, amongst others. What’s the general rule of thumb on choosing an error metric, i.e. how do you know which error metric to use for a given problem? Are there any guidelines?
The pool of error metrics you can choose from is different between classification and regression. In the latter you try to predict one continuous value, and with classification you predict discrete classes such as “healthy” or “not healthy”. From the examples you mentioned, root mean square error would be applicable for regression and AUC for classification with two classes.
Let me give you a little bit more detail on classification. You mentioned AUC as a measure, which is the area under the ROC curve, which usually is only applied to binary classification problems with two classes.
Although, there are ways to construct a ROC curve for more than two classes, they loose the simplicity of the ROC curve for two classes. In addition, ROC curves can only be constructed if the classifier of choice outputs some kind of score associated with each prediction. For instance, logistic regression will give you probabilities for each of the two classes. In addition to their simplicity ROC curves have the advantage that they are not affected by the ratio between positively and negatively labelled instances in your datasets and don’t force you to choice a threshold. Nevertheless, it is recommended to not only look at the ROC curve alone but other visualizations as well. I’d recommend having a look at precision-recall curves and cost-curves. There is not one true error measurement, they all have their strength and weaknesses.
Literature I found helpful in this regard are:
- Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861–874.
- Drummond, C., & Holte, R. (2006). Cost curves: An improved method for visualizing classifier performance. Machine Learning, 65(1), 95–130
- Parker, C. (2011). An Analysis of Performance Measures for Binary Classifiers. 2011 IEEE 11th International Conference on Data Mining (pp. 517–526)
- Davis, J., & Goadrich, M. (2006). The relationship between Precision-Recall and ROC curves. Proceedings of the 23rd international conference on Machine learning (pp. 233–240). New York, NY, USA: ACM
If your classifier does not provide some kind of score, you have to fall back to the basic measures that can be obtained from a confusion matrix containing the number of true positives, false positives, true negatives and false negatives. The visualizations mentioned above (ROC, precision-recall, cost curve) are all based on these tables obtained by using a different threshold of the classifier’s score. The most popular measure in this case is probably the F1-Measure. In addition, there is a long list of measurements you can retrieve from a confusion matrix: sensitivity, specificity, positive predictive value, negative predictive value, accuracy, Matthews correlation coefficient, …
Similar to ROC curves, confusion matrices are very easy to understand in the binary classification problem, but get more complicated with multiple classes, because for N classes you have to consider either a single N \times N table or N 2 \times 2 tables each of them comparing one of the classes (A) against all other classes (not A).