I’ve a dataset of 120 samples in a 10-fold cross validation setting.
Currently, I pick the training data of the first holdout and do a 5-fold cross-validation on it to pick the values of gamma and C by grid search. I’m using SVM with RBF kernel.
Since I’m doing a ten 10 cross-validation to report precision,recall, do I perform this grid search in the training data of each holdout (there are 10 holdouts, each having 10% test and 90% training data) ? Wouldn’t that be too time consuming ?
If I use the gamma and C of the first holdout and use it for the rest of 9 holdouts of the k-fold cross-validation, is that a violation because I would have used the train data to get gamma and C and again use the portion of train data as test in second holdout ?
Yes, this would be a violation as the test data for folds 2-10 of the outer cross-validation would have been part of the training data for fold 1 which were used to determine the values of the kernel and regularisation parameters. This means that some information about the test data has potentially leaked into the design of the model, which potentially gives an optimistic bias to the performance evaluation, that is most optimistic for models that are very sensitive to the setting of the hyper-parameters (i.e. it most stongly favours models with an undesirable feature).
This bias is likely to be strongest for small datasets, such as this one, as the variance of the model selection criterion is largest for small datasets, which encourages over-fitting the model selection criterion, which means more information about the test data can leak through.
I wrote a paper on this a year or two ago as I was rather startled by the magnitude of the bias deviations from full nested cross-validation can introduce, which can easily swamp the difference in performance between classifier systems. The paper is “On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation”
Gavin C. Cawley, Nicola L. C. Talbot; JMLR 11(Jul):2079−2107, 2010.
Essentially tuning the hyper-parameters should be considered an integral part of fitting the model, so each time you train the SVM on a new sample of data, independently retune the hyper-parameters for that sample. If you follow that rule, you probably can’t go too far wrong. It is well worth the computational expense to get an unbiased performance estimate, as otherwise you run the risk of drawing the wrong conclusions from your experiment.