I’ve searched exhaustively on this forum and elsewhere, and have come across a lot of great material. However, I’m ultimately still confused. Here’s a basic, concrete example of what I’d like to accomplish, my approach for doing so, and my questions.
I have a dataset sized 1000 x 51; 1000 observations, each with 50 numeric features and 1 binary response variable marked with either “0” or “1”. “0” indicates a response of “not-early,” and “1” indicates a response of “early.” I’d like to build a single LASSO logistic regression model to predict, on a testing set that does not include response variables, whether each testing observations result in a classification of 0 or 1.
My approach uses the following steps:
Partition the training data into k = 5 folds, each containing 200 observations. Let’s label each fold data_1, data_2, data_3, data_4, data_5.
For k = 1, our train_set_k is comprised of data_1, data_2, data_3, and data_4, and will contain 800 observations. Our test_set_k is comprised of data_5, and will contain 200 observations.
For k = 2, our train_set_k is comprised of data_1, data_2, data_3, and data_5. Our test_set_k is comprised of data_4.
For k = 1 to 5, partition train_set_k into k_i = 5 folds, each containing 800/5 = 160 observations. Cross-validate on these k_i = 5 folds to find the optimal setting for the hyper-parameter lambda for our LASSO logistic regression model. Lambda should be a floating point number between 0 and 1.
Question #1: I’m unclear as to what “cross-validate to find the optimal setting for the hyper-parameter lambda” actually entails. In R, I’m using the following code:
model.one.early <- cv.glmnet(x.early, y.early, family = "binomial", nfolds=5, type.measure="auc")
..where nfolds = 5 pertains to the k_i = 5 above. In other words, each of the nfolds = 5 folds will contain 160 observations.
From this code, I’m able to output values for: “lambda.min” – the value of lambda that gives the minimum mean cross-validated error, which I assume to mean “gives the maximum mean cross-validated AUC,” as I specified type.measure = “auc” above; “lambda.1se” – the largest value of lambda such that error is within 1 standard error of the minimum.”
Question #2: What is the above line of code actually doing? How does it compute values of lambda.min and lambda.1se?
Question #3: Which value of lambda (lambda.min or lambda.1se) do I want to keep? Why?
Fit a LASSO logistic regression model to the 800 observations in this fold using hyper-parameter lambda.min (or lambda.1se) as obtained above. Use this model to predict on the remaining 200 observations, using a piece of code like this:
early.preds <- data.frame(predict(model.one.early, newx=as.matrix(test.early.df), type="response", s="lambda.min"))
Compute an AUC for these predictions.
Once the above loop finishes, I should have a list of k = 5 lambda.min (or lambda.1se) values, and a list of 5 corresponding AUC values. To my understanding, by taking an average of these k = 5 AUC values, we can obtain an “estimation of the generalisation performance for our method of generating our model.” (-Dikran Marsupial, linked here)
This is where I’m confused. What do I do next? Again, I’d like to make predictions on a separate, un-labeled testing set. From what I’ve read, I must ultimately fit my LASSO logistic regression model with all available training data, using some code like this:
final_model <- glmnet(x=train_data_ALL, y=data_responses, family="binomial")
Question #4: Is this correct? Do I indeed fit one single model on ALL of my training data?
Then, I’d simply using this model to predict on my testing set, using some code like this:
finals_preds <- predict(final_model, newx=test_data_ALL, type="response", lambda=?)
In my nested cross-validation employed in steps 1 through 5, I’ve obtained a list of 5 values of lambda and 5 corresponding AUC values.
Question #5: Which value of lambda do I choose? Do I select the value of lambda that gave the highest AUC? Do I average the k = 5 values of lambda, and then plug this average into the above line of code for lambda = ?.
Question #6: In the end, I just want one LASSO logistic regression model, with one unique value for each hyper-parameter … correct?
Question #7: If the answer to Question #5 is yes, how do we obtain an estimate for the AUC value that this model will produce? Is this estimate equivalent to the average of the k = 5 AUC values obtain in Step 5?
To answer your initial question (What to do AFTER Nested Cross-Validation?):
Nested cross-validation gives you several scores based on test data that the algorithm has not yet seen. Ordinary CV (“non-nested”) gives you just one such score based on one held-out test set. So you can better evaluate the true performance of your model.
After nested CV you fit the chosen model on the whole dataset. And then you use the model to make predictions on new, unlabeled data (that are not part of your 1000 obs.).
I’m not 100% sure that you perform proper nested CV with an outer and an inner loop. To understand nested CV, I found this description helpful:
(Petersohn, Temporal Video Segmentation, Vogt Verlag, 2010, p. 34)
Thoughts on bootstrapping as a better alternative than (nested) CV can be found here.
P.S.: I presume that you will more likely get answers if you only ask 1 or 2 questions instead of 7 in one post. Maybe you want to split them up so that others can find them more easily.