In machine learning and statistics, we can assess how well a model generalizes to a new dataset by splitting our data into training and test data:

- Split data into training and test data.
- Fit a model to the training data.
- Assess the performance of the model on the test data.

The disadvantage of the above approach, is that not all data is used for training. A more robust assessment of generalization can be done using cross-validation:

- Split data into k parts (
*folds*) - For i in 1…k:
- Fit a model to all data but the k-th fold.
- Assess the performance of the model on the k-th fold.

This approach uses all data for training, and gives a more robust estimate of generalization performance.

Because cross-validation can be used for assessing the performance of a model, it can also be used for comparing the performance of different models (model selection). In the next two sections, I will give examples of both cross-validation for performance estimation and cross-validation for model selection. In the last two section, I will explain why you need to do something called *nested cross-validation* if you want to *simultaneously* do performance estimation and model selection.

#### Cross-validation for estimating generalization performance

For example, suppose we fit a linear-kernel SVM with C = 1 to the breast cancer data from the R-package mlbench. Then we could estimate how well such an SVM generalizes by doing a 5-fold cross-validation as follows:

# Load some packages library(kernlab) library(caret) library(mlbench) library(PRROC) # Set the random seed for reproducibility set.seed(111) # Load and prepare the breast cancer data data(BreastCancer) data <- BreastCancer[!is.na(BreastCancer$Bare.nuclei), -1] # For simplicity, restrict both classes to 200 samples data <- data[ c( sample(which(data$Class == levels(data$Class)[1]), 200), sample(which(data$Class == levels(data$Class)[2]), 200) ), ] y <- data$Class X <- data X$Class <- NULL # Test the performance of a linear SVM with C = 1 folds <- createFolds(data$Class, k = 5) # For each fold ... auc <- sapply(folds, function(fold) { # Train an SVM, excluding the fold fit <- ksvm( Class ~ ., data = data[-fold,], kernel = "vanilladot", kpar = list(), C = 1, prob.model = TRUE, Class.weights = 1 / table(data$Class[-fold]) ) # Predict the fold yh <- predict(fit, newdata = data[fold,], type = "probabilities") # Compare the predictions to the labels posneg <- split(yh[,1], data$Class[fold]) # Return the AUC under the ROC roc.curve(posneg[[1]], posneg[[2]])$auc })

The results look as follows (average AUC of about 0.976):

#### Cross-validation for model selection

An example of using cross-validation for choosing between models, such as when optimizing hyperparameters, looks as follows:

# Function for one round of training and validating an SVM train_and_validate <- function( data, fold, C ) { # Train an SVM, excluding the fold fit <- ksvm( Class ~ ., data = data[-fold,], kernel = "vanilladot", kpar = list(), C = C, prob.model = TRUE, Class.weights = 1 / table(data$Class[-fold]) ) # Predict the fold yh <- predict(fit, newdata = data[fold,], type = "probabilities") # Compare the predictions to the labels posneg <- split(yh[,1], data$Class[fold]) # Return the AUC under the ROC roc.curve(posneg[[1]], posneg[[2]])$auc } # Function for doing a k-fold cross-validation for each C in CC cv <- function( data, k, CC, seed = NULL ) { # Set the seed, if given if (!is.null(seed)) { set.seed(seed) } # For each value of the hyperparameter C ... auc <- lapply(CC, function(C) { folds <- createFolds(data$Class, k = k) # For each fold ... sapply(folds, function(fold) { # Train an SVM, and validate on the fold train_and_validate( data, fold, C ) }) }) auc } # Do the cross-validation for each C in CC auc <- cv( data = data, k = 5, CC = 2^seq(log2(.01), log2(10), length.out = 21), seed = 111 )

In the figure above, we can see that performance remains relatively constant until C 0.5, and then seems to decline, so we had better take C not too large in order to achieve good generalization.

Technical note: In the code above, we specify a different set of folds for each C. We could also have chosen to use the same fold set for all C. However, in that case, results might have been strongly dependent on how the specific fold set was actually chosen. While this can also be the case when choosing different fold sets, at least with different fold sets we get an indication of the variance in AUC caused by choosing different fold sets, just by looking at the plot. Nevertheless, choosing the same fold set for different models can make much sense in some cases, for example when directly comparing two models. In this case the difference between the models could be assessed by pairwise comparing the different folds, for example using a paired Wilcoxon signed-rank test.

#### Can we simultaneously do model selection and performance estimation?

Looking at the last plot above, you might be wondering the following: Can we conclude that the expected generalization AUC is about 0.991? (the maximum value of the red line) More specifically, is taking the maximum AUC a valid approach for estimating the generalization performance of the best C? That is an interesting question. How could we check that? Well, let’s check if randomized data results in a generalization AUC of about 0.5! More concretely: if we would randomize the class labels in the breast cancer dataset, we would expect a classifier not to do any better than random. Therefore, from the following procedure we would expect a generalization AUC of about 0.5.

- Randomize the class labels of the breast cancer dataset.
- Do a cross-validation of all values of C.
- Choose the best C by choosing the maximum value of the AUC.

Let’s repeat the above 25 times, and see what happens:

set.seed(111) auc <- replicate(25, { # Randomize the class labels. This should result in # random performance, i.e. AUC ~= 0.5 data$Class <- sample(data$Class) # Cross-validate for each C in CC, and take the # average of the folds as the performance estimate auc <- sapply(cv( data = data, k = 5, CC = 2^seq(log2(.01), log2(10), length.out = 11) ), mean) # Take the max AUC across the different Cs max(auc) })

In the above plot, you can see that with ~0.54 the average observed AUC is quite a bit higher than the ~0.5 you would expect from a random classifier! More specifically, you can see that 24 out of 25 AUCs are higher than 0.5, which can be verified as statistically highly significant, for example using a one-sample Wilcoxon signed rank test, or a binomial test. Coming back to the question we posed at the beginning of this section: No, taking the maximum AUC is *not* a valid approach for estimating the generalization performance of the best C. Why is this? Well, in the randomized example, essentially what we’re doing (25 times) is randomly sampling 11 AUCs (probably with some covariance structure, but still random). We would expect these 11 AUCs to average out at around 0.5. Among these 11, we then take the maximum, which we would expect to be higher than 0.5. Therefore, after repeating the above 25 times, we would definitely expect the average of the 25 maximum AUCs to also be higher than 0.5! Summarizing, simultaneously doing model selection and performance estimation leads to a positively biased performance estimate.

#### Nested cross-validation

So how should we go about doing model selection and performance estimation? In principle, what we could do, is set aside some test data to test the performance once we’ve optimized C:

- Set aside some test data, not to be used in the cross-validation.
- Optimize C using cross-validation on all but the test data.
- Select the best C (the one with the highest AUC).
- Test the performance of the highest C on the test data.

This is a perfectly valid approach, except that it has the disadvantage of not having used all data for training, at some point. This argument should sound familiar: While it is perfectly valid to split our data into training and test data, we will get a more robust estimate of our performance by doing a cross-validation, as such iteratively using all data for training. Analogously, it it perfectly valid to split our data into (1) training data for training the models, (2) validation data for selecting the best model and (3) test data for assessing the performance of the best model. However, we will get a more robust performance estimate by iteratively using all data for training. This can be done by something called *nested cross-validation*, and can be done as follows:

- Split data into k1 folds.
- For i in 1…k1:
- For all Cs:
- Do a k2-fold cross-validation on all data but the k1-th fold.
- Return the performance averaged across the k2 folds.

- Select the C with the best average performance.
- Assess the performance of this model on the k1-th fold.

- For all Cs:

We can see why the above is called *nested* cross-validation: an inner cross-validation loop for model selection is nested within an outer cross-validation loop for performance estimation. As such, nested cross-validation tries to estimate the expected performance of a model where the hyperparameter C is optimized using cross-validation.

Here’s a function for doing nested cross-validation:

ncv <- function( data, k, CC, seed = NULL ) { if (!is.null(seed)) { set.seed(seed) } folds <- createFolds(data$Class, k = k) # For each fold ... auc <- sapply(folds, function(fold) { # Do a cross-validation for each C auc <- cv( data[-fold,], k, CC, seed = seed ) # Select the C with the highest AUC C <- CC[which.max(sapply(auc, mean))] C 1, sample(C, 1), C) # Test this C on the test data train_and_validate( data, fold = fold, C = C ) }) auc }

Let's first check whether using this function we indeed get an AUC of ~0.5 on the randomized data:

set.seed(111) auc <- replicate(25, { cat(".") # Randomize the class labels. This should result in # random performance, i.e. AUC ~= 0.5 data$Class <- sample(data$Class) # This returns k scores auc <- ncv( data = data, k = 5, CC = 2^seq(log2(.01), log2(10), length.out = 11) ) # Take the average as the performance estimate mean(auc) })

Indeed, we get random performance! Now let’s try it out on the breast cancer data:

# test on BreastCancer data auc <- ncv( data = data, k = 5, CC = 2^seq(log2(.01), log2(10), length.out = 21), seed = 111 )

We can see that nested cross-validation gives an expected performance ~0.987, which is indeed lower than the positively biased estimate of ~0.991 we found initially. This may seem like a small difference but compared to 0.991, 0.987 is ~40% further away from the perfect classifier with AUC = 1, making it quite a substantial difference.