Conducting Cross-Validation With k-folds in R (Example Code)

In this article, you’ll learn how to conduct cross-validation with k-folds in the R programming language.

Setting up the Example

We use three packages: data.table, class, and caret.

install.packages(data.table)                                      # Install data.table package
library("data.table")  
 
install.packages(class)                                           # Install class package
library("class")  
 
install.packages(caret)                                           # Install caret package
library("caret")

For an illustrative example of cross-validation, we use the built-in iris dataset.

data(iris)                                                             # Load iris data set
head(iris)                                                             # Print head of data
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1          5.1         3.5          1.4         0.2  setosa
# 2          4.9         3.0          1.4         0.2  setosa
# 3          4.7         3.2          1.3         0.2  setosa
# 4          4.6         3.1          1.5         0.2  setosa
# 5          5.0         3.6          1.4         0.2  setosa
# 6          5.4         3.9          1.7         0.4  setosa

prop.table(table(iris$Species)) * 100                                  # Proportions of species in iris
#     setosa versicolor  virginica 
#   33.33333   33.33333   33.33333

In the iris data, the three species are balanced, each accounting for 1/3 of the total number of observations.

Example: Cross-Validation With k Folds

We want to implement a k-nearest neighbor (kNN) algorithm for classifying new observations as one of the three iris species classes. For general information on kNN, see this post from RPubs.

In kNN, we have to choose the number of nearest neighbors k. We can use k-fold cross validation to estimate how well kNN works under different values of k, and thereby determine which k to choose.

nr_neighbors <- c(1, 5, 10)                                            # Setting the number of neighbors to consider

We normalize the features.

f_norm <- function (x) { ( x - min(x) ) / ( max(x) - min(x) ) }        # Function for normalizing data
iris_X <- apply(iris[, 1:4], 2, f_norm )                               # Normalizing the data

For the cross-validation, we decide to go with 6 folds.

nr_folds <- 6                                                          # Choosing the number of folds for cross-validation

The iris data is partitioned into 6 folds of roughly equal size.

data_folds <- sample(rep(1:nr_folds, each = nrow(iris)/nr_folds), nrow(iris)) # Creating data folds

Now we conduct the cross-validation. We use each fold as validation data and the rest 5 folds as training data. Based on the validation data, we predict the species via kNN and compare the predicted classes with the actual classes in the validation data. For the evaluation, we calculate the F1 score for each of the three classes of species in iris. For information on the F1 score, see e.g. Aggarwal (2015) Data Mining, Chapter 10.

result_cv <- lapply(1:nr_folds,                                        # 5-fold cross-validation
                    function (k) {
 
                      IDs_for_training   <- (1:nrow(iris))[data_folds != k]   # IDs for training data
                      IDs_for_validation <- (1:nrow(iris))[data_folds == k]   # IDs for test data
                      
                      results_choices_neighbors <- sapply(nr_neighbors,       # Calculating kNN with "nr_neighbors" neighbors
                                                          function (i) {
 
                                                            kNN_out <- class::knn(train = iris_X[IDs_for_training, ],
                                                                                  test  = iris_X[IDs_for_validation, ],
                                                                                  cl    = iris$Species[IDs_for_training],
                                                                                  k     = i)
 
                                                            confMat <- caret::confusionMatrix(reference = iris$Species[IDs_for_validation],  # Model performance
                                                                                              data      = kNN_out)
                                                            confMat$byClass[,"F1"]   # Return accuracy
                                                            
                                                          })
                      colnames(results_choices_neighbors) <- paste0(nr_neighbors, "_neighbors")
                      results_choices_neighbors
                    })

Calculate the mean value of the species-specific F1 scores over the folds.

round(Reduce('+', result_cv) / nr_folds, digits = 3)                   # Averaging the results over the k folds
#                   1_neighbors 5_neighbors 10_neighbors
# Class: setosa           1.000       1.000        1.000
# Class: versicolor       0.934       0.949        0.940
# Class: virginica        0.922       0.929        0.937

You can see the estimates of the F1 scores for kNN with 1, 5, and 10 neighbors. There is no clear winner in this example. Try a different setting, e.g. a different number of k neighbors or different number of folds!

 

Anna-Lena Wölwer R Programming & Survey Statistics

Note: This article was created in collaboration with Anna-Lena Wölwer. Anna-Lena is a researcher and programmer who creates tutorials on statistical methodology as well as on the R programming language. You may find more info about Anna-Lena and her other articles on her profile page.

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.
You need to agree with the terms to proceed

Menu
Top