Applying Function to Multiple Columns of data.table in R (4 Examples)
In this R tutorial you’ll learn how to handle lapply with data.table objects.
Setting up the Examples
Start by installing and loading the data.table package. We also have an overview post of data.table here. You can find the GitLab repository of data.table here.
install.packages("data.table") # Install & load data.table library("data.table") |
install.packages("data.table") # Install & load data.table library("data.table")
Take the iris data as an example dataset.
data(iris) # Loading iris data set head(iris) # Printing head of data # Sepal.Length Sepal.Width Petal.Length Petal.Width Species # 1 5.1 3.5 1.4 0.2 setosa # 2 4.9 3.0 1.4 0.2 setosa # 3 4.7 3.2 1.3 0.2 setosa # 4 4.6 3.1 1.5 0.2 setosa # 5 5.0 3.6 1.4 0.2 setosa # 6 5.4 3.9 1.7 0.4 setosa |
data(iris) # Loading iris data set head(iris) # Printing head of data # Sepal.Length Sepal.Width Petal.Length Petal.Width Species # 1 5.1 3.5 1.4 0.2 setosa # 2 4.9 3.0 1.4 0.2 setosa # 3 4.7 3.2 1.3 0.2 setosa # 4 4.6 3.1 1.5 0.2 setosa # 5 5.0 3.6 1.4 0.2 setosa # 6 5.4 3.9 1.7 0.4 setosa
Copy the data and put it in the data.table format.
iris_DT <- data.table(data.table::copy(iris)) # Copying data as data.table |
iris_DT <- data.table(data.table::copy(iris)) # Copying data as data.table
Example 1: Calculating the Sum Values of Multiple Variables
In this example, we want to use lapply to apply the sum() function to multiple data.table columns. We do this by use of .SD and .SDcols. .SD serves as a placeholder for each of the columns which we put in .SDcols.
iris_DT[ , lapply (.SD, sum), .SDcols = c("Sepal.Length", "Petal.Length")] # Calculating sum values # Sepal.Length Petal.Length # 1: 876.5 563.7 |
iris_DT[ , lapply (.SD, sum), .SDcols = c("Sepal.Length", "Petal.Length")] # Calculating sum values # Sepal.Length Petal.Length # 1: 876.5 563.7
The previous line returns the sum of Sepal.Length and Petal.Length.
Example 2: Calculating the Sum Values of Multiple Variables by Groups
Now, we go one step further by calculating the sum of both variables for each category of column Species. For that, we simply add the “, by =” argument to the previous code as follows.
iris_DT[ , lapply (.SD, sum), by = .(Species), .SDcols = c("Sepal.Length", "Petal.Length")] # Calculating group sums # Species Sepal.Length Petal.Length # 1: setosa 250.3 73.1 # 2: versicolor 296.8 213.0 # 3: virginica 329.4 277.6 |
iris_DT[ , lapply (.SD, sum), by = .(Species), .SDcols = c("Sepal.Length", "Petal.Length")] # Calculating group sums # Species Sepal.Length Petal.Length # 1: setosa 250.3 73.1 # 2: versicolor 296.8 213.0 # 3: virginica 329.4 277.6
Example 3: Using Self-Defined Functions
Often, we additionally want to use further function arguments or use self-defined functions. We can easily add a self-defined function as follows.
iris_DT[ , lapply (.SD, function (x) { sum(sqrt(x) / 2) }), .SDcols = c("Sepal.Length", "Petal.Length")] # Applying arbitrary function # Sepal.Length Petal.Length # 1: 180.8488 140.5313 |
iris_DT[ , lapply (.SD, function (x) { sum(sqrt(x) / 2) }), .SDcols = c("Sepal.Length", "Petal.Length")] # Applying arbitrary function # Sepal.Length Petal.Length # 1: 180.8488 140.5313
We calculated for each variable x the sum of half of the square root of its values.
Example 4: Defining New Columns
Furthermore, we can use lapply together with the definition of new columns, as shown in the following example.
iris_DT <- iris_DT[ , c("Sepal.Length_new", "Petal.Length_new") := lapply (.SD, function (x) { 4*x + 2 }), .SDcols = c("Sepal.Length", "Petal.Length")] # Defining new variables head(iris_DT) # Printing data head # Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length_new # 1: 5.1 3.5 1.4 0.2 setosa 22.4 # 2: 4.9 3.0 1.4 0.2 setosa 21.6 # 3: 4.7 3.2 1.3 0.2 setosa 20.8 # 4: 4.6 3.1 1.5 0.2 setosa 20.4 # 5: 5.0 3.6 1.4 0.2 setosa 22.0 # 6: 5.4 3.9 1.7 0.4 setosa 23.6 # Petal.Length_new # 1: 7.6 # 2: 7.6 # 3: 7.2 # 4: 8.0 # 5: 7.6 # 6: 8.8 |
iris_DT <- iris_DT[ , c("Sepal.Length_new", "Petal.Length_new") := lapply (.SD, function (x) { 4*x + 2 }), .SDcols = c("Sepal.Length", "Petal.Length")] # Defining new variables head(iris_DT) # Printing data head # Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length_new # 1: 5.1 3.5 1.4 0.2 setosa 22.4 # 2: 4.9 3.0 1.4 0.2 setosa 21.6 # 3: 4.7 3.2 1.3 0.2 setosa 20.8 # 4: 4.6 3.1 1.5 0.2 setosa 20.4 # 5: 5.0 3.6 1.4 0.2 setosa 22.0 # 6: 5.4 3.9 1.7 0.4 setosa 23.6 # Petal.Length_new # 1: 7.6 # 2: 7.6 # 3: 7.2 # 4: 8.0 # 5: 7.6 # 6: 8.8
For the new columns, we need to specify the column names, “:=” shows that we want to define the new columns by the function on the right-hand side.
Note: This article was created in collaboration with Anna-Lena Wölwer. Anna-Lena is a researcher and programmer who creates tutorials on statistical methodology as well as on the R programming language. You may find more info about Anna-Lena and her other articles on her profile page.