Comparing Object Types data.frame & data.table in R (2 Examples)
In this tutorial, I’ll explain why you should use data.table instead of data.frame objects in the R programming language using two simple illustrations. In a nutshell: data.table is fast, intuitive, and easy-to-read.
If you want to get more information on data.table, we recommend you to read the introduction post on CRAN.
Setting up the Examples
For the illustration, we have to install and load the following packages: data.table, dplyr, and rbenchmark.
install.packages("data.table") # Install & load data.table library("data.table") install.packages("dplyr") # Install & load dplyr library("dplyr") install.packages("rbenchmark") # Install & load rbenchmark library("rbenchmark") |
install.packages("data.table") # Install & load data.table library("data.table") install.packages("dplyr") # Install & load dplyr library("dplyr") install.packages("rbenchmark") # Install & load rbenchmark library("rbenchmark")
To illustrate some differences of data.table and data.frame, we load the built-in iris dataset.
data(iris) # Loading iris data set head(iris) # Printing first data rows # Sepal.Length Sepal.Width Petal.Length Petal.Width Species # 1 5.1 3.5 1.4 0.2 setosa # 2 4.9 3.0 1.4 0.2 setosa # 3 4.7 3.2 1.3 0.2 setosa # 4 4.6 3.1 1.5 0.2 setosa # 5 5.0 3.6 1.4 0.2 setosa # 6 5.4 3.9 1.7 0.4 setosa |
data(iris) # Loading iris data set head(iris) # Printing first data rows # Sepal.Length Sepal.Width Petal.Length Petal.Width Species # 1 5.1 3.5 1.4 0.2 setosa # 2 4.9 3.0 1.4 0.2 setosa # 3 4.7 3.2 1.3 0.2 setosa # 4 4.6 3.1 1.5 0.2 setosa # 5 5.0 3.6 1.4 0.2 setosa # 6 5.4 3.9 1.7 0.4 setosa
We define two objects with the same iris data: A data.frame iris_df and a data.table iris_dt.
iris_df <- iris # New object iris_dt <- data.table(iris) # New data.table object |
iris_df <- iris # New object iris_dt <- data.table(iris) # New data.table object
class(iris_df) # Object class of the data # [1] "data.frame" class(iris_dt) # Object class of the data # [1] "data.table" "data.frame" |
class(iris_df) # Object class of the data # [1] "data.frame" class(iris_dt) # Object class of the data # [1] "data.table" "data.frame"
Example 1: Comparing the Language of data.frame and data.table
We aim for calculating group sums of a variable for specific data rows. This can be done in various ways using the data.frame and data.table syntax.
We start with the piping system of dplyr. It is quite intuitive and can be used for both a data.frame and a data.table.
iris_df %>% # Calculating the sum of different groups group_by(Species) %>% filter(Sepal.Width <= 3.5) %>% summarize(sum(Petal.Width)) # # A tibble: 3 × 2 # Species `sum(Petal.Width)` # <fct> <dbl> # 1 setosa 8.1 # 2 versicolor 66.3 # 3 virginica 94.6 |
iris_df %>% # Calculating the sum of different groups group_by(Species) %>% filter(Sepal.Width <= 3.5) %>% summarize(sum(Petal.Width)) # # A tibble: 3 × 2 # Species `sum(Petal.Width)` # <fct> <dbl> # 1 setosa 8.1 # 2 versicolor 66.3 # 3 virginica 94.6
As an alternative, we can also use the sapply function.
sapply(unique(iris_df$Species), # Calculating the sum of different groups function (x){ sum(iris_df[iris_df$Sepal.Width <= 3.5 & iris_df$Species == x, "Petal.Width"]) }) # [1] 8.1 66.3 94.6 |
sapply(unique(iris_df$Species), # Calculating the sum of different groups function (x){ sum(iris_df[iris_df$Sepal.Width <= 3.5 & iris_df$Species == x, "Petal.Width"]) }) # [1] 8.1 66.3 94.6
Finally, we can use the data.table syntax, which is both, really short and intuitiv. Remember, the data.table syntax is data.table[chosen rows, chosen action, by argument].
iris_dt[Sepal.Width <= 3.5, sum(Petal.Width), Species] # Calculating the sum of different groups # Species V1 # 1: setosa 8.1 # 2: versicolor 66.3 # 3: virginica 94.6 |
iris_dt[Sepal.Width <= 3.5, sum(Petal.Width), Species] # Calculating the sum of different groups # Species V1 # 1: setosa 8.1 # 2: versicolor 66.3 # 3: virginica 94.6
Example 2: Comparing the Computation Time
To see that the data.table syntax is not only coming in handy because of its readability, but also because of its efficiency, let us compare the time each of the three above codes take.
benchmark("Example_1" = iris_df %>% # Comparing computation time group_by(Species) %>% filter(Sepal.Width <= 3.5) %>% summarize(sum(Petal.Width)), "Example_2" = sapply(unique(iris_df$Species), function (x){ sum(iris_df[iris_df$Sepal.Width <= 3.5 & iris_df$Species == x, "Petal.Width"]) }), "Example_3" = iris_dt[Sepal.Width <= 3.5, sum(Petal.Width), Species], replications = 100 )[,1:6] test replications elapsed relative user.self sys.self 1 Example_1 100 1.40 23.333 1.35 0.07 2 Example_2 100 0.06 1.000 0.06 0.00 3 Example_3 100 0.14 2.333 0.14 0.00 |
benchmark("Example_1" = iris_df %>% # Comparing computation time group_by(Species) %>% filter(Sepal.Width <= 3.5) %>% summarize(sum(Petal.Width)), "Example_2" = sapply(unique(iris_df$Species), function (x){ sum(iris_df[iris_df$Sepal.Width <= 3.5 & iris_df$Species == x, "Petal.Width"]) }), "Example_3" = iris_dt[Sepal.Width <= 3.5, sum(Petal.Width), Species], replications = 100 )[,1:6] test replications elapsed relative user.self sys.self 1 Example_1 100 1.40 23.333 1.35 0.07 2 Example_2 100 0.06 1.000 0.06 0.00 3 Example_3 100 0.14 2.333 0.14 0.00
As you can see, the performance of sapply and data.table is close and much faster than the dplyr piping. For larger data, the data.table syntax becomes faster than sapply.
Note: This article was created in collaboration with Anna-Lena Wölwer. Anna-Lena is a researcher and programmer who creates tutorials on statistical methodology as well as on the R programming language. You may find more info about Anna-Lena and her other articles on her profile page.