Comparing Object Types data.frame & data.table in R (2 Examples)

In this tutorial, I’ll explain why you should use data.table instead of data.frame objects in the R programming language using two simple illustrations. In a nutshell: data.table is fast, intuitive, and easy-to-read.

If you want to get more information on data.table, we recommend you to read the introduction post on CRAN.

Setting up the Examples

For the illustration, we have to install and load the following packages: data.table, dplyr, and rbenchmark.

install.packages("data.table")     # Install & load data.table
library("data.table")
 
install.packages("dplyr")          # Install & load dplyr
library("dplyr")
 
install.packages("rbenchmark")     # Install & load rbenchmark
library("rbenchmark")

To illustrate some differences of data.table and data.frame, we load the built-in iris dataset.

data(iris)                         # Loading iris data set
head(iris)                         # Printing first data rows
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1          5.1         3.5          1.4         0.2  setosa
# 2          4.9         3.0          1.4         0.2  setosa
# 3          4.7         3.2          1.3         0.2  setosa
# 4          4.6         3.1          1.5         0.2  setosa
# 5          5.0         3.6          1.4         0.2  setosa
# 6          5.4         3.9          1.7         0.4  setosa

We define two objects with the same iris data: A data.frame iris_df and a data.table iris_dt.

iris_df <- iris                    # New object
iris_dt <- data.table(iris)        # New data.table object
class(iris_df)                     # Object class of the data
# [1] "data.frame"
class(iris_dt)                     # Object class of the data
# [1] "data.table" "data.frame"

Example 1: Comparing the Language of data.frame and data.table

We aim for calculating group sums of a variable for specific data rows. This can be done in various ways using the data.frame and data.table syntax.

We start with the piping system of dplyr. It is quite intuitive and can be used for both a data.frame and a data.table.

iris_df %>% # Calculating the sum of different groups
  group_by(Species) %>%
  filter(Sepal.Width <= 3.5) %>%
  summarize(sum(Petal.Width))
 
# # A tibble: 3 × 2
# Species      `sum(Petal.Width)`
# <fct>                     <dbl>
# 1 setosa                    8.1
# 2 versicolor               66.3
# 3 virginica                94.6

As an alternative, we can also use the sapply function.

sapply(unique(iris_df$Species),  # Calculating the sum of different groups
                       function (x){
                         sum(iris_df[iris_df$Sepal.Width <= 3.5 & iris_df$Species == x, "Petal.Width"])
                       })
# [1]  8.1 66.3 94.6

Finally, we can use the data.table syntax, which is both, really short and intuitiv. Remember, the data.table syntax is data.table[chosen rows, chosen action, by argument].

iris_dt[Sepal.Width <= 3.5, sum(Petal.Width), Species] # Calculating the sum of different groups
#       Species   V1
# 1:     setosa  8.1
# 2: versicolor 66.3
# 3:  virginica 94.6

Example 2: Comparing the Computation Time

To see that the data.table syntax is not only coming in handy because of its readability, but also because of its efficiency, let us compare the time each of the three above codes take.

benchmark("Example_1" = iris_df %>% # Comparing computation time
            group_by(Species) %>%
            filter(Sepal.Width <= 3.5) %>%
            summarize(sum(Petal.Width)),
          "Example_2" = sapply(unique(iris_df$Species),
                               function (x){
                                 sum(iris_df[iris_df$Sepal.Width <= 3.5 & iris_df$Species == x, "Petal.Width"])
                               }),
          "Example_3" = iris_dt[Sepal.Width <= 3.5, sum(Petal.Width), Species],
          replications = 100
)[,1:6]
 
       test replications elapsed relative user.self sys.self
1 Example_1          100    1.40   23.333      1.35     0.07
2 Example_2          100    0.06    1.000      0.06     0.00
3 Example_3          100    0.14    2.333      0.14     0.00

As you can see, the performance of sapply and data.table is close and much faster than the dplyr piping. For larger data, the data.table syntax becomes faster than sapply.

 

Anna-Lena Wölwer R Programming & Survey Statistics

Note: This article was created in collaboration with Anna-Lena Wölwer. Anna-Lena is a researcher and programmer who creates tutorials on statistical methodology as well as on the R programming language. You may find more info about Anna-Lena and her other articles on her profile page.

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.
You need to agree with the terms to proceed

Menu
Top