Getting Started With data.table in R (6 Examples)

In this article, I’ll illustrate how to use the package data.table in R programming with several examples. For more in depth information on data.tables, we recommend you to take a look at the documentation on github and CRAN.

Setting up the Examples

We first install and load the package.

install.packages("data.table")                                           # Install data.table package
library("data.table")                                                    # Load data.table

We use the iris dataset to demonstrate the usage of the package. The structure of the original data is a data.frame which we convert into a data.table.

data(iris)                                                               # Load iris data set
iris_DT_1 <- data.table::copy(iris)                                      # Replicate the iris data set
iris_DT_1 <- setDT(iris_DT_1)                                            # Convert to data.table
head(iris_DT_1)                                                          # Printing the data head
#    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1:          5.1         3.5          1.4         0.2  setosa
# 2:          4.9         3.0          1.4         0.2  setosa
# 3:          4.7         3.2          1.3         0.2  setosa
# 4:          4.6         3.1          1.5         0.2  setosa
# 5:          5.0         3.6          1.4         0.2  setosa
# 6:          5.4         3.9          1.7         0.4  setosa

Above, you see the first data rows of the resulting data.table iris_DT_1.

Example 1: Choosing Specific Data Columns

There are different ways to address a column of a data.table. In the following, we display four different options.

In options 1 and 2, we use the name of the column Species to address it, either with the $ sign or via the indexing brackets [ , ]. Note that both ways, the values of the column are returned as a vector.

head(iris_DT_1$Species)                                   # Option 1
# [1] setosa setosa setosa setosa setosa setosa
# Levels: setosa versicolor virginica

In contrast to a data.frame, we do not need quotation marks around the column names within brackets [ , ].

head(iris_DT_1[, Species])                                # Option 2
# [1] setosa setosa setosa setosa setosa setosa
# Levels: setosa versicolor virginica

Alternatively, we can also address a column by its index or with its name stored as a character in another object, as shown in options 3 and 4. Both options return a data.table with a single column, not a vector.

head(iris_DT_1[, 5])                                      # Option 3
#    Species
# 1:  setosa
# 2:  setosa
# 3:  setosa
# 4:  setosa
# 5:  setosa
# 6:  setosa

It might be a bit confusing when starting to work with data.table, but one gets used to it quickly: .SD, more info here. We can simply use it to index columns when their names are stored in other objects like name_of_selected_column.

name_of_selected_column <- "Species"
head(iris_DT_1[, .SD, .SDcols = name_of_selected_column]) # Option 4
#    Species
# 1:  setosa
# 2:  setosa
# 3:  setosa
# 4:  setosa
# 5:  setosa
# 6:  setosa

Note that we can use the same function, when name_of_selected_column is a vector containing several column names.

Example 2: Sub-Setting the iris data.table

In this example, we show how to filter all data rows of a data.table for which certain conditions hold. That is, we want to get all data rows of iris_DT_1 in which column Species takes level Setosa and column Sepal.Length takes values greater 5.

iris_DT_2 <- iris_DT_1[ Species == "setosa" & Sepal.Length > 5, ]        # Data subset
head(iris_DT_2)                                                          # Print head of the data
#    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1:          5.1         3.5          1.4         0.2  setosa
# 2:          5.4         3.9          1.7         0.4  setosa
# 3:          5.4         3.7          1.5         0.2  setosa
# 4:          5.8         4.0          1.2         0.2  setosa
# 5:          5.7         4.4          1.5         0.4  setosa
# 6:          5.4         3.9          1.3         0.4  setosa

Example 3: Getting the Number of Rows for Certain Data Subsets

We cannot only extract data subsets within the brackets [ , ], but also perform computations and define new columns within these. One particular useful option is to use .N to count the number of rows for which certain conditions hold. For that, within brackets [ , ] we condition the rows for which we want to count the number of occasions.

iris_DT_1[ Species == "virginica" & Petal.Length >= 5.7, .N]
# [1] 19

There are 19 data rows of iris_DT_1 in which column Species takes level virginica and column Petal.Length takes values greater or equal 5.7.

The number of counts can also be used to define additional data columns. In the following, we define an additional column named n_obs. It is defined as the number of counts (:= .N), where the counts are calculated for each species (, by = Species). That is, the entries of the same species are all assigned the same value of n_obs.

iris_DT_3 <- data.table::copy(iris_DT_1)                                 # Replicate the data
iris_DT_3[ , "n_obs" := .N, by = Species]                                # Define new column n_obs

head(iris_DT_3)
#    Sepal.Length Sepal.Width Petal.Length Petal.Width Species n_obs
# 1:          5.1         3.5          1.4         0.2  setosa    50
# 2:          4.9         3.0          1.4         0.2  setosa    50
# 3:          4.7         3.2          1.3         0.2  setosa    50
# 4:          4.6         3.1          1.5         0.2  setosa    50
# 5:          5.0         3.6          1.4         0.2  setosa    50
# 6:          5.4         3.9          1.7         0.4  setosa    50

Example 4: Calculation of Statistics for Specific Subsets of the iris Data

Within the brackets [ , ], we can not only count instances, but basically perform any desired function, for a chosen set of data rows and columns. In the following example, we calculate the summary statistics of the quotient of columns Petal.Length and Petal.Width for those data rows for which Species is virginica.

iris_DT_1[ Species == "virginica", summary(Petal.Length / Petal.Width)]  # Calculate summary statistics
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   2.125   2.511   2.667   2.781   3.056   4.000

Example 5: Calculation of Statistics by Groups

In this example, we calculate the mean value of column Petal.Length and the sum of column Petal.Width, for the different levels of column Species (by = Species). When we calculate several statistics, we have to use list() for these.

iris_DT_1[ , list("mean_Pet.L" = mean(Petal.Length), "sum_Petal.W" = sum(Petal.Width)), by = Species]
#       Species mean_Pet.L sum_Petal.W
# 1:     setosa      1.462        12.3
# 2: versicolor      4.260        66.3
# 3:  virginica      5.552       101.3

In the same manner, we can apply any other function to data subsets within a data.table.

Example 6: Creating Plots Within data.table

The applicability of functions within brackets [ , ] is not limited to statistics. For example, we can use it to create plots of data.table information as shown in the following.

iris_DT_1[ Sepal.Length >= 5 , plot(Petal.Length, Petal.Width, pch = 20, col = "blue") ]

r graph figure 1 getting started data table

The scatterplot shows the values of Petal.Length and Petal.Width for all those data rows in which Sepal.Length is greater or equal to 5.

 

Anna-Lena Wölwer R Programming & Survey Statistics

Note: This article was created in collaboration with Anna-Lena Wölwer. Anna-Lena is a researcher and programmer who creates tutorials on statistical methodology as well as on the R programming language. You may find more info about Anna-Lena and her other articles on her profile page.

Leave a Reply

Your email address will not be published.

Fill out this field
Fill out this field
Please enter a valid email address.
You need to agree with the terms to proceed

Menu
Top