Getting Started With data.table in R (6 Examples)
In this article, I’ll illustrate how to use the package data.table in R programming with several examples. For more in depth information on data.tables, we recommend you to take a look at the documentation on github and CRAN.
Setting up the Examples
We first install and load the package.
install.packages("data.table") # Install data.table package library("data.table") # Load data.table |
install.packages("data.table") # Install data.table package library("data.table") # Load data.table
We use the iris dataset to demonstrate the usage of the package. The structure of the original data is a data.frame which we convert into a data.table.
data(iris) # Load iris data set iris_DT_1 <- data.table::copy(iris) # Replicate the iris data set iris_DT_1 <- setDT(iris_DT_1) # Convert to data.table |
data(iris) # Load iris data set iris_DT_1 <- data.table::copy(iris) # Replicate the iris data set iris_DT_1 <- setDT(iris_DT_1) # Convert to data.table
head(iris_DT_1) # Printing the data head # Sepal.Length Sepal.Width Petal.Length Petal.Width Species # 1: 5.1 3.5 1.4 0.2 setosa # 2: 4.9 3.0 1.4 0.2 setosa # 3: 4.7 3.2 1.3 0.2 setosa # 4: 4.6 3.1 1.5 0.2 setosa # 5: 5.0 3.6 1.4 0.2 setosa # 6: 5.4 3.9 1.7 0.4 setosa |
head(iris_DT_1) # Printing the data head # Sepal.Length Sepal.Width Petal.Length Petal.Width Species # 1: 5.1 3.5 1.4 0.2 setosa # 2: 4.9 3.0 1.4 0.2 setosa # 3: 4.7 3.2 1.3 0.2 setosa # 4: 4.6 3.1 1.5 0.2 setosa # 5: 5.0 3.6 1.4 0.2 setosa # 6: 5.4 3.9 1.7 0.4 setosa
Above, you see the first data rows of the resulting data.table iris_DT_1.
Example 1: Choosing Specific Data Columns
There are different ways to address a column of a data.table. In the following, we display four different options.
In options 1 and 2, we use the name of the column Species to address it, either with the $ sign or via the indexing brackets [ , ]. Note that both ways, the values of the column are returned as a vector.
head(iris_DT_1$Species) # Option 1 # [1] setosa setosa setosa setosa setosa setosa # Levels: setosa versicolor virginica |
head(iris_DT_1$Species) # Option 1 # [1] setosa setosa setosa setosa setosa setosa # Levels: setosa versicolor virginica
In contrast to a data.frame, we do not need quotation marks around the column names within brackets [ , ].
head(iris_DT_1[, Species]) # Option 2 # [1] setosa setosa setosa setosa setosa setosa # Levels: setosa versicolor virginica |
head(iris_DT_1[, Species]) # Option 2 # [1] setosa setosa setosa setosa setosa setosa # Levels: setosa versicolor virginica
Alternatively, we can also address a column by its index or with its name stored as a character in another object, as shown in options 3 and 4. Both options return a data.table with a single column, not a vector.
head(iris_DT_1[, 5]) # Option 3 # Species # 1: setosa # 2: setosa # 3: setosa # 4: setosa # 5: setosa # 6: setosa |
head(iris_DT_1[, 5]) # Option 3 # Species # 1: setosa # 2: setosa # 3: setosa # 4: setosa # 5: setosa # 6: setosa
It might be a bit confusing when starting to work with data.table, but one gets used to it quickly: .SD, more info here. We can simply use it to index columns when their names are stored in other objects like name_of_selected_column.
name_of_selected_column <- "Species" head(iris_DT_1[, .SD, .SDcols = name_of_selected_column]) # Option 4 # Species # 1: setosa # 2: setosa # 3: setosa # 4: setosa # 5: setosa # 6: setosa |
name_of_selected_column <- "Species" head(iris_DT_1[, .SD, .SDcols = name_of_selected_column]) # Option 4 # Species # 1: setosa # 2: setosa # 3: setosa # 4: setosa # 5: setosa # 6: setosa
Note that we can use the same function, when name_of_selected_column is a vector containing several column names.
Example 2: Sub-Setting the iris data.table
In this example, we show how to filter all data rows of a data.table for which certain conditions hold. That is, we want to get all data rows of iris_DT_1 in which column Species takes level Setosa and column Sepal.Length takes values greater 5.
iris_DT_2 <- iris_DT_1[ Species == "setosa" & Sepal.Length > 5, ] # Data subset head(iris_DT_2) # Print head of the data # Sepal.Length Sepal.Width Petal.Length Petal.Width Species # 1: 5.1 3.5 1.4 0.2 setosa # 2: 5.4 3.9 1.7 0.4 setosa # 3: 5.4 3.7 1.5 0.2 setosa # 4: 5.8 4.0 1.2 0.2 setosa # 5: 5.7 4.4 1.5 0.4 setosa # 6: 5.4 3.9 1.3 0.4 setosa |
iris_DT_2 <- iris_DT_1[ Species == "setosa" & Sepal.Length > 5, ] # Data subset head(iris_DT_2) # Print head of the data # Sepal.Length Sepal.Width Petal.Length Petal.Width Species # 1: 5.1 3.5 1.4 0.2 setosa # 2: 5.4 3.9 1.7 0.4 setosa # 3: 5.4 3.7 1.5 0.2 setosa # 4: 5.8 4.0 1.2 0.2 setosa # 5: 5.7 4.4 1.5 0.4 setosa # 6: 5.4 3.9 1.3 0.4 setosa
Example 3: Getting the Number of Rows for Certain Data Subsets
We cannot only extract data subsets within the brackets [ , ], but also perform computations and define new columns within these. One particular useful option is to use .N to count the number of rows for which certain conditions hold. For that, within brackets [ , ] we condition the rows for which we want to count the number of occasions.
iris_DT_1[ Species == "virginica" & Petal.Length >= 5.7, .N] # [1] 19 |
iris_DT_1[ Species == "virginica" & Petal.Length >= 5.7, .N] # [1] 19
There are 19 data rows of iris_DT_1 in which column Species takes level virginica and column Petal.Length takes values greater or equal 5.7.
The number of counts can also be used to define additional data columns. In the following, we define an additional column named n_obs. It is defined as the number of counts (:= .N), where the counts are calculated for each species (, by = Species). That is, the entries of the same species are all assigned the same value of n_obs.
iris_DT_3 <- data.table::copy(iris_DT_1) # Replicate the data iris_DT_3[ , "n_obs" := .N, by = Species] # Define new column n_obs head(iris_DT_3) # Sepal.Length Sepal.Width Petal.Length Petal.Width Species n_obs # 1: 5.1 3.5 1.4 0.2 setosa 50 # 2: 4.9 3.0 1.4 0.2 setosa 50 # 3: 4.7 3.2 1.3 0.2 setosa 50 # 4: 4.6 3.1 1.5 0.2 setosa 50 # 5: 5.0 3.6 1.4 0.2 setosa 50 # 6: 5.4 3.9 1.7 0.4 setosa 50 |
iris_DT_3 <- data.table::copy(iris_DT_1) # Replicate the data iris_DT_3[ , "n_obs" := .N, by = Species] # Define new column n_obs head(iris_DT_3) # Sepal.Length Sepal.Width Petal.Length Petal.Width Species n_obs # 1: 5.1 3.5 1.4 0.2 setosa 50 # 2: 4.9 3.0 1.4 0.2 setosa 50 # 3: 4.7 3.2 1.3 0.2 setosa 50 # 4: 4.6 3.1 1.5 0.2 setosa 50 # 5: 5.0 3.6 1.4 0.2 setosa 50 # 6: 5.4 3.9 1.7 0.4 setosa 50
Example 4: Calculation of Statistics for Specific Subsets of the iris Data
Within the brackets [ , ], we can not only count instances, but basically perform any desired function, for a chosen set of data rows and columns. In the following example, we calculate the summary statistics of the quotient of columns Petal.Length and Petal.Width for those data rows for which Species is virginica.
iris_DT_1[ Species == "virginica", summary(Petal.Length / Petal.Width)] # Calculate summary statistics # Min. 1st Qu. Median Mean 3rd Qu. Max. # 2.125 2.511 2.667 2.781 3.056 4.000 |
iris_DT_1[ Species == "virginica", summary(Petal.Length / Petal.Width)] # Calculate summary statistics # Min. 1st Qu. Median Mean 3rd Qu. Max. # 2.125 2.511 2.667 2.781 3.056 4.000
Example 5: Calculation of Statistics by Groups
In this example, we calculate the mean value of column Petal.Length and the sum of column Petal.Width, for the different levels of column Species (by = Species). When we calculate several statistics, we have to use list() for these.
iris_DT_1[ , list("mean_Pet.L" = mean(Petal.Length), "sum_Petal.W" = sum(Petal.Width)), by = Species] # Species mean_Pet.L sum_Petal.W # 1: setosa 1.462 12.3 # 2: versicolor 4.260 66.3 # 3: virginica 5.552 101.3 |
iris_DT_1[ , list("mean_Pet.L" = mean(Petal.Length), "sum_Petal.W" = sum(Petal.Width)), by = Species] # Species mean_Pet.L sum_Petal.W # 1: setosa 1.462 12.3 # 2: versicolor 4.260 66.3 # 3: virginica 5.552 101.3
In the same manner, we can apply any other function to data subsets within a data.table.
Example 6: Creating Plots Within data.table
The applicability of functions within brackets [ , ] is not limited to statistics. For example, we can use it to create plots of data.table information as shown in the following.
iris_DT_1[ Sepal.Length >= 5 , plot(Petal.Length, Petal.Width, pch = 20, col = "blue") ] |
iris_DT_1[ Sepal.Length >= 5 , plot(Petal.Length, Petal.Width, pch = 20, col = "blue") ]
The scatterplot shows the values of Petal.Length and Petal.Width for all those data rows in which Sepal.Length is greater or equal to 5.
Note: This article was created in collaboration with Anna-Lena Wölwer. Anna-Lena is a researcher and programmer who creates tutorials on statistical methodology as well as on the R programming language. You may find more info about Anna-Lena and her other articles on her profile page.