Comparison of data.table Columns in R (5 Examples)
This tutorial shows how to make different comparisons of the columns of two data.tables in the R programming language. For example, we will answer the questions: Do the data.tables contain data rows with exactly the same information? Which of the IDs in one data.table are also present in the other data.table and where?
Preparing the Examples
To handle data.tables in R, we first have to load the data.table package. The documentation of the package can be found here.
install.packages("data.table") # Install data.table package library("data.table") # Load data.table package |
install.packages("data.table") # Install data.table package library("data.table") # Load data.table package
For the illustration, we use the built-in iris dataset. More information on the iris dataset can be found here.
data(iris) # Load iris data set head(iris) # Display first data rows # Sepal.Length Sepal.Width Petal.Length Petal.Width Species ID # 1 5.1 3.5 1.4 0.2 setosa 1 # 2 4.9 3.0 1.4 0.2 setosa 2 # 3 4.7 3.2 1.3 0.2 setosa 3 # 4 4.6 3.1 1.5 0.2 setosa 4 # 5 5.0 3.6 1.4 0.2 setosa 5 # 6 5.4 3.9 1.7 0.4 setosa 6 |
data(iris) # Load iris data set head(iris) # Display first data rows # Sepal.Length Sepal.Width Petal.Length Petal.Width Species ID # 1 5.1 3.5 1.4 0.2 setosa 1 # 2 4.9 3.0 1.4 0.2 setosa 2 # 3 4.7 3.2 1.3 0.2 setosa 3 # 4 4.6 3.1 1.5 0.2 setosa 4 # 5 5.0 3.6 1.4 0.2 setosa 5 # 6 5.4 3.9 1.7 0.4 setosa 6
To the data, we add a column called ID for indicating the observation ID of a data row.
iris$ID <- 1:nrow(iris) # Adding ID column |
iris$ID <- 1:nrow(iris) # Adding ID column
From the iris data, we create a data.table called iris_DT_1 which contains the first four rows of the data.
iris_DT_1 <- data.table(iris[1:4, ]) # Creating data.table from iris iris_DT_1 # Displaying data # Sepal.Length Sepal.Width Petal.Length Petal.Width Species ID # 1: 5.1 3.5 1.4 0.2 setosa 1 # 2: 4.9 3.0 1.4 0.2 setosa 2 # 3: 4.7 3.2 1.3 0.2 setosa 3 # 4: 4.6 3.1 1.5 0.2 setosa 4 |
iris_DT_1 <- data.table(iris[1:4, ]) # Creating data.table from iris iris_DT_1 # Displaying data # Sepal.Length Sepal.Width Petal.Length Petal.Width Species ID # 1: 5.1 3.5 1.4 0.2 setosa 1 # 2: 4.9 3.0 1.4 0.2 setosa 2 # 3: 4.7 3.2 1.3 0.2 setosa 3 # 4: 4.6 3.1 1.5 0.2 setosa 4
Furthermore, we create a data.table called iris_DT_2 containing data row two to six of the iris data.
iris_DT_2 <- data.table(iris[2:6, ]) # Creating data.table from iris iris_DT_2 # Displaying data # Sepal.Length Sepal.Width Petal.Length Petal.Width Species ID # 1: 4.9 3.0 1.4 0.2 setosa 2 # 2: 4.7 3.2 1.3 0.2 setosa 3 # 3: 4.6 3.1 1.5 0.2 setosa 4 # 4: 5.0 3.6 1.4 0.2 setosa 5 # 5: 5.4 3.9 1.7 0.4 setosa 6 |
iris_DT_2 <- data.table(iris[2:6, ]) # Creating data.table from iris iris_DT_2 # Displaying data # Sepal.Length Sepal.Width Petal.Length Petal.Width Species ID # 1: 4.9 3.0 1.4 0.2 setosa 2 # 2: 4.7 3.2 1.3 0.2 setosa 3 # 3: 4.6 3.1 1.5 0.2 setosa 4 # 4: 5.0 3.6 1.4 0.2 setosa 5 # 5: 5.4 3.9 1.7 0.4 setosa 6
In the following examples, we compare the two data.tables iris_DT_1 and iris_DT_2.
Example 1: Counting Number of Rows With Exactly Same Values
In this example, we ask ourselves: How many exact duplicate rows are in the data.tables iris_DT_1 and iris_DT_2? For that, we first row-wisely stack the two data.tables on top of each other.
iris_DT_12 <- rbindlist(list(iris_DT_1, iris_DT_2)) # Combining Two data.tables iris_DT_12 # Displaying data # Sepal.Length Sepal.Width Petal.Length Petal.Width Species ID # 1: 5.1 3.5 1.4 0.2 setosa 1 # 2: 4.9 3.0 1.4 0.2 setosa 2 # 3: 4.7 3.2 1.3 0.2 setosa 3 # 4: 4.6 3.1 1.5 0.2 setosa 4 # 5: 4.9 3.0 1.4 0.2 setosa 2 # 6: 4.7 3.2 1.3 0.2 setosa 3 # 7: 4.6 3.1 1.5 0.2 setosa 4 # 8: 5.0 3.6 1.4 0.2 setosa 5 # 9: 5.4 3.9 1.7 0.4 setosa 6 |
iris_DT_12 <- rbindlist(list(iris_DT_1, iris_DT_2)) # Combining Two data.tables iris_DT_12 # Displaying data # Sepal.Length Sepal.Width Petal.Length Petal.Width Species ID # 1: 5.1 3.5 1.4 0.2 setosa 1 # 2: 4.9 3.0 1.4 0.2 setosa 2 # 3: 4.7 3.2 1.3 0.2 setosa 3 # 4: 4.6 3.1 1.5 0.2 setosa 4 # 5: 4.9 3.0 1.4 0.2 setosa 2 # 6: 4.7 3.2 1.3 0.2 setosa 3 # 7: 4.6 3.1 1.5 0.2 setosa 4 # 8: 5.0 3.6 1.4 0.2 setosa 5 # 9: 5.4 3.9 1.7 0.4 setosa 6
We sum the number of duplicate rows. The duplicate rows are indicated by the logical output of function duplicated().
sum(duplicated(iris_DT_12)) # Number of duplicate rows # [1] 3 |
sum(duplicated(iris_DT_12)) # Number of duplicate rows # [1] 3
There are three duplicate rows in the data.
Example 2: Combining Information From Multiple data.tables Without Duplicates
In this example, we want to combine the information of data.tables iris_DT_1 and iris_DT_2 without having duplicated rows. This is done by use of function merge.data.table() with its argument all set to TRUE.
iris_DT_12b <- merge.data.table(iris_DT_1, iris_DT_2, all = TRUE) # Merging Two data.tables iris_DT_12b # Displaying data # Sepal.Length Sepal.Width Petal.Length Petal.Width Species ID # 1: 4.6 3.1 1.5 0.2 setosa 4 # 2: 4.7 3.2 1.3 0.2 setosa 3 # 3: 4.9 3.0 1.4 0.2 setosa 2 # 4: 5.0 3.6 1.4 0.2 setosa 5 # 5: 5.1 3.5 1.4 0.2 setosa 1 # 6: 5.4 3.9 1.7 0.4 setosa 6 |
iris_DT_12b <- merge.data.table(iris_DT_1, iris_DT_2, all = TRUE) # Merging Two data.tables iris_DT_12b # Displaying data # Sepal.Length Sepal.Width Petal.Length Petal.Width Species ID # 1: 4.6 3.1 1.5 0.2 setosa 4 # 2: 4.7 3.2 1.3 0.2 setosa 3 # 3: 4.9 3.0 1.4 0.2 setosa 2 # 4: 5.0 3.6 1.4 0.2 setosa 5 # 5: 5.1 3.5 1.4 0.2 setosa 1 # 6: 5.4 3.9 1.7 0.4 setosa 6
Although the data rows with ID equal to 2, 3, and 4 appeared in both datasets, they appear only once in the merged data.table.
Example 3: Counting Number of Rows With Information on Same IDs
Now we ask ourselves: How many of the IDs appearing in one data.table are also included in the other?
table(iris_DT_1$ID %in% iris_DT_2$ID) # Counting common values # FALSE TRUE # 1 3 |
table(iris_DT_1$ID %in% iris_DT_2$ID) # Counting common values # FALSE TRUE # 1 3
Three of the four ID values in iris_DT_1 are also present in iris_DT_2.
table(iris_DT_2$ID %in% iris_DT_1$ID) # Counting common values # FALSE TRUE # 2 3 |
table(iris_DT_2$ID %in% iris_DT_1$ID) # Counting common values # FALSE TRUE # 2 3
Three of the five ID values in iris_DT_2 are also present in iris_DT_1.
Example 4: Receiving the Indices of Those IDs Which Appear in Another data.table
With the following code, we get the position of those IDs in iris_DT_1 which also appear in iris_DT_2.
which(iris_DT_1$ID %in% iris_DT_2$ID) # Indices of common IDs # [1] 2 3 4 |
which(iris_DT_1$ID %in% iris_DT_2$ID) # Indices of common IDs # [1] 2 3 4
Example 5: Receiving the IDs Which Appear in Another data.table
Now we want to know the values of the IDs which appear in another dataset.
iris_DT_1$ID[iris_DT_1$ID %in% iris_DT_2$ID] # Values of common IDs # [1] 2 3 4 |
iris_DT_1$ID[iris_DT_1$ID %in% iris_DT_2$ID] # Values of common IDs # [1] 2 3 4
The IDs with values 2, 3, and 4 appear in both datasets.
Note: This article was created in collaboration with Anna-Lena Wölwer. Anna-Lena is a researcher and programmer who creates tutorials on statistical methodology as well as on the R programming language. You may find more info about Anna-Lena and her other articles on her profile page.