Using Inner, Outer, Left & Right Joins of data.tables in R (4 Examples)
In this R post you’ll learn how to connect the information of multiple data.tables: Inner, outer, left, and right join.
Preparing the Examples
We first install and load the data.table package.
install.packages("data.table") # Install data.table package library("data.table") # Load data.table package |
install.packages("data.table") # Install data.table package library("data.table") # Load data.table package
For illustrating how to merge data.tables, we use the iris dataset.
data(iris) # Load iris data set iris_DT <- data.table::copy(iris) # Replicate iris data set setDT(iris_DT) # Convert iris to a data.table head(iris_DT) # Print the head of the data # Sepal.Length Sepal.Width Petal.Length Petal.Width Species # 1: 5.1 3.5 1.4 0.2 setosa # 2: 4.9 3.0 1.4 0.2 setosa # 3: 4.7 3.2 1.3 0.2 setosa # 4: 4.6 3.1 1.5 0.2 setosa # 5: 5.0 3.6 1.4 0.2 setosa # 6: 5.4 3.9 1.7 0.4 setosa |
data(iris) # Load iris data set iris_DT <- data.table::copy(iris) # Replicate iris data set setDT(iris_DT) # Convert iris to a data.table head(iris_DT) # Print the head of the data # Sepal.Length Sepal.Width Petal.Length Petal.Width Species # 1: 5.1 3.5 1.4 0.2 setosa # 2: 4.9 3.0 1.4 0.2 setosa # 3: 4.7 3.2 1.3 0.2 setosa # 4: 4.6 3.1 1.5 0.2 setosa # 5: 5.0 3.6 1.4 0.2 setosa # 6: 5.4 3.9 1.7 0.4 setosa
From the iris data.table, we create two data subsets, each containing different rows and columns of the data.
iris_DT_21 <- data.table::copy(iris_DT) # Replicate the iris data set iris_DT_21 <- iris_DT_21[1:5, -c(3:5)] # Create iris subset 1 iris_DT_21 <- cbind(iris_DT_21, "ID" = 11:15) head(iris_DT_21) # Print the head of the data # Sepal.Length Sepal.Width ID # 1: 5.1 3.5 11 # 2: 4.9 3.0 12 # 3: 4.7 3.2 13 # 4: 4.6 3.1 14 # 5: 5.0 3.6 15 |
iris_DT_21 <- data.table::copy(iris_DT) # Replicate the iris data set iris_DT_21 <- iris_DT_21[1:5, -c(3:5)] # Create iris subset 1 iris_DT_21 <- cbind(iris_DT_21, "ID" = 11:15) head(iris_DT_21) # Print the head of the data # Sepal.Length Sepal.Width ID # 1: 5.1 3.5 11 # 2: 4.9 3.0 12 # 3: 4.7 3.2 13 # 4: 4.6 3.1 14 # 5: 5.0 3.6 15
iris_DT_22 <- data.table::copy(iris_DT) # Replicate the iris data set iris_DT_22 <- iris_DT_22[6:8, -c(1,2)] # Create iris subset 2 iris_DT_22 <- cbind(iris_DT_22, "ID" = 14:16) head(iris_DT_22) # Print the head of the data # Petal.Length Petal.Width Species ID # 1: 1.7 0.4 setosa 14 # 2: 1.4 0.3 setosa 15 # 3: 1.5 0.2 setosa 16 |
iris_DT_22 <- data.table::copy(iris_DT) # Replicate the iris data set iris_DT_22 <- iris_DT_22[6:8, -c(1,2)] # Create iris subset 2 iris_DT_22 <- cbind(iris_DT_22, "ID" = 14:16) head(iris_DT_22) # Print the head of the data # Petal.Length Petal.Width Species ID # 1: 1.7 0.4 setosa 14 # 2: 1.4 0.3 setosa 15 # 3: 1.5 0.2 setosa 16
You can see that we also added variable ID to both data subsets. With this variable, we identify the observation units in both datasets. Both data.tables have IDs 14 and 15 in common. We show different ways of how to combine both data.tables.
Example 1: Full Outer Join
Example 1 shows how to merge the information fully about both datasets. With variable ID, the common rows of both data.tables are matches. Missing information is filled by NA.
iris_DT_23 <- merge.data.table(iris_DT_21, iris_DT_22, all = TRUE) iris_DT_23 # ID Sepal.Length Sepal.Width Petal.Length Petal.Width Species # 1: 11 5.1 3.5 NA NA <NA> # 2: 12 4.9 3.0 NA NA <NA> # 3: 13 4.7 3.2 NA NA <NA> # 4: 14 4.6 3.1 1.7 0.4 setosa # 5: 15 5.0 3.6 1.4 0.3 setosa # 6: 16 NA NA 1.5 0.2 setosa |
iris_DT_23 <- merge.data.table(iris_DT_21, iris_DT_22, all = TRUE) iris_DT_23 # ID Sepal.Length Sepal.Width Petal.Length Petal.Width Species # 1: 11 5.1 3.5 NA NA <NA> # 2: 12 4.9 3.0 NA NA <NA> # 3: 13 4.7 3.2 NA NA <NA> # 4: 14 4.6 3.1 1.7 0.4 setosa # 5: 15 5.0 3.6 1.4 0.3 setosa # 6: 16 NA NA 1.5 0.2 setosa
Example 2: Inner Join
In contrast to a full merge of the data.tables, we can perform an inner join. That is, we want to join the information only for the common rows of the two data.tables. Only for ID equal to 14 and 15 there is information in both data.tables. With argument nomatch = 0, we indicate that non-matching rows should be excluded.
iris_DT_24 <- iris_DT_21[iris_DT_22, on = c("ID"), nomatch = 0] iris_DT_24 # Sepal.Length Sepal.Width ID Petal.Length Petal.Width Species # 1: 4.6 3.1 14 1.7 0.4 setosa # 2: 5.0 3.6 15 1.4 0.3 setosa |
iris_DT_24 <- iris_DT_21[iris_DT_22, on = c("ID"), nomatch = 0] iris_DT_24 # Sepal.Length Sepal.Width ID Petal.Length Petal.Width Species # 1: 4.6 3.1 14 1.7 0.4 setosa # 2: 5.0 3.6 15 1.4 0.3 setosa
Example 3: Left Outer Join
We can also join one dataset to the other with a left outer or right outer join. In the left outer join, we join data iris_DT_22 (left) and data iris_DT_21 (right) for the rows of data iris_DT_21. For that, we can use the following code.
iris_DT_25 <- iris_DT_22[iris_DT_21, on = c("ID")] iris_DT_25 # Petal.Length Petal.Width Species ID Sepal.Length Sepal.Width # 1: NA NA <NA> 11 5.1 3.5 # 2: NA NA <NA> 12 4.9 3.0 # 3: NA NA <NA> 13 4.7 3.2 # 4: 1.7 0.4 setosa 14 4.6 3.1 # 5: 1.4 0.3 setosa 15 5.0 3.6 |
iris_DT_25 <- iris_DT_22[iris_DT_21, on = c("ID")] iris_DT_25 # Petal.Length Petal.Width Species ID Sepal.Length Sepal.Width # 1: NA NA <NA> 11 5.1 3.5 # 2: NA NA <NA> 12 4.9 3.0 # 3: NA NA <NA> 13 4.7 3.2 # 4: 1.7 0.4 setosa 14 4.6 3.1 # 5: 1.4 0.3 setosa 15 5.0 3.6
The values of those rows in iris_DT_21 for which there is no information in iris_DT_22 are filled by NA.
Example 4: Right Outer Join
We can also do a right outer join instead. In the right outer join, we join data iris_DT_21 (left) and data iris_DT_22 (right) for the rows of data iris_DT_22. For that, we can use the following code.
iris_DT_26 <- iris_DT_21[iris_DT_22, on = c("ID")] iris_DT_26 # Sepal.Length Sepal.Width ID Petal.Length Petal.Width Species # 1: 4.6 3.1 14 1.7 0.4 setosa # 2: 5.0 3.6 15 1.4 0.3 setosa # 3: NA NA 16 1.5 0.2 setosa |
iris_DT_26 <- iris_DT_21[iris_DT_22, on = c("ID")] iris_DT_26 # Sepal.Length Sepal.Width ID Petal.Length Petal.Width Species # 1: 4.6 3.1 14 1.7 0.4 setosa # 2: 5.0 3.6 15 1.4 0.3 setosa # 3: NA NA 16 1.5 0.2 setosa
Compare the outcome to the previous example to see the differences.
Note: This article was created in collaboration with Anna-Lena Wölwer. Anna-Lena is a researcher and programmer who creates tutorials on statistical methodology as well as on the R programming language. You may find more info about Anna-Lena and her other articles on her profile page.