Using Inner, Outer, Left & Right Joins of data.tables in R (4 Examples)

In this R post you’ll learn how to connect the information of multiple data.tables: Inner, outer, left, and right join.

Preparing the Examples

We first install and load the data.table package.

install.packages("data.table")       # Install data.table package
library("data.table")                # Load data.table package

For illustrating how to merge data.tables, we use the iris dataset.

data(iris)                              # Load iris data set
iris_DT <- data.table::copy(iris)       # Replicate iris data set
setDT(iris_DT)                          # Convert iris to a data.table
head(iris_DT)                           # Print the head of the data
#    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1:          5.1         3.5          1.4         0.2  setosa
# 2:          4.9         3.0          1.4         0.2  setosa
# 3:          4.7         3.2          1.3         0.2  setosa
# 4:          4.6         3.1          1.5         0.2  setosa
# 5:          5.0         3.6          1.4         0.2  setosa
# 6:          5.4         3.9          1.7         0.4  setosa

From the iris data.table, we create two data subsets, each containing different rows and columns of the data.

iris_DT_21 <- data.table::copy(iris_DT) # Replicate the iris data set
iris_DT_21 <- iris_DT_21[1:5, -c(3:5)]  # Create iris subset 1
iris_DT_21 <- cbind(iris_DT_21, "ID" = 11:15)
 
head(iris_DT_21)                        # Print the head of the data
#    Sepal.Length Sepal.Width ID
# 1:          5.1         3.5 11
# 2:          4.9         3.0 12
# 3:          4.7         3.2 13
# 4:          4.6         3.1 14
# 5:          5.0         3.6 15
iris_DT_22 <- data.table::copy(iris_DT) # Replicate the iris data set
iris_DT_22 <- iris_DT_22[6:8, -c(1,2)]  # Create iris subset 2
iris_DT_22 <- cbind(iris_DT_22, "ID" = 14:16)
 
head(iris_DT_22)                        # Print the head of the data
#    Petal.Length Petal.Width Species ID
# 1:          1.7         0.4  setosa 14
# 2:          1.4         0.3  setosa 15
# 3:          1.5         0.2  setosa 16

You can see that we also added variable ID to both data subsets. With this variable, we identify the observation units in both datasets. Both data.tables have IDs 14 and 15 in common. We show different ways of how to combine both data.tables.

Example 1: Full Outer Join

Example 1 shows how to merge the information fully about both datasets. With variable ID, the common rows of both data.tables are matches. Missing information is filled by NA.

iris_DT_23 <- merge.data.table(iris_DT_21, iris_DT_22, all = TRUE)
iris_DT_23
#    ID Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1: 11          5.1         3.5           NA          NA    <NA>
# 2: 12          4.9         3.0           NA          NA    <NA>
# 3: 13          4.7         3.2           NA          NA    <NA>
# 4: 14          4.6         3.1          1.7         0.4  setosa
# 5: 15          5.0         3.6          1.4         0.3  setosa
# 6: 16           NA          NA          1.5         0.2  setosa

Example 2: Inner Join

In contrast to a full merge of the data.tables, we can perform an inner join. That is, we want to join the information only for the common rows of the two data.tables. Only for ID equal to 14 and 15 there is information in both data.tables. With argument nomatch = 0, we indicate that non-matching rows should be excluded.

iris_DT_24 <- iris_DT_21[iris_DT_22, on = c("ID"), nomatch = 0]
iris_DT_24
#    Sepal.Length Sepal.Width ID Petal.Length Petal.Width Species
# 1:          4.6         3.1 14          1.7         0.4  setosa
# 2:          5.0         3.6 15          1.4         0.3  setosa

Example 3: Left Outer Join

We can also join one dataset to the other with a left outer or right outer join. In the left outer join, we join data iris_DT_22 (left) and data iris_DT_21 (right) for the rows of data iris_DT_21. For that, we can use the following code.

iris_DT_25 <- iris_DT_22[iris_DT_21, on = c("ID")]
iris_DT_25
#    Petal.Length Petal.Width Species ID Sepal.Length Sepal.Width
# 1:           NA          NA    <NA> 11          5.1         3.5
# 2:           NA          NA    <NA> 12          4.9         3.0
# 3:           NA          NA    <NA> 13          4.7         3.2
# 4:          1.7         0.4  setosa 14          4.6         3.1
# 5:          1.4         0.3  setosa 15          5.0         3.6

The values of those rows in iris_DT_21 for which there is no information in iris_DT_22 are filled by NA.

Example 4: Right Outer Join

We can also do a right outer join instead. In the right outer join, we join data iris_DT_21 (left) and data iris_DT_22 (right) for the rows of data iris_DT_22. For that, we can use the following code.

iris_DT_26 <- iris_DT_21[iris_DT_22, on = c("ID")]
iris_DT_26
#    Sepal.Length Sepal.Width ID Petal.Length Petal.Width Species
# 1:          4.6         3.1 14          1.7         0.4  setosa
# 2:          5.0         3.6 15          1.4         0.3  setosa
# 3:           NA          NA 16          1.5         0.2  setosa

Compare the outcome to the previous example to see the differences.

 

Anna-Lena Wölwer R Programming & Survey Statistics

Note: This article was created in collaboration with Anna-Lena Wölwer. Anna-Lena is a researcher and programmer who creates tutorials on statistical methodology as well as on the R programming language. You may find more info about Anna-Lena and her other articles on her profile page.

Leave a Reply

Your email address will not be published.

Fill out this field
Fill out this field
Please enter a valid email address.
You need to agree with the terms to proceed

Menu
Top