Summarize data.table by Group in R Programming (Example Code)
This tutorial shows how to aggregate a data.table by group means.
Setting up the Example
Load the data.table package.
install.packages("data.table") # Install data.table package library("data.table") # Load data.table package |
install.packages("data.table") # Install data.table package library("data.table") # Load data.table package
Take the iris dataset as an example and transform it to a data.table, stored as iris_dt.
data(iris) # Loading example data iris_dt <- data.table(iris) iris_dt # Sepal.Length Sepal.Width Petal.Length Petal.Width Species # 1: 5.1 3.5 1.4 0.2 setosa # 2: 4.9 3.0 1.4 0.2 setosa # 3: 4.7 3.2 1.3 0.2 setosa # 4: 4.6 3.1 1.5 0.2 setosa # 5: 5.0 3.6 1.4 0.2 setosa # --- # 146: 6.7 3.0 5.2 2.3 virginica # 147: 6.3 2.5 5.0 1.9 virginica # 148: 6.5 3.0 5.2 2.0 virginica # 149: 6.2 3.4 5.4 2.3 virginica # 150: 5.9 3.0 5.1 1.8 virginica |
data(iris) # Loading example data iris_dt <- data.table(iris) iris_dt # Sepal.Length Sepal.Width Petal.Length Petal.Width Species # 1: 5.1 3.5 1.4 0.2 setosa # 2: 4.9 3.0 1.4 0.2 setosa # 3: 4.7 3.2 1.3 0.2 setosa # 4: 4.6 3.1 1.5 0.2 setosa # 5: 5.0 3.6 1.4 0.2 setosa # --- # 146: 6.7 3.0 5.2 2.3 virginica # 147: 6.3 2.5 5.0 1.9 virginica # 148: 6.5 3.0 5.2 2.0 virginica # 149: 6.2 3.4 5.4 2.3 virginica # 150: 5.9 3.0 5.1 1.8 virginica
Example: Computing the Mean by Groups in a data.table
We aggregate the data such that it only contains the mean value of Sepal.Length for each unique value of column Species. The new column containing the group means is called Species_average.
iris_dt_new <- iris_dt[ , .(Species_average = mean(Sepal.Length)), by = Species] # Calculating mean by group iris_dt_new # Species Species_average # 1: setosa 5.006 # 2: versicolor 5.936 # 3: virginica 6.588 |
iris_dt_new <- iris_dt[ , .(Species_average = mean(Sepal.Length)), by = Species] # Calculating mean by group iris_dt_new # Species Species_average # 1: setosa 5.006 # 2: versicolor 5.936 # 3: virginica 6.588
The code above automatically reduces our data to the desired output dimensions.
Note: This article was created in collaboration with Anna-Lena Wölwer. Anna-Lena is a researcher and programmer who creates tutorials on statistical methodology as well as on the R programming language. You may find more info about Anna-Lena and her other articles on her profile page.