Using Summary Statistics in a data.table in R (3 Examples)

In this tutorial, I’ll illustrate how to apply summary statistics like the mean or median inside a data.table object in R.

Preparing the Examples

Install and load the package data.table.

install.packages("data.table")                     # Install data.table package
library("data.table")                              # Load data.table package

For the examples, use the iris dataset.

data(iris)                                         # Load iris data set
head(iris)
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1          5.1         3.5          1.4         0.2  setosa
# 2          4.9         3.0          1.4         0.2  setosa
# 3          4.7         3.2          1.3         0.2  setosa
# 4          4.6         3.1          1.5         0.2  setosa
# 5          5.0         3.6          1.4         0.2  setosa
# 6          5.4         3.9          1.7         0.4  setosa

Convert it into a data.table object, called iris_dt.

iris_dt <- data.table::copy(iris)                  # Replicate iris data set
setDT(iris_dt)                                     # Convert iris to a data.table
head(iris_dt)
#    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1:          5.1         3.5          1.4         0.2  setosa
# 2:          4.9         3.0          1.4         0.2  setosa
# 3:          4.7         3.2          1.3         0.2  setosa
# 4:          4.6         3.1          1.5         0.2  setosa
# 5:          5.0         3.6          1.4         0.2  setosa
# 6:          5.4         3.9          1.7         0.4  setosa

Example 1: Absolute Frequencies

We want to create frequency tables inside data.table. For an illustration of these, we define an additional categorical column Petal.L.Class, which corresponds to the quantiles of Petal.Length. Then, we display the frequency table of the categorical variables Petal.L.Class and Species.

iris_dt_1 <- iris_dt[ , "Petal.L.Class" := cut(Petal.Length, quantile(Petal.Length))] 
iris_dt_2 <- iris_dt_1[ , table(.SD), .SDcols = c("Petal.L.Class", "Species")] 
iris_dt_2
#              Species
# Petal.L.Class setosa versicolor virginica
# (1,1.6]           43          0         0
# (1.6,4.35]         6         25         0
# (4.35,5.1]         0         25        16
# (5.1,6.9]          0          0        34

We see that in the data, the Petal.Length of the Species virginica is highest among the three species.

Example 2: Summary Statistics for Chosen Columns

In R, function summary() returns the values of a set of summary statistics including chosen quantiles and the minimum and maximum value of a variable. With the following code, we can apply the function to chosen columns, here to Petal.Length and Petal.Width.

iris_dt[ , summary(.SD), .SDcols = c("Petal.Length", "Petal.Width")] 
#   Petal.Length    Petal.Width   
#  Min.   :1.000   Min.   :0.100  
#  1st Qu.:1.600   1st Qu.:0.300  
#  Median :4.350   Median :1.300  
#  Mean   :3.758   Mean   :1.199  
#  3rd Qu.:5.100   3rd Qu.:1.800  
#  Max.   :6.900   Max.   :2.500

Example 4: Display Different Statistics

We can also choose our own set of summary statistics for a variable, as demonstrated by the code below.

iris_dt_4 <- iris_dt[, c("Mean"        = mean(Petal.Length), # Calculate chosen statistics
                         "Variance"    = var(Petal.Length),
                         "Median"      = median(Petal.Length),
                         "Minimum"     = min(Petal.Length),
                         "Maximum"     = max(Petal.Length),
                         "quantile_7"  = quantile(Petal.Length, 0.75))]
iris_dt_4
#           Mean       Variance         Median        Minimum        Maximum 
#       3.758000       3.116278       4.350000       1.000000       6.900000 
# quantile_7.75% 
#       5.100000

Example 4: Use Summary Statistics to Create new Column with Median Values

The next code line shows how to create a new column which includes the median value of Petal.Width for the three categories of Species.

iris_dt2 <- iris_dt[ , "Petal.W.Median" := median(Petal.Width), by = Species]
head(iris_dt2)
#    Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.W.Median
# 1:          5.1         3.5          1.4         0.2  setosa            0.2
# 2:          4.9         3.0          1.4         0.2  setosa            0.2
# 3:          4.7         3.2          1.3         0.2  setosa            0.2
# 4:          4.6         3.1          1.5         0.2  setosa            0.2
# 5:          5.0         3.6          1.4         0.2  setosa            0.2
# 6:          5.4         3.9          1.7         0.4  setosa            0.2

Related Tutorials

Have a look at the following R programming tutorials. They illustrate topics such as groups and variables:

 

Anna-Lena Wölwer R Programming & Survey Statistics

Note: This article was created in collaboration with Anna-Lena Wölwer. Anna-Lena is a researcher and programmer who creates tutorials on statistical methodology as well as on the R programming language. You may find more info about Anna-Lena and her other articles on her profile page.

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.
You need to agree with the terms to proceed

Menu
Top