Using Summary Statistics in a data.table in R (3 Examples)

In this tutorial, I’ll illustrate how to apply summary statistics like the mean or median inside a data.table object in R.

Preparing the Examples

Install and load the package data.table.

install.packages("data.table")                     # Install data.table package
library("data.table")                              # Load data.table package

For the examples, use the iris dataset.

data(iris)                                         # Load iris data set
head(iris)
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1          5.1         3.5          1.4         0.2  setosa
# 2          4.9         3.0          1.4         0.2  setosa
# 3          4.7         3.2          1.3         0.2  setosa
# 4          4.6         3.1          1.5         0.2  setosa
# 5          5.0         3.6          1.4         0.2  setosa
# 6          5.4         3.9          1.7         0.4  setosa

Convert it into a data.table object, called iris_dt.

iris_dt <- data.table::copy(iris)                  # Replicate iris data set
setDT(iris_dt)                                     # Convert iris to a data.table
head(iris_dt)
#    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1:          5.1         3.5          1.4         0.2  setosa
# 2:          4.9         3.0          1.4         0.2  setosa
# 3:          4.7         3.2          1.3         0.2  setosa
# 4:          4.6         3.1          1.5         0.2  setosa
# 5:          5.0         3.6          1.4         0.2  setosa
# 6:          5.4         3.9          1.7         0.4  setosa

Example 1: Absolute Frequencies

We want to create frequency tables inside data.table. For an illustration of these, we define an additional categorical column Petal.L.Class, which corresponds to the quantiles of Petal.Length. Then, we display the frequency table of the categorical variables Petal.L.Class and Species.

iris_dt_1 <- iris_dt[ , "Petal.L.Class" := cut(Petal.Length, quantile(Petal.Length))] 
iris_dt_2 <- iris_dt_1[ , table(.SD), .SDcols = c("Petal.L.Class", "Species")] 
iris_dt_2
#              Species
# Petal.L.Class setosa versicolor virginica
# (1,1.6]           43          0         0
# (1.6,4.35]         6         25         0
# (4.35,5.1]         0         25        16
# (5.1,6.9]          0          0        34

We see that in the data, the Petal.Length of the Species virginica is highest among the three species.

Example 2: Summary Statistics for Chosen Columns

In R, function summary() returns the values of a set of summary statistics including chosen quantiles and the minimum and maximum value of a variable. With the following code, we can apply the function to chosen columns, here to Petal.Length and Petal.Width.

iris_dt[ , summary(.SD), .SDcols = c("Petal.Length", "Petal.Width")] 
#   Petal.Length    Petal.Width   
#  Min.   :1.000   Min.   :0.100  
#  1st Qu.:1.600   1st Qu.:0.300  
#  Median :4.350   Median :1.300  
#  Mean   :3.758   Mean   :1.199  
#  3rd Qu.:5.100   3rd Qu.:1.800  
#  Max.   :6.900   Max.   :2.500

Example 4: Display Different Statistics

We can also choose our own set of summary statistics for a variable, as demonstrated by the code below.

iris_dt_4 <- iris_dt[, c("Mean"        = mean(Petal.Length), # Calculate chosen statistics
                         "Variance"    = var(Petal.Length),
                         "Median"      = median(Petal.Length),
                         "Minimum"     = min(Petal.Length),
                         "Maximum"     = max(Petal.Length),
                         "quantile_7"  = quantile(Petal.Length, 0.75))]
iris_dt_4
#           Mean       Variance         Median        Minimum        Maximum 
#       3.758000       3.116278       4.350000       1.000000       6.900000 
# quantile_7.75% 
#       5.100000

Example 4: Use Summary Statistics to Create new Column with Median Values

The next code line shows how to create a new column which includes the median value of Petal.Width for the three categories of Species.

iris_dt2 <- iris_dt[ , "Petal.W.Median" := median(Petal.Width), by = Species]
head(iris_dt2)
#    Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.W.Median
# 1:          5.1         3.5          1.4         0.2  setosa            0.2
# 2:          4.9         3.0          1.4         0.2  setosa            0.2
# 3:          4.7         3.2          1.3         0.2  setosa            0.2
# 4:          4.6         3.1          1.5         0.2  setosa            0.2
# 5:          5.0         3.6          1.4         0.2  setosa            0.2
# 6:          5.4         3.9          1.7         0.4  setosa            0.2

Using Summary Statistics in a data.table in R (3 Examples)

Preparing the Examples

Example 1: Absolute Frequencies

Example 2: Summary Statistics for Chosen Columns

Example 4: Display Different Statistics

Example 4: Use Summary Statistics to Create new Column with Median Values

Related Tutorials

Leave a Reply Cancel reply

How to Shuffle a Vector in R Programming (Example Code)

How to Transform Vector to Matrix in R (Example Code)

R Extract Rows in Data Frame 1 that are not in Data Frame 2 (Example Code)