Using Summary Statistics in a data.table in R (3 Examples)
In this tutorial, I’ll illustrate how to apply summary statistics like the mean or median inside a data.table object in R.
Preparing the Examples
Install and load the package data.table.
install.packages("data.table") # Install data.table package library("data.table") # Load data.table package |
install.packages("data.table") # Install data.table package library("data.table") # Load data.table package
For the examples, use the iris dataset.
data(iris) # Load iris data set head(iris) # Sepal.Length Sepal.Width Petal.Length Petal.Width Species # 1 5.1 3.5 1.4 0.2 setosa # 2 4.9 3.0 1.4 0.2 setosa # 3 4.7 3.2 1.3 0.2 setosa # 4 4.6 3.1 1.5 0.2 setosa # 5 5.0 3.6 1.4 0.2 setosa # 6 5.4 3.9 1.7 0.4 setosa |
data(iris) # Load iris data set head(iris) # Sepal.Length Sepal.Width Petal.Length Petal.Width Species # 1 5.1 3.5 1.4 0.2 setosa # 2 4.9 3.0 1.4 0.2 setosa # 3 4.7 3.2 1.3 0.2 setosa # 4 4.6 3.1 1.5 0.2 setosa # 5 5.0 3.6 1.4 0.2 setosa # 6 5.4 3.9 1.7 0.4 setosa
Convert it into a data.table object, called iris_dt.
iris_dt <- data.table::copy(iris) # Replicate iris data set setDT(iris_dt) # Convert iris to a data.table head(iris_dt) # Sepal.Length Sepal.Width Petal.Length Petal.Width Species # 1: 5.1 3.5 1.4 0.2 setosa # 2: 4.9 3.0 1.4 0.2 setosa # 3: 4.7 3.2 1.3 0.2 setosa # 4: 4.6 3.1 1.5 0.2 setosa # 5: 5.0 3.6 1.4 0.2 setosa # 6: 5.4 3.9 1.7 0.4 setosa |
iris_dt <- data.table::copy(iris) # Replicate iris data set setDT(iris_dt) # Convert iris to a data.table head(iris_dt) # Sepal.Length Sepal.Width Petal.Length Petal.Width Species # 1: 5.1 3.5 1.4 0.2 setosa # 2: 4.9 3.0 1.4 0.2 setosa # 3: 4.7 3.2 1.3 0.2 setosa # 4: 4.6 3.1 1.5 0.2 setosa # 5: 5.0 3.6 1.4 0.2 setosa # 6: 5.4 3.9 1.7 0.4 setosa
Example 1: Absolute Frequencies
We want to create frequency tables inside data.table. For an illustration of these, we define an additional categorical column Petal.L.Class, which corresponds to the quantiles of Petal.Length. Then, we display the frequency table of the categorical variables Petal.L.Class and Species.
iris_dt_1 <- iris_dt[ , "Petal.L.Class" := cut(Petal.Length, quantile(Petal.Length))] iris_dt_2 <- iris_dt_1[ , table(.SD), .SDcols = c("Petal.L.Class", "Species")] iris_dt_2 # Species # Petal.L.Class setosa versicolor virginica # (1,1.6] 43 0 0 # (1.6,4.35] 6 25 0 # (4.35,5.1] 0 25 16 # (5.1,6.9] 0 0 34 |
iris_dt_1 <- iris_dt[ , "Petal.L.Class" := cut(Petal.Length, quantile(Petal.Length))] iris_dt_2 <- iris_dt_1[ , table(.SD), .SDcols = c("Petal.L.Class", "Species")] iris_dt_2 # Species # Petal.L.Class setosa versicolor virginica # (1,1.6] 43 0 0 # (1.6,4.35] 6 25 0 # (4.35,5.1] 0 25 16 # (5.1,6.9] 0 0 34
We see that in the data, the Petal.Length of the Species virginica is highest among the three species.
Example 2: Summary Statistics for Chosen Columns
In R, function summary() returns the values of a set of summary statistics including chosen quantiles and the minimum and maximum value of a variable. With the following code, we can apply the function to chosen columns, here to Petal.Length and Petal.Width.
iris_dt[ , summary(.SD), .SDcols = c("Petal.Length", "Petal.Width")] # Petal.Length Petal.Width # Min. :1.000 Min. :0.100 # 1st Qu.:1.600 1st Qu.:0.300 # Median :4.350 Median :1.300 # Mean :3.758 Mean :1.199 # 3rd Qu.:5.100 3rd Qu.:1.800 # Max. :6.900 Max. :2.500 |
iris_dt[ , summary(.SD), .SDcols = c("Petal.Length", "Petal.Width")] # Petal.Length Petal.Width # Min. :1.000 Min. :0.100 # 1st Qu.:1.600 1st Qu.:0.300 # Median :4.350 Median :1.300 # Mean :3.758 Mean :1.199 # 3rd Qu.:5.100 3rd Qu.:1.800 # Max. :6.900 Max. :2.500
Example 4: Display Different Statistics
We can also choose our own set of summary statistics for a variable, as demonstrated by the code below.
iris_dt_4 <- iris_dt[, c("Mean" = mean(Petal.Length), # Calculate chosen statistics "Variance" = var(Petal.Length), "Median" = median(Petal.Length), "Minimum" = min(Petal.Length), "Maximum" = max(Petal.Length), "quantile_7" = quantile(Petal.Length, 0.75))] iris_dt_4 # Mean Variance Median Minimum Maximum # 3.758000 3.116278 4.350000 1.000000 6.900000 # quantile_7.75% # 5.100000 |
iris_dt_4 <- iris_dt[, c("Mean" = mean(Petal.Length), # Calculate chosen statistics "Variance" = var(Petal.Length), "Median" = median(Petal.Length), "Minimum" = min(Petal.Length), "Maximum" = max(Petal.Length), "quantile_7" = quantile(Petal.Length, 0.75))] iris_dt_4 # Mean Variance Median Minimum Maximum # 3.758000 3.116278 4.350000 1.000000 6.900000 # quantile_7.75% # 5.100000
Example 4: Use Summary Statistics to Create new Column with Median Values
The next code line shows how to create a new column which includes the median value of Petal.Width for the three categories of Species.
iris_dt2 <- iris_dt[ , "Petal.W.Median" := median(Petal.Width), by = Species] head(iris_dt2) # Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.W.Median # 1: 5.1 3.5 1.4 0.2 setosa 0.2 # 2: 4.9 3.0 1.4 0.2 setosa 0.2 # 3: 4.7 3.2 1.3 0.2 setosa 0.2 # 4: 4.6 3.1 1.5 0.2 setosa 0.2 # 5: 5.0 3.6 1.4 0.2 setosa 0.2 # 6: 5.4 3.9 1.7 0.4 setosa 0.2 |
iris_dt2 <- iris_dt[ , "Petal.W.Median" := median(Petal.Width), by = Species] head(iris_dt2) # Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.W.Median # 1: 5.1 3.5 1.4 0.2 setosa 0.2 # 2: 4.9 3.0 1.4 0.2 setosa 0.2 # 3: 4.7 3.2 1.3 0.2 setosa 0.2 # 4: 4.6 3.1 1.5 0.2 setosa 0.2 # 5: 5.0 3.6 1.4 0.2 setosa 0.2 # 6: 5.4 3.9 1.7 0.4 setosa 0.2
Related Tutorials
Have a look at the following R programming tutorials. They illustrate topics such as groups and variables:
- Descriptive Statistics Using summary() Function
- Multiple Summary Statistics for Several Variables by Group
Note: This article was created in collaboration with Anna-Lena Wölwer. Anna-Lena is a researcher and programmer who creates tutorials on statistical methodology as well as on the R programming language. You may find more info about Anna-Lena and her other articles on her profile page.