groupBy & Sort PySpark DataFrame in Descending Order in Python (2 Examples)

On this page, I’ll explain how to groupBy and sort a PySpark DataFrame in descending order in the Python programming language.

Table of contents:

  • Introduction
  • Creating Example Data
  • Example 1: groupBy & Sort PySpark DataFrame in Descending Order Using sort() Method
  • Example 2: groupBy & Sort PySpark DataFrame in Descending Order Using orderBy() Method
  • Video, Further Resources & Summary

Let’s get started!

Introduction

PySpark is an open-source software that is used to store and process data by using the Python Programming language.

We can construct a PySpark object by using a Spark session and by specifying the app name by using the getOrCreate() method.

SparkSession.builder.appName(app_name).getOrCreate()

After the creation of the data containing a list of dictionaries, we have to assign the data to the createDataFrame method. By doing this, we create a PySpark DataFrame.

spark.createDataFrame(data)

As a next step, we may display the DataFrame by using the show() method:

dataframe.show()

Now we are set up and can move on to the example data.

Creating Example Data

In this section, we’ll create a DataFrame from a list of dictionaries with eight rows and three columns, containing details of fruits and cities. We can display the DataFrame using the show method:

# import the pyspark module
import pyspark
 
# import the sparksession from pyspark.sql module
from pyspark.sql import SparkSession
 
# creating sparksession and then give the app name
spark = SparkSession.builder.appName('statistics_globe').getOrCreate()
 
#create a dictionary with 3 pairs with 8 values each
#inside a list
data = [{'fruit': 'apple', 'cost': 67, 'city': 'patna'},
        {'fruit': 'mango', 'cost': 87, 'city': 'delhi'},
        {'fruit': 'apple', 'cost': 76, 'city': 'harayana'},
        {'fruit': 'banana', 'cost': 87, 'city': 'hyderabad'},
        {'fruit': 'guava', 'cost': 56, 'city': 'delhi'},
        {'fruit': 'mango', 'cost': 234, 'city': 'patna'},
        {'fruit': 'apple', 'cost': 43, 'city': 'delhi'},
        {'fruit': 'mango', 'cost': 49, 'city': 'banglore'}]
 
 
# creating a dataframe  from the given list of dictionary
dataframe = spark.createDataFrame(data)
 
# display the final dataframe
dataframe.show()

groupBy Sort PySpark DataFrame Table 1

The table above illustrates our example DataFrame. As you can see, it contains three columns that are called fruit, cost and city.

Let’s group the fruits and city columns to the cost column and sort the DataFrame in descending order.

Example 1: groupBy & Sort PySpark DataFrame in Descending Order Using sort() Method

This example uses the desc() and sum() functions imported from the pyspark.sql.functions module to calculate the sum by group.

We use the agg() function to group our data, and the desc() function to sort the final DataFrame in descending order.

dataframe.groupBy("column_1").agg(sum('column_2').alias("Grouped_column_name")).sort(desc("column_name"))

In this specific example, we are grouping the fruit column with cost by calculating the total cost based on the fruit column. Furthermore, we sort the DataFrame in descending order based on the fruit column.

The following Python code import the sum() and desc() functions from the pyspark.sql.functions module:

from pyspark.sql.functions import sum,desc

In the next step, we calculate the total cost based on the fruit column:

dataframe.groupBy("fruit").agg(sum('cost').alias("Total_Cost")).sort(desc("fruit")) .show()

groupBy Sort PySpark DataFrame Table 2

The previous table shows the result of our Python code, i.e. the total cost by group.

Example 2: groupBy & Sort PySpark DataFrame in Descending Order Using orderBy() Method

The method shown in Example 2 is similar to the method explained in Example 1. However, this time we are using the orderBy() function.

The orderBy() function is used with the parameter ascending equal to False. This specifies to sort the DataFrame in descending order.

Consider the Python code below:

dataframe.groupBy("column_1").agg(sum('column_2').alias("Grouped_column_name")).orderBy("column_name", ascending=False)

In the first step, we have to import the sum() function from the pyspark.sql.functions module:

from pyspark.sql.functions import sum

Next, we are calculating the total cost based on the fruit column and sort the DataFrame in descending order based on the fruit column:

dataframe.groupBy("fruit").agg(sum('cost').alias("Total_Cost")).orderBy("fruit", ascending=False).show()

groupBy Sort PySpark DataFrame Table 3

As you can see, the previous output table is the same as the output table in Example 1. However, this time, we have used the orderBy() function instead of the sort() function.

Video, Further Resources & Summary

Would you like to learn more on how to sum variables by group? Then you may check out the following video of Krish Naik’s YouTube channel.

In the video, he explains how to use the groupBy and aggregate functions:

YouTube

By loading the video, you agree to YouTube’s privacy policy.
Learn more

Load video

In addition, you may have a look at some related tutorials on the Data Hacks website:

In summary: This post has explained how to groupBy and order PySpark DataFrames in Python programming. If you have any further questions, you may leave a comment below.

This article was written in collaboration with Gottumukkala Sravan Kumar. You may find more information about Gottumukkala Sravan Kumar and his other articles on his profile page.

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.
You need to agree with the terms to proceed

Menu
Top