groupBy & Sort PySpark DataFrame in Descending Order in Python (2 Examples)
On this page, I’ll explain how to groupBy and sort a PySpark DataFrame in descending order in the Python programming language.
Table of contents:
- Introduction
- Creating Example Data
- Example 1: groupBy & Sort PySpark DataFrame in Descending Order Using sort() Method
- Example 2: groupBy & Sort PySpark DataFrame in Descending Order Using orderBy() Method
- Video, Further Resources & Summary
Let’s get started!
Introduction
PySpark is an open-source software that is used to store and process data by using the Python Programming language.
We can construct a PySpark object by using a Spark session and by specifying the app name by using the getOrCreate() method.
SparkSession.builder.appName(app_name).getOrCreate() |
SparkSession.builder.appName(app_name).getOrCreate()
After the creation of the data containing a list of dictionaries, we have to assign the data to the createDataFrame method. By doing this, we create a PySpark DataFrame.
spark.createDataFrame(data) |
spark.createDataFrame(data)
As a next step, we may display the DataFrame by using the show() method:
dataframe.show() |
dataframe.show()
Now we are set up and can move on to the example data.
Creating Example Data
In this section, we’ll create a DataFrame from a list of dictionaries with eight rows and three columns, containing details of fruits and cities. We can display the DataFrame using the show method:
# import the pyspark module import pyspark # import the sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and then give the app name spark = SparkSession.builder.appName('statistics_globe').getOrCreate() #create a dictionary with 3 pairs with 8 values each #inside a list data = [{'fruit': 'apple', 'cost': 67, 'city': 'patna'}, {'fruit': 'mango', 'cost': 87, 'city': 'delhi'}, {'fruit': 'apple', 'cost': 76, 'city': 'harayana'}, {'fruit': 'banana', 'cost': 87, 'city': 'hyderabad'}, {'fruit': 'guava', 'cost': 56, 'city': 'delhi'}, {'fruit': 'mango', 'cost': 234, 'city': 'patna'}, {'fruit': 'apple', 'cost': 43, 'city': 'delhi'}, {'fruit': 'mango', 'cost': 49, 'city': 'banglore'}] # creating a dataframe from the given list of dictionary dataframe = spark.createDataFrame(data) # display the final dataframe dataframe.show() |
# import the pyspark module import pyspark # import the sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and then give the app name spark = SparkSession.builder.appName('statistics_globe').getOrCreate() #create a dictionary with 3 pairs with 8 values each #inside a list data = [{'fruit': 'apple', 'cost': 67, 'city': 'patna'}, {'fruit': 'mango', 'cost': 87, 'city': 'delhi'}, {'fruit': 'apple', 'cost': 76, 'city': 'harayana'}, {'fruit': 'banana', 'cost': 87, 'city': 'hyderabad'}, {'fruit': 'guava', 'cost': 56, 'city': 'delhi'}, {'fruit': 'mango', 'cost': 234, 'city': 'patna'}, {'fruit': 'apple', 'cost': 43, 'city': 'delhi'}, {'fruit': 'mango', 'cost': 49, 'city': 'banglore'}] # creating a dataframe from the given list of dictionary dataframe = spark.createDataFrame(data) # display the final dataframe dataframe.show()
The table above illustrates our example DataFrame. As you can see, it contains three columns that are called fruit, cost and city.
Let’s group the fruits and city columns to the cost column and sort the DataFrame in descending order.
Example 1: groupBy & Sort PySpark DataFrame in Descending Order Using sort() Method
This example uses the desc() and sum() functions imported from the pyspark.sql.functions module to calculate the sum by group.
We use the agg() function to group our data, and the desc() function to sort the final DataFrame in descending order.
dataframe.groupBy("column_1").agg(sum('column_2').alias("Grouped_column_name")).sort(desc("column_name")) |
dataframe.groupBy("column_1").agg(sum('column_2').alias("Grouped_column_name")).sort(desc("column_name"))
In this specific example, we are grouping the fruit column with cost by calculating the total cost based on the fruit column. Furthermore, we sort the DataFrame in descending order based on the fruit column.
The following Python code import the sum() and desc() functions from the pyspark.sql.functions module:
from pyspark.sql.functions import sum,desc |
from pyspark.sql.functions import sum,desc
In the next step, we calculate the total cost based on the fruit column:
dataframe.groupBy("fruit").agg(sum('cost').alias("Total_Cost")).sort(desc("fruit")) .show() |
dataframe.groupBy("fruit").agg(sum('cost').alias("Total_Cost")).sort(desc("fruit")) .show()
The previous table shows the result of our Python code, i.e. the total cost by group.
Example 2: groupBy & Sort PySpark DataFrame in Descending Order Using orderBy() Method
The method shown in Example 2 is similar to the method explained in Example 1. However, this time we are using the orderBy() function.
The orderBy() function is used with the parameter ascending equal to False. This specifies to sort the DataFrame in descending order.
Consider the Python code below:
dataframe.groupBy("column_1").agg(sum('column_2').alias("Grouped_column_name")).orderBy("column_name", ascending=False) |
dataframe.groupBy("column_1").agg(sum('column_2').alias("Grouped_column_name")).orderBy("column_name", ascending=False)
In the first step, we have to import the sum() function from the pyspark.sql.functions module:
from pyspark.sql.functions import sum |
from pyspark.sql.functions import sum
Next, we are calculating the total cost based on the fruit column and sort the DataFrame in descending order based on the fruit column:
dataframe.groupBy("fruit").agg(sum('cost').alias("Total_Cost")).orderBy("fruit", ascending=False).show() |
dataframe.groupBy("fruit").agg(sum('cost').alias("Total_Cost")).orderBy("fruit", ascending=False).show()
As you can see, the previous output table is the same as the output table in Example 1. However, this time, we have used the orderBy() function instead of the sort() function.
Video, Further Resources & Summary
Would you like to learn more on how to sum variables by group? Then you may check out the following video of Krish Naik’s YouTube channel.
In the video, he explains how to use the groupBy and aggregate functions:
In addition, you may have a look at some related tutorials on the Data Hacks website:
- Add New Column to PySpark DataFrame in Python
- Change Column Names of PySpark DataFrame in Python
- Concatenate Two & Multiple PySpark DataFrames
- Convert PySpark DataFrame Column from String to Double Type
- Convert PySpark DataFrame Column from String to Int Type
- Display PySpark DataFrame in Table Format
- Export PySpark DataFrame as CSV
- Filter PySpark DataFrame Column with None Value in Python
- Import PySpark in Python Shell
- Python Programming Tutorials
In summary: This post has explained how to groupBy and order PySpark DataFrames in Python programming. If you have any further questions, you may leave a comment below.
This article was written in collaboration with Gottumukkala Sravan Kumar. You may find more information about Gottumukkala Sravan Kumar and his other articles on his profile page.