Change Column Names of PySpark DataFrame in Python (4 Examples)

In this tutorial you’ll learn how to change column names in a PySpark DataFrame in the Python programming language.

The table of content is structured as follows:

  • Introduction
  • Creating Example Data
  • Example 1: Change Column Names in PySpark DataFrame Using select() Function
  • Example 2: Change Column Names in PySpark DataFrame Using selectExpr() Function
  • Example 3: Change Column Names in PySpark DataFrame Using toDF() Function
  • Example 4: Change Column Names in PySpark DataFrame Using withColumnRenamed() Function
  • Video, Further Resources & Summary

Let’s do this!

Introduction

PySpark is an open-source software that is used to store and process data by using the Python Programming language.

We can create a PySpark object by using a Spark session and specify the app name by using the getorcreate() method.

SparkSession.builder.appName(app_name).getOrCreate()

After creating the data with a list of dictionaries, we have to pass the data to the createDataFrame() method. This will create a PySpark DataFrame.

spark.createDataFrame(data)

Next, we can display the DataFrame by using the show() method:

dataframe.show()

Creating Example Data

In this example we are going to create a DataFrame from a list of dictionaries with three rows and three columns, containing student subjects. We are displaying the DataFrame by using the show() method:

# import the pyspark module
import pyspark
 
# import the Spark session from pyspark.sql module
from pyspark.sql import Spark session
 
# creating Spark session and then give the app name
spark = SparkSession.builder.appName('data_hacks').getOrCreate()
 
# create a dictionary with 3 pairs with 3 values each
# inside a list
data = [{'first_subject': 'java', 'second_subject': 'hadoop', 'third_subject': 'php'},
        {'first_subject': 'c/c++', 'second_subject': 'hive', 'third_subject': 'jsp'},
        {'first_subject': 'Scala', 'second_subject': 'pig', 'third_subject': 'html/css'}]
 
# creating a DataFrame from the given list of dictionary
dataframe = spark.createDataFrame(data)
 
# display the final DataFrame
dataframe.show()

Table 1 Python PySpark Change Column Names

The previously shown table shows our example DataFrame. As you can see, it contains three columns that are called first_subject, second_subject, and third_subject.

Let’s rename these variables!

Example 1: Change Column Names in PySpark DataFrame Using select() Function

The Second example will discuss how to change the column names in a PySpark DataFrame by using select() function.

The select method is used to select columns through the col method and to change the column names by using the alias() function.

First, we have to import the col method from the sql functions module. We can change single or multiple columns at a time.

dataframe.select(col("Old_column1").alias(new_column1),
                (col("Old_column2").alias(new_column2),……
                (col("Old_columnn").alias(new_columnn),

In this example we are changing the first_subjects column into programming, the second_subjects column into Big data, and the third_subjects variable into Web technologies.

# import column method from pyspark.sql.functions
from pyspark.sql.functions import col
 
# Select the first_subject columns as 'Programming'
# Select the second_subject columns as 'Big data'
# Select the third_subject columns as 'Web technologies'
# and display the DataFrame by using show() method
dataframe.select(col("first_subject").alias('Programming'),
                 col("second_subject").alias('Big data'),
                col("third_subject").alias("Web technologies")).show()

Table 2 Python PySpark Change Column Names

Example 2: Change Column Names in PySpark DataFrame Using selectExpr() Function

This method will work as a select expression by using a keyword and convert the column names in the PySpark DataFrame.

dataframe.selectExpr("oldcolumn1 as newcolumn","oldcolumn2 as newcolumn”,………..)

In this example, we are going to change the column names to new column names as seen in the previous example.

# Select the first_subject columns as 'Programming'
# Select the second_subject columns as 'Bigdata'
# Select the third_subject columns as 'Webtechnologies'
#and display the DataFrame by using show() method
dataframe.selectExpr("first_subject as Programming","second_subject as Bigdata","third_subject as Webtechnologies").show()

Table 3 Python PySpark Change Column Names

Example 3: Change Column Names in PySpark DataFrame Using toDF() Function

In this example we are using the toDF() function. This function will take an input list of new columns and change all the columns in the PySpark DataFrame.

dataframe.toDF(*new_column_list).show()

We are going to change the old columns to new columns as Programming, Big data and Web technologies.

new_column_list=['Programming','Big data','Web technologies']
# Select the first_subject columns as 'Programming'
# Select the second_subject columns as 'Bigdata'
# Select the third_subject columns as 'Webtechnologies'
#and display the DataFrame by using show() method
dataframe.toDF(*new_column_list).show()

Table 4 Python PySpark Change Column Names

Example 4: Change Column Names in PySpark DataFrame Using withColumnRenamed() Function

In this example we are going to change one or multiple column names at a time by using the withColumnRenamed() function and displaying the DataFrame by using the show() function.

dataframe.withColumnRenamed("oldcolumn1 ","newcolumn_name").withColumnRenamed("oldcolumn2",”new_columnname")…………

We are changing only first and second column in this example.

# Select the first_subject columns as 'Programming'
# Select the second_subject columns as 'Bigdata'
# and display the DataFrame by using show() method
dataframe.withColumnRenamed("first_subject","Programming").withColumnRenamed("second_subject","Big data").show()

Table 5 Python PySpark Change Column Names

Video, Further Resources & Summary

Do you need more explanations on how to modify the variable names of a PySpark DataFrame? Then you may have a look at the following YouTube video of the DecisionForest YouTube channel.

In the video, the speaker explains how to select, rename, transform, and manipulate the columns of a Spark DataFrame in Python.

YouTube

By loading the video, you agree to YouTube’s privacy policy.
Learn more

Load video

Furthermore, you may have a look at some of the other tutorials on the Data Hacks website:

Summary: This post has illustrated how to rename variables of a PySpark DataFrame in the Python programming language. In case you have any additional questions, you may leave a comment below.

This article was written in collaboration with Gottumukkala Sravan Kumar. You may find more information about Gottumukkala and his other articles on his profile page.

Leave a Reply

Your email address will not be published.

Fill out this field
Fill out this field
Please enter a valid email address.
You need to agree with the terms to proceed

Menu
Top