Change Column Names of PySpark DataFrame in Python (4 Examples)
In this tutorial you’ll learn how to change column names in a PySpark DataFrame in the Python programming language.
The table of content is structured as follows:
- Introduction
- Creating Example Data
- Example 1: Change Column Names in PySpark DataFrame Using select() Function
- Example 2: Change Column Names in PySpark DataFrame Using selectExpr() Function
- Example 3: Change Column Names in PySpark DataFrame Using toDF() Function
- Example 4: Change Column Names in PySpark DataFrame Using withColumnRenamed() Function
- Video, Further Resources & Summary
Let’s do this!
Introduction
PySpark is an open-source software that is used to store and process data by using the Python Programming language.
We can create a PySpark object by using a Spark session and specify the app name by using the getorcreate() method.
SparkSession.builder.appName(app_name).getOrCreate() |
SparkSession.builder.appName(app_name).getOrCreate()
After creating the data with a list of dictionaries, we have to pass the data to the createDataFrame() method. This will create a PySpark DataFrame.
spark.createDataFrame(data) |
spark.createDataFrame(data)
Next, we can display the DataFrame by using the show() method:
dataframe.show() |
dataframe.show()
Creating Example Data
In this example we are going to create a DataFrame from a list of dictionaries with three rows and three columns, containing student subjects. We are displaying the DataFrame by using the show() method:
# import the pyspark module import pyspark # import the Spark session from pyspark.sql module from pyspark.sql import Spark session # creating Spark session and then give the app name spark = SparkSession.builder.appName('data_hacks').getOrCreate() # create a dictionary with 3 pairs with 3 values each # inside a list data = [{'first_subject': 'java', 'second_subject': 'hadoop', 'third_subject': 'php'}, {'first_subject': 'c/c++', 'second_subject': 'hive', 'third_subject': 'jsp'}, {'first_subject': 'Scala', 'second_subject': 'pig', 'third_subject': 'html/css'}] # creating a DataFrame from the given list of dictionary dataframe = spark.createDataFrame(data) # display the final DataFrame dataframe.show() |
# import the pyspark module import pyspark # import the Spark session from pyspark.sql module from pyspark.sql import Spark session # creating Spark session and then give the app name spark = SparkSession.builder.appName('data_hacks').getOrCreate() # create a dictionary with 3 pairs with 3 values each # inside a list data = [{'first_subject': 'java', 'second_subject': 'hadoop', 'third_subject': 'php'}, {'first_subject': 'c/c++', 'second_subject': 'hive', 'third_subject': 'jsp'}, {'first_subject': 'Scala', 'second_subject': 'pig', 'third_subject': 'html/css'}] # creating a DataFrame from the given list of dictionary dataframe = spark.createDataFrame(data) # display the final DataFrame dataframe.show()
The previously shown table shows our example DataFrame. As you can see, it contains three columns that are called first_subject, second_subject, and third_subject.
Let’s rename these variables!
Example 1: Change Column Names in PySpark DataFrame Using select() Function
The Second example will discuss how to change the column names in a PySpark DataFrame by using select() function.
The select method is used to select columns through the col method and to change the column names by using the alias() function.
First, we have to import the col method from the sql functions module. We can change single or multiple columns at a time.
dataframe.select(col("Old_column1").alias(new_column1), (col("Old_column2").alias(new_column2),…… (col("Old_columnn").alias(new_columnn), |
dataframe.select(col("Old_column1").alias(new_column1), (col("Old_column2").alias(new_column2),…… (col("Old_columnn").alias(new_columnn),
In this example we are changing the first_subjects column into programming, the second_subjects column into Big data, and the third_subjects variable into Web technologies.
# import column method from pyspark.sql.functions from pyspark.sql.functions import col # Select the first_subject columns as 'Programming' # Select the second_subject columns as 'Big data' # Select the third_subject columns as 'Web technologies' # and display the DataFrame by using show() method dataframe.select(col("first_subject").alias('Programming'), col("second_subject").alias('Big data'), col("third_subject").alias("Web technologies")).show() |
# import column method from pyspark.sql.functions from pyspark.sql.functions import col # Select the first_subject columns as 'Programming' # Select the second_subject columns as 'Big data' # Select the third_subject columns as 'Web technologies' # and display the DataFrame by using show() method dataframe.select(col("first_subject").alias('Programming'), col("second_subject").alias('Big data'), col("third_subject").alias("Web technologies")).show()
Example 2: Change Column Names in PySpark DataFrame Using selectExpr() Function
This method will work as a select expression by using a keyword and convert the column names in the PySpark DataFrame.
dataframe.selectExpr("oldcolumn1 as newcolumn","oldcolumn2 as newcolumn”,………..) |
dataframe.selectExpr("oldcolumn1 as newcolumn","oldcolumn2 as newcolumn”,………..)
In this example, we are going to change the column names to new column names as seen in the previous example.
# Select the first_subject columns as 'Programming' # Select the second_subject columns as 'Bigdata' # Select the third_subject columns as 'Webtechnologies' #and display the DataFrame by using show() method dataframe.selectExpr("first_subject as Programming","second_subject as Bigdata","third_subject as Webtechnologies").show() |
# Select the first_subject columns as 'Programming' # Select the second_subject columns as 'Bigdata' # Select the third_subject columns as 'Webtechnologies' #and display the DataFrame by using show() method dataframe.selectExpr("first_subject as Programming","second_subject as Bigdata","third_subject as Webtechnologies").show()
Example 3: Change Column Names in PySpark DataFrame Using toDF() Function
In this example we are using the toDF() function. This function will take an input list of new columns and change all the columns in the PySpark DataFrame.
dataframe.toDF(*new_column_list).show() |
dataframe.toDF(*new_column_list).show()
We are going to change the old columns to new columns as Programming, Big data and Web technologies.
new_column_list=['Programming','Big data','Web technologies'] # Select the first_subject columns as 'Programming' # Select the second_subject columns as 'Bigdata' # Select the third_subject columns as 'Webtechnologies' #and display the DataFrame by using show() method dataframe.toDF(*new_column_list).show() |
new_column_list=['Programming','Big data','Web technologies'] # Select the first_subject columns as 'Programming' # Select the second_subject columns as 'Bigdata' # Select the third_subject columns as 'Webtechnologies' #and display the DataFrame by using show() method dataframe.toDF(*new_column_list).show()
Example 4: Change Column Names in PySpark DataFrame Using withColumnRenamed() Function
In this example we are going to change one or multiple column names at a time by using the withColumnRenamed() function and displaying the DataFrame by using the show() function.
dataframe.withColumnRenamed("oldcolumn1 ","newcolumn_name").withColumnRenamed("oldcolumn2",”new_columnname")………… |
dataframe.withColumnRenamed("oldcolumn1 ","newcolumn_name").withColumnRenamed("oldcolumn2",”new_columnname")…………
We are changing only first and second column in this example.
# Select the first_subject columns as 'Programming' # Select the second_subject columns as 'Bigdata' # and display the DataFrame by using show() method dataframe.withColumnRenamed("first_subject","Programming").withColumnRenamed("second_subject","Big data").show() |
# Select the first_subject columns as 'Programming' # Select the second_subject columns as 'Bigdata' # and display the DataFrame by using show() method dataframe.withColumnRenamed("first_subject","Programming").withColumnRenamed("second_subject","Big data").show()
Video, Further Resources & Summary
Do you need more explanations on how to modify the variable names of a PySpark DataFrame? Then you may have a look at the following YouTube video of the DecisionForest YouTube channel.
In the video, the speaker explains how to select, rename, transform, and manipulate the columns of a Spark DataFrame in Python.

By loading the video, you agree to YouTube’s privacy policy.
Learn more
Furthermore, you may have a look at some of the other tutorials on the Data Hacks website:
- Add New Column to PySpark DataFrame in Python
- Concatenate Two & Multiple PySpark DataFrames
- Convert PySpark DataFrame Column from String to Double Type
- Convert PySpark DataFrame Column from String to Int Type
- Display PySpark DataFrame in Table Format
- Export PySpark DataFrame as CSV
- Filter PySpark DataFrame Column with None Value in Python
- groupBy & Sort PySpark DataFrame in Descending Order
- Import PySpark in Python Shell
- Rename Columns of pandas DataFrame in Python
- Python Programming Tutorials
Summary: This post has illustrated how to rename variables of a PySpark DataFrame in the Python programming language. In case you have any additional questions, you may leave a comment below.
This article was written in collaboration with Gottumukkala Sravan Kumar. You may find more information about Gottumukkala and his other articles on his profile page.