How to Add New Column to PySpark DataFrame in Python (5 Examples)
On this page, you’ll learn how to add a new column to PySpark DataFrame in the Python programming language.
The following sections are explained in this article:
- Introduction
- Creating Example Data
- Example 1: Add New Column with Constant Value
- Example 2: Add New Column based on Another Column in DataFrame
- Example 3: Add New Column Using select() Method
- Example 4: Add New Column Using SQL Expression
- Example 5: Add New Column based on Conditions on Another Column in DataFrame
- Video, Further Resources & Summary
Here’s how to do it!
Introduction
PySpark is an open-source software that is used to store and process data by using the Python Programming language.
We can create a PySpark object by using a Spark session and specify the app name by using the getorcreate() method.
SparkSession.builder.appName(app_name).getOrCreate() |
SparkSession.builder.appName(app_name).getOrCreate()
After creating the data with a list of dictionaries, we have to pass the data to the createDataFrame() method. This will create our PySpark DataFrame.
spark.createDataFrame(data) |
spark.createDataFrame(data)
Next, we can display the DataFrame by using the show() method:
dataframe.show() |
dataframe.show()
Creating Example Data
In this example, we are going to create a DataFrame from a list of dictionaries with three rows and three columns, containing student subjects.
We are displaying the DataFrame by using the show() method:
# import the pyspark module import pyspark # import the Spark session from pyspark.sql module from pyspark.sql import Spark session # creating Spark session and then give the app name spark = SparkSession.builder.appName('data_hacks').getOrCreate() # create a dictionary with 3 pairs with 3 values each # inside a list data = [{'first_subject': 'java', 'second_subject': 'hadoop', 'third_subject': 'php'}, {'first_subject': 'c/c++', 'second_subject': 'hive', 'third_subject': 'jsp'}, {'first_subject': 'Scala', 'second_subject': 'pig', 'third_subject': 'html/css'}] # creating a DataFrame from the given list of dictionary dataframe = spark.createDataFrame(data) # display the final DataFrame dataframe.show() |
# import the pyspark module import pyspark # import the Spark session from pyspark.sql module from pyspark.sql import Spark session # creating Spark session and then give the app name spark = SparkSession.builder.appName('data_hacks').getOrCreate() # create a dictionary with 3 pairs with 3 values each # inside a list data = [{'first_subject': 'java', 'second_subject': 'hadoop', 'third_subject': 'php'}, {'first_subject': 'c/c++', 'second_subject': 'hive', 'third_subject': 'jsp'}, {'first_subject': 'Scala', 'second_subject': 'pig', 'third_subject': 'html/css'}] # creating a DataFrame from the given list of dictionary dataframe = spark.createDataFrame(data) # display the final DataFrame dataframe.show()
The previously shown table shows our example DataFrame. As you can see, it contains three columns that are called first_subject, second_subject, and third_subject.
Let’s add new columns to this existing DataFrame.
Example 1: Add New Column with Constant Value
This example uses the lit() function to add a column with a constant value.
The withColumn() method adds a new column with a constant value to our example DataFrame.
First, we have to import the lit() method from the sql functions module.
dataframe.withColumn("column_name", lit(value)) |
dataframe.withColumn("column_name", lit(value))
In this example, we are adding marks column with a constant value from 90.
#import lit method from pyspark.sql module from pyspark.sql.functions import lit #Add Marks column with 90 as default value #and display dataframe.withColumn("Marks", lit(90)).show() |
#import lit method from pyspark.sql module from pyspark.sql.functions import lit #Add Marks column with 90 as default value #and display dataframe.withColumn("Marks", lit(90)).show()
Example 2: Add New Column based on Another Column in DataFrame
This method will use the concat_ws() method, which will combine values from two or more columns and add the values to the new column.
Before, we have to import the concat_ws() function from the pyspark.sql.functions module.
dataframe.withColumn("column_name", concat_ws("separator","column1","column2",....,"column n")) |
dataframe.withColumn("column_name", concat_ws("separator","column1","column2",....,"column n"))
In this example, we are going to combine the first_subject and the second_subject columns and assign them to a new column named Computer subjects.
#import concat_ws method from pyspark.sql from pyspark.sql.functions import concat_ws #add new column named Computer subjects from first_subject column #and second_subject column separated by comma operator #and display dataframe.withColumn("Computer subjects", concat_ws(",","first_subject",'second_subject')).show() |
#import concat_ws method from pyspark.sql from pyspark.sql.functions import concat_ws #add new column named Computer subjects from first_subject column #and second_subject column separated by comma operator #and display dataframe.withColumn("Computer subjects", concat_ws(",","first_subject",'second_subject')).show()
Example 3: Add New Column Using select() Method
In this example, we are using the select() method along with the lit() method to add a new column and assign a constant value to the DataFrame.
dataframe.select("column1","column2", lit(constant-value).alias("new_column_name")) |
dataframe.select("column1","column2", lit(constant-value).alias("new_column_name"))
We are going to add a new column called marks and display the first two columns along with marks and assign a default value 90 to this new column.
#import lit method from pyspark.sql module from pyspark.sql.functions import lit #Add Column vy using select() method #add column named marks with default value - 90 by using lit() #and display dataframe.select("first_subject","second_subject", lit(90).alias("marks")).show() |
#import lit method from pyspark.sql module from pyspark.sql.functions import lit #Add Column vy using select() method #add column named marks with default value - 90 by using lit() #and display dataframe.select("first_subject","second_subject", lit(90).alias("marks")).show()
Example 4: Add New Column Using SQL Expression
In this example, we are going to add a new column to the DataFrame from sql view through sql() function and fill the new column with constant value.
spark.sql("select column1,............,columnn, 'constant_value' as new_column from view_name") |
spark.sql("select column1,............,columnn, 'constant_value' as new_column from view_name")
Here, we are creating a view and from that view we are creating a new column called marks. To this column, we assign the constant value 90.
# Add Column to DataFrame using SQL Expression #by creating the view dataframe.createOrReplaceTempView("data") #get the column name and values #and display spark.sql("select first_subject,second_subject, '90' as marks from data").show() |
# Add Column to DataFrame using SQL Expression #by creating the view dataframe.createOrReplaceTempView("data") #get the column name and values #and display spark.sql("select first_subject,second_subject, '90' as marks from data").show()
Example 5: Add New Column based on Conditions on Another Column in DataFrame
In this example we are going to add a new column based on the condition from another existing column data using the when() method.
This method is used to execute a condition along with the lit() method.
So before, we have to import the when() method from the pyspark.sql.functions.
spark.sql("select column1,............,columnn, 'constant_value' as new_column from view_name") |
spark.sql("select column1,............,columnn, 'constant_value' as new_column from view_name")
Here we are adding a column named Course Domain based on subjects conditions:
– when the third_subject column is html/css assign the Course Domain value as Programming
– when the first_subject column is java and second_subject column is hadoop then assign the Course Domain value as Object oriented
– otherwise assign the Course Domain as Data analysis.
#import lit method from pyspark.sql module from pyspark.sql.functions import lit #add column named Course Domain based on subjects conditions #when the third_subject column is html/css assign the Course Domain value as Programming #when the first_subject column is java and second_subject column is hadoop then assign the Course Domain value as Object oriented #otherwise assign the Course Domain as Data analysis from pyspark.sql.functions import when dataframe.withColumn("Course Domain", when((dataframe.third_subject == "html/css"), lit("Programming")) .when((dataframe.first_subject =='java') & (dataframe.second_subject =='hadoop'), lit("Object oriented")) .otherwise(lit("Data analysis"))).show() |
#import lit method from pyspark.sql module from pyspark.sql.functions import lit #add column named Course Domain based on subjects conditions #when the third_subject column is html/css assign the Course Domain value as Programming #when the first_subject column is java and second_subject column is hadoop then assign the Course Domain value as Object oriented #otherwise assign the Course Domain as Data analysis from pyspark.sql.functions import when dataframe.withColumn("Course Domain", when((dataframe.third_subject == "html/css"), lit("Programming")) .when((dataframe.first_subject =='java') & (dataframe.second_subject =='hadoop'), lit("Object oriented")) .otherwise(lit("Data analysis"))).show()
Video, Further Resources & Summary
Do you need more explanations on how to modify the variable names of a PySpark DataFrame? Then you may have a look at the following YouTube video of the GeekCoders YouTube channel.
The speaker explains in this video how to add columns to a DataFrame using PySpark.
In addition, you can also have a look at our other tutorials on the Data Hacks website:
- Change Column Names of PySpark DataFrame in Python
- Concatenate Two & Multiple PySpark DataFrames
- Convert PySpark DataFrame Column from String to Double Type
- Convert PySpark DataFrame Column from String to Int Type
- Display PySpark DataFrame in Table Format
- Export PySpark DataFrame as CSV
- Filter PySpark DataFrame Column with None Value in Python
- groupBy & Sort PySpark DataFrame in Descending Order
- Import PySpark in Python Shell
- Python Programming Tutorials
Summary: This post has explained you how to insert new columns in a PySpark DataFrame in the Python programming language. In case you have any additional questions, you may leave a comment in the section below.
This article was written in collaboration with Gottumukkala Sravan Kumar. You may find more information about Gottumukkala Sravan Kumar and his other articles on his profile page.