Export PySpark DataFrame as CSV (3 Examples)
This post explains how to export a PySpark DataFrame as a CSV in the Python programming language.
The tutorial consists of these contents:
- Introduction
- Creating Example Data
- Example 1: Using write.csv() Function
- Example 2: Using write.format() Function
- Example 3: Using write.option() Function
- Video, Further Resources & Summary
Let’s dive into it:
Introduction
PySpark is an open-source software that is used to store and process data by using the Python Programming language.
We can construct a PySpark object by using a Spark session and specify the app name by using the getorcreate() method.
SparkSession.builder.appName(app_name).getOrCreate() |
SparkSession.builder.appName(app_name).getOrCreate()
After data creation with a list of dictionaries, we have to pass the data to the createDataFrame() method. This will create our PySpark DataFrame.
spark.createDataFrame(data) |
spark.createDataFrame(data)
To display our DataFrame we can use the show() method:
dataframe.show() |
dataframe.show()
Creating Example Data
In this case, we are going to create a DataFrame from a list of dictionaries with eight rows and three columns, containing details about fruits and cities. We can display the DataFrame by using the show() method:
# import the pyspark module import pyspark # import the sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and then give the app name spark = SparkSession.builder.appName('statistics_globe').getOrCreate() #create a dictionary with 3 pairs with 8 values each #inside a list data = [{'fruit': 'apple', 'cost': '67.89', 'city': 'patna'}, {'fruit': 'mango', 'cost': '87.67', 'city': 'delhi'}, {'fruit': 'apple', 'cost': '64.76', 'city': 'harayana'}, {'fruit': 'banana', 'cost': '87.00', 'city': 'hyderabad'}, {'fruit': 'guava', 'cost': '69.56', 'city': 'delhi'}, {'fruit': 'mango', 'cost': '234.67', 'city': 'patna'}, {'fruit': 'apple', 'cost': '143.00', 'city': 'delhi'}, {'fruit': 'mango', 'cost': '49.0', 'city': 'banglore'}] # creating a dataframe from the given list of dictionary dataframe = spark.createDataFrame(data) # display the final dataframe dataframe.show() |
# import the pyspark module import pyspark # import the sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and then give the app name spark = SparkSession.builder.appName('statistics_globe').getOrCreate() #create a dictionary with 3 pairs with 8 values each #inside a list data = [{'fruit': 'apple', 'cost': '67.89', 'city': 'patna'}, {'fruit': 'mango', 'cost': '87.67', 'city': 'delhi'}, {'fruit': 'apple', 'cost': '64.76', 'city': 'harayana'}, {'fruit': 'banana', 'cost': '87.00', 'city': 'hyderabad'}, {'fruit': 'guava', 'cost': '69.56', 'city': 'delhi'}, {'fruit': 'mango', 'cost': '234.67', 'city': 'patna'}, {'fruit': 'apple', 'cost': '143.00', 'city': 'delhi'}, {'fruit': 'mango', 'cost': '49.0', 'city': 'banglore'}] # creating a dataframe from the given list of dictionary dataframe = spark.createDataFrame(data) # display the final dataframe dataframe.show()
The table above shows our example DataFrame. As you can see, it contains three columns that are called fruit, cost, and city.
Now let’s export the data from our DataFrame into a CSV.
Example 1: Using write.csv() Function
This example is using the write.csv() method to export the data from the given PySpark DataFrame.
dataframe.write.csv("file_name") |
dataframe.write.csv("file_name")
In the next step, we are exporting the above DataFrame into a CSV.
#export the dataframe with file name as final_data dataframe.write.csv("final_data") |
#export the dataframe with file name as final_data dataframe.write.csv("final_data")
Example 2: Using write.format() Function
This example is using the save() method to export the data in the csv format.
dataframe.write.format("csv").save("file_name") |
dataframe.write.format("csv").save("file_name")
In this example, we are exporting the above DataFrame into CSV format.
dataframe.write.format("csv").save("final_data") |
dataframe.write.format("csv").save("final_data")
Example 3: Using write.option() Function
This example uses the option() method to display header values (column names) and exporting the data.
dataframe.write.option("header",True).csv("column_name") |
dataframe.write.option("header",True).csv("column_name")
In this example we are exporting PySpark DataFrame into csv
#export the dataframe with file name as final_data dataframe.write.option("header",True).csv("final_data") |
#export the dataframe with file name as final_data dataframe.write.option("header",True).csv("final_data")
Video, Further Resources & Summary
If you need more explanations on how to Write Files in PySpark, you may have a look at the following YouTube video of the YouTube channel Let’s Data!
You may have a look on the Data Hacks website for some other tutorials:
- Add New Column to PySpark DataFrame in Python
- Change Column Names of PySpark DataFrame in Python
- Concatenate Two & Multiple PySpark DataFrames
- Convert PySpark DataFrame Column from String to Double Type
- Convert PySpark DataFrame Column from String to Int Type
- Display PySpark DataFrame in Table Format
- Filter PySpark DataFrame Column with None Value in Python
- groupBy & Sort PySpark DataFrame in Descending Order
- Import PySpark in Python Shell
- Python Programming Tutorials
Summary: This post has illustrated how to send out a PySpark DataFrame as a CSV in the Python programming language. In case you have any additional questions, you may leave a comment below.
This article was written in collaboration with Gottumukkala Sravan Kumar. You may find more information about Gottumukkala Sravan Kumar and his other articles on his profile page.