Skip to content
On this page

Saving A Dataframe

In PySpark, you can save a DataFrame to different file formats using the write method of the DataFrame. The write method provides various options to save the DataFrame to formats like CSV, Parquet, JSON, ORC, and more.

Let's go through examples of saving a DataFrame to different formats.

First, make sure you have PySpark installed and set up a SparkSession:

python
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder
    .appName("Save DataFrame Examples")
    .getOrCreate()

Next, let's create a sample DataFrame with some data:

python
from pyspark.sql import Row

# Sample data
data = [
    Row(name="Alice", age=28),
    Row(name="Bob", age=32),
    Row(name="Charlie", age=25),
    Row(name="David", age=31),
    Row(name="Eva", age=24),
]

# Create DataFrame from the sample data
df = spark.createDataFrame(data)
df.show()

The output will be:

+-------+---+
|   name|age|
+-------+---+
|  Alice| 28|
|    Bob| 32|
|Charlie| 25|
|  David| 31|
|    Eva| 24|
+-------+---+

1. Save DataFrame as CSV:

python
# Save DataFrame as CSV
df.write.csv("data.csv", header=True, mode="overwrite")

2. Save DataFrame as Parquet:

python
# Save DataFrame as Parquet
df.write.parquet("data.parquet", mode="overwrite")

3. Save DataFrame as JSON:

python
# Save DataFrame as JSON
df.write.json("data.json", mode="overwrite")

4. Save DataFrame as ORC:

python
# Save DataFrame as ORC
df.write.orc("data.orc", mode="overwrite")

5. Save DataFrame as Avro:

To save a DataFrame as Avro, you need to install the avro Python package first:

bash
pip install avro-python3

Then you can save the DataFrame as Avro:

python
# Save DataFrame as Avro
df.write.format("com.databricks.spark.avro").save("data.avro")

6. Save DataFrame as Delta Lake:

To save a DataFrame as Delta Lake, you need to install the delta PySpark package first:

bash
pip install delta-spark

Then you can save the DataFrame as Delta Lake:

python
# Save DataFrame as Delta Lake
df.write.format("delta").save("data.delta")

The write method provides flexibility in specifying options like file format, compression, partitioning, and more. Note that the examples use the mode="overwrite" option to overwrite the existing output files if they already exist. You can also use mode="append" to append data to existing files, or mode="ignore" to avoid saving the DataFrame if the output file already exists.