Removing Duplicate Rows from a DataFrame - `.dropDuplicates()`

Overview

The dropDuplicates() function (also known as drop_duplicates()) is used to remove duplicate rows from a DataFrame. Duplicate rows are rows with identical values across all columns. The dropDuplicates() function helps you ensure data cleanliness by eliminating redundant information. The function returns a new DataFrame with duplicate rows removed, leaving the original DataFrame unchanged.

Drop Duplicate Rows

You can use the dropDuplicates() function to remove duplicate rows from a DataFrame.

python

from pyspark.sql import SparkSession

# Create a SparkSession (if not already created)
spark = SparkSession.builder.appName("DropDuplicatesExample").getOrCreate()

# Sample data with duplicate rows as a list of dictionaries
data = [
{"name": "Alice", "age": 30, "city": "New York"},
{"name": "Bob", "age": 25, "city": "San Francisco"},
{"name": "Alice", "age": 30, "city": "New York"},  # Duplicate row
{"name": "Charlie", "age": 35, "city": "Los Angeles"},
{"name": "Bob", "age": 25, "city": "San Francisco"},  # Duplicate row
]

# Create a DataFrame
df = spark.createDataFrame(data)

# Drop duplicate rows
df_without_duplicates = df.dropDuplicates()

df_without_duplicates.show()

Output:

+-------+---+-------------+
|   name|age|         city|
+-------+---+-------------+
|Charlie| 35|  Los Angeles|
|    Bob| 25|San Francisco|
|  Alice| 30|     New York|
+-------+---+-------------+

The duplicate rows with the same values for all columns have been removed from the DataFrame.

Drop Duplicate Rows Based on Specific Columns

You can use the dropDuplicates() function with a subset of columns to consider when identifying duplicate rows.

python

from pyspark.sql import SparkSession

# Create a SparkSession (if not already created)
spark = SparkSession.builder.appName("DropDuplicatesExample").getOrCreate()

# Sample data with duplicate rows as a list of dictionaries
data = [
{"name": "Alice", "age": 30, "city": "New York"},
{"name": "Bob", "age": 25, "city": "San Francisco"},
{"name": "Alice", "age": 30, "city": "Los Angeles"},  # Duplicate row (same name and age)
{"name": "Charlie", "age": 35, "city": "Los Angeles"},
{"name": "Bob", "age": 25, "city": "San Francisco"},  # Duplicate row (same name and age)
]

# Create a DataFrame
df = spark.createDataFrame(data)

# Drop duplicate rows based on specific columns
df_without_duplicates = df.dropDuplicates(["name", "age"])

df_without_duplicates.show()

Output:

+-------+---+-------------+
|   name|age|         city|
+-------+---+-------------+
|Charlie| 35|  Los Angeles|
|  Alice| 30|     New York|
|    Bob| 25|San Francisco|
+-------+---+-------------+

The duplicate rows with the same values for the columns "name" and "age" have been removed from the DataFrame, while keeping the last occurrence of each duplicate.

📖👉 Official Doc

Removing Duplicate Rows from a DataFrame - .dropDuplicates() ​

Overview ​

Drop Duplicate Rows ​

Drop Duplicate Rows Based on Specific Columns ​

Removing Duplicate Rows from a DataFrame - `.dropDuplicates()`

Overview

Drop Duplicate Rows

Drop Duplicate Rows Based on Specific Columns