Appearance
Removing Duplicate Rows from a DataFrame - .dropDuplicates()
Overview
The dropDuplicates()
function (also known as drop_duplicates()
) is used to remove duplicate rows from a DataFrame. Duplicate rows are rows with identical values across all columns. The dropDuplicates()
function helps you ensure data cleanliness by eliminating redundant information. The function returns a new DataFrame with duplicate rows removed, leaving the original DataFrame unchanged.
Drop Duplicate Rows
You can use the dropDuplicates()
function to remove duplicate rows from a DataFrame.
python
from pyspark.sql import SparkSession
# Create a SparkSession (if not already created)
spark = SparkSession.builder.appName("DropDuplicatesExample").getOrCreate()
# Sample data with duplicate rows as a list of dictionaries
data = [
{"name": "Alice", "age": 30, "city": "New York"},
{"name": "Bob", "age": 25, "city": "San Francisco"},
{"name": "Alice", "age": 30, "city": "New York"}, # Duplicate row
{"name": "Charlie", "age": 35, "city": "Los Angeles"},
{"name": "Bob", "age": 25, "city": "San Francisco"}, # Duplicate row
]
# Create a DataFrame
df = spark.createDataFrame(data)
# Drop duplicate rows
df_without_duplicates = df.dropDuplicates()
df_without_duplicates.show()
Output:
+-------+---+-------------+
| name|age| city|
+-------+---+-------------+
|Charlie| 35| Los Angeles|
| Bob| 25|San Francisco|
| Alice| 30| New York|
+-------+---+-------------+
The duplicate rows with the same values for all columns have been removed from the DataFrame.
Drop Duplicate Rows Based on Specific Columns
You can use the dropDuplicates()
function with a subset of columns to consider when identifying duplicate rows.
python
from pyspark.sql import SparkSession
# Create a SparkSession (if not already created)
spark = SparkSession.builder.appName("DropDuplicatesExample").getOrCreate()
# Sample data with duplicate rows as a list of dictionaries
data = [
{"name": "Alice", "age": 30, "city": "New York"},
{"name": "Bob", "age": 25, "city": "San Francisco"},
{"name": "Alice", "age": 30, "city": "Los Angeles"}, # Duplicate row (same name and age)
{"name": "Charlie", "age": 35, "city": "Los Angeles"},
{"name": "Bob", "age": 25, "city": "San Francisco"}, # Duplicate row (same name and age)
]
# Create a DataFrame
df = spark.createDataFrame(data)
# Drop duplicate rows based on specific columns
df_without_duplicates = df.dropDuplicates(["name", "age"])
df_without_duplicates.show()
Output:
+-------+---+-------------+
| name|age| city|
+-------+---+-------------+
|Charlie| 35| Los Angeles|
| Alice| 30| New York|
| Bob| 25|San Francisco|
+-------+---+-------------+
The duplicate rows with the same values for the columns "name" and "age" have been removed from the DataFrame, while keeping the last occurrence of each duplicate.
📖👉 Official Doc