Selecting Distinct Rows in a DataFrame - `.distinct()`

Overview

The distinct() function is used to select distinct rows from a DataFrame. Distinct rows are rows with unique values across all columns. The distinct() function allows you to eliminate duplicate records and focus on unique data. It returns a new DataFrame containing distinct rows, leaving the original DataFrame unchanged.

Select Distinct Rows

You can use the distinct() function to select distinct rows from a DataFrame.

python

from pyspark.sql import SparkSession

# Create a SparkSession (if not already created)
spark = SparkSession.builder.appName("DistinctExample").getOrCreate()

# Sample data with duplicate rows as a list of dictionaries
data = [
{"name": "Alice", "age": 30, "city": "New York"},
{"name": "Bob", "age": 25, "city": "San Francisco"},
{"name": "Alice", "age": 30, "city": "New York"},  # Duplicate row
{"name": "Charlie", "age": 35, "city": "Los Angeles"},
{"name": "Bob", "age": 25, "city": "San Francisco"},  # Duplicate row
]

# Create a DataFrame
df = spark.createDataFrame(data)

# Select distinct rows
distinct_df = df.distinct()

distinct_df.show()

Output:

+-------+---+-------------+
|   name|age|         city|
+-------+---+-------------+
|    Bob| 25|San Francisco|
|Charlie| 35|  Los Angeles|
|  Alice| 30|     New York|
+-------+---+-------------+

The distinct() function selects only the distinct rows with unique values across all columns, effectively removing duplicate rows.

📖👉 Official Doc

Selecting Distinct Rows in a DataFrame - .distinct() ​

Overview ​

Select Distinct Rows ​

Selecting Distinct Rows in a DataFrame - `.distinct()`

Overview

Select Distinct Rows