Appearance
Selecting Distinct Rows in a DataFrame - .distinct()
Overview
The distinct()
function is used to select distinct rows from a DataFrame. Distinct rows are rows with unique values across all columns. The distinct()
function allows you to eliminate duplicate records and focus on unique data. It returns a new DataFrame containing distinct rows, leaving the original DataFrame unchanged.
Select Distinct Rows
You can use the distinct()
function to select distinct rows from a DataFrame.
python
from pyspark.sql import SparkSession
# Create a SparkSession (if not already created)
spark = SparkSession.builder.appName("DistinctExample").getOrCreate()
# Sample data with duplicate rows as a list of dictionaries
data = [
{"name": "Alice", "age": 30, "city": "New York"},
{"name": "Bob", "age": 25, "city": "San Francisco"},
{"name": "Alice", "age": 30, "city": "New York"}, # Duplicate row
{"name": "Charlie", "age": 35, "city": "Los Angeles"},
{"name": "Bob", "age": 25, "city": "San Francisco"}, # Duplicate row
]
# Create a DataFrame
df = spark.createDataFrame(data)
# Select distinct rows
distinct_df = df.distinct()
distinct_df.show()
Output:
+-------+---+-------------+
| name|age| city|
+-------+---+-------------+
| Bob| 25|San Francisco|
|Charlie| 35| Los Angeles|
| Alice| 30| New York|
+-------+---+-------------+
The distinct()
function selects only the distinct rows with unique values across all columns, effectively removing duplicate rows.
📖👉 Official Doc