Performing Different Types of Joins in PySpark

The join() function supports various types of joins, similar to SQL joins. The type of join you choose will determine which rows are included in the resulting DataFrame. Here are the different types of joins commonly used in PySpark:

Inner Join

An inner join returns only the rows with matching values in both DataFrames. It keeps only the rows that have keys present in both DataFrames.

python

from pyspark.sql import SparkSession

# Create a SparkSession (if not already created)
spark = SparkSession.builder.appName("InnerJoinExample").getOrCreate()

# Sample data as lists of dictionaries
data1 = [{"name": "Alice", "age": 30, "department_id": 1},
         {"name": "Bob", "age": 25, "department_id": 2}]

data2 = [{"department_id": 1, "department_name": "HR"},
         {"department_id": 2, "department_name": "Finance"}]

# Create DataFrames
df1 = spark.createDataFrame(data1)
df2 = spark.createDataFrame(data2)

# Perform an inner join on 'department_id' column
inner_join_df = df1.join(df2, on="department_id", how="inner")

inner_join_df.show()

Output:

+-------------+---+-------+---------------+
|department_id|age|   name|department_name|
+-------------+---+-------+---------------+
|            1| 30|  Alice|             HR|
|            2| 25|    Bob|        Finance|
+-------------+---+-------+---------------+

Left Join

A left join returns all the rows from the left DataFrame and the matching rows from the right DataFrame. If there is no match in the right DataFrame, the result will contain null values for the right DataFrame's columns.

python

from pyspark.sql import SparkSession

# Create a SparkSession (if not already created)
spark = SparkSession.builder.appName("LeftJoinExample").getOrCreate()

# Sample data as lists of dictionaries
data1 = [{"name": "Alice", "age": 30, "department_id": 1},
         {"name": "Bob", "age": 25, "department_id": 2},
         {"name": "Charlie", "age": 35, "department_id": 3}]

data2 = [{"department_id": 1, "department_name": "HR"},
         {"department_id": 2, "department_name": "Finance"}]

# Create DataFrames
df1 = spark.createDataFrame(data1)
df2 = spark.createDataFrame(data2)

# Perform a left join on 'department_id' column
left_join_df = df1.join(df2, on="department_id", how="left")

left_join_df.show()

Output:

+-------------+---+-------+---------------+
|department_id|age|   name|department_name|
+-------------+---+-------+---------------+
|            1| 30|  Alice|             HR|
|            2| 25|    Bob|        Finance|
|            3| 35|Charlie|           null|
+-------------+---+-------+---------------+

Right Join

A right join returns all the rows from the right DataFrame and the matching rows from the left DataFrame. If there is no match in the left DataFrame, the result will contain null values for the left DataFrame's columns.

python

from pyspark.sql import SparkSession

# Create a SparkSession (if not already created)
spark = SparkSession.builder.appName("RightJoinExample").getOrCreate()

# Sample data as lists of dictionaries
data1 = [{"department_id": 1, "department_name": "HR"},
         {"department_id": 2, "department_name": "Finance"}]

data2 = [{"name": "Alice", "age": 30, "department_id": 1},
         {"name": "Bob", "age": 25, "department_id": 2},
         {"name": "Charlie", "age": 35, "department_id": 3}]

# Create DataFrames
df1 = spark.createDataFrame(data1)
df2 = spark.createDataFrame(data2)

# Perform a right join on 'department_id' column
right_join_df = df1.join(df2, on="department_id", how="right")

right_join_df.show()

Output:

+-------------+---------------+---+-------+
|department_id|department_name|age|   name|
+-------------+---------------+---+-------+
|            1|             HR| 30|  Alice|
|            2|        Finance| 25|    Bob|
|            3|           null| 35|Charlie|
+-------------+---------------+---+-------+

Full Outer Join

A full outer join returns all the rows from both DataFrames. If there is no match in one of the DataFrames, the result will contain null values for the columns of the DataFrame with no match.

python

from pyspark.sql import SparkSession

# Create a SparkSession (if not already created)
spark = SparkSession.builder.appName("FullOuterJoinExample").getOrCreate()

# Sample data as lists of dictionaries
data1 = [{"name": "Alice", "age": 30, "department_id": 1},
         {"name": "Bob", "age": 25, "department_id": 2}]

data2 = [{"department_id": 1, "department_name": "HR"},
         {"department_id": 3, "department_name": "Sales"}]

# Create DataFrames
df1 = spark.createDataFrame(data1)
df2 = spark.createDataFrame(data2)

# Perform a full outer join on 'department_id' column
full_outer_join_df = df1.join(df2, on="department_id", how="fullouter")

full_outer_join_df.show()

Output:

+-------------+----+-------+---------------+
|department_id| age|   name|department_name|
+-------------+----+-------+---------------+
|            1|  30|  Alice|             HR|
|            2|  25|    Bob|           null|
|            3|null|   null|          Sales|
+-------------+----+-------+---------------+

Left Semi Join

A left semi join returns only the rows from the left DataFrame where there is a match in the right DataFrame.

python

from pyspark.sql import SparkSession

# Create a SparkSession (if not already created)
spark = SparkSession.builder.appName("LeftSemiJoinExample").getOrCreate()

# Sample data as lists of dictionaries
data1 = [{"name": "Alice", "age": 30, "department_id": 1},
         {"name": "Bob", "age": 25, "department_id": 2},
         {"name": "Charlie", "age": 35, "department_id": 3}]

data2 = [{"department_id": 1, "department_name": "HR"},
         {"department_id": 2, "department_name": "Finance"}]

# Create DataFrames
df1 = spark.createDataFrame(data1)
df2 = spark.createDataFrame(data2)

# Perform a left

semi
join
on
'department_id'
column
left_semi_join_df = df1.join(df2, on="department_id", how="leftsemi")

left_semi_join_df.show()

Output:

+-------------+----+-------+
|department_id| age|   name|
+-------------+----+-------+
|            1|  30|  Alice|
|            2|  25|    Bob|
+-------------+----+-------+

Left Anti Join

A left anti join returns only the rows from the left DataFrame where there is no match in the right DataFrame.

python

from pyspark.sql import SparkSession

# Create a SparkSession (if not already created)
spark = SparkSession.builder.appName("LeftAntiJoinExample").getOrCreate()

# Sample data as lists of dictionaries
data1 = [{"name": "Alice", "age": 30, "department_id": 1},
         {"name": "Bob", "age": 25, "department_id": 2},
         {"name": "Charlie", "age": 35, "department_id": 3}]

data2 = [{"department_id": 1, "department_name": "HR"},
         {"department_id": 2, "department_name": "Finance"}]

# Create DataFrames
df1 = spark.createDataFrame(data1)
df2 = spark.createDataFrame(data2)

# Perform a left anti join on 'department_id' column
left_anti_join_df = df1.join(df2, on="department_id", how="leftanti")

left_anti_join_df.show()

Output:

+-------------+---+-------+
|department_id|age|   name|
+-------------+---+-------+
|            3| 35|Charlie|
+-------------+---+-------+

The choice of join type depends on the specific use case and the desired output. Understanding the different types of joins in PySpark allows you to perform complex data integrations and manipulations efficiently.

Performing Different Types of Joins in PySpark ​

Inner Join ​

Left Join ​

Right Join ​

Full Outer Join ​

Left Semi Join ​

Left Anti Join ​

Performing Different Types of Joins in PySpark

Inner Join

Left Join

Right Join

Full Outer Join

Left Semi Join

Left Anti Join