Combining DataFrames - `union`, `unionAll`, `unionByName`

Overview

In PySpark, you can combine two or more DataFrames using the union, unionAll, and unionByName methods. These methods allow you to stack DataFrames vertically, appending rows from one DataFrame to another. However, there are some differences in their behavior that you need to be aware of.

`union`

The union method in PySpark performs a distinct union operation, which means it eliminates duplicate rows from the result. It returns a new DataFrame containing all the rows from the source DataFrames with duplicates removed. The union method ensures that the resulting DataFrame contains only unique rows.

python

from pyspark.sql import SparkSession

# Create a SparkSession (if not already created)
spark = SparkSession.builder.appName("UnionExample").getOrCreate()

# Sample data as lists of dictionaries
data1 = [{"name": "Alice", "age": 30},
         {"name": "Bob", "age": 25}]

data2 = [{"name": "Charlie", "age": 35},
         {"name": "David", "age": 28}]

# Create DataFrames
df1 = spark.createDataFrame(data1)
df2 = spark.createDataFrame(data2)

# Perform distinct union (eliminating duplicates) using 'union'
union_df = df1.union(df2)

union_df.show()

Output:

+-------+---+
|   name|age|
+-------+---+
|  Alice| 30|
|    Bob| 25|
|Charlie| 35|
|  David| 28|
+-------+---+

`unionAll`

The unionAll method, on the other hand, performs a non-distinct union, meaning it retains all rows from the source DataFrames, including duplicates. It returns a new DataFrame that includes all the rows from the source DataFrames, even if there are duplicate rows present.

python

# Perform non-distinct union (keeping duplicates) using 'unionAll'
union_all_df = df1.unionAll(df2)

union_all_df.show()

Output:

+-------+---+
|   name|age|
+-------+---+
|  Alice| 30|
|    Bob| 25|
|Charlie| 35|
|  David| 28|
+-------+---+
|  Alice| 30|
|    Bob| 25|
|Charlie| 35|
|  David| 28|
+-------+---+

`unionByName`

The unionByName method is similar to unionAll, but it performs the union operation based on column names rather than their positions. It returns a new DataFrame with all the rows from the source DataFrames, matching rows based on the column names. If the source DataFrames have different column orders, unionByName will correctly align them based on column names.

python

from pyspark.sql.functions import lit

# Create a new DataFrame with different column order
df2_reordered = df2.select(lit(None).alias("age"), lit(None).alias("name"))

# Perform union based on column names using 'unionByName'
union_by_name_df = df1.unionByName(df2_reordered)

union_by_name_df.show()

Output:

+-------+----+
|   name| age|
+-------+----+
|  Alice|  30|
|    Bob|  25|
|Charlie|  35|
|  David|  28|
+-------+----+
|   null|null|
|   null|null|
+-------+----+

In these examples, we created two DataFrames df1 and df2, each with different sets of data. We then used the union, unionAll, and unionByName methods to combine the DataFrames. The union method eliminates duplicate rows, the unionAll method keeps all rows (including duplicates), and the unionByName method matches rows based on column names, even if the order of columns is different between the source DataFrames.

Combining DataFrames - union, unionAll, unionByName ​

Overview ​

union ​

unionAll ​

unionByName ​

Combining DataFrames - `union`, `unionAll`, `unionByName`

Overview

`union`

`unionAll`

`unionByName`