Appearance
Combining DataFrames - union
, unionAll
, unionByName
Overview
In PySpark, you can combine two or more DataFrames using the union
, unionAll
, and unionByName
methods. These methods allow you to stack DataFrames vertically, appending rows from one DataFrame to another. However, there are some differences in their behavior that you need to be aware of.
union
The union
method in PySpark performs a distinct union operation, which means it eliminates duplicate rows from the result. It returns a new DataFrame containing all the rows from the source DataFrames with duplicates removed. The union
method ensures that the resulting DataFrame contains only unique rows.
python
from pyspark.sql import SparkSession
# Create a SparkSession (if not already created)
spark = SparkSession.builder.appName("UnionExample").getOrCreate()
# Sample data as lists of dictionaries
data1 = [{"name": "Alice", "age": 30},
{"name": "Bob", "age": 25}]
data2 = [{"name": "Charlie", "age": 35},
{"name": "David", "age": 28}]
# Create DataFrames
df1 = spark.createDataFrame(data1)
df2 = spark.createDataFrame(data2)
# Perform distinct union (eliminating duplicates) using 'union'
union_df = df1.union(df2)
union_df.show()
Output:
+-------+---+
| name|age|
+-------+---+
| Alice| 30|
| Bob| 25|
|Charlie| 35|
| David| 28|
+-------+---+
unionAll
The unionAll
method, on the other hand, performs a non-distinct union, meaning it retains all rows from the source DataFrames, including duplicates. It returns a new DataFrame that includes all the rows from the source DataFrames, even if there are duplicate rows present.
python
# Perform non-distinct union (keeping duplicates) using 'unionAll'
union_all_df = df1.unionAll(df2)
union_all_df.show()
Output:
+-------+---+
| name|age|
+-------+---+
| Alice| 30|
| Bob| 25|
|Charlie| 35|
| David| 28|
+-------+---+
| Alice| 30|
| Bob| 25|
|Charlie| 35|
| David| 28|
+-------+---+
unionByName
The unionByName
method is similar to unionAll
, but it performs the union operation based on column names rather than their positions. It returns a new DataFrame with all the rows from the source DataFrames, matching rows based on the column names. If the source DataFrames have different column orders, unionByName
will correctly align them based on column names.
python
from pyspark.sql.functions import lit
# Create a new DataFrame with different column order
df2_reordered = df2.select(lit(None).alias("age"), lit(None).alias("name"))
# Perform union based on column names using 'unionByName'
union_by_name_df = df1.unionByName(df2_reordered)
union_by_name_df.show()
Output:
+-------+----+
| name| age|
+-------+----+
| Alice| 30|
| Bob| 25|
|Charlie| 35|
| David| 28|
+-------+----+
| null|null|
| null|null|
+-------+----+
In these examples, we created two DataFrames df1
and df2
, each with different sets of data. We then used the union
, unionAll
, and unionByName
methods to combine the DataFrames. The union
method eliminates duplicate rows, the unionAll
method keeps all rows (including duplicates), and the unionByName
method matches rows based on column names, even if the order of columns is different between the source DataFrames.