Ordering Rows in a DataFrame - `.orderBy()`

Overview

The orderBy() function (also known as sort()) is used to sort the rows in a DataFrame based on one or more columns. The orderBy() function allows you to arrange the data in ascending or descending order according to specified criteria. It returns a new DataFrame with the rows sorted based on the specified column(s).

Order by Single Column

You can use the orderBy() function to sort the DataFrame based on a single column.

python

from pyspark.sql import SparkSession

# Create a SparkSession (if not already created)
spark = SparkSession.builder.appName("OrderByExample").getOrCreate()

# Sample data as a list of dictionaries
data = [
    {"name": "Alice", "age": 30},
    {"name": "Bob", "age": 25},
    {"name": "Charlie", "age": 35},
]

# Create a DataFrame
df = spark.createDataFrame(data)

# Order by a single column in ascending order
sorted_df = df.orderBy("age")

sorted_df.show()

Output:

+-------+---+
|   name|age|
+-------+---+
|    Bob| 25|
|  Alice| 30|
|Charlie| 35|
+-------+---+

Order by Single Column in Descending Order

You can use the orderBy() function with the desc() function from pyspark.sql.functions to sort the DataFrame based on a single column in descending order.

python

from pyspark.sql import SparkSession
from pyspark.sql.functions import desc

# Create a SparkSession (if not already created)
spark = SparkSession.builder.appName("OrderByExample").getOrCreate()

# Sample data as a list of dictionaries
data = [
    {"name": "Alice", "age": 30},
    {"name": "Bob", "age": 25},
    {"name": "Charlie", "age": 35},
]

# Create a DataFrame
df = spark.createDataFrame(data)

# Order by a single column in descending order
sorted_df = df.orderBy(desc("age"))

sorted_df.show()

Output:

+-------+---+
|   name|age|
+-------+---+
|Charlie| 35|
|  Alice| 30|
|    Bob| 25|
+-------+---+

Order by Multiple Columns

You can use the orderBy() function to sort the DataFrame based on multiple columns.

python

from pyspark.sql import SparkSession

# Create a SparkSession (if not already created)
spark = SparkSession.builder.appName("OrderByExample").getOrCreate()

# Sample data as a list of dictionaries
data = [
    {"name": "Alice", "age": 30, "city": "New York"},
    {"name": "Bob", "age": 25, "city": "San Francisco"},
    {"name": "Charlie", "age": 35, "city": "Los Angeles"},
]

# Create a DataFrame
df = spark.createDataFrame(data)

# Order by multiple columns
sorted_df = df.orderBy("city", "age")  # option 1
sorted_df = df.orderBy(["city", "age"], ascending=[False, False])  # option 2

sorted_df.show()

Output:

+-------+---+-------------+
|   name|age|         city|
+-------+---+-------------+
|Charlie| 35|  Los Angeles|
|    Bob| 25|San Francisco|
|  Alice| 30|     New York|
+-------+---+-------------+

The orderBy() function in PySpark (alias: sort()) is a powerful tool for sorting rows in a DataFrame based on one or more columns. Whether you need to sort the data in ascending or descending order, or sort by multiple columns, the orderBy() function provides the flexibility and control to arrange your DataFrame's data according to specific criteria.

📖👉 Official Doc

Ordering Rows in a DataFrame - .orderBy() ​

Overview ​

Order by Single Column ​

Order by Single Column in Descending Order ​

Order by Multiple Columns ​

Ordering Rows in a DataFrame - `.orderBy()`

Overview

Order by Single Column

Order by Single Column in Descending Order

Order by Multiple Columns