Skip to content
On this page

Ordering Rows in a DataFrame - .orderBy()

Overview

The orderBy() function (also known as sort()) is used to sort the rows in a DataFrame based on one or more columns. The orderBy() function allows you to arrange the data in ascending or descending order according to specified criteria. It returns a new DataFrame with the rows sorted based on the specified column(s).

Order by Single Column

You can use the orderBy() function to sort the DataFrame based on a single column.

python
from pyspark.sql import SparkSession

# Create a SparkSession (if not already created)
spark = SparkSession.builder.appName("OrderByExample").getOrCreate()

# Sample data as a list of dictionaries
data = [
    {"name": "Alice", "age": 30},
    {"name": "Bob", "age": 25},
    {"name": "Charlie", "age": 35},
]

# Create a DataFrame
df = spark.createDataFrame(data)

# Order by a single column in ascending order
sorted_df = df.orderBy("age")

sorted_df.show()

Output:

+-------+---+
|   name|age|
+-------+---+
|    Bob| 25|
|  Alice| 30|
|Charlie| 35|
+-------+---+

Order by Single Column in Descending Order

You can use the orderBy() function with the desc() function from pyspark.sql.functions to sort the DataFrame based on a single column in descending order.

python
from pyspark.sql import SparkSession
from pyspark.sql.functions import desc

# Create a SparkSession (if not already created)
spark = SparkSession.builder.appName("OrderByExample").getOrCreate()

# Sample data as a list of dictionaries
data = [
    {"name": "Alice", "age": 30},
    {"name": "Bob", "age": 25},
    {"name": "Charlie", "age": 35},
]

# Create a DataFrame
df = spark.createDataFrame(data)

# Order by a single column in descending order
sorted_df = df.orderBy(desc("age"))

sorted_df.show()

Output:

+-------+---+
|   name|age|
+-------+---+
|Charlie| 35|
|  Alice| 30|
|    Bob| 25|
+-------+---+

Order by Multiple Columns

You can use the orderBy() function to sort the DataFrame based on multiple columns.

python
from pyspark.sql import SparkSession

# Create a SparkSession (if not already created)
spark = SparkSession.builder.appName("OrderByExample").getOrCreate()

# Sample data as a list of dictionaries
data = [
    {"name": "Alice", "age": 30, "city": "New York"},
    {"name": "Bob", "age": 25, "city": "San Francisco"},
    {"name": "Charlie", "age": 35, "city": "Los Angeles"},
]

# Create a DataFrame
df = spark.createDataFrame(data)

# Order by multiple columns
sorted_df = df.orderBy("city", "age")  # option 1
sorted_df = df.orderBy(["city", "age"], ascending=[False, False])  # option 2

sorted_df.show()

Output:

+-------+---+-------------+
|   name|age|         city|
+-------+---+-------------+
|Charlie| 35|  Los Angeles|
|    Bob| 25|San Francisco|
|  Alice| 30|     New York|
+-------+---+-------------+

The orderBy() function in PySpark (alias: sort()) is a powerful tool for sorting rows in a DataFrame based on one or more columns. Whether you need to sort the data in ascending or descending order, or sort by multiple columns, the orderBy() function provides the flexibility and control to arrange your DataFrame's data according to specific criteria.

📖👉 Official Doc