Appearance
Ordering Rows in a DataFrame - .orderBy()
Overview
The orderBy()
function (also known as sort()
) is used to sort the rows in a DataFrame based on one or more columns. The orderBy()
function allows you to arrange the data in ascending or descending order according to specified criteria. It returns a new DataFrame with the rows sorted based on the specified column(s).
Order by Single Column
You can use the orderBy()
function to sort the DataFrame based on a single column.
python
from pyspark.sql import SparkSession
# Create a SparkSession (if not already created)
spark = SparkSession.builder.appName("OrderByExample").getOrCreate()
# Sample data as a list of dictionaries
data = [
{"name": "Alice", "age": 30},
{"name": "Bob", "age": 25},
{"name": "Charlie", "age": 35},
]
# Create a DataFrame
df = spark.createDataFrame(data)
# Order by a single column in ascending order
sorted_df = df.orderBy("age")
sorted_df.show()
Output:
+-------+---+
| name|age|
+-------+---+
| Bob| 25|
| Alice| 30|
|Charlie| 35|
+-------+---+
Order by Single Column in Descending Order
You can use the orderBy()
function with the desc()
function from pyspark.sql.functions
to sort the DataFrame based on a single column in descending order.
python
from pyspark.sql import SparkSession
from pyspark.sql.functions import desc
# Create a SparkSession (if not already created)
spark = SparkSession.builder.appName("OrderByExample").getOrCreate()
# Sample data as a list of dictionaries
data = [
{"name": "Alice", "age": 30},
{"name": "Bob", "age": 25},
{"name": "Charlie", "age": 35},
]
# Create a DataFrame
df = spark.createDataFrame(data)
# Order by a single column in descending order
sorted_df = df.orderBy(desc("age"))
sorted_df.show()
Output:
+-------+---+
| name|age|
+-------+---+
|Charlie| 35|
| Alice| 30|
| Bob| 25|
+-------+---+
Order by Multiple Columns
You can use the orderBy()
function to sort the DataFrame based on multiple columns.
python
from pyspark.sql import SparkSession
# Create a SparkSession (if not already created)
spark = SparkSession.builder.appName("OrderByExample").getOrCreate()
# Sample data as a list of dictionaries
data = [
{"name": "Alice", "age": 30, "city": "New York"},
{"name": "Bob", "age": 25, "city": "San Francisco"},
{"name": "Charlie", "age": 35, "city": "Los Angeles"},
]
# Create a DataFrame
df = spark.createDataFrame(data)
# Order by multiple columns
sorted_df = df.orderBy("city", "age") # option 1
sorted_df = df.orderBy(["city", "age"], ascending=[False, False]) # option 2
sorted_df.show()
Output:
+-------+---+-------------+
| name|age| city|
+-------+---+-------------+
|Charlie| 35| Los Angeles|
| Bob| 25|San Francisco|
| Alice| 30| New York|
+-------+---+-------------+
The orderBy()
function in PySpark (alias: sort()
) is a powerful tool for sorting rows in a DataFrame based on one or more columns. Whether you need to sort the data in ascending or descending order, or sort by multiple columns, the orderBy()
function provides the flexibility and control to arrange your DataFrame's data according to specific criteria.
📖👉 Official Doc