Skip to content
On this page

User-Defined/Custom Function (UDF) in PySpark

Overview

A User-Defined Function (UDF) is a custom function that allows you to apply arbitrary Python code to perform transformations on DataFrame columns. UDFs provide flexibility to execute complex operations that are not directly supported by built-in Spark functions.

Example

python
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType

# Create a SparkSession (if not already created)
spark = SparkSession.builder.appName("UDFExample").getOrCreate()

# Sample data as a list of dictionaries
data = [{"name": "Alice", "age": 30},
        {"name": "Bob", "age": 25},
        {"name": "Charlie", "age": 35},
        {"name": "David", "age": 28},
        {"name": "Eva", "age": 32}]

# Create a DataFrame
df = spark.createDataFrame(data)


# Define a custom UDF to add a bonus to the age based on a condition
@udf(IntegerType())
def add_bonus(age):
    if age < 30:
        return age + 5
    else:
        return age + 3


# Use the UDF to add a new column 'bonus_age'
df = df.withColumn("bonus_age", add_bonus(col("age")))

df.show()

Output:

+-------+---+---------+
|   name|age|bonus_age|
+-------+---+---------+
|  Alice| 30|       33|
|    Bob| 25|       30|
|Charlie| 35|       38|
|  David| 28|       33|
|    Eva| 32|       35|
+-------+---+---------+

In this example, we define a custom UDF add_bonus() that takes an 'age' as input and adds a bonus based on the age condition. We then register this UDF with Spark using spark.udf.register() to make it available for DataFrame transformations. After that, we use the UDF with udf() and withColumn() to create a new column 'bonus_age' in the DataFrame, which contains the bonus age values based on the 'age' column.

UDFs are powerful tools that allow you to extend PySpark's capabilities by applying custom Python code to DataFrame columns, giving you the flexibility to handle complex transformations and calculations in your data. However, it's essential to use UDFs judiciously, as they may involve serialization and deserialization costs when working with large datasets.