Skip to content
On this page

Applying a Function to Each Row in a DataFrame - .map()

Overview

The map() function is used to apply a transformation function to each row in a DataFrame. It operates on the underlying RDD of the DataFrame and allows you to modify or transform the values of a specific column or create a new column based on existing data. The map() function returns a new RDD with the specified transformation applied.

Example

python
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit

# Create a SparkSession (if not already created)
spark = SparkSession.builder.appName("MapExample").getOrCreate()

# Sample data as a list of dictionaries
data = [{"name": "Alice", "age": 30},
        {"name": "Bob", "age": 25},
        {"name": "Charlie", "age": 35}]

# Create a DataFrame
df = spark.createDataFrame(data)


# Define a function to apply to each row
def add_bonus(row):
    age = row["age"]


if age < 30:
    row["bonus_age"] = age + 5
else:
    row["bonus_age"] = age + 3
return row

# Use 'map()' to transform each row in the DataFrame
mapped_rdd = df.rdd.map(add_bonus)

# Convert the transformed RDD back to a DataFrame
mapped_df = mapped_rdd.toDF()

mapped_df.show()

Output:

+-------+---+---------+
|   name|age|bonus_age|
+-------+---+---------+
|  Alice| 30|       33|
|    Bob| 25|       30|
|Charlie| 35|       38|
+-------+---+---------+

In this example, we first define the add_bonus() function to apply the bonus calculation to each row. Then, we use the map() function on the DataFrame's RDD to transform each row by applying the add_bonus() function. The result is a new RDD with the specified transformations. Finally, we convert the transformed RDD back to a DataFrame using the toDF() method to obtain the final DataFrame mapped_df. The map() function on the DataFrame's RDD provides a flexible way to apply custom transformations to each row efficiently.