Appearance
Applying a Function to Each Row in a DataFrame - .map()
Overview
The map()
function is used to apply a transformation function to each row in a DataFrame. It operates on the underlying RDD of the DataFrame and allows you to modify or transform the values of a specific column or create a new column based on existing data. The map()
function returns a new RDD with the specified transformation applied.
Example
python
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit
# Create a SparkSession (if not already created)
spark = SparkSession.builder.appName("MapExample").getOrCreate()
# Sample data as a list of dictionaries
data = [{"name": "Alice", "age": 30},
{"name": "Bob", "age": 25},
{"name": "Charlie", "age": 35}]
# Create a DataFrame
df = spark.createDataFrame(data)
# Define a function to apply to each row
def add_bonus(row):
age = row["age"]
if age < 30:
row["bonus_age"] = age + 5
else:
row["bonus_age"] = age + 3
return row
# Use 'map()' to transform each row in the DataFrame
mapped_rdd = df.rdd.map(add_bonus)
# Convert the transformed RDD back to a DataFrame
mapped_df = mapped_rdd.toDF()
mapped_df.show()
Output:
+-------+---+---------+
| name|age|bonus_age|
+-------+---+---------+
| Alice| 30| 33|
| Bob| 25| 30|
|Charlie| 35| 38|
+-------+---+---------+
In this example, we first define the add_bonus()
function to apply the bonus calculation to each row. Then, we use the map()
function on the DataFrame's RDD to transform each row by applying the add_bonus()
function. The result is a new RDD with the specified transformations. Finally, we convert the transformed RDD back to a DataFrame using the toDF()
method to obtain the final DataFrame mapped_df
. The map()
function on the DataFrame's RDD provides a flexible way to apply custom transformations to each row efficiently.