Appearance
Applying a Function to Each Row in a DataFrame - .foreach()
Overview
The foreach()
function allows you to apply a function to each row in a DataFrame. It is a higher-order function that enables you to perform custom operations on individual rows of the DataFrame. The foreach()
function does not return a new DataFrame; instead, it is typically used for side effects, such as writing data to an external storage system, printing rows, or updating values externally.
Example
python
from pyspark.sql import SparkSession
# Create a SparkSession (if not already created)
spark = SparkSession.builder.appName("ForeachExample").getOrCreate()
# Sample data as a list of dictionaries
data = [{"name": "Alice", "age": 30},
{"name": "Bob", "age": 25},
{"name": "Charlie", "age": 35}]
# Create a DataFrame
df = spark.createDataFrame(data)
# Define a function to apply to each row
def process_row(row):
# Access row values using the column names
name = row["name"]
age = row["age"]
print(f"Processing row: Name: {name}, Age: {age}")
# Apply the function using 'foreach()'
df.foreach(process_row)
Output:
Processing row: Name: Alice, Age: 30
Processing row: Name: Bob, Age: 25
Processing row: Name: Charlie, Age: 35
In this example, the foreach()
function is used to apply the process_row()
function to each row of the DataFrame. The function prints the name and age of each row. The foreach()
function allows you to perform custom operations on individual rows, making it useful for tasks that involve side effects, such as writing data to external systems or updating values outside of Spark.