Skip to content
On this page

Add/Update columns in a DataFrame - .withColumn()

Overview

The withColumn() function is used to add or update columns in a DataFrame. It allows you to create new columns based on certain conditions, perform calculations using existing columns, or update the values of existing columns. The withColumn() function returns a new DataFrame with the added or updated columns, while leaving the original DataFrame unchanged.

Add new column

You can use the withColumn() function to add a new column to the DataFrame. The new column can be a constant value, a value based on a condition, or the result of a calculation.

python
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit

# Create a SparkSession (if not already created)
spark = SparkSession.builder.appName("WithColumnExample").getOrCreate()

# Sample data as a list of dictionaries
data = [
{"name": "Alice", "age": 30},
{"name": "Bob", "age": 25},
{"name": "Charlie", "age": 35},
]

# Create a DataFrame
df = spark.createDataFrame(data)

# Add a new column with a constant value
df_with_new_column = df.withColumn("country", lit("USA"))

df_with_new_column.show()

Output:

+-------+---+-------+
|   name|age|country|
+-------+---+-------+
|  Alice| 30|    USA|
|    Bob| 25|    USA|
|Charlie| 35|    USA|
+-------+---+-------+

Calculated from other column

You can use the withColumn() function to add a new column based on calculations or transformations performed on existing columns.

python
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Create a SparkSession (if not already created)
spark = SparkSession.builder.appName("WithColumnExample").getOrCreate()

# Sample data as a list of dictionaries
data = [
{"name": "Alice", "age": 30},
{"name": "Bob", "age": 25},
{"name": "Charlie", "age": 35},
]

# Create a DataFrame
df = spark.createDataFrame(data)

# Add a new column calculated from other columns
df_with_calculated_column = df.withColumn("age_after_5_years", col("age") + 5)

df_with_calculated_column.show()

Output:

+-------+---+----------------+
|   name|age|age_after_5_years|
+-------+---+----------------+
|  Alice| 30|              35|
|    Bob| 25|              30|
|Charlie| 35|              40|
+-------+---+----------------+

Update an existing column

You can use the withColumn() function to update the values of an existing column based on certain conditions or transformations.

python
from pyspark.sql import SparkSession
from pyspark.sql.functions import when

# Create a SparkSession (if not already created)
spark = SparkSession.builder.appName("WithColumnExample").getOrCreate()

# Sample data as a list of dictionaries
data = [
{"name": "Alice", "age": 30, "status": "Unknown"},
{"name": "Bob", "age": 25 "status": "Unknown"},
{"name": "Charlie", "age": 35 "status": "Unknown"},
]

# Create a DataFrame
df = spark.createDataFrame(data)
df.show()

print("----------------------------")

# Update an existing column based on a condition
df_with_updated_column = df.withColumn("status", when(col("age") >= 30, "Senior").otherwise("Junior"))

df_with_updated_column.show()

Output:

+-------+---+--------+
|   name|age|  status|
+-------+---+--------+
|  Alice| 30| Unknown|
|    Bob| 25| Unknown|
|Charlie| 35| Unknown|
+-------+---+--------+

----------------------------

+-------+---+-------+
|   name|age| status|
+-------+---+-------+
|  Alice| 30| Senior|
|    Bob| 25| Junior|
|Charlie| 35| Senior|
+-------+---+-------+

The withColumn() function in PySpark provides a flexible and powerful way to add or update columns in a DataFrame. It allows you to create new columns with constant values or calculated from other columns, as well as update the values of existing columns based on conditions. This functionality is fundamental for data manipulation and transformation in PySpark data processing pipelines.