Skip to content
On this page

Dropping Columns from a DataFrame - .drop()

Overview

The drop() function is used to remove one or more columns from a DataFrame. It allows you to eliminate unnecessary columns from the DataFrame to focus on relevant data or to streamline further data processing. The drop() function returns a new DataFrame with the specified columns removed, leaving the original DataFrame unchanged.

Drop a Single Column

You can use the drop() function to remove a single column from the DataFrame by providing the column name as an argument.

python
from pyspark.sql import SparkSession

# Create a SparkSession (if not already created)
spark = SparkSession.builder.appName("DropColumnExample").getOrCreate()

# Sample data as a list of dictionaries
data = [
{"name": "Alice", "age": 30, "country": "USA"},
{"name": "Bob", "age": 25, "country": "Canada"},
{"name": "Charlie", "age": 35, "country": "UK"},
]

# Create a DataFrame
df = spark.createDataFrame(data)

# Drop a single column
df_dropped = df.drop("age")

df_dropped.show()

Output:

+-------+-------+
|   name|country|
+-------+-------+
|  Alice|    USA|
|    Bob| Canada|
|Charlie|     UK|
+-------+-------+

Drop Multiple Columns

You can also use the drop() function to remove multiple columns from the DataFrame by providing a list of column names as arguments.

python
from pyspark.sql import SparkSession

# Create a SparkSession (if not already created)
spark = SparkSession.builder.appName("DropColumnExample").getOrCreate()

# Sample data as a list of dictionaries
data = [
{"name": "Alice", "age": 30, "country": "USA"},
{"name": "Bob", "age": 25, "country": "Canada"},
{"name": "Charlie", "age": 35, "country": "UK"},
]

# Create a DataFrame
df = spark.createDataFrame(data)

# Drop multiple columns
df_dropped = df.drop("age", "country")

df_dropped.show()

Output:

+-------+
|   name|
+-------+
|  Alice|
|    Bob|
|Charlie|
+-------+

The drop() function in PySpark is a useful tool for removing columns from a DataFrame that are not needed for analysis or further processing. Whether you need to drop a single column, or multiple columns, the drop() function allows you to efficiently customize your DataFrame by eliminating unwanted columns while preserving the essential data for your analysis.

📖👉 Official Doc