Appearance
Dropping Columns from a DataFrame - .drop()
Overview
The drop()
function is used to remove one or more columns from a DataFrame. It allows you to eliminate unnecessary columns from the DataFrame to focus on relevant data or to streamline further data processing. The drop()
function returns a new DataFrame with the specified columns removed, leaving the original DataFrame unchanged.
Drop a Single Column
You can use the drop()
function to remove a single column from the DataFrame by providing the column name as an argument.
python
from pyspark.sql import SparkSession
# Create a SparkSession (if not already created)
spark = SparkSession.builder.appName("DropColumnExample").getOrCreate()
# Sample data as a list of dictionaries
data = [
{"name": "Alice", "age": 30, "country": "USA"},
{"name": "Bob", "age": 25, "country": "Canada"},
{"name": "Charlie", "age": 35, "country": "UK"},
]
# Create a DataFrame
df = spark.createDataFrame(data)
# Drop a single column
df_dropped = df.drop("age")
df_dropped.show()
Output:
+-------+-------+
| name|country|
+-------+-------+
| Alice| USA|
| Bob| Canada|
|Charlie| UK|
+-------+-------+
Drop Multiple Columns
You can also use the drop()
function to remove multiple columns from the DataFrame by providing a list of column names as arguments.
python
from pyspark.sql import SparkSession
# Create a SparkSession (if not already created)
spark = SparkSession.builder.appName("DropColumnExample").getOrCreate()
# Sample data as a list of dictionaries
data = [
{"name": "Alice", "age": 30, "country": "USA"},
{"name": "Bob", "age": 25, "country": "Canada"},
{"name": "Charlie", "age": 35, "country": "UK"},
]
# Create a DataFrame
df = spark.createDataFrame(data)
# Drop multiple columns
df_dropped = df.drop("age", "country")
df_dropped.show()
Output:
+-------+
| name|
+-------+
| Alice|
| Bob|
|Charlie|
+-------+
The drop()
function in PySpark is a useful tool for removing columns from a DataFrame that are not needed for analysis or further processing. Whether you need to drop a single column, or multiple columns, the drop()
function allows you to efficiently customize your DataFrame by eliminating unwanted columns while preserving the essential data for your analysis.
📖👉 Official Doc