Skip to content
On this page

Pivoting a DataFrame - .pivot()

Overview

The .pivot() method is used to pivot a DataFrame, transforming it from a long format to a wide format. Pivoting involves rotating or transposing rows of data into columns, effectively reorganizing the data based on specific values in one column to create a summarized view of the data. The .pivot() method is particularly useful for creating cross-tabulations and summarizing data in a tabular form.

Example

python
from pyspark.sql import SparkSession

# Create a SparkSession (if not already created)
spark = SparkSession.builder.appName("PivotExample").getOrCreate()

# Sample data as a list of dictionaries
data = [{"name": "Alice", "age": 30, "department": "HR"},
        {"name": "Bob", "age": 25, "department": "Finance"},
        {"name": "Charlie", "age": 35, "department": "HR"},
        {"name": "David", "age": 28, "department": "Engineering"},
        {"name": "Eva", "age": 32, "department": "Finance"}]

# Create a DataFrame
df = spark.createDataFrame(data)

# Pivot the DataFrame to create a cross-tabulation of 'age' against 'department'
pivot_df = df.groupBy("department").pivot("age").count()

pivot_df.show()

Output:

+-----------+----+---+----+----+
| department|  25| 28|  30|  32|
+-----------+----+---+----+----+
|    Finance|   1|  0|   0|   1|
|         HR|   0|  0|   1|   0|
|Engineering|   0|  1|   0|   0|
+-----------+----+---+----+----+

In this example, the DataFrame df contains information about employees, including their names, ages, and departments. We use the .pivot() method along with .groupBy() and .count() to pivot the DataFrame and create a cross-tabulation of the 'age' against the 'department'. The result is a new DataFrame pivot_df, where each row represents a unique ' department', and the columns represent the different age values observed in the original DataFrame. The values in the table are the counts of employees belonging to each department for each age group.

.pivot() is a powerful tool for summarizing and reshaping data in PySpark, providing valuable insights when analyzing categorical variables and relationships between different attributes in a DataFrame.