Skip to content
On this page

Json Functions

from_json():

The from_json() function is used to parse JSON strings in a DataFrame column and convert them into a structured format. It takes two arguments: the DataFrame column containing JSON strings and a schema describing the structure of the JSON data. The function returns a new DataFrame with the parsed JSON data.

Syntax:

python
from pyspark.sql import functions as F

df = df.withColumn("parsed_column", F.from_json("json_column", json_schema))

Example: Suppose we have a DataFrame containing JSON data in the json_data column, and we want to parse it using a specific JSON schema.

python
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql import functions as F

# Create a SparkSession
spark = SparkSession.builder
    .appName("from_json Example")
    .getOrCreate()

# Sample data with JSON strings
data = [
    ('{"name": "Alice", "age": 28}',),
    ('{"name": "Bob", "age": 32}',),
    ('{"name": "Charlie", "age": 25}',),
    ('{"name": "David", "age": 31}',),
    ('{"name": "Eva", "age": 24}',),
]

# Create DataFrame with JSON strings
schema = StructType([StructField("json_data", StringType(), True)])
df = spark.createDataFrame(data, schema)

# Define JSON schema
json_schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
])

# Parse JSON data using from_json()
df = df.withColumn("parsed_data", F.from_json("json_data", json_schema))
df.show(truncate=False)

The output will be:

+-------------------------------+------------------------+
|json_data                      |parsed_data             |
+-------------------------------+------------------------+
|{"name": "Alice", "age": 28}   |{name: Alice, age: 28}  |
|{"name": "Bob", "age": 32}     |{name: Bob, age: 32}    |
|{"name": "Charlie", "age": 25} |{name: Charlie, age: 25}|
|{"name": "David", "age": 31}   |{name: David, age: 31}  |
|{"name": "Eva", "age": 24}     |{name: Eva, age: 24}    |
+-------------------------------+------------------------+

to_json():

The to_json() function is used to convert a struct column or multiple columns of a DataFrame into JSON strings. It takes one or more DataFrame columns as arguments and returns a new column containing the JSON representation of the struct.

Syntax:

python
from pyspark.sql import functions as F

df = df.withColumn("json_string", F.to_json(struct_col))

Example: Suppose we have a DataFrame with a struct column data containing name and age fields, and we want to convert it to JSON strings.

python
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql import functions as F

# Create a SparkSession
spark = SparkSession.builder
    .appName("to_json Example")
    .getOrCreate()

# Sample data with a struct column
data = [
    ("Alice", 28),
    ("Bob", 32),
    ("Charlie", 25),
    ("David", 31),
    ("Eva", 24),
]

# Create DataFrame with a struct column
schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
])
df = spark.createDataFrame(data, schema)

# Convert struct column to JSON strings using to_json()
df = df.withColumn("json_string", F.to_json(F.struct("name", "age")))
df.show(truncate=False)

The output will be:

+-------+---+---------------------------+
|name   |age|json_string                |
+-------+---+---------------------------+
|Alice  |28 |{"name":"Alice","age":28}  |
|Bob    |32 |{"name":"Bob","age":32}    |
|Charlie|25 |{"name":"Charlie","age":25}|
|David  |31 |{"name":"David","age":31}  |
|Eva    |24 |{"name":"Eva","age":24}    |
+-------+---+---------------------------+

json_tuple():

The json_tuple() function is used to extract specific fields from a JSON string in a DataFrame column. It takes two arguments: the DataFrame column containing JSON strings and the names of the fields to extract. The function returns the values of the specified fields as

separate columns in the DataFrame.

Syntax:

python
from pyspark.sql import functions as F

df = df.withColumn("field1", F.json_tuple("json_column", "field1_name"))

Example: Suppose we have a DataFrame containing JSON data in the json_data column, and we want to extract the name field from the JSON strings.

python
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

# Create a SparkSession
spark = SparkSession.builder
    .appName("json_tuple Example")
    .getOrCreate()

# Sample data with JSON strings
data = [
    ('{"name": "Alice", "age": 28}',),
    ('{"name": "Bob", "age": 32}',),
    ('{"name": "Charlie", "age": 25}',),
    ('{"name": "David", "age": 31}',),
    ('{"name": "Eva", "age": 24}',),
]

# Create DataFrame with JSON strings
df = spark.createDataFrame(data, ["json_data"])

# Extract the 'name' field using json_tuple()
df = df.withColumn("name", F.json_tuple("json_data", "name"))
df.show(truncate=False)

The output will be:

+-------------------------------+---------+
|json_data                      |name     |
+-------------------------------+---------+
|{"name": "Alice", "age": 28}   |Alice    |
|{"name": "Bob", "age": 32}     |Bob      |
|{"name": "Charlie", "age": 25} |Charlie  |
|{"name": "David", "age": 31}   |David    |
|{"name": "Eva", "age": 24}     |Eva      |
+-------------------------------+---------+