Skip to main content
AI Assist is now on Stack Overflow. Start a chat to get instant answers from across the network. Sign up to save and share your chats.
 from pyspark.sql import functions as F df=spark.read.json("your_json_file.json") df.show(truncate=False) +---------------------+------------------------------------------+ |Price |Product | +---------------------+------------------------------------------+ |[700, 250, 800, 1200]|[Desktop Computer, Tablet, iPhone, Laptop]| +---------------------+------------------------------------------+ 
df=spark.read.json("your_json_file.json") df.show(truncate=False) +---------------------+------------------------------------------+ |Price |Product | +---------------------+------------------------------------------+ |[700, 250, 800, 1200]|[Desktop Computer, Tablet, iPhone, Laptop]| +---------------------+------------------------------------------+ 
 from pyspark.sql import functions as F df=spark.read.json("your_json_file.json") df.show(truncate=False) +---------------------+------------------------------------------+ |Price |Product | +---------------------+------------------------------------------+ |[700, 250, 800, 1200]|[Desktop Computer, Tablet, iPhone, Laptop]| +---------------------+------------------------------------------+ 
Source Link
niuer
  • 1.7k
  • 2
  • 12
  • 14

For Spark version without array_zip, we can also do this:

  1. First read the json file into a DataFrame
df=spark.read.json("your_json_file.json") df.show(truncate=False) +---------------------+------------------------------------------+ |Price |Product | +---------------------+------------------------------------------+ |[700, 250, 800, 1200]|[Desktop Computer, Tablet, iPhone, Laptop]| +---------------------+------------------------------------------+ 

Next, expand the struct into array:

df = df.withColumn('prc_array', F.array(F.expr('Price.*'))) df = df.withColumn('prod_array', F.array(F.expr('Product.*'))) 

Then create a map between the two arrays

df = df.withColumn('prc_prod_map', F.map_from_arrays('prc_array', 'prod_array')) df.select('prc_array', 'prod_array', 'prc_prod_map').show(truncate=False) +---------------------+------------------------------------------+-----------------------------------------------------------------------+ |prc_array |prod_array |prc_prod_map | +---------------------+------------------------------------------+-----------------------------------------------------------------------+ |[700, 250, 800, 1200]|[Desktop Computer, Tablet, iPhone, Laptop]|[700 -> Desktop Computer, 250 -> Tablet, 800 -> iPhone, 1200 -> Laptop]| +---------------------+------------------------------------------+-----------------------------------------------------------------------+ 

Finally, apply explode on the map:

df = df.select(F.explode('prc_prod_map').alias('prc', 'prod')) df.show(truncate=False) +----+----------------+ |prc |prod | +----+----------------+ |700 |Desktop Computer| |250 |Tablet | |800 |iPhone | |1200|Laptop | +----+----------------+ 

This way, we avoid the potentially time consuming join operation on two tables.