Return to Answer

importing neccessary libraries.

edit approved Jul 7, 2021 at 11:02

 from pyspark.sql import functions as F df=spark.read.json("your_json_file.json") df.show(truncate=False) +---------------------+------------------------------------------+ |Price |Product | +---------------------+------------------------------------------+ |[700, 250, 800, 1200]|[Desktop Computer, Tablet, iPhone, Laptop]| +---------------------+------------------------------------------+

df=spark.read.json("your_json_file.json") df.show(truncate=False) +---------------------+------------------------------------------+ |Price |Product | +---------------------+------------------------------------------+ |[700, 250, 800, 1200]|[Desktop Computer, Tablet, iPhone, Laptop]| +---------------------+------------------------------------------+

 from pyspark.sql import functions as F df=spark.read.json("your_json_file.json") df.show(truncate=False) +---------------------+------------------------------------------+ |Price |Product | +---------------------+------------------------------------------+ |[700, 250, 800, 1200]|[Desktop Computer, Tablet, iPhone, Laptop]| +---------------------+------------------------------------------+

Source Link

answered Sep 5, 2019 at 21:43

niuer

1.7k
2
12
14

For Spark version without array_zip, we can also do this:

First read the json file into a DataFrame

df=spark.read.json("your_json_file.json") df.show(truncate=False) +---------------------+------------------------------------------+ |Price |Product | +---------------------+------------------------------------------+ |[700, 250, 800, 1200]|[Desktop Computer, Tablet, iPhone, Laptop]| +---------------------+------------------------------------------+

Next, expand the struct into array:

df = df.withColumn('prc_array', F.array(F.expr('Price.*'))) df = df.withColumn('prod_array', F.array(F.expr('Product.*')))

Then create a map between the two arrays

df = df.withColumn('prc_prod_map', F.map_from_arrays('prc_array', 'prod_array')) df.select('prc_array', 'prod_array', 'prc_prod_map').show(truncate=False) +---------------------+------------------------------------------+-----------------------------------------------------------------------+ |prc_array |prod_array |prc_prod_map | +---------------------+------------------------------------------+-----------------------------------------------------------------------+ |[700, 250, 800, 1200]|[Desktop Computer, Tablet, iPhone, Laptop]|[700 -> Desktop Computer, 250 -> Tablet, 800 -> iPhone, 1200 -> Laptop]| +---------------------+------------------------------------------+-----------------------------------------------------------------------+

Finally, apply explode on the map:

df = df.select(F.explode('prc_prod_map').alias('prc', 'prod')) df.show(truncate=False) +----+----------------+ |prc |prod | +----+----------------+ |700 |Desktop Computer| |250 |Tablet | |800 |iPhone | |1200|Laptop | +----+----------------+

This way, we avoid the potentially time consuming join operation on two tables.

Collectives™ on Stack Overflow

Return to Answer