from pyspark.sql import functions as F df=spark.read.json("your_json_file.json") df.show(truncate=False) +---------------------+------------------------------------------+ |Price |Product | +---------------------+------------------------------------------+ |[700, 250, 800, 1200]|[Desktop Computer, Tablet, iPhone, Laptop]| +---------------------+------------------------------------------+ df=spark.read.json("your_json_file.json") df.show(truncate=False) +---------------------+------------------------------------------+ |Price |Product | +---------------------+------------------------------------------+ |[700, 250, 800, 1200]|[Desktop Computer, Tablet, iPhone, Laptop]| +---------------------+------------------------------------------+ from pyspark.sql import functions as F df=spark.read.json("your_json_file.json") df.show(truncate=False) +---------------------+------------------------------------------+ |Price |Product | +---------------------+------------------------------------------+ |[700, 250, 800, 1200]|[Desktop Computer, Tablet, iPhone, Laptop]| +---------------------+------------------------------------------+ For Spark version without array_zip, we can also do this:
- First read the json file into a DataFrame
df=spark.read.json("your_json_file.json") df.show(truncate=False) +---------------------+------------------------------------------+ |Price |Product | +---------------------+------------------------------------------+ |[700, 250, 800, 1200]|[Desktop Computer, Tablet, iPhone, Laptop]| +---------------------+------------------------------------------+ Next, expand the struct into array:
df = df.withColumn('prc_array', F.array(F.expr('Price.*'))) df = df.withColumn('prod_array', F.array(F.expr('Product.*'))) Then create a map between the two arrays
df = df.withColumn('prc_prod_map', F.map_from_arrays('prc_array', 'prod_array')) df.select('prc_array', 'prod_array', 'prc_prod_map').show(truncate=False) +---------------------+------------------------------------------+-----------------------------------------------------------------------+ |prc_array |prod_array |prc_prod_map | +---------------------+------------------------------------------+-----------------------------------------------------------------------+ |[700, 250, 800, 1200]|[Desktop Computer, Tablet, iPhone, Laptop]|[700 -> Desktop Computer, 250 -> Tablet, 800 -> iPhone, 1200 -> Laptop]| +---------------------+------------------------------------------+-----------------------------------------------------------------------+ Finally, apply explode on the map:
df = df.select(F.explode('prc_prod_map').alias('prc', 'prod')) df.show(truncate=False) +----+----------------+ |prc |prod | +----+----------------+ |700 |Desktop Computer| |250 |Tablet | |800 |iPhone | |1200|Laptop | +----+----------------+ This way, we avoid the potentially time consuming join operation on two tables.
lang-py