0

I have a pyspark dataframe df that has columns product_id and image. the product_id column is a unique identifier for the product. the image columns is a list of dictionaries containing the url and default setting for links to images associated with each product. the lists of dictionaries have the form:

[{"url":"path","default":true},{"url":"path2","default":false}] 

some of the product_id in the dataframe df have multiple records with different lists, example:

+----------+----------------------------------------------------------------+ |product_id|image | +----------+----------------------------------------------------------------+ |3 |[{"url":"path3","default":true},{"url":"path5","default":false}]| +----------+----------------------------------------------------------------+ |3 |[{"url":"path7","default":true},{"url":"path8","default":false}]| +----------+----------------------------------------------------------------+ 

I would like to create pyspark or spark sql to combine these multiple lists of dictionaries into one list for product_id in the dataframe that have more than one record. An example of the desired output dataframe is below:

+----------+----------------------------------------------------------------+ |product_id|image | +----------+----------------------------------------------------------------+ |3 |[{"url":"path3","default":true},{"url":"path5","default":false}, +----------+----------------------------------------------------------------- {"url":"path7","default":true},{"url":"path8","default":false}]| ---------------------------------------------------------------+ 

I've tried solutions like the one below:

# Explode the nested list of dictionaries exploded_df = df.select(col("product_id"), explode(col("image")).alias("image")) # Collect unique elements using collect_set papi_image_merged_df2 = exploded_df.groupBy("product_id").agg(collect_set("image").alias("image")) 

but it creates a nested list of lists as output when combining the records that have the same product_id:

+----------+----------------------------------------------------------------+ |product_id|image | +----------+----------------------------------------------------------------+ |3 |[[{"url":"path3","default":true},{"url":"path5","default":false}], +----------+----------------------------------------------------------------- [{"url":"path7","default":true},{"url":"path8","default":false}]]| -----------------------------------------------------------------+ 

can anyone suggest how to get an output like the desired output example I showed above, where it creates a list of just the unique dictionaries?

1

1 Answer 1

0

Try this:

import pyspark.sql.functions as f from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() df = spark.createDataFrame([ (3, [{'url': 'path1', 'default': True}, {'url': 'path2', 'default': False}]), (3, [{'url': 'path3', 'default': True}, {'url': 'path4', 'default': False}]) ], ['product_id', 'image']) df = ( df .groupBy('product_id') .agg(f.collect_list(f.col('image')).alias('images')) .select('product_id', f.flatten(f.col('images')).alias('images')) ) df.show(truncate= False) 

And the output is:

+----------+--------------------------------------------------------------------------------------------------------------------------------------+ |product_id|images | +----------+--------------------------------------------------------------------------------------------------------------------------------------+ |3 |[{default -> true, url -> path1}, {default -> false, url -> path2}, {default -> true, url -> path3}, {default -> false, url -> path4}]| +----------+--------------------------------------------------------------------------------------------------------------------------------------+ 
Sign up to request clarification or add additional context in comments.

2 Comments

thank you for getting back to me with this suggestion. I tried it but I get the error: "cannot resolve 'flatten(image)' due to data type mismatch: The argument should be an array of arrays, but 'image' is of array<string> type.;" do you have a suggestion how to fix/cast to a different data type?
@user3476463, maybe when you transported the code, instead of flatten('images'), you put flatten('image')?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.