combine column of list of dict into list of unique dict in pyspark

Question

I have a pyspark dataframe df that has columns product_id and image. the product_id column is a unique identifier for the product. the image columns is a list of dictionaries containing the url and default setting for links to images associated with each product. the lists of dictionaries have the form:

[{"url":"path","default":true},{"url":"path2","default":false}]

some of the product_id in the dataframe df have multiple records with different lists, example:

+----------+----------------------------------------------------------------+ |product_id|image | +----------+----------------------------------------------------------------+ |3 |[{"url":"path3","default":true},{"url":"path5","default":false}]| +----------+----------------------------------------------------------------+ |3 |[{"url":"path7","default":true},{"url":"path8","default":false}]| +----------+----------------------------------------------------------------+

I would like to create pyspark or spark sql to combine these multiple lists of dictionaries into one list for product_id in the dataframe that have more than one record. An example of the desired output dataframe is below:

+----------+----------------------------------------------------------------+ |product_id|image | +----------+----------------------------------------------------------------+ |3 |[{"url":"path3","default":true},{"url":"path5","default":false}, +----------+----------------------------------------------------------------- {"url":"path7","default":true},{"url":"path8","default":false}]| ---------------------------------------------------------------+

I've tried solutions like the one below:

# Explode the nested list of dictionaries exploded_df = df.select(col("product_id"), explode(col("image")).alias("image")) # Collect unique elements using collect_set papi_image_merged_df2 = exploded_df.groupBy("product_id").agg(collect_set("image").alias("image"))

but it creates a nested list of lists as output when combining the records that have the same product_id:

+----------+----------------------------------------------------------------+ |product_id|image | +----------+----------------------------------------------------------------+ |3 |[[{"url":"path3","default":true},{"url":"path5","default":false}], +----------+----------------------------------------------------------------- [{"url":"path7","default":true},{"url":"path8","default":false}]]| -----------------------------------------------------------------+

can anyone suggest how to get an output like the desired output example I showed above, where it creates a list of just the unique dictionaries?

Does this answer your question? GroupBy and concat array columns pyspark — Emma
– Emma, Commented Aug 28, 2023 at 21:30

ARCrow · Accepted Answer · 2023-08-29 05:00:46Z

Try this:

import pyspark.sql.functions as f from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() df = spark.createDataFrame([ (3, [{'url': 'path1', 'default': True}, {'url': 'path2', 'default': False}]), (3, [{'url': 'path3', 'default': True}, {'url': 'path4', 'default': False}]) ], ['product_id', 'image']) df = ( df .groupBy('product_id') .agg(f.collect_list(f.col('image')).alias('images')) .select('product_id', f.flatten(f.col('images')).alias('images')) ) df.show(truncate= False)

And the output is:

+----------+--------------------------------------------------------------------------------------------------------------------------------------+ |product_id|images | +----------+--------------------------------------------------------------------------------------------------------------------------------------+ |3 |[{default -> true, url -> path1}, {default -> false, url -> path2}, {default -> true, url -> path3}, {default -> false, url -> path4}]| +----------+--------------------------------------------------------------------------------------------------------------------------------------+

thank you for getting back to me with this suggestion. I tried it but I get the error: "cannot resolve 'flatten(image)' due to data type mismatch: The argument should be an array of arrays, but 'image' is of array<string> type.;" do you have a suggestion how to fix/cast to a different data type?
@user3476463, maybe when you transported the code, instead of flatten('images'), you put flatten('image')?

Collectives™ on Stack Overflow

combine column of list of dict into list of unique dict in pyspark

1 Answer 1

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Linked

Related