I have a pyspark dataframe df that has columns product_id and image. the product_id column is a unique identifier for the product. the image columns is a list of dictionaries containing the url and default setting for links to images associated with each product. the lists of dictionaries have the form:
[{"url":"path","default":true},{"url":"path2","default":false}] some of the product_id in the dataframe df have multiple records with different lists, example:
+----------+----------------------------------------------------------------+ |product_id|image | +----------+----------------------------------------------------------------+ |3 |[{"url":"path3","default":true},{"url":"path5","default":false}]| +----------+----------------------------------------------------------------+ |3 |[{"url":"path7","default":true},{"url":"path8","default":false}]| +----------+----------------------------------------------------------------+ I would like to create pyspark or spark sql to combine these multiple lists of dictionaries into one list for product_id in the dataframe that have more than one record. An example of the desired output dataframe is below:
+----------+----------------------------------------------------------------+ |product_id|image | +----------+----------------------------------------------------------------+ |3 |[{"url":"path3","default":true},{"url":"path5","default":false}, +----------+----------------------------------------------------------------- {"url":"path7","default":true},{"url":"path8","default":false}]| ---------------------------------------------------------------+ I've tried solutions like the one below:
# Explode the nested list of dictionaries exploded_df = df.select(col("product_id"), explode(col("image")).alias("image")) # Collect unique elements using collect_set papi_image_merged_df2 = exploded_df.groupBy("product_id").agg(collect_set("image").alias("image")) but it creates a nested list of lists as output when combining the records that have the same product_id:
+----------+----------------------------------------------------------------+ |product_id|image | +----------+----------------------------------------------------------------+ |3 |[[{"url":"path3","default":true},{"url":"path5","default":false}], +----------+----------------------------------------------------------------- [{"url":"path7","default":true},{"url":"path8","default":false}]]| -----------------------------------------------------------------+ can anyone suggest how to get an output like the desired output example I showed above, where it creates a list of just the unique dictionaries?