0

This is my data.

CouponNbr,ItemNbr,TypeCode,DeptNbr,MPQ 10,2,1,10,1 10,3,4,50,2 11,2,1,10,1 11,3,4,50,2 

I want to group it in spark in such a way such that it looks like this:

CouponNbr,ItemsInfo 10,[[2,1,10,1],[3,4,50,2]] 11,[[2,1,10,1],[3,4,50,2]] 

I try to group it by and convert it to dictionary with the following code,

df.groupby("CouponNbr").apply(lambda x:x[["ItemNbr","TypeCode","DeptNbr","MPQ"]].to_dict("r")) 

But this is in pandas and it returns the following

CouponNbr,ItemsInfo 10,[{[ItemNbr:2,TypeCode:1,DeptNbr:10,MPQ:1],[ItemNbr:3,TypeCode:4,DeptNbr:50,MPQ:2]}] 11,[{[ItemNbr:2,TypeCode:1,DeptNbr:10,MPQ:1],[ItemNbr:3,TypeCode:4,DeptNbr:50,MPQ:2]}] 

Is there a way I could achieve the format I need in pyspark? Thanks.

1 Answer 1

3

You can firstly collect columns into a single array column using the array function and then do groupby.agg using collect_list:

import pyspark.sql.functions as F df.groupBy('CouponNbr').agg( F.collect_list( F.array('ItemNbr', 'TypeCode', 'DeptNbr', 'MPQ') ).alias('ItemsInfo') ).show(2, False) +---------+------------------------------+ |CouponNbr|ItemsInfo | +---------+------------------------------+ |10 |[[2, 1, 10, 1], [3, 4, 50, 2]]| |11 |[[2, 1, 10, 1], [3, 4, 50, 2]]| +---------+------------------------------+ 
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, works like a charm. I even tried this: def jsonToDataFrame(json, schema=None): # SparkSessions are available with Spark 2.0+ reader = spark.read if schema: reader.schema(schema) return reader.json(sc.parallelize([json])), and converted the dataframe to json before passing it as a parameter to this function. That worked as well. Thanks :)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.