Grouping data without calling aggregation function in pyspark

Question

This is my data.

CouponNbr,ItemNbr,TypeCode,DeptNbr,MPQ 10,2,1,10,1 10,3,4,50,2 11,2,1,10,1 11,3,4,50,2

I want to group it in spark in such a way such that it looks like this:

CouponNbr,ItemsInfo 10,[[2,1,10,1],[3,4,50,2]] 11,[[2,1,10,1],[3,4,50,2]]

I try to group it by and convert it to dictionary with the following code,

df.groupby("CouponNbr").apply(lambda x:x[["ItemNbr","TypeCode","DeptNbr","MPQ"]].to_dict("r"))

But this is in pandas and it returns the following

CouponNbr,ItemsInfo 10,[{[ItemNbr:2,TypeCode:1,DeptNbr:10,MPQ:1],[ItemNbr:3,TypeCode:4,DeptNbr:50,MPQ:2]}] 11,[{[ItemNbr:2,TypeCode:1,DeptNbr:10,MPQ:1],[ItemNbr:3,TypeCode:4,DeptNbr:50,MPQ:2]}]

Is there a way I could achieve the format I need in pyspark? Thanks.

akuiper · Accepted Answer · 2018-08-10 16:29:54Z

You can firstly collect columns into a single array column using the array function and then do groupby.agg using collect_list:

import pyspark.sql.functions as F df.groupBy('CouponNbr').agg( F.collect_list( F.array('ItemNbr', 'TypeCode', 'DeptNbr', 'MPQ') ).alias('ItemsInfo') ).show(2, False) +---------+------------------------------+ |CouponNbr|ItemsInfo | +---------+------------------------------+ |10 |[[2, 1, 10, 1], [3, 4, 50, 2]]| |11 |[[2, 1, 10, 1], [3, 4, 50, 2]]| +---------+------------------------------+

Thanks, works like a charm. I even tried this: def jsonToDataFrame(json, schema=None): # SparkSessions are available with Spark 2.0+ reader = spark.read if schema: reader.schema(schema) return reader.json(sc.parallelize([json])), and converted the dataframe to json before passing it as a parameter to this function. That worked as well. Thanks :)

Collectives™ on Stack Overflow

Grouping data without calling aggregation function in pyspark

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related