PySpark - Add a composite key name to the Dictionary

Question

Please bare with me if there are any mistakes as this is my first post.

This is the dataframe df: column 'a' is a string and rest are float.

I have added an image for the dataframe as somehow the formatting is getting messed up when I manually added the data.

On the given dataFrame df, I wanted to groupby column 'a' and find the min and max of each other column.I want to get the output as dictionary.So, I converted the resultant pyspark dataframe toJSON and using json.loads converted to Dictionary.

Code snippet: import pyspark.sql.functions as F cols=['b','c'] req_cols=[F.struct(F.first('a').alias('a'),F.max(col).alias('max'),F.min(col).lias('min')).alias(col) for col in cols] df_cache=df.groupby('a').agg(*req_cols).cache() dict=json.loads(df_cache.toJSON.collect()[0])

My output:

{ "b": { "max": "min": "a":'10' }, "c": { "max": "min": "a":'10' }, }

Required output:

{ "b_10": { "max": "min": "a":'10' }, "c_10": { "max": "min": "a":'10' }, "b_20": { "max": "min": "a":'20' }, "c_20": { "max": "min": "a":'20' }, "b_30": { "max": "min": "a":'30' }, "c_30": { "max": "min": "a":'30' }, }

Output

AdibP · Accepted Answer · 2021-06-25 06:58:08Z

2

Use pivot when grouping

df_cache = df.groupBy().pivot('a').agg(*req_cols).cache()

the column names will be different from your desired output so you need to change them if you want

answered Jun 25, 2021 at 6:58

AdibP

2,9691 gold badge13 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Anjali Over a year ago

Thanks a lot . I implemented the same in my code and it works as expected.

Collectives™ on Stack Overflow

PySpark - Add a composite key name to the Dictionary

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related