1

Please bare with me if there are any mistakes as this is my first post.

This is the dataframe df: column 'a' is a string and rest are float.

I have added an image for the dataframe as somehow the formatting is getting messed up when I manually added the data.

Dataframe

On the given dataFrame df, I wanted to groupby column 'a' and find the min and max of each other column.I want to get the output as dictionary.So, I converted the resultant pyspark dataframe toJSON and using json.loads converted to Dictionary.

Code snippet: import pyspark.sql.functions as F cols=['b','c'] req_cols=[F.struct(F.first('a').alias('a'),F.max(col).alias('max'),F.min(col).lias('min')).alias(col) for col in cols] df_cache=df.groupby('a').agg(*req_cols).cache() dict=json.loads(df_cache.toJSON.collect()[0]) 

My output:

{ "b": { "max": "min": "a":'10' }, "c": { "max": "min": "a":'10' }, } 

Required output:

{ "b_10": { "max": "min": "a":'10' }, "c_10": { "max": "min": "a":'10' }, "b_20": { "max": "min": "a":'20' }, "c_20": { "max": "min": "a":'20' }, "b_30": { "max": "min": "a":'30' }, "c_30": { "max": "min": "a":'30' }, } 

Output

1 Answer 1

2

Use pivot when grouping

df_cache = df.groupBy().pivot('a').agg(*req_cols).cache() 

the column names will be different from your desired output so you need to change them if you want

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks a lot . I implemented the same in my code and it works as expected.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.