pivot in PYSPARKSQL

Question

i have a requirement to use pivot on below table.

id,week,score 102,1,96 101,1,138 102,1,37 101,1,59 101,2,282 102,2,212 102,2,78 101,2,97 102,3,60 102,3,123 101,3,220 101,3,87

output

 1 2 3 101 138,59 282,97 220,87 102 96,37 212,78 123,60

here i need to sort score

i have tried below code but it only works when there is a single record on a particular id

df.groupBy("id").pivot("week").agg(first("score"))

No shankar, Here i am expecting all records as an array..please check the expected output. — LUZO
– LUZO, Commented Dec 28, 2017 at 17:06

Prasad Khode · Accepted Answer · 2017-12-28 12:56:38Z

Instead of first you should be collecting all the values using collect_list, this will give you the result in list

import org.apache.spark.sql.functions._ df.groupBy("id").pivot("week").agg(collect_list("score")).show()

output :

+---+---------+---------+---------+ |id |1 |2 |3 | +---+---------+---------+---------+ |101|[138, 59]|[282, 97]|[220, 87]| |102|[96, 37] |[212, 78]|[60, 123]| +---+---------+---------+---------+

May i get the above output using sql query in pyspark by using temptable.

Anahcolus · Accepted Answer · 2017-12-29 06:02:41Z

Equivalent python code for pyspark for the scala answer posted by Prasad Khode is as below

from pyspark.sql import functions as F df.groupBy("id").pivot("week").agg(F.collect_list("score")).show()

If you check the api document you can see

collect_list(Column e)
Aggregate function: returns a list of objects with duplicates.

You can use collect_set as well which will give you the same output with duplicates removed.

df.groupBy("id").pivot("week").agg(F.collect_set("score")).show()

api document says the following

collect_set(Column e)
Aggregate function: returns a set of objects with duplicate elements eliminated.

Thanks for your input Ramesh, May i have a chance of getting same output using sql query in pyspark. i have used this query which shows error "select id,list((case when week = 1 then score else null end) as 1,','),list((case when week = 2 then score else null end) as 2,','),list((case when week = 3 then score else null end) as 3,',') from d group by id"
@LUZO, I am afraid I can't be a good help on that as sql queries are nightmares for me. :) you can post another question with much detail explaining your requirements. You will certainly get an answer.

Collectives™ on Stack Overflow

pivot in PYSPARKSQL

2 Answers 2

1 Comment

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Linked

Related