pyspark pivot alternative with better performance

Question

Below is my input dataset :

df = spark.createDataFrame([ \ ("0","CattyCat","B2K","B"), \ ("0","CattyCat","B3L","I"), \ ("0","CattyCat","B3U","I"), \ ("0","CattyCat","D3J","C"), \ ("0","CattyCat","J1N","H"), \ ("0","CattyCat","K7A","I"), \ ("0","CattyCat","L1B","D"), \ ("0","CattyCat","U3F","B"), \ ("1","CattyCat","B2K","I"), \ ("1","CattyCat","B3L","I"), \ ("1","CattyCat","B3U","I"), \ ("1","CattyCat","D3J","C"), \ ("1","CattyCat","J1N","H"), \ ("1","CattyCat","K7A","I"), \ ("1","CattyCat","L1B","D"), \ ("1","CattyCat","U3F","B"), \ ("2","CattyCat","B2K","B"), \ ("2","CattyCat","B3L","B"), \ ("2","CattyCat","B3U","I"), \ ("2","CattyCat","D3J","C"), \ ("2","CattyCat","J1N","H"), \ ("2","CattyCat","K7A","I"), \ ("2","CattyCat","L1B","D"), \ ("2","CattyCat","U3F","B"), \ ], ["RowCount","CatName","Name","Value"]) df.show(30) +--------+--------+----+-----+ |RowCount| CatName|Name|Value| +--------+--------+----+-----+ | 0|CattyCat| B2K| B| | 0|CattyCat| B3L| I| | 0|CattyCat| B3U| I| | 0|CattyCat| D3J| C| | 0|CattyCat| J1N| H| | 0|CattyCat| K7A| I| | 0|CattyCat| L1B| D| | 0|CattyCat| U3F| B| | 1|CattyCat| B2K| I| | 1|CattyCat| B3L| I| | 1|CattyCat| B3U| I| | 1|CattyCat| D3J| C| | 1|CattyCat| J1N| H| | 1|CattyCat| K7A| I| | 1|CattyCat| L1B| D| | 1|CattyCat| U3F| B| | 2|CattyCat| B2K| B| | 2|CattyCat| B3L| B| | 2|CattyCat| B3U| I| | 2|CattyCat| D3J| C| | 2|CattyCat| J1N| H| | 2|CattyCat| K7A| I| | 2|CattyCat| L1B| D| | 2|CattyCat| U3F| B| +--------+--------+----+-----+

My goal is to pivot\cross-tab this data. I was able to acheve this using groupby.pivot.agg as below:

import pyspark.sql.functions as F display(df.groupBy("RowCount","CatName").pivot("Name").agg(F.first("value"))) +----------+----------+-----+-----+-----+-----+-----+-----+-----+-----+ | RowCount | CatName | B2K | B3L | B3U | D3J | J1N | K7A | L1B | U3F | +----------+----------+-----+-----+-----+-----+-----+-----+-----+-----+ | 0 | CattyCat | B | I | I | C | H | I | D | B | +----------+----------+-----+-----+-----+-----+-----+-----+-----+-----+ | 1 | CattyCat | I | I | I | C | H | I | D | B | +----------+----------+-----+-----+-----+-----+-----+-----+-----+-----+ | 2 | CattyCat | B | B | I | C | H | I | D | B | +----------+----------+-----+-----+-----+-----+-----+-----+-----+-----+

But the problem that I'm facing is that when the dataset is huge(100's of millions), the performance is very very poor. (Single task in the last stage of a single executor, stuck for hours) P.S: I did also find out that pivot could also take a second parameter which could be a sequence of column names which might offer better performance. But unfortunately I cannot know these column names in advance.

Is there a way of doing this 'Cross Tab' in a better performant way ?

You can check this post - without pivot - stackoverflow.com/questions/61686883/… .. its in scala — s.polam
– s.polam, Commented Nov 26, 2020 at 8:54
Thanks will check it out. Have to try converting this to pyspark — gbzygil
– gbzygil, Commented Nov 26, 2020 at 21:44

Collectives™ on Stack Overflow

pyspark pivot alternative with better performance

0

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Linked