How to pivot a Pyspark Dataframe

Question

I have the following challenge: I have a dataframe called hashtags_users_grouped which has the following structure:

hashtag_id | user_id | count 123 1 1 245 1 3 123 2 5

In each row, we find values that tell me when a certain user mentioned a certain hashtag and how many times he did it. In this example, user 1 mentioned hashtag 123 one time and 245 three times, while user 2 only mentioned hashtag 123 five times.

I want to have a dataframe with the following output:

user | 123 | 245 1 1 3 2 5 0

In other words, the same information as the first table, but with a column per hashtag, to know the amount of times a user mentioned each hashtag. I read the documentation and tried to run the following, without success:

a = hashtags_users_joined_grouped_df.groupBy("user_id").pivot("hashtag_id") a.show(5)

I got the following error message:

AttributeError: 'GroupedData' object has no attribute 'show'

Do you know any way to do this?

Does this answer your question? How to pivot Spark DataFrame? — blackbishop
– blackbishop, Commented Nov 27, 2021 at 9:12

Nithish · Accepted Answer · 2021-11-26 20:28:54Z

After applying pivot you need to perform an aggregate, in this case the aggregate is first as the count metric has already been computed.

from pyspark.sql import functions as F df = spark.createDataFrame([(123, 1, 1, ), (245, 1, 3), (123, 2, 5),], ("hashtag_id", "user_id", "count", )) df.groupBy("user_id")\ .pivot("hashtag_id")\ .agg(F.first("count"))\ .show()

Output

+-------+---+----+ |user_id|123| 245| +-------+---+----+ | 1| 1| 3| | 2| 5|null| +-------+---+----+

Collectives™ on Stack Overflow

How to pivot a Pyspark Dataframe

1 Answer 1

Output

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Output

Comments

Linked

Related