4

I have the following challenge: I have a dataframe called hashtags_users_grouped which has the following structure:

hashtag_id | user_id | count 123 1 1 245 1 3 123 2 5 

In each row, we find values that tell me when a certain user mentioned a certain hashtag and how many times he did it. In this example, user 1 mentioned hashtag 123 one time and 245 three times, while user 2 only mentioned hashtag 123 five times.

I want to have a dataframe with the following output:

user | 123 | 245 1 1 3 2 5 0 

In other words, the same information as the first table, but with a column per hashtag, to know the amount of times a user mentioned each hashtag. I read the documentation and tried to run the following, without success:

a = hashtags_users_joined_grouped_df.groupBy("user_id").pivot("hashtag_id") a.show(5) 

I got the following error message:

AttributeError: 'GroupedData' object has no attribute 'show' 

Do you know any way to do this?

1

1 Answer 1

10

After applying pivot you need to perform an aggregate, in this case the aggregate is first as the count metric has already been computed.

from pyspark.sql import functions as F df = spark.createDataFrame([(123, 1, 1, ), (245, 1, 3), (123, 2, 5),], ("hashtag_id", "user_id", "count", )) df.groupBy("user_id")\ .pivot("hashtag_id")\ .agg(F.first("count"))\ .show() 

Output

+-------+---+----+ |user_id|123| 245| +-------+---+----+ | 1| 1| 3| | 2| 5|null| +-------+---+----+ 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.