1

I have a dataframe topic_data that contains the output of an LDA topic model:

topic_data.head(15) topic word score 0 0 Automobile 0.063986 1 0 Vehicle 0.017457 2 0 Horsepower 0.015675 3 0 Engine 0.014857 4 0 Bicycle 0.013919 5 1 Sport 0.032938 6 1 Association_football 0.025324 7 1 Basketball 0.020949 8 1 Baseball 0.016935 9 1 National_Football_League 0.016597 10 2 Japan 0.051454 11 2 Beer 0.032839 12 2 Alcohol 0.027909 13 2 Drink 0.019494 14 2 Vodka 0.017908 

This shows the top 5 terms for each topic, and the score (weight) for each. What I'm trying to do is reformat so that the index is the rank of the term, the columns are the topic IDs, and the values are formatted strings generated from the word and score columns (something along the lines of "%s (%.02f)" % (word,score)). That means the new dataframe should look something like this:

Topic 0 1 ... Rank 0 Automobile (0.06) Sport (0.03) ... 1 Vehicle (0.017) Association_football (0.03) ... ... ... ... ... 

What's the right way of going about this? I assume it involves a combination of index-setting, unstacking, and ranking, but I'm not sure of the right approach.

1 Answer 1

2

It would be something like this, note that Rank has to be generated first:

In [140]: df['Rank'] = (-1*df).groupby('topic').score.transform(np.argsort) df['New_str'] = df.word + df.score.apply(' ({0:.2f})'.format) df2 = df.sort(['Rank', 'score'])[['New_str', 'topic','Rank']] print df2.pivot(index='Rank', values='New_str', columns='topic') topic 0 1 2 Rank 0 Automobile (0.06) Sport (0.03) Japan (0.05) 1 Vehicle (0.02) Association_football (0.03) Beer (0.03) 2 Horsepower (0.02) Basketball (0.02) Alcohol (0.03) 3 Engine (0.01) Baseball (0.02) Drink (0.02) 4 Bicycle (0.01) National_Football_League (0.02) Vodka (0.02) 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.