8
$\begingroup$

I have categorized 800,000 documents into 500 categories using the Mahout topic modelling.

Instead of representing the topic using the top 5/10 words for each topics, I want to infer a generic name for the group using any existing algorithm. For the time being, I have used the following algorithm to arrive at the name for the topic:

For each topic

  • Take all the documents belonging to the topic (using the document-topic distribution output)
  • Run python nltk to get the noun phrases
  • Create the TF file from the output
  • name for the topic is the phrase (limited towards max 5 words)

Please suggest a approach to arrive at more relevant name for the topics.

$\endgroup$

4 Answers 4

7
$\begingroup$

I can suggest several papers on this topic:

  • Automatic Labelling of Topic Models
  • Automatic Labeling Hierarchical Topics
  • Representing Topics Labels for Exploring Digital Libraries

You can find more by looking at their citations.

$\endgroup$
1
  • $\begingroup$ thanks... i will check the papers (in particular the first one) $\endgroup$ Commented Jan 8, 2016 at 19:14
2
$\begingroup$

If you don't want to dig into much NLP in that task, I suggest you to generate a set of most frequent NGrams (of lengths 2-5) from your documents and find the most distinct ngrams for each category using TF*IDF metric as sense importance of a particular ngram (normalizing measure by word count) and selecting those Ngrams that are used in a particular category and are not (or rarely) used in others.

$\endgroup$
1
  • $\begingroup$ thanks for the suggestion. But initially i had tried with NGrams(3 words) with tf-idf approach. But the label generated were jot that meaningful. Can you suggest any NLP approach which will be more helpful. $\endgroup$ Commented Jan 8, 2016 at 19:13
0
$\begingroup$

You might try using word vectors to average the top N words in a topic and then using the cosine similarity to find the closest word in the corpus?

Just a quick and dirty an idea...

$\endgroup$
2
  • $\begingroup$ i have tried this approach.Also added tf-idf so that the words are unique for topic. But the result is not that encouraging $\endgroup$ Commented May 17, 2018 at 10:26
  • $\begingroup$ Thanks, I was thinking of trying this myself but won’t bother now. $\endgroup$ Commented May 22, 2018 at 18:59
0
$\begingroup$

A few ideas you'll often see..

  • Generate a list from Wikipedia titles, extract keyphrases, predict the related wikipedia pages and use the keyphrases.
  • Generate a hand-labeled dataset.
  • Use a graph populated with topics and the relations between words and topics to predict the most likely topics
  • Abstractive summarization and keyphrase extraction
$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.