2
$\begingroup$

I'm rather new to Word2Vec, having started working on it about a week ago.

My question is this: Is there a way to obtain frequently occurring phrases in a large document using Word2Vec along with a score to denote the 'frequency'?

$\endgroup$
2
  • $\begingroup$ do you want the frequency of the phrase in the document? If that is the case, you can just use a dictionary counter right? $\endgroup$ Commented Aug 30, 2016 at 7:41
  • 1
    $\begingroup$ Look for 'phrase2vec' or, classically, (word-level) 'bi-grams' or 'n-grams' $\endgroup$ Commented Oct 13, 2016 at 13:08

2 Answers 2

5
$\begingroup$

You may use gensim phrase vectorizer module available in Python.

You need to give threshold value which is some sort of pmi of words. The higher this value less are the number of phrases the default is 10. You can play around with this value to get results for your data.

phrase_threshold = 1

bigram = Phrases(sentences,threshold=phrase_threshold)

This is based on this based on the skipgram paper by Tomas Mikolov.

$\endgroup$
1
  • $\begingroup$ Can you also please tell how can you print/save the frequencies for all phrases in the corpus? $\endgroup$ Commented May 31, 2017 at 15:04
1
$\begingroup$

Choose the implementation according to the need. In this scenario, tf-idf does a much better work than word2vec. tf-idf provides the importance of a word in a document by considering the relative frequency with other documents.

Because the frequently occuring words could also occur frequently in other documents too. In tf-idf method, more weights are given to words which occur much more frequently in one document compared to others. For more reading of tfidf.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.