Obtaining frequently occurring phrases using Word2Vec

Question

I'm rather new to Word2Vec, having started working on it about a week ago.

My question is this: Is there a way to obtain frequently occurring phrases in a large document using Word2Vec along with a score to denote the 'frequency'?

do you want the frequency of the phrase in the document? If that is the case, you can just use a dictionary counter right? — Hima Varsha
– Hima Varsha, Commented Aug 30, 2016 at 7:41
Look for 'phrase2vec' or, classically, (word-level) 'bi-grams' or 'n-grams' — Adam Bittlingmayer
– Adam Bittlingmayer, Commented Oct 13, 2016 at 13:08

Shayan Shafiq · Accepted Answer · 2020-12-23 17:29:18Z

You may use gensim phrase vectorizer module available in Python.

You need to give threshold value which is some sort of pmi of words. The higher this value less are the number of phrases the default is 10. You can play around with this value to get results for your data.

phrase_threshold = 1

bigram = Phrases(sentences,threshold=phrase_threshold)

This is based on this based on the skipgram paper by Tomas Mikolov.

Can you also please tell how can you print/save the frequencies for all phrases in the corpus? — Soheil
– Soheil, Commented May 31, 2017 at 15:04

chmodsss · Accepted Answer · 2016-10-20 11:11:05Z

Choose the implementation according to the need. In this scenario, tf-idf does a much better work than word2vec. tf-idf provides the importance of a word in a document by considering the relative frequency with other documents.

Because the frequently occuring words could also occur frequently in other documents too. In tf-idf method, more weights are given to words which occur much more frequently in one document compared to others. For more reading of tfidf.

Stack Exchange Network

Obtaining frequently occurring phrases using Word2Vec

2 Answers 2

Hot Network Questions

Obtaining frequently occurring phrases using Word2Vec

2 Answers 2

Related

Hot Network Questions