I'm rather new to Word2Vec, having started working on it about a week ago.
My question is this: Is there a way to obtain frequently occurring phrases in a large document using Word2Vec along with a score to denote the 'frequency'?
I'm rather new to Word2Vec, having started working on it about a week ago.
My question is this: Is there a way to obtain frequently occurring phrases in a large document using Word2Vec along with a score to denote the 'frequency'?
You may use gensim phrase vectorizer module available in Python.
You need to give threshold value which is some sort of pmi of words. The higher this value less are the number of phrases the default is 10. You can play around with this value to get results for your data.
phrase_threshold = 1
bigram = Phrases(sentences,threshold=phrase_threshold)
This is based on this based on the skipgram paper by Tomas Mikolov.
Choose the implementation according to the need. In this scenario, tf-idf does a much better work than word2vec. tf-idf provides the importance of a word in a document by considering the relative frequency with other documents.
Because the frequently occuring words could also occur frequently in other documents too. In tf-idf method, more weights are given to words which occur much more frequently in one document compared to others. For more reading of tfidf.