Using pre-trained word2vec vs re-training a word2vec model

Question

I am relatively new to using word2vec. I am interested in solving the topic-word intrusion introduced here by using the vector spaces of words generated by word2vec and SVC.

I have a corpus with a vocabulary of 8000 words, the vocabulary is perfectly contained in Google's word2vec trained model. I was wondering which model would provide a better representation of the words, the pre-trained model on 3M words or a model trained only on the 8000 words appearing in my corpus?

$\begingroup$ Have you considered trying them both? $\endgroup$

Stephen Rauch
– Stephen Rauch ♦

2017-09-13 13:59:11 +00:00
Commented Sep 13, 2017 at 13:59 — Stephen Rauch
– Stephen Rauch ♦, Commented Sep 13, 2017 at 13:59

Valentin Calomme · Accepted Answer · 2017-09-16 13:37:07Z

As always, "is A better than B" always depends on what you consider better? Is it accuracy, is it speed etc.

The dumb but correct answer to your question is: "try both and see which one is better".

Performance of techniques is always to some extent dependent on the data. What you need to remember with word vectors, is that they are learned based on a certain context. If the context that was used by Google's model is similar to yours, you might be better off using their model. But if it's different, you might run into some problems.

Just imagine the following case. You have 4 words: King, Man, Queen, Woman. Which pairs of two words would you create? Depending on the context, you could make a case for several

King/Man and Queen/Woman because of gender
Man/Woman and King/Queen because of the use of the word etc.

how would you train a word2vec model? You need a webbot that scours all of Wikipedia or Google Books, but how easy is it to build that? — EasyJapaneseBoy
– EasyJapaneseBoy, Commented Jan 25, 2019 at 18:13

Stack Exchange Network

Using pre-trained word2vec vs re-training a word2vec model

1 Answer 1

Hot Network Questions

Using pre-trained word2vec vs re-training a word2vec model

1 Answer 1

Related

Hot Network Questions