Updating Google News Word2vec Word Embedding?

Question

Is it possible to update the Google News Word Embedding with a custom text dataset (text data pertaining to a particular domain) ?

Google News Word2Vec - Word Embedding clearly helps us to come with a robust set of word vectors but it unfortunately cannot be used for most business case. For example:

embeddings.most_similar('python') [('pythons', 0.6688377857208252), ('Burmese_python', 0.6680365204811096), ('snake', 0.6606293320655823), ('crocodile', 0.6591362953186035), ('boa_constrictor', 0.6443518996238708), ('alligator', 0.6421656608581543), ('reptile', 0.6387744545936584), ('albino_python', 0.6158879995346069), ('croc', 0.6083582639694214), ('lizard', 0.601341724395752)]

This output is clearly not what we want. We could create a custom word2vec model using gensim library for this business case but it would not be exhaustive (vocabulary will be comparatively less). What is best practice in such cases ? Is is possible to update the weights of a pretrained Word Embedding model so that the word embedding also learns from domain text data?

ebrahimi · Accepted Answer · 2018-12-07 23:51:46Z

Transfer-learning is one possible approach :

Design and implement a neural net to match Google Word2Vec's design (In terms of number of layers, activation functions and etc.,).
Pre-initialize weights with these vectors
Retrain with domain-specific corpus

This is an implementation that can be used as base and modified for step #1

Stack Exchange Network

Updating Google News Word2vec Word Embedding?

1 Answer 1

Hot Network Questions

Updating Google News Word2vec Word Embedding?

1 Answer 1

Related

Hot Network Questions