2
$\begingroup$

Is it possible to update the Google News Word Embedding with a custom text dataset (text data pertaining to a particular domain) ?

Google News Word2Vec - Word Embedding clearly helps us to come with a robust set of word vectors but it unfortunately cannot be used for most business case. For example:

embeddings.most_similar('python') [('pythons', 0.6688377857208252), ('Burmese_python', 0.6680365204811096), ('snake', 0.6606293320655823), ('crocodile', 0.6591362953186035), ('boa_constrictor', 0.6443518996238708), ('alligator', 0.6421656608581543), ('reptile', 0.6387744545936584), ('albino_python', 0.6158879995346069), ('croc', 0.6083582639694214), ('lizard', 0.601341724395752)] 

This output is clearly not what we want. We could create a custom word2vec model using gensim library for this business case but it would not be exhaustive (vocabulary will be comparatively less). What is best practice in such cases ? Is is possible to update the weights of a pretrained Word Embedding model so that the word embedding also learns from domain text data?

$\endgroup$

1 Answer 1

1
$\begingroup$

Transfer-learning is one possible approach :

  1. Design and implement a neural net to match Google Word2Vec's design (In terms of number of layers, activation functions and etc.,).
  2. Pre-initialize weights with these vectors
  3. Retrain with domain-specific corpus

This is an implementation that can be used as base and modified for step #1

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.