Return to Answer

added 2321 characters in body

edited May 31, 2018 at 6:58

Thank Abhishek. I've figure it out! Here are my experiments.

1). we plot a easy example:

from gensim.models import Word2Vec from sklearn.decomposition import PCA from matplotlib import pyplot # define training data sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'], ['this', 'is', 'the', 'second', 'sentence'], ['yet', 'another', 'sentence'], ['one', 'more', 'sentence'], ['and', 'the', 'final', 'sentence']] # train model model_1 = Word2Vec(sentences, size=300, min_count=1) # fit a 2d PCA model to the vectors X = model_1[model_1.wv.vocab] pca = PCA(n_components=2) result = pca.fit_transform(X) # create a scatter plot of the projection pyplot.scatter(result[:, 0], result[:, 1]) words = list(model_1.wv.vocab) for i, word in enumerate(words): pyplot.annotate(word, xy=(result[i, 0], result[i, 1])) pyplot.show()

From the above plots, we can see that easy sentences cannot distinguish different words' meaning by distances.

2). Load pre-trained word embedding:

from gensim.models import KeyedVectors model_2 = Word2Vec(size=300, min_count=1) model_2.build_vocab(sentences) total_examples = model_2.corpus_count model = KeyedVectors.load_word2vec_format("glove.6B.300d.txt", binary=False) model_2.build_vocab([list(model.vocab.keys())], update=True) model_2.intersect_word2vec_format("glove.6B.300d.txt", binary=False, lockf=1.0) model_2.train(sentences, total_examples=total_examples, epochs=model_2.iter) # fit a 2d PCA model to the vectors X = model_2[model_1.wv.vocab] pca = PCA(n_components=2) result = pca.fit_transform(X) # create a scatter plot of the projection pyplot.scatter(result[:, 0], result[:, 1]) words = list(model_1.wv.vocab) for i, word in enumerate(words): pyplot.annotate(word, xy=(result[i, 0], result[i, 1])) pyplot.show()

From the above figure, we can see that word embeddings are more meaningful.
Hope this answer will be helpful.

Thank Abhishek. I've figure it out!

Thank Abhishek. I've figure it out! Here are my experiments.

1). we plot a easy example:

from gensim.models import Word2Vec from sklearn.decomposition import PCA from matplotlib import pyplot # define training data sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'], ['this', 'is', 'the', 'second', 'sentence'], ['yet', 'another', 'sentence'], ['one', 'more', 'sentence'], ['and', 'the', 'final', 'sentence']] # train model model_1 = Word2Vec(sentences, size=300, min_count=1) # fit a 2d PCA model to the vectors X = model_1[model_1.wv.vocab] pca = PCA(n_components=2) result = pca.fit_transform(X) # create a scatter plot of the projection pyplot.scatter(result[:, 0], result[:, 1]) words = list(model_1.wv.vocab) for i, word in enumerate(words): pyplot.annotate(word, xy=(result[i, 0], result[i, 1])) pyplot.show()

From the above plots, we can see that easy sentences cannot distinguish different words' meaning by distances.

2). Load pre-trained word embedding:

from gensim.models import KeyedVectors model_2 = Word2Vec(size=300, min_count=1) model_2.build_vocab(sentences) total_examples = model_2.corpus_count model = KeyedVectors.load_word2vec_format("glove.6B.300d.txt", binary=False) model_2.build_vocab([list(model.vocab.keys())], update=True) model_2.intersect_word2vec_format("glove.6B.300d.txt", binary=False, lockf=1.0) model_2.train(sentences, total_examples=total_examples, epochs=model_2.iter) # fit a 2d PCA model to the vectors X = model_2[model_1.wv.vocab] pca = PCA(n_components=2) result = pca.fit_transform(X) # create a scatter plot of the projection pyplot.scatter(result[:, 0], result[:, 1]) words = list(model_1.wv.vocab) for i, word in enumerate(words): pyplot.annotate(word, xy=(result[i, 0], result[i, 1])) pyplot.show()

From the above figure, we can see that word embeddings are more meaningful.
Hope this answer will be helpful.

Source Link

answered May 31, 2018 at 6:28

Shixiang Wan

Thank Abhishek. I've figure it out!