7

I have used keras to use pre-trained word embeddings but I am not quite sure how to do it on scikit-learn model.

I need to do this in sklearn as well because I am using vecstack to ensemble both keras sequential model and sklearn model.

This is what I have done for keras model:

glove_dir = '/home/Documents/Glove' embeddings_index = {} f = open(os.path.join(glove_dir, 'glove.6B.200d.txt'), 'r', encoding='utf-8') for line in f: values = line.split() word = values[0] coefs = np.asarray(values[1:], dtype='float32') embeddings_index[word] = coefs f.close() embedding_dim = 200 embedding_matrix = np.zeros((max_words, embedding_dim)) for word, i in word_index.items(): if i < max_words: embedding_vector = embeddings_index.get(word) if embedding_vector is not None: embedding_matrix[i] = embedding_vector model = Sequential() model.add(Embedding(max_words, embedding_dim, input_length=maxlen)) . . model.layers[0].set_weights([embedding_matrix]) model.layers[0].trainable = False model.compile(----) model.fit(-----) 

I am very new to scikit-learn, from what I have seen to make an model in sklearn you do:

lr = LogisticRegression() lr.fit(X_train, y_train) lr.predict(x_test) 

So, my question is how do I use pre-trained Glove with this model? where do I pass the pre-trained glove embedding_matrix

Thank you very much and I really appreciate your help.

7
  • Please describe what you model you want to build in sklearn, best with formula and/or descriptive diagram. Commented Mar 16, 2019 at 16:20
  • Hello, I just want a logistic regression model with pre-trained word embedding and take the average of word embedding vectors. Commented Mar 16, 2019 at 16:30
  • Input is the amazon review. Since it's a review(text), word embeddings plays a huge role, right? Commented Mar 16, 2019 at 16:44
  • So you want to input.... a bag-of-words representation of some text, i.e. a fixed length vector of counts of individual words in the text? Commented Mar 16, 2019 at 17:03
  • Well yes and no. I have used Tokenizer to vectorize and convert text into Sequences so it can be used as an input. Instead of Bag of Words I want word embeddings beacause I think bag of word approach is very domain specific and I also want to work cross domain. Commented Mar 16, 2019 at 17:13

1 Answer 1

12

You can simply use the Zeugma library.

You can install it with pip install zeugma, then create and train your model with the following lines of code (assuming corpus_train and corpus_test are lists of strings):

from sklearn.linear_model import LogisticRegresion from zeugma.embeddings import EmbeddingTransformer glove = EmbeddingTransformer('glove') x_train = glove.transform(corpus_train) model = LogisticRegression() model.fit(x_train, y_train) x_test = glove.transform(corpus_test) model.predict(x_test) 

You can also use different pre-trained embeddings (complete list here) or train your own (see Zeugma's documentation for how to do this).

Sign up to request clarification or add additional context in comments.

5 Comments

This code no longer works with Gensim 4.0.0 or higher.
Since today, Zeugma should now support Gensim 4.0+. Just upgrade to the latest version (0.49+) with pip install -U zeugma
Yeah, I saw, I'm upgrading it at this moment.
There's any alternative to zeugma ? seems to me not supported anymore :/
Hey @DanielWiczew I'm not aware of alternatives but Zeugma is still maintained, the just hasn't been commits recently because none were needed. Let me know if you experience issues with it.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.