3
$\begingroup$

I want to try the skipgrams approach on my dataset. But I do not know how to vectorize it. For example, I have my cleaned document for which I got it's skipgrams. Now, how do I know vectorize it so that I can use it further for classification? I use sklearn for all the above purposes.

$\endgroup$

1 Answer 1

2
$\begingroup$

The answer to your question can be found at: https://stackoverflow.com/a/45997893/5312422

To vectorize text with skip-grams in scikit-learn simply passing the skip gram tokens as the vocabulary to CountVectorizer will not work. You need to modify the way tokens are processed which can be done with a custom analyzer. Below is an example vectorizer that produces 1-skip-2-grams,

from toolz import itertoolz, compose from toolz.curried import map as cmap, sliding_window, pluck from sklearn.feature_extraction.text import CountVectorizer

class SkipGramVectorizer(CountVectorizer): def build_analyzer(self): preprocess = self.build_preprocessor() stop_words = self.get_stop_words() tokenize = self.build_tokenizer() return lambda doc: self._word_skip_grams( compose(tokenize, preprocess, self.decode)(doc), stop_words) def _word_skip_grams(self, tokens, stop_words=None): # handle stop words if stop_words is not None: tokens = [w for w in tokens if w not in stop_words] return compose(cmap(' '.join), pluck([0, 2]), sliding_window(3))(tokens) 

For instance, on this Wikipedia example,

text = ['the rain in Spain falls mainly on the plain'] vect = SkipGramVectorizer() vect.fit(text) vect.get_feature_names() 

this would vectorizer would yield the following tokens,

['falls on', 'in falls', 'mainly the', 'on plain', 'rain spain', 'spain mainly', 'the in'] 
$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.