I want to try the skipgrams approach on my dataset. But I do not know how to vectorize it. For example, I have my cleaned document for which I got it's skipgrams. Now, how do I know vectorize it so that I can use it further for classification? I use sklearn for all the above purposes.
1 Answer
The answer to your question can be found at: https://stackoverflow.com/a/45997893/5312422
To vectorize text with skip-grams in scikit-learn simply passing the skip gram tokens as the vocabulary to CountVectorizer will not work. You need to modify the way tokens are processed which can be done with a custom analyzer. Below is an example vectorizer that produces 1-skip-2-grams,
from toolz import itertoolz, compose from toolz.curried import map as cmap, sliding_window, pluck from sklearn.feature_extraction.text import CountVectorizer
class SkipGramVectorizer(CountVectorizer): def build_analyzer(self): preprocess = self.build_preprocessor() stop_words = self.get_stop_words() tokenize = self.build_tokenizer() return lambda doc: self._word_skip_grams( compose(tokenize, preprocess, self.decode)(doc), stop_words) def _word_skip_grams(self, tokens, stop_words=None): # handle stop words if stop_words is not None: tokens = [w for w in tokens if w not in stop_words] return compose(cmap(' '.join), pluck([0, 2]), sliding_window(3))(tokens)For instance, on this Wikipedia example,
text = ['the rain in Spain falls mainly on the plain'] vect = SkipGramVectorizer() vect.fit(text) vect.get_feature_names()this would vectorizer would yield the following tokens,
['falls on', 'in falls', 'mainly the', 'on plain', 'rain spain', 'spain mainly', 'the in']