0
$\begingroup$

I'm new to data science and NLP. I'm trying to solve a problem that is having 1 million rows and some 50,000 distinct classes. The dataset has some text column as a predictor and the other one is the multilabel responses. I have been using tfidf to represent the text fields and MultiLabelBinarizer to transform the labels. But MultiLabelBinarizer is giving MemoryError.

And there is no way I can pass the legacy multi-label data representation using a sequence of sequences as it seems no more to be supported in the sklearn package. So, what should be my approach?

Any help is appreciated. Thanks in advance.

$\endgroup$
1
  • $\begingroup$ You may opt to use TF or PyTorch and pass your data in smaller batches (to avoid MemoryError). Also how well distributed is your dataset? $\endgroup$ Commented Nov 24, 2018 at 8:55

1 Answer 1

1
$\begingroup$

As you already know, the main culprit here is the large number of classes (50k). For every sample data, you have a label of size 50k. Even if a sample only belongs to a single class, the label will still be of size 50k.

For example, sample1 has a label of A and B while sample2 has a label of A. Sample3 on the other hand has C for its label and sample4 has A. We can visualize it like this:

sample1 - [1, 1, 0, 0, 0, 0, ..., 0] sample2 - [1, 0, 0, 0, 0, 0, ..., 0] sample3 - [0, 0, 1, 0, 0, 0, ..., 0] sample4 - [1, 0, 0, 0, 0, 0, ..., 0] 

One solution here is to use Siamese neural network. This takes in two input at a time. Instead of training directly to learn classes as the output, we are training to learn the similarity between two samples. We first select random pairs. We use the label 1 if two samples belong to the same class; else 0.

sample1 & sample2 - 1 sample1 & sample3 - 0 sample2 & sample4 - 1 sample3 & sample4 - 0 

This way, even if there is a lot of classes, you are only training with 2 labels, 1 or 0, thus, avoiding memory error.

The problem now is that you are doing a multi-classification task. For instance, sample1 and sample 2 are the same class A but sample1 also belongs to class B. My suggestion, but I'm not totally sure here, is to add multiple instance of that pair.

sample1 & sample2 - 1 sample1 & sample2 - 0 sample1 & sample3 - 0 sample2 & sample4 - 1 sample3 & sample4 - 0 

As for the large number of training samples (1 million), this can be handled by using small batches during training.

$\endgroup$
3
  • $\begingroup$ Thanks for the wonderful answer. I'll try it out. Keeping the question open for some more time to accept new answers. $\endgroup$ Commented Nov 26, 2018 at 2:52
  • $\begingroup$ One more thing, I managed to get the labels compressed using the sparse matrix reoresenattion of Multi Label Binarizer. But training using the OneVsRest is taking again a long time. Is that acceptable? Or I need to think some other way out like you pointed? $\endgroup$ Commented Nov 26, 2018 at 2:54
  • 1
    $\begingroup$ Nice idea. If 50k classes is too much, a good way is to use decomposition algorithms like PCA to reduce the dimension, say about 50-100? Then, instead of doing a classification task, you can now do a regression approach. $\endgroup$ Commented Nov 28, 2018 at 12:14

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.