Handle 50,000 classes in OneVsRestClassifier

Question

I'm new to data science and NLP. I'm trying to solve a problem that is having 1 million rows and some 50,000 distinct classes. The dataset has some text column as a predictor and the other one is the multilabel responses. I have been using tfidf to represent the text fields and MultiLabelBinarizer to transform the labels. But MultiLabelBinarizer is giving MemoryError.

And there is no way I can pass the legacy multi-label data representation using a sequence of sequences as it seems no more to be supported in the sklearn package. So, what should be my approach?

Any help is appreciated. Thanks in advance.

You may opt to use TF or PyTorch and pass your data in smaller batches (to avoid MemoryError). Also how well distributed is your dataset? — Random Nerd
– Random Nerd, Commented Nov 24, 2018 at 8:55

atmarges · Accepted Answer · 2018-11-25 16:39:51Z

As you already know, the main culprit here is the large number of classes (50k). For every sample data, you have a label of size 50k. Even if a sample only belongs to a single class, the label will still be of size 50k.

For example, sample1 has a label of A and B while sample2 has a label of A. Sample3 on the other hand has C for its label and sample4 has A. We can visualize it like this:

sample1 - [1, 1, 0, 0, 0, 0, ..., 0] sample2 - [1, 0, 0, 0, 0, 0, ..., 0] sample3 - [0, 0, 1, 0, 0, 0, ..., 0] sample4 - [1, 0, 0, 0, 0, 0, ..., 0]

One solution here is to use Siamese neural network. This takes in two input at a time. Instead of training directly to learn classes as the output, we are training to learn the similarity between two samples. We first select random pairs. We use the label 1 if two samples belong to the same class; else 0.

sample1 & sample2 - 1 sample1 & sample3 - 0 sample2 & sample4 - 1 sample3 & sample4 - 0

This way, even if there is a lot of classes, you are only training with 2 labels, 1 or 0, thus, avoiding memory error.

The problem now is that you are doing a multi-classification task. For instance, sample1 and sample 2 are the same class A but sample1 also belongs to class B. My suggestion, but I'm not totally sure here, is to add multiple instance of that pair.

sample1 & sample2 - 1 sample1 & sample2 - 0 sample1 & sample3 - 0 sample2 & sample4 - 1 sample3 & sample4 - 0

As for the large number of training samples (1 million), this can be handled by using small batches during training.

Thanks for the wonderful answer. I'll try it out. Keeping the question open for some more time to accept new answers. — lu5er
– lu5er, Commented Nov 26, 2018 at 2:52
One more thing, I managed to get the labels compressed using the sparse matrix reoresenattion of Multi Label Binarizer. But training using the OneVsRest is taking again a long time. Is that acceptable? Or I need to think some other way out like you pointed? — lu5er
– lu5er, Commented Nov 26, 2018 at 2:54
Nice idea. If 50k classes is too much, a good way is to use decomposition algorithms like PCA to reduce the dimension, say about 50-100? Then, instead of doing a classification task, you can now do a regression approach. — atmarges
– atmarges, Commented Nov 28, 2018 at 12:14

Stack Exchange Network

Handle 50,000 classes in OneVsRestClassifier

1 Answer 1

Hot Network Questions

Handle 50,000 classes in OneVsRestClassifier

1 Answer 1

Related

Hot Network Questions