2D matrix for labelbinarizer

Question

There is one behavior of labelbinarizer

import numpy as np from sklearn import preprocessing lb = preprocessing.LabelBinarizer() lb.fit(np.array([[0, 1, 1], [1, 0, 0]])) lb.classes_

The output is array([0, 1, 2]). Why there is a 2 there?

Kiritee Gak · Accepted Answer · 2018-01-28 18:35:49Z

I think the documentation is kind of self-explanatory here. Fit takes in array of size n_samples in which each element is the class of the datum or if the data point belongs to multiple classes, the input would be obviously of size n_samples x n_classes. That is what you gave in as input in your example. Each point can belong to any of the three classes. That is why you have [0, 1, 2] as number of classes. So as mentioned in the documentation if you try

>> lb.transform([0, 1, 2, 0]) [[1 0 0] [0 1 0] [0 0 1] [1 0 0]]

and if you try a class that is non-existent after fit like

>> lb.transform([0, 1, 2, 1000]) [[1 0 0] [0 1 0] [0 0 1] [0 0 0]]

No class named 1000 exists, so multi-targeted conversion for 1000 class case is plainly [0, 0, 0]. Hope this helps.

You mean you have three classes for the first code snippet? Actually I guess the documentation was so brief, I didn't understand when I read. — Green Falcon
– Green Falcon, Commented Jan 28, 2018 at 18:38
@Media The first code snippet is the continuation of what op has mentioned from the sklearn documentation. But either ways it has three classes. — Kiritee Gak
– Kiritee Gak, Commented Jan 28, 2018 at 18:41

Bs He · Accepted Answer · 2018-12-01 23:40:27Z

Because in lb.fit you feed in a 2-by-3 array, which means 2 samples and each sample could have at most 3 classes. Therefore, you got 0, 1, 2 here. See:

 class0 class1 class2 sample1 0 1 1 sample2 1 0 0

However, I think LabelBinarizer encoder has one character very unlike other encoders. Note that usually we put the raw form of lables into encoder.fit() function; for example:

>>> from sklearn import preprocessing >>> le = preprocessing.LabelEncoder() >>> le.fit([1, 2, 2, 6]) LabelEncoder() >>> le.classes_ array([1, 2, 6])

and we expect encoder.transform() yield the required format for new raw labels, i.e.,

>>> le.transform([1, 1, 2, 6]) array([0, 0, 1, 2]...)

But for LabelBinarizer, I think what we put into lb.fit() is actually the required coding format, and the true raw label codes should be like [[1,2], 0], which seems not to be format can be handled by sklearn since the dimension varies. Here is the paradox, in the python document, we see such an example:

>>> lb.transform([0, 1, 2, 1]) array([[1, 0, 0], [0, 1, 0], [0, 0, 1], [0, 1, 0]])

all samples in [0,1,2,1] are uniquely labeled, and if you tried to use lb.transform([[1,2],2]) to indicate that the first sample is multiple-labeled, you get error. That is, your raw labels have to in the exactly same format as after being transformed by lb.

Stack Exchange Network

2D matrix for labelbinarizer

2 Answers 2

Hot Network Questions

2D matrix for labelbinarizer

2 Answers 2

Related

Hot Network Questions