Combining classifiers by flipping a coin

Question

I am studying a machine learning course and the lecture slides contain information what I find contradicting with the recommended book.

The problem is the following: there are three classifiers:

classifier A providing better performance in the lower range of the thresholds,
classifier B providing better performance in the higher range of the thresholds,
classifier C what we get by flipping a p-coin and selecting from the two classifiers.

What will be the performance of classifier C, as viewed on a ROC curve?

The lecture slides state that just by flipping this coin, we are going to get the magical "convex hull" of classifier A's and B's ROC curve.

I don't understand this point. Just by simply flipping a coin, how can we gain information?

The lecture slide

lecture slides

What the book says

The recommended book (Data Mining... by Ian H. Witten, Eibe Frank and Mark A. Hall) on the other hand states that:

To see this, choose a particular probability cutoff for method A that gives true and false positive rates of tA and fA, respectively, and another cutoff for method B that gives tB and fB. If you use these two schemes at random with probabilities p and q, where p + q = 1, then you will get true and false positive rates of p . tA + q . tB and p . fA + q . fB. This represents a point lying on the straight line joining the points (tA, fA) and (tB, fB), and by varying p and q you can trace out the whole line between these two points.

In my understanding, what the book says is that to actually gain information and reach the convex hull we need to do something more advanced than simply flipping a p-coin.

AFAIK, the correct way (as suggested by the book) is the following:

we should find an optimal threshold Oa for classifier A
we should find an optimal threshold Ob for classifier B
define C as following:
- If t < Oa, use classifier A with t
- If t > Ob, use classifier B with t
- If Oa < t < Ob, pick between classifier A with Oa and B with Ob by the probability as a linear combination of where we are between Oa and Ob.

Is this correct? If yes, there are a few key differences compared to what the slides suggest.

It's not a simple coin flipping, but a more advanced algorithm which needs manually defined points and picks based on what region we fall into.
It never uses classifier A and B with threshold values between Oa and Ob.

Can you explain to me this problem and what is the correct way to understand it, if my understanding was not correct?

What would happen if we would just simply flip a p-coin like the slides would suggest? I would think that we'd get a ROC curve that is between A and B, but never "better" than the better one at a given point.

As far as I can see, I really don't understand how the slides could be correct. The probabilistic calculation on the left hand side doesn't make sense to me.

Update: Found the article written by the original author who invented the convex hull method: http://www.bmva.org/bmvc/1998/pdf/p082.pdf

From my reading of both the slide you post and the book excerpt, they seem to be describing the exact same thing, and the slides are not in error. — cardinal
– cardinal, Commented May 15, 2012 at 18:52
Note that it is also not too difficult to construct a simulation to convince yourself of the fact stated in the slide. The only difficulty you may have is constructing two ROC curves that look roughly like that, but it's manageable, say, using a Gaussian mixture model to generate the observations and some suboptimal decision rules. — cardinal
– cardinal, Commented May 15, 2012 at 18:58

leonbloy · Accepted Answer · 2012-05-15 19:29:09Z

(Edited)

The lecture slides are right.

Method A has an "optimal point" that gives true and false positive rates of (TPA, FPA in the graph) respectively . This point would correspond to a threshold, or more in general[*] a optimal decision boundary for A. All the same goes for B. (But the thresholds and the boundaries are not related).

It's seen that classifier A performs nice under the preference "minimize false positives" (conservative strategy) and classifier B when we want to "maximize true positives" (eager strategy).

~~The answer to your first question, is basically yes, except that the probability of the coin is (in some sense) arbitrary. The final clasiffier would be:~~

If $x$ belongs to the "optimal acceptance region for A" (conservative), use that classifier A (i.e.: accept it) If $x$ belongs to the "optimal rejection region for B" (eager), use that classifier B (i.e., reject it) Elsewhere , flip a coin with probability $p$ and use the classifier A or B.

(Corrected: actually, the lectures are completely right, we can just flip the coin in any case. See diagrams)

You can use any fixed $p$ in the range (0,1), it depends on whether you want to be more or less conservative, i.e., if you want to be more near to one of the points or in the middle.

[*] You should be general here: if you think in terms of a single scalar threshold, all this makes little sense; a one-dimensional feature with a threshold-based classifier does not gives you enough degrees of freedom to have different classifiers as A and B, that performs along different curves when the free paramenters (decision boundary=threshold) varies. In other words: A and B are called "methods" or "systems", not "classifiers"; because A is a whole family of classifiers, parametrized by some parameter (scalar) that determines a decision boundary, not just a scalar]

I added some diagrams to make it more clear:

enter image description here

Suppose a bidimensional feature, the diagram displays some samples, the green points are the "good" ones, the red the "bad" ones. Suppose that the method A has a tunable parameter $t$ (threshold, offset, bias), higher values of $t$ turns the classifier more eager to accept ('Yes'). The orange lines correspond to the boundary decision for this method, for different values of $t$. It's seen that this method (actually a family of classifiers) performs particularly well for the $t_A=2$, in the sense that it has very few false positives for a moderate amount of true positives. By contrast, the method B (blue), which has its own tunable parameter $t$ (unrelated to that of A) performs particularly well ($t_B=4$) in the region of high acceptance: the filled blue line attains high true positive ratio.

In this scenario, then, one can say that the filled orange line is the "optimal A classifier" (inside its family), and the same for B. But one cannot tell whether the orange line is better than the blue line: one performs better when we asssign high cost to false positives, the other when false negatives are much more costly.

enter image description here

Now, it might happen that these two classifiers are too extremes for our needs, we'd like that both types of errors have similar weights. We'd prefer, instead of using classifier A (orange dot) or B (blue dot) to attain a performance that it's in between them. As the course say, one can attain that result by just flipping a coin and choose one of the classifiers at random.

Just by simply flipping a coin, how can we gain information?

We don't gain information. Our new randomized classifier is not simply "better" than A or B, it's performance is sort of an average of A and B, in what respect to the costs assigned to each type of error. That can be or not beneficial to us, depending on what are our costs.

AFAIK, the correct way (as suggested by the book) is the following ... Is this correct?

Not really. The correct way is simply: flip a coin with probability $p$, choose a classifier (the optimal A or the optimal B) and classify using that classifier.

@leonboy I believe that x is the threshold and for low values of x classifier A works best. For high values of x classifier B works best. By best I mean for the given false positive rate the true positive rate is the highest. If all we know is that A works best up to a single point where they cross and B for all thresholds above that then any algorithm that gives weight less than 1 to A in the region between FPa and FPb where A has the higher TP cannot perform as well as A. So such an algorithm C has to fall below A in that region. — Michael R. Chernick
– Michael R. Chernick, Commented May 15, 2012 at 15:45
Similarly in the region between FPa and FPb where TP is higher for B no algorithm with p greater than 0 will perform better than B. The formula for TPc is correct but a fixed weighted average between TPb and TPa cannot be larger than the greater of TPa and TPb. It has to fall between them. But the diagram always shows TPc above TPa and TPb throughout the region from FPa and FPb. Do you see something here that we are missing? I don't find it in your answer. — Michael R. Chernick
– Michael R. Chernick, Commented May 15, 2012 at 15:51
Okay the light bulb went off! X is a vector in your mind rather than a scalar threshold. Does that really change anything? The FP aixs is a scalar probability. My crossing point is the FP point of equality for A and B. There could be many vectors X that lead to it. I am just saying that at any point along the FP axis between FPa and FPb. TPc = p TPa + (1-p) TPb. The line in the plot is in the TP vs FP plane. How could that line go through the points above the curves for both A and B as the OP questioned (I think properly)? — Michael R. Chernick
– Michael R. Chernick, Commented May 15, 2012 at 17:14
@Michael: I think A and B as distinct methods that give different boundary decisions. Each one has an adjustable parameter (what in 1D is a threshold), the parameters are independent, and give (for each) a family of classifiers. I'll try to plot a diagram to try to clarify, hold on. — leonbloy
– leonbloy, Commented May 15, 2012 at 17:44
I gave leonbloy an upvote for that pretty description. But I like cardinal's final comment because that argument is clear to me and agrees with my latest thinking. @leobloy The one thing missing from your diagram is a plot of the points for the randomized rule that beats both individual ones. i guess you can describe the new rule as one that weights the two errors differently but it is not necessary and I think less confusing if you leave that argument out. — Michael R. Chernick
– Michael R. Chernick, Commented May 15, 2012 at 19:26

Michael R. Chernick · Accepted Answer · 2012-05-15 15:04:55Z

2

I agree with your reasoning. If you use the classifier by coin flipping to pick one when you are between points A and B your point on the curve would always be below the better classifier and above the poorer one and not possibly above both! There must be something wrong with the diagram. At the point where the 2 ROC curves cross the random selection algorithm will have the same performance as the two algorithms. It will not be above it the way the diagram depicts it.

edited May 15, 2012 at 15:04

answered May 6, 2012 at 14:16

Michael R. Chernick

43.8k28 gold badges87 silver badges160 bronze badges

2

$\begingroup$ I believe the slide is correct. If you use two different decision procedures with two different thresholds and then take a randomized decision, you will get a convex combination which will give a point lying in between the two. This point may be above both (!) of the curves at the same false positive rate. This is because the threshold used for each procedure is different at that point. $\endgroup$

cardinal
– cardinal

2012-05-15 18:54:56 +00:00
Commented May 15, 2012 at 18:54
1

$\begingroup$ So the A and B in the convex combination is different from the A and B that are chosen individually to at that false positive rate. I just think the diagram was confusing as I did not see that A and B were selected from a family of classifiers. $\endgroup$

Michael R. Chernick
– Michael R. Chernick

2012-05-15 19:11:24 +00:00
Commented May 15, 2012 at 19:11
1

$\begingroup$ Yes, the diagram is a bit confusing, as is the surrounding text! It took me a few minutes to unravel it. You fix two decision procedures, from each curve, say at the lower "hump" for the $A$ curve and upper hump for $B$ curve. This defines a rule for each. Then, the randomized rule is to flip a coin and take the decision from the corresponding classifier. This yields a new rule with different TP and FP rates from both of them giving the convex combination shown. (NB: To get even a single ROC curve at all, one needs a parametrized family of classifiers.) $\endgroup$

cardinal
– cardinal

2012-05-15 19:18:31 +00:00
Commented May 15, 2012 at 19:18
$\begingroup$ I believe that this answer is the correct, appended with cardinal's comment! Getting out of the intersection area might happen, but it's not a method. I've found the original paper from the guy who invented this method, and it explains it very well! bmva.org/bmvc/1998/pdf/p082.pdf $\endgroup$

hyperknot
– hyperknot

2012-05-19 12:54:54 +00:00
Commented May 19, 2012 at 12:54
$\begingroup$ @zsero: I believe that even Michael will acknowledge that this answer was based on the understanding of the diagram at the time the answer was posted and his interpretation of it has changed since the comments and other answer appeared. Just as the figure depicts, one can achieve via randomization any point on any line between a point on the first curve and a point on the second even if the resulting true positive rate dominates the other two curves for a given false positive rate. $\endgroup$

cardinal
– cardinal

2012-05-20 14:56:09 +00:00
Commented May 20, 2012 at 14:56

Add a comment |

Stack Exchange Network

Combining classifiers by flipping a coin

2 Answers 2

Linked

Hot Network Questions

Combining classifiers by flipping a coin

2 Answers 2

Linked

Related

Hot Network Questions