Is there ever a reason to solve a regression problem as a classification problem?

Question

Would there ever be a reason to repose a regression forecasting problem as a classification problem by, for example, using classes to describe if sales in one year are 10%,50%,90%, or >100% of current levels. Those things could naturally be inferred from the results of a regression, yet I have seen people doing such classifications that actually feels more like regression problems.

I'm a newbie to the forum and ML in general so I hope I did everything right when posing the question :)

Real-world example: predicting workplace performance based on test scores is a regression problem trainable on past hirees' data, which can be used to solve the classification problem of which role, if any, to hire someone for (if it's hire or no-hire, that's binary classification, otherwise it's multiclass) based on score cutoffs. — J.G.
– J.G., Commented Feb 24, 2022 at 9:33
SE websites are not a forum, check the differences between a forum and a QA. This will help you and the community out — Hakaishin
– Hakaishin, Commented Feb 24, 2022 at 10:47

Soeren Soerensen · Accepted Answer · 2022-02-23 19:55:18Z

In line with @delaney's reply: I have not seen and I'm unable to imagine a reason for doing so.

Borrowing from the discussion in https://github.com/scikit-learn/scikit-learn/issues/15850#issuecomment-896285461 :

One loses information by binning the response. Why would one want to do that in the first place (except data compression)?
Continuous targets have an order (<). (Standard) Classification classes don’t (except ordinal categorical regression/classification).
Continuous targets usually have some kind of smoothness: Proximity in feature space (for continuous features) means proximity in target space.
All this loss of information is accompanied by possibly more parameters in the model, e.g. logistic regression has number of coefficients proportional to number of classes.
The binning obfuscates whether one is trying to predict the expectation/mean or a quantile.
One can end up with a badly (conditionally) calibrated regression model, ie biased. (This can also happen for stdandard regression techniques.)

From V. Fedorov, F. Mannino, Rongmei Zhang "Consequences of dichotomization" (2009) doi: 10.1002/pst.331

While the analysis of dichotomized outcomes may be easier, there are no benefits to this approach when the true outcomes can be observed and the ‘working’ model is flexible enough to describe the population at hand. Thus, dichotomization should be avoided in most cases.

"One loses information by binning the response." One also adds bias from binning. — Alexis
– Alexis, Commented Feb 26, 2022 at 19:04

J. Delaney · Accepted Answer · 2022-02-23 17:12:06Z

In general, there is no good reason. Grouping the data as you describe means that some information is being thrown away, and that can't be a good thing.

The reason you see people do this is probably out of practical convenience. Libraries for classification might be more common and easily accessible, and they also automatically provide answers that are in the correct range (while regression for example can output negative values etc.).

One slightly better motivation I can think of is that the typical outputs of classification algorithms can be interpreted as class probabilities, which can provide a measure of uncertainty on the result (for example, you can read a result as giving 40% probability for the range 10-20, 50% for the range 20-30 etc.). Of course regression models can in general provide uncertainty estimates as well, but it is a feature that is lacking in many standard tool and is not "automatic" as in the classification case.

kjetil b halvorsen · Accepted Answer · 2022-02-23 20:00:45Z

In addition to the good answers by users J. Delaney and Soeren Soerensen: One motivation for doing this might be that they think the response will not work well with a linear model, that its expectation is badly modeled as a linear function of the predictors. But then there are better alternatives, like response transfromations (see How to choose the best transformation to achieve linearity? and When (and why) should you take the log of a distribution (of numbers)?).

But another, newer, idea is to use ordinal regression. User Frank Harrell has written much about this here, search. Some starting points: Which model should I use to fit my data ? ordinal and non-ordinal, not normal and not homoscedastic, proportional odds (PO) ordinal logistic regression model as nonparametric ANOVA that controls for covariates, Analysis for ordinal categorical outcome

Neal Fultz · Accepted Answer · 2022-02-24 22:02:07Z

One counter-example that I see often:

Outcomes that are proportions (eg 10% = 2/20, 20%= 1/5, etc) should not get dumped through OLS, instead use a logistic regression with the denominator specified. This will weight the cases correctly even though they have different variances.

OTOH, logistic regression is a proper regression model, despite it mostly being taught as a classifier. So maybe this doesn't count.

Björn · Accepted Answer · 2022-02-25 13:39:55Z

I found this a very interesting question and I struggled to think of scenarios where binning a response variable would lead to better predictions.

The best I could come up with is a scenario like this one (all code is attached at the end), where the red class corresponds to $y \leq 1$ and the blue class to $y>1$ and we have one (or of course more) predictor that is within class uncorrelated with $y$, but separates the classes perfectly.

Here, a Firth penalized logistic regression

 Predicted Truth red blue red 5000 0 blue 2 4998

beats a simple linear model (followed by classifying based on whether predictions are >1):

 Predicted Truth red blue red 4970 30 blue 0 5000

However, let's be honest, part of the problem is that a linear regression is not such a great model for this problem. Replacing the linear regression and the logistic regression with a regression and a classification random forest, respectively, deals with this perfectly. Both produce this result (see below):

 Predicted Truth red blue red 5000 0 blue 0 5000

However, I guess that's at least an example where you seem to do a little better within the class of models with a linear regression equation (of course, this still totally ignores the possibility of using splines etc.).

library(tidyverse) library(ranger) library(ggrepel) library(logistf) # Set defaults for ggplot2 ---- theme_set( theme_bw(base_size=18) + theme(legend.position = "none")) scale_colour_discrete <- function(...) { # Alternative: ggsci::scale_color_nejm(...) scale_colour_brewer(..., palette="Set1") } scale_fill_discrete <- function(...) { # Alternative: ggsci::scale_fill_nejm(...) scale_fill_brewer(..., palette="Set1") } scale_colour_continuous <- function(...) { scale_colour_viridis_c(..., option="turbo") } update_geom_defaults("point", list(size=2)) update_geom_defaults("line", list(size=1.5)) # To allow adding label to points e.g. as geom_text_repel(data=. %>% filter(1:n()==n())) update_geom_defaults("text_repel", list(label.size = NA, fill = rgb(0,0,0,0), segment.color = "transparent", size=6)) # Start program ---- set.seed(1234) records = 5000 # Create the example data including a train-test split example = tibble(y = c(runif(n=records*2, min = 0, max=1), runif(n=records*2, min = 1, max=2)), class = rep(c(0L,1L), each=records*2), test = factor(rep(c(0,1,0,1), each=records), levels=0:1, labels=c("Train", "Test")), predictor = c(runif(n=records*2, min = 0, max=1), runif(n=records*2, min = 1, max=2))) # Plot the dataset example %>% ggplot(aes(x=predictor, y=y, col=factor(class))) + geom_point(alpha=0.3) + facet_wrap(~test) # Linear regression lm1 = lm(data=example %>% filter(test=="Train"), y ~ predictor) # Performance of linear regression prediction followed by classifying by prediction>1 table(example %>% filter(test=="Test") %>% pull(class), predict(lm1, example %>% filter(test=="Test")) > 1) # Firth penalized logistic regression glm1 = logistf(data=example %>% filter(test=="Train"), class ~ predictor, pl=F) # Performance of classifying by predicted log-odds from Firth LR being >0 table(example %>% filter(test=="Test") %>% pull(class), predict(glm1, example %>% filter(test=="Test"))>0) # Now, let's try this with RF instead: # First, binary classification RF rf1 = ranger(formula = class ~ predictor, data=example %>% filter(test=="Train"), classification = T) table(example %>% filter(test=="Test") %>% pull(class), predict(rf1, example %>% filter(test=="Test"))$predictions) # Now regression RF rf2 = ranger(formula = y ~ predictor, data=example %>% filter(test=="Train"), classification = F) table(example %>% filter(test=="Test") %>% pull(class), predict(rf2, example %>% filter(test=="Test"))$predictions>1)

Sextus Empiricus · Accepted Answer · 2022-03-09 07:49:25Z

Bayesian regression does something like this on a continuous scale.

To each value of the parameter a probability is assigned indicating how likely the parameter has that value.

For instance, for each value of sales (a continuoum of classes) a probability is assigned predicting how likely it is that sales-value/class.

Hypothesis testing is also much like this and a discrete form. One performs regression fits some parameters and subsequently classifies the observation as indictating whether the hypothesis is true or not true. Neyman Pearson hypothesis testing is very explicit with this and compares a null hypothesis and an alternative hypothesis and uses the likelihood ratio to decide between the two hypotheses.

For instance a hypothesis might be that the growth in sales are gonna be more than some hypothetical percentage $x$ and the regression is leading to a rejection or non-rejection of that percentage/class.

questionto42 · Accepted Answer · 2022-03-10 20:35:15Z

You can discretize the regression problem for example into the classification of having an illness "yes" and "no", by this making it possible to read the probabilities of each class (yes/no) from an ML classification model. You might have perhaps ten different intensities of this illness and you know the thresholds for them from experience so that you have labels, perhaps using a point system over many input columns or just experience over years. The advantage of a classification model is that each class out of the ten classes gets its own probability, while in a regression model, you do not see the probability, but you get just the one most probable predicted value instead.

$\begingroup$ Please say why when downvoting. $\endgroup$

questionto42
– questionto42

2022-03-10 20:31:33 +00:00
Commented Mar 10, 2022 at 20:31 — questionto42
– questionto42, Commented Mar 10, 2022 at 20:31

wwwslinger · Accepted Answer · 2022-03-09 00:54:24Z

I actually do this quite often, in general because the data may work for regression, but the scenario isn't necessarily a regression problem even if it could be. Here's a common scenario:

Let pretend you're a data scientist at a company and they say to you that they want to forecast monthly sales. They hand you a bunch of data that includes historical sales, perhaps other continuous data, and a large number of categorical data about the products, consumers, marketing approaches, etc. You immediately see this data and think regression is a likely good choice.

You dig into the data to see if regression is a good fit, perhaps doing an EDA, and find that there's 100s of categorical data with 100s of levels each. You then go back and ask the sales team if all of the categoricals are useful to them. They say yes, but then they clarify that they really only care if they're making 10x above spend (which is also one of the pieces of data you have). Suddenly you have a choice to regress on monthly saies ad report on 10x or not, or to lump sales into levels of <10x or >= 10x. Now you have a logistic regression as an option.

You then one-hot all your categoricals as a first pass and find that the data are too expansive (too many fields and levels) for you to run the regression quickly. The sales team needs the model by the end of the week. You go back and propose that they give you more time, but they say no. You also tell them your option of logistic regression, but they say that maybe they want to know 0.5x, 2x and then 10x and above for it to be really useful. You still have regression on the table, but now you have a clear classification possibility.

At this point, you can hash the categoricals quickly, greatly reducing the number of features and the size of the problem. You can bin the sales numbers into 0.5x, 2x, and >=10x. You can quickly run a tree-based classifier like Random Forest, XGBoost, or LightGBM classifier on your local machine, pull out feature importance, look at some trees, etc. and gain insight into what features matter without having to figure out how all the coding should work for a regression.

Perhaps at this point the prediction quality is poor, but nevertheless, you've delivered a predictive model in time, gained insight on the potentially useful features via classification, and opened up a few more options for proceeding on a better model.

That said, if you have multiple data types that require different loss functions and regularlizations, GLRM helps formulate all that quite nicely.

How do your data not lend themselves well to regression? Categorical predictor variables are perfectly fine in regression models. — Dave
– Dave, Commented Feb 26, 2022 at 12:50
To echo Dave's question, I can't make any sense of what you wrote. — Frank Harrell
– Frank Harrell, Commented Feb 26, 2022 at 17:59
What do the number of $X$ categories have to do with anything? — Dave
– Dave, Commented Mar 2, 2022 at 1:09
Yes that's a major recordkeeping gaffe and I doubt many teams would do it that way. For unrecoverably binned Y I'd use an ordinal semiparametric model. — Frank Harrell
– Frank Harrell, Commented Mar 10, 2022 at 20:27

Stack Exchange Network

Is there ever a reason to solve a regression problem as a classification problem?

8 Answers 8

Linked

Hot Network Questions

Is there ever a reason to solve a regression problem as a classification problem?

8 Answers 8

Linked

Related

Hot Network Questions