How much can the AUC improve comparing the raw dataset and the feature engineered dataset?

Question

Let's say I put the following two datasets in the best possible model (same model for both):

A raw dataset, the variables as they came just from the query.
A feature-engineered dataset, with hundreds of created variables, which came from the same raw dataset I just mentioned.

Could the difference between both AUCs be high? How much?

Any ground-rules here, on what "raw vs feature-engineered" and "best possible model" can mean? — Ben Reiniger
– Ben Reiniger ♦, Commented Jan 17, 2020 at 21:58
Yes. Raw: The variables have missing values, none grouping variable is derived (mean by group or similar), no summations A+B, or A-B, ratios, A/B or similar are calculated. Feature-Engineered: Mean-encoding, Frequency encoding, Impact-encoding, separation in ranges, ranks, lagged variables. a new variable defined from cluster. Best model: Let's say XGBoost. — Juan Esteban de la Calle
– Juan Esteban de la Calle, Commented Jan 17, 2020 at 22:07

Erwan · Accepted Answer · 2020-01-19 01:40:42Z

Yes, the performance can vary a lot using feature engineering.

Example: suppose a dataset where the response variable $y$ is true if $x$ is odd.

x y 346 F 13 T 178 F 64 F 987 T ...

Most learning models will fail to identify the pattern and will perform poorly, usually falling back to always predicting the majority class. However simply adding a feature $x \% 2$ to the data will allow any model to perform perfectly.

Of course this a toy example, but the point is that a single well chosen feature can drastically change the performance. Naturally the increase in performance totally depends on the data and the nature of the features added.

Pieter21 · Accepted Answer · 2020-01-17 22:51:48Z

I would say that the best possible model for the raw data would derive all the meaningful features that you would have created from the data anyway.

And I would say that the best possible model for the feature-engineered model will remove/ignore unnecessary features.

The best possible model would have AUC of 1 anyway. It makes all predictions correctly.

But even in the context of noise where AUC of 1 can not be achieved, I think the argument holds.

But learning rate/convergence speed may vary.

Stack Exchange Network

How much can the AUC improve comparing the raw dataset and the feature engineered dataset?

2 Answers 2

Hot Network Questions

How much can the AUC improve comparing the raw dataset and the feature engineered dataset?

2 Answers 2

Related

Hot Network Questions