One-hot encode a numeric categorical feature (e.g. year built, satisfaction out of 10, etc) or not?

Question

I'm trying to understand the pros and cons of different approaches for encoding a certain feature rather than keeping its numerical value.

Let's say we have a dataframe that has a Satisfaction column with values in the range 1-10 and we were trying to regress on a continuous Y value which is the rate of probability of the client to return

 10 Very Excellent 9 Excellent 8 Very Good 7 Good 6 Above Average 5 Average 4 Below Average 3 Fair 2 Poor 1 Very Poor

In this example the variable is categorical but we have an order relationship between the values which could be useful for predicting the Y. An user with a 10 would be more inclined to return than an user with a 0.

But the model would probably be able to figure out by itself which category is more likely to return while looking at the Y variable in the train dataset during training.

Also by keeping the order relationship 1-10 we assume that between all of the category there is the same distance, while I'd say that there probably is a different distance in emotion between Average and Good with respect to Excellent and Very Excellent.

I could:

One hot-encode the column to eliminate the numerical relationship between the values
Transform the data as numeric
Do both, keep the values as numeric and add a one-hot encoding

Can someone make light on what would be the nuances between all of the options?

Daniel Warfield · Accepted Answer · 2022-02-10 20:31:23Z

That rating system is not categorical, it's ordinal, meaning there is a scale and order to the data.

There are few hard and fast rules, because sometimes doing things "wrong" produces better results under particular circumstances. However, I would recommend using numeric data. Predicting a regression instead of a classification will help the model to understand that there is an order to the output, which will likely improve performance in a use case like this. If you present it as categorical, your model has to spend more "effort" learning that there is an order to the output. This may require a more complex model, meaning longer training times, less efficient prediction, and a higher risk of overfitting.

Your output will likely be easier to interpret too. Instead of

80% very poor 10% poor 3% moderate

you would get a single value between 0-1. So 0.13 for instance. (you will want to normalize your data range by shrinking it to be between 0 and 1).

types of data

Stack Exchange Network

One-hot encode a numeric categorical feature (e.g. year built, satisfaction out of 10, etc) or not?

1 Answer 1

Hot Network Questions

One-hot encode a numeric categorical feature (e.g. year built, satisfaction out of 10, etc) or not?

1 Answer 1

Related

Hot Network Questions