0
$\begingroup$

I'm trying to understand the pros and cons of different approaches for encoding a certain feature rather than keeping its numerical value.

Let's say we have a dataframe that has a Satisfaction column with values in the range 1-10 and we were trying to regress on a continuous Y value which is the rate of probability of the client to return

 10 Very Excellent 9 Excellent 8 Very Good 7 Good 6 Above Average 5 Average 4 Below Average 3 Fair 2 Poor 1 Very Poor 

In this example the variable is categorical but we have an order relationship between the values which could be useful for predicting the Y. An user with a 10 would be more inclined to return than an user with a 0.

But the model would probably be able to figure out by itself which category is more likely to return while looking at the Y variable in the train dataset during training.

Also by keeping the order relationship 1-10 we assume that between all of the category there is the same distance, while I'd say that there probably is a different distance in emotion between Average and Good with respect to Excellent and Very Excellent.

I could:

  • One hot-encode the column to eliminate the numerical relationship between the values
  • Transform the data as numeric
  • Do both, keep the values as numeric and add a one-hot encoding

Can someone make light on what would be the nuances between all of the options?

$\endgroup$

1 Answer 1

1
$\begingroup$

That rating system is not categorical, it's ordinal, meaning there is a scale and order to the data.

There are few hard and fast rules, because sometimes doing things "wrong" produces better results under particular circumstances. However, I would recommend using numeric data. Predicting a regression instead of a classification will help the model to understand that there is an order to the output, which will likely improve performance in a use case like this. If you present it as categorical, your model has to spend more "effort" learning that there is an order to the output. This may require a more complex model, meaning longer training times, less efficient prediction, and a higher risk of overfitting.

Your output will likely be easier to interpret too. Instead of

80% very poor 10% poor 3% moderate 

you would get a single value between 0-1. So 0.13 for instance. (you will want to normalize your data range by shrinking it to be between 0 and 1).

types of data

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.