I'm trying to understand the pros and cons of different approaches for encoding a certain feature rather than keeping its numerical value.
Let's say we have a dataframe that has a Satisfaction column with values in the range 1-10 and we were trying to regress on a continuous Y value which is the rate of probability of the client to return
10 Very Excellent 9 Excellent 8 Very Good 7 Good 6 Above Average 5 Average 4 Below Average 3 Fair 2 Poor 1 Very Poor In this example the variable is categorical but we have an order relationship between the values which could be useful for predicting the Y. An user with a 10 would be more inclined to return than an user with a 0.
But the model would probably be able to figure out by itself which category is more likely to return while looking at the Y variable in the train dataset during training.
Also by keeping the order relationship 1-10 we assume that between all of the category there is the same distance, while I'd say that there probably is a different distance in emotion between Average and Good with respect to Excellent and Very Excellent.
I could:
- One hot-encode the column to eliminate the numerical relationship between the values
- Transform the data as numeric
- Do both, keep the values as numeric and add a one-hot encoding
Can someone make light on what would be the nuances between all of the options?