Why categorical variable with high cardinality is not preferred but not in numerical variable?

Question

I researched online a bit about categorical variables with high cardinality. Many posts and papers just stop short and conclude that 'it skews model's performance' without going into details why and how high cardinality skews model performance?

In particular, how they skew tree-based and distance-based models respectively? I have the following two thoughts about why high cardinality does not work:

i) under some encoding methods, e.g. one-hot, it leads to the curse of dimensionality.

ii) under label encoding, the interval/distance between each of the encoding carries absolutely no meaning, e.g. distance between category 1 and 2 carries no meaning in terms of distance algorithm.

But other than these answers, I look for more (i.e. why and how high cardinality skews models). Also, the following question naturally branches from the above logic:

Also, if high cardinality is bad in categorical variables, then why there is no one complaining 'too high cardinality' in numerical variable? Especially when numerical and categorical variables are usually mixed in a single tabular dataset?

Thank you in advance!

Tree based models get biased towards high cardinal variables as the entropy of these variables is less compared to low cardinal variables. Hence tree based models favor high cardinal variables. — spectre
– spectre, Commented Dec 24, 2021 at 8:07

Erwan · Accepted Answer · 2021-12-24 18:41:16Z

2

The main difference is that a numerical variable has only one dimension for a ML model. Additionally there's always an assumption about the distribution of a numerical variable, for example that it follows a normal distribution. A decision tree would can use a simple condition like "x > 15" to split the data, and if 15 is for instance the median then the data is split into two equal parts.

By contrast a categorical variable with high cardinality has a high number of dimensions for a model. This means that the model works with a large number of variables with little dependency between them, hence the curse of dimensionality. Additionally there can be no assumption about the categorical distribution, so it's impossible for the model to "summarize" the data with only a few values. A decision tree can only use conditions such as "x == 'red'" or "x != 'red'"; it's almost never possible to split the data in equal parts with these conditions, since most values represent only a small number of instances. This means that it's much harder to find general patterns which cover most of the data with such a categorical variable.

answered Dec 24, 2021 at 18:41

Erwan

27k3 gold badges17 silver badges39 bronze badges

$\begingroup$ Thanks for your answer. It answers my question, except one: then how can the model tell intelligently whether to treat the categorical variable as single/multiple dimensions? I feel on this question, only we data scientist can tell. Do you think this question needs to be determined by the model? (i.e. the model 'will know' if we treat categorical as numerical and yields bad results 'automatically'? $\endgroup$

Student
– Student

2022-01-04 08:45:55 +00:00
Commented Jan 4, 2022 at 8:45
$\begingroup$ Unless we simplify the categorical variable by discarding some of its values, the number of dimensions is always the cardinality of the variable (minus one). A categorical variable can be represented in a single dimension only if it's binary (has exactly two values). In other words, treating a categorical variable as one dimension means not taking into account its categorical nature. It might make sense in some specific cases but it's not really a categorical variable anymore. $\endgroup$

Erwan
– Erwan

2022-01-04 13:08:02 +00:00
Commented Jan 4, 2022 at 13:08
$\begingroup$ I bet your definition of 'dimension'. In which context, and to whom? Can I assume what you are suggesting is that 'ah, see, if we keep such high cardinality, the model will take it numerically and it thus lose the business sense of being a categorical variable'? $\endgroup$

Student
– Student

2022-01-04 14:08:21 +00:00
Commented Jan 4, 2022 at 14:08
$\begingroup$ @Hing Not sure what you mean. My definition: "dimension" as in: how many features are needed to represent all the possible values of the variable faithfully, i.e. without losing the categorical nature of the variable. "categorical nature" = the comparisons can only be made with equality or inequality operators. $\endgroup$

Erwan
– Erwan

2022-01-04 17:10:12 +00:00
Commented Jan 4, 2022 at 17:10
$\begingroup$ thanks for the clarification. I guess I'm still struggling with the problem: how can a model tell if a feature is categorical or numeric in nature? Only human like us know. Or you are trying to suggest it doesn't matter if the model tells between categorical or numeric, the results are not reliable/making sense if we simply include categorical variables with high cardinality blindly? $\endgroup$

Student
– Student

2022-01-06 01:23:39 +00:00
Commented Jan 6, 2022 at 1:23

| Show 3 more comments

Stack Exchange Network

Why categorical variable with high cardinality is not preferred but not in numerical variable?

1 Answer 1

Hot Network Questions

Why categorical variable with high cardinality is not preferred but not in numerical variable?

1 Answer 1

Related

Hot Network Questions