I researched online a bit about categorical variables with high cardinality. Many posts and papers just stop short and conclude that 'it skews model's performance' without going into details why and how high cardinality skews model performance?
In particular, how they skew tree-based and distance-based models respectively? I have the following two thoughts about why high cardinality does not work:
i) under some encoding methods, e.g. one-hot, it leads to the curse of dimensionality.
ii) under label encoding, the interval/distance between each of the encoding carries absolutely no meaning, e.g. distance between category 1 and 2 carries no meaning in terms of distance algorithm.
But other than these answers, I look for more (i.e. why and how high cardinality skews models). Also, the following question naturally branches from the above logic:
Also, if high cardinality is bad in categorical variables, then why there is no one complaining 'too high cardinality' in numerical variable? Especially when numerical and categorical variables are usually mixed in a single tabular dataset?
Thank you in advance!