I am using Catboost and one thing I notice in the guide is that it says to not preprocess to one-hot encoding.
My data has a single target per row however the feature can have both thousands of values and multiple values associated with the same row. I am struggling how to best present this data to Catboost. Any given row could have zero, one, or twenty of these values associated with it.
My first thought was to use something like 20 'holder' columns and put the feature values into those to associate with the row, however there is no particular order to the feature values; the same value would tend to 'jump' between columns.
My second thought is one-hot, a column per possible feature value. This will create thousands of new features that will each be active only a small percentage of the dataset. I feel like this is the wrong approach esp. since catboost explicitly says to not do this.
My third thought was to duplicate my data such that this feature has a single value per row and duplicate the target sample for each active feature value for that target step. So if I have 10 associated values, I will create 10 rows with the same target and a different value for the feature in each row.
My main confusion is how to handle the 1:many target:feature-value relationship. I have read about 'feature extraction' but not sure if that works in my case. Any given value of the feature is only active a small percentage of the time. Should I just ignore these feature values, despite the fact I know they should have an impact on the target?
There is likely a grouping of values that will cause the same general effect on the target variable, however which values group together to cause this I do not know ahead of time.
Thoughts?