I have a sales dataset which consists of binary label as output - "Business win" and "Business loss" of our products.
We have a set of 1st level customers (lets call that group as jacks) with whom we do we business. These jacks then sell our products to end customers (let's call that group as roses).
These sales data contain fields such as sales id, product id, product name, product type, market segment (like APAC, EMEA etc), jack id, jack category, jack region, rose id, rose category, rose region, project id, project name etc.
A single jack can sell same product across multiple different projects (to same or different roses).
As you can see that most of my input variables are categorical in nature.
I would like to find out what are the features that influence the business outcome? that is win or loss?
If it's a business win or loss, I would like to find out why it is so (using Lime or SHAP, etc)
My question
a) Since there are more than 100 unique products, should I create one hot encoding variable for all my 100 products? We would like to find out whether the product is one of the features that can help us predict whether the business is likely to lose or win business etc? ex: Product A when ordered has an 80% chance that this business is a loss. This is one of the features. Similarly, I would like to get this sort of detailed insight. I don't wish to know simply variable product is an important factor. I would like to know which product leads to business loss or win. Hope this helps
b) I understand we can create one-hot encoding variables for region variables because it has only 4 values like APAC, EMEA, GC, EUROPE, etc.
c) My total number of rows in the dataset is 300K. But as you see most the categorical variable has 100 unique values. How should I decide whether it has one hot encoded or not?
d) Is there any other better or alternative method to do this?