Categorical to One hot encoding - Big data [closed]

Question

I have a sales dataset which consists of binary label as output - "Business win" and "Business loss" of our products.

We have a set of 1st level customers (lets call that group as jacks) with whom we do we business. These jacks then sell our products to end customers (let's call that group as roses).

These sales data contain fields such as sales id, product id, product name, product type, market segment (like APAC, EMEA etc), jack id, jack category, jack region, rose id, rose category, rose region, project id, project name etc.

A single jack can sell same product across multiple different projects (to same or different roses).

As you can see that most of my input variables are categorical in nature.

I would like to find out what are the features that influence the business outcome? that is win or loss?

If it's a business win or loss, I would like to find out why it is so (using Lime or SHAP, etc)

My question

a) Since there are more than 100 unique products, should I create one hot encoding variable for all my 100 products? We would like to find out whether the product is one of the features that can help us predict whether the business is likely to lose or win business etc? ex: Product A when ordered has an 80% chance that this business is a loss. This is one of the features. Similarly, I would like to get this sort of detailed insight. I don't wish to know simply variable product is an important factor. I would like to know which product leads to business loss or win. Hope this helps

b) I understand we can create one-hot encoding variables for region variables because it has only 4 values like APAC, EMEA, GC, EUROPE, etc.

c) My total number of rows in the dataset is 300K. But as you see most the categorical variable has 100 unique values. How should I decide whether it has one hot encoded or not?

d) Is there any other better or alternative method to do this?

spectre · Accepted Answer · 2022-01-01 05:48:57Z

Let's answer you questions one by one.

a) Since there are more than 100 unique products, should I create one hot encoding variables for all my 100 products?

There are many ways to encode a categorical variable, a list of them you can find here. Which one you should use depends on your data. Categorical variables can be of many types like ordinal, nominal, high cardinality or low cardinality. Not all encoders work with all types of categorical variables. A simple Google search will lead you to articles where you can find all the necessary info regarding which encoder to use when. Here are a few articles I found article 1, article 2, article 3.

Since your cardinality is more than 100, using OneHotEncoder will lead to increase in dimensionality which is not a good thing. So you should go for other encoders like OrdinalEncoder TargetEncoder or others, depending on your data type.

I would like to know which product leads to business loss or win.

You can get these types of insights using Shap and/or Lime easily.

b) I understand we can create one-hot encoding variables for region variable because it has only 4 values like APAC, EMEA, GC, EUROPE etc.

Yes you can one hot encode them provided they do not have a sense or order between them.

c) My total number of rows in dataset is 300K. But as you see most of categorical variable has 100 unique values. How should I decide whether it has one hot encoded or not?

As it said above, what kind of encoder to use depends on your data type and your problem type i.e. weather it is a classification or regression problem.

d) Is there any other better or alternative method to do this?

Yes there are. Check out the links I mentioned above!:D

Cheers!

Carlos Mougan · Accepted Answer · 2021-12-31 09:15:08Z

I would reccomend you to encode high cardinality categorical variables with Target Encoding methods:

Python Library: https://contrib.scikit-learn.org/category_encoders/
Paper: https://link.springer.com/chapter/10.1007%2F978-3-030-85529-1_14

If you want to understand what the model is doing, I would recommend you to look at the Interpretability book https://christophm.github.io/interpretable-ml-book/

Stack Exchange Network

Categorical to One hot encoding - Big data [closed]

2 Answers 2

Hot Network Questions

Categorical to One hot encoding - Big data [closed]

2 Answers 2

Related

Hot Network Questions