1
$\begingroup$

I'm currently trying to analyse a dataset containing multiple non-ordinal categorical features and a binary target variable. The table looks something like this:

+------------+---------+------------+--------+ | Col1 | .... | Col14 | Target | +------------+---------+------------+--------+ | cat 1 | cat 1 | cat 1 | 0 | | ... | ... | ... | ... | | cat 9 | cat 50 | cat 450 | 1 | +------------+---------+------------+--------+ 

The entire table is 400.000 rows x 15 columns, from which the last column is the target variable. Each feature has multiple non-ordinal categories ranging from 9 categories to multiple hundreds of categories.

My first instinct would be to one hot encode all the categorical variables. However, I'm scared that doing so will make any model prone to overfitting.

How could I handle/encode the features variables to analyse their effect on the target variable, using Python?

$\endgroup$

1 Answer 1

0
$\begingroup$

It looks like a case where target encoding will shine.

Target encoding replaces a category for the mean target in that category. You have to be careful not to overfit, but it is a very effective method to use categorical features with many levels.

There's a python package that implements target encoding using the syntax of Scikit Learn.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.