Handling a Pandas Data Frame containing multiple non-ordinal categorical features

Question

I'm currently trying to analyse a dataset containing multiple non-ordinal categorical features and a binary target variable. The table looks something like this:

+------------+---------+------------+--------+ | Col1 | .... | Col14 | Target | +------------+---------+------------+--------+ | cat 1 | cat 1 | cat 1 | 0 | | ... | ... | ... | ... | | cat 9 | cat 50 | cat 450 | 1 | +------------+---------+------------+--------+

The entire table is 400.000 rows x 15 columns, from which the last column is the target variable. Each feature has multiple non-ordinal categories ranging from 9 categories to multiple hundreds of categories.

My first instinct would be to one hot encode all the categorical variables. However, I'm scared that doing so will make any model prone to overfitting.

How could I handle/encode the features variables to analyse their effect on the target variable, using Python?

David Masip · Accepted Answer · 2020-06-03 12:55:15Z

It looks like a case where target encoding will shine.

Target encoding replaces a category for the mean target in that category. You have to be careful not to overfit, but it is a very effective method to use categorical features with many levels.

There's a python package that implements target encoding using the syntax of Scikit Learn.

Stack Exchange Network

Handling a Pandas Data Frame containing multiple non-ordinal categorical features

1 Answer 1

Hot Network Questions

Handling a Pandas Data Frame containing multiple non-ordinal categorical features

1 Answer 1

Related

Hot Network Questions