Treating missing data in categorical features

Question

I have a dataset with one of the categorical columns having a considerable number of missing values. The interesting thing about this column is that it has values only for a particular category in "another" column .

For eg :

column 1 column2 ======================================== Google - Google - Google - Google - Facebook Image Facebook Video Facebook Image

My column of interest has values only for one category (Facebook) that is present in another column. Therefore, the missing values for google cannot be imputed with average, cannot be predicted and those rows cannot be ignored either.

In such a situation, is it wise to consider the missing values '-' as a separate category in one-hot encoding? Or will this affect my machine learning model badly?

To me it depends and you have to make some test both with and without variable. Did you also try to merge column 1 and column 2 variables ? (In your example, you could make 3 variables Google, FacebookImage and FacebookVideo). That's another thing you can try to avoid having 2 highly correlated columns. — Adept
– Adept, Commented Aug 21, 2020 at 14:19

Shiv · Accepted Answer · 2020-08-23 20:39:48Z

3

You could break the column 2 from your example into number of columns : Image,Video....

So the new features will be like:

Column1 Image Video Google 0 0 Google 0 0 Facebook 1 0 Facebook 0 1

answered Aug 23, 2020 at 20:39

Shiv

7296 silver badges20 bronze badges

$\begingroup$ We can follow this method for all kinds of categorical columns? $\endgroup$

Vikas Ukani
– Vikas Ukani

2020-09-19 05:23:26 +00:00
Commented Sep 19, 2020 at 5:23
$\begingroup$ Suppose, There is an categorical feature in which there are too many unique values, For that, This method goes wrong, Right? $\endgroup$

Vikas Ukani
– Vikas Ukani

2020-09-19 05:24:40 +00:00
Commented Sep 19, 2020 at 5:24

Add a comment |

Soumendra Mishra · Accepted Answer · 2020-08-24 18:35:15Z

2

You can try this:

import pandas as pd df_new = pd.get_dummies(df, columns=['column2']) print(df_new)

Output:

 column1 column2_Image column2_Video 0 Google 0 0 1 Google 0 0 2 Google 0 0 3 Google 0 0 4 Facebook 1 0 5 Facebook 0 1 6 Facebook 1 0

answered Aug 24, 2020 at 18:35

Soumendra Mishra

2622 silver badges12 bronze badges

$\begingroup$ What if there are many unique values in column_2, For Instance, Image, Video, PDF, DOC, Excel, Audio, etc. $\endgroup$

Vikas Ukani
– Vikas Ukani

2020-09-19 05:26:35 +00:00
Commented Sep 19, 2020 at 5:26
1

$\begingroup$ It will work. For example, if you add a new value (email), a new column will be added: column2_Email column2_Image column2_Video $\endgroup$

Soumendra Mishra
– Soumendra Mishra

2020-09-19 06:37:46 +00:00
Commented Sep 19, 2020 at 6:37
$\begingroup$ Is there any disadvantages of too many features column, Suppose I use this method and I got 200+ feature in my DataFrame. So, There is and negative point of this kind of problem? $\endgroup$

Vikas Ukani
– Vikas Ukani

2020-09-19 06:51:28 +00:00
Commented Sep 19, 2020 at 6:51
1

$\begingroup$ There is no performance issues. It all depends on your use case. $\endgroup$

Soumendra Mishra
– Soumendra Mishra

2020-09-19 06:54:32 +00:00
Commented Sep 19, 2020 at 6:54
$\begingroup$ @VikasUkani If you want to use one hot encoding, I would suggest to use OneHotEncoder function instead of pd.get_dummies. Both of them perform the same function but the advantage of OneHotEncoder is that during deployment, if an entirely new feature comes, then pd.get_dummies will give an error. But OneHotEncoder won't if you just specify the parameter handle_unknown = 'ignore'. $\endgroup$

spectre
– spectre

2021-08-06 09:46:44 +00:00
Commented Aug 6, 2021 at 9:46

| Show 1 more comment

Stack Exchange Network

Treating missing data in categorical features

2 Answers 2

Hot Network Questions

Treating missing data in categorical features

2 Answers 2

Related

Hot Network Questions