one-hot-encoding categorical data gives error

Question

I am currently working on the Boston problem hosted on Kaggle. The dataset is nothing like the Titanic dataset. There are many categorical columns and I'm trying to one-hot-encode these columns. I've decided to go with the column MSZoning to get the approach working and work out a strategy to apply it to other categorical columns. This is a small snippet of the dataset:

Here are the different types of values present in MSZoning, so obviously integer encoding only would be a bad idea:

['RL' 'RM' 'C (all)' 'FV' 'RH']

Here is my attempt on Python to assign MSZoning with the new one-hot-encoded data. I do know that one-hot-encoding turns each value into a column of its own and assigns binary values to each of them so I realize that this isn't exactly a good idea. I wanted to try it anyways:

import pandas as pd from sklearn.preprocessing import LabelEncoder, OneHotEncoder train = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/train.csv") test = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/train.csv") labelEncoder = LabelEncoder() train['MSZoning'] = labelEncoder.fit_transform(train['MSZoning']) train_OHE = OneHotEncoder(categorical_features=train['MSZoning']) train['MSZoning'] = train_OHE.fit_transform(train['MSZoning']).toarray() print(train['MSZoning'])

Which is giving me the following (obvious) error:

C:\Users\security\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py:392: DeprecationWarning: The 'categorical_features' keyword is deprecated in version 0.20 and will be removed in 0.22. You can use the ColumnTransformer instead. "use the ColumnTransformer instead.", DeprecationWarning) Traceback (most recent call last): File "C:/Users/security/Downloads/AP/Boston-Kaggle/Boston.py", line 11, in <module> train['MSZoning'] = train_OHE.fit_transform(train['MSZoning']).toarray() File "C:\Users\security\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py", line 511, in fit_transform self._handle_deprecations(X) File "C:\Users\security\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py", line 394, in _handle_deprecations n_features = X.shape[1] IndexError: tuple index out of range

I did read through some Medium posts on this but they didn't exactly relate to what I was trying to do with my dataset as they were working with dummy data with a couple of categorical columns. What I want to know is, how do I make use of one-hot-encoding after the (attempted) step?

Quick note: you have loaded the same dataframe for both train and test — Leevo
– Leevo, Commented Jun 10, 2019 at 9:41

Leevo · Accepted Answer · 2019-06-10 09:48:10Z

First of all, I noticed you have loaded the same dataframe for both train and test. Change the code like this:

import numpy as np import pandas as pd train = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/train.csv") test = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/test.csv")

At this point, one-hot encode each variable you want with pandas' get_dummies() function:

# Onhe-hot encode a given variable OHE_MSZoning = pd.get_dummies(train['MSZoning'])

It will be returned as a pandas dataframe. In my Jupyter Notebook it looks like this:

OHE_MSZoning.head()

You can repeat the same command for all the variables you want to one-hot encode.

Hope this helps, otherwise let me know.

How come you're using pandas.get_dummies() over the sklearn function? — Andros Adrianopolos
– Andros Adrianopolos, Commented Jun 10, 2019 at 9:57
It's the method I'm used to, I work all the time with pandas dataframes and I find it useful. But it's not necessarily better than sklearn. I used this because I'm sure it worked. — Leevo
– Leevo, Commented Jun 10, 2019 at 10:51
I'll definitely give it a try. Thank you. I'll accept your answer. If you think that this was a well asked question, could you give me an upvote? — Andros Adrianopolos
– Andros Adrianopolos, Commented Jun 10, 2019 at 10:52
So do you just create a new variable for every instance of this? This dataset has many categorical variables. — Andros Adrianopolos
– Andros Adrianopolos, Commented Jun 11, 2019 at 6:43

dustindorroh · Accepted Answer · 2019-06-10 11:31:11Z

Here is an approach using the encoders from sklearn

import numpy as np import pandas as pd from sklearn.preprocessing import LabelEncoder, OneHotEncoder train = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/train.csv") test = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/test.csv")

labelEncoder = LabelEncoder() MSZoning_label = labelEncoder.fit_transform(train['MSZoning'])

The order mapping of classes and labels from sklearn's LabelEncoder can be seen from its classes_ property

labelEncoder.classes_

array(['C (all)', 'FV', 'RH', 'RL', 'RM'], dtype=object)

onehotEncoder = OneHotEncoder(n_values=len(labelEncoder.classes_)) MSZoning_onehot_sparse = onehotEncoder.fit_transform([MSZoning_label])

Convert MSZoning_onehot from sparse array to dense array
Reshape the dense array to be (n_classes,n_examples)
Convert from float to int type

MSZoning_onehot = MSZoning_onehot_sparse.toarray().reshape(len(MSZoning_label),-1).astype(int)

Pack it back into a data frame if you wan't

MSZoning_label_onehot = pd.DataFrame(MSZoning_onehot,columns=labelEncoder.classes_) MSZoning_label_onehot.head(10)

I don't get this line array(['C (all)', 'FV', 'RH', 'RL', 'RM'], dtype=object). MSZoning is already type object. — Andros Adrianopolos
– Andros Adrianopolos, Commented Jun 11, 2019 at 9:39
That is the output of the line above it. In [1]: print(labelEncoder.classes_) Out[2]: array(['C (all)', 'FV', 'RH', 'RL', 'RM'], dtype=object) — dustindorroh
– dustindorroh, Commented Jun 11, 2019 at 9:52
When you pack it back into the dataframe, the dataframe isn't train. Shouldn't you submit your OHE variables back into the mother data? — Andros Adrianopolos
– Andros Adrianopolos, Commented Jun 11, 2019 at 10:20
I created a new dataframe in the example, but you can add it back to the train dataframe if you like. The indexes between the two are mapped. — dustindorroh
– dustindorroh, Commented Jun 22, 2019 at 7:44

Stack Exchange Network

one-hot-encoding categorical data gives error

2 Answers 2

Linked

Hot Network Questions

one-hot-encoding categorical data gives error

2 Answers 2

Linked

Related

Hot Network Questions