3
$\begingroup$

I am currently working on the Boston problem hosted on Kaggle. The dataset is nothing like the Titanic dataset. There are many categorical columns and I'm trying to one-hot-encode these columns. I've decided to go with the column MSZoning to get the approach working and work out a strategy to apply it to other categorical columns. This is a small snippet of the dataset:

enter image description here

Here are the different types of values present in MSZoning, so obviously integer encoding only would be a bad idea:

['RL' 'RM' 'C (all)' 'FV' 'RH']

Here is my attempt on Python to assign MSZoning with the new one-hot-encoded data. I do know that one-hot-encoding turns each value into a column of its own and assigns binary values to each of them so I realize that this isn't exactly a good idea. I wanted to try it anyways:

import pandas as pd from sklearn.preprocessing import LabelEncoder, OneHotEncoder train = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/train.csv") test = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/train.csv") labelEncoder = LabelEncoder() train['MSZoning'] = labelEncoder.fit_transform(train['MSZoning']) train_OHE = OneHotEncoder(categorical_features=train['MSZoning']) train['MSZoning'] = train_OHE.fit_transform(train['MSZoning']).toarray() print(train['MSZoning']) 

Which is giving me the following (obvious) error:

C:\Users\security\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py:392: DeprecationWarning: The 'categorical_features' keyword is deprecated in version 0.20 and will be removed in 0.22. You can use the ColumnTransformer instead. "use the ColumnTransformer instead.", DeprecationWarning) Traceback (most recent call last): File "C:/Users/security/Downloads/AP/Boston-Kaggle/Boston.py", line 11, in <module> train['MSZoning'] = train_OHE.fit_transform(train['MSZoning']).toarray() File "C:\Users\security\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py", line 511, in fit_transform self._handle_deprecations(X) File "C:\Users\security\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py", line 394, in _handle_deprecations n_features = X.shape[1] IndexError: tuple index out of range 

I did read through some Medium posts on this but they didn't exactly relate to what I was trying to do with my dataset as they were working with dummy data with a couple of categorical columns. What I want to know is, how do I make use of one-hot-encoding after the (attempted) step?

$\endgroup$
1
  • 1
    $\begingroup$ Quick note: you have loaded the same dataframe for both train and test $\endgroup$ Commented Jun 10, 2019 at 9:41

2 Answers 2

3
$\begingroup$

First of all, I noticed you have loaded the same dataframe for both train and test. Change the code like this:

import numpy as np import pandas as pd train = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/train.csv") test = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/test.csv") 

At this point, one-hot encode each variable you want with pandas' get_dummies() function:

# Onhe-hot encode a given variable OHE_MSZoning = pd.get_dummies(train['MSZoning']) 

It will be returned as a pandas dataframe. In my Jupyter Notebook it looks like this:

OHE_MSZoning.head() 

enter image description here

You can repeat the same command for all the variables you want to one-hot encode.

Hope this helps, otherwise let me know.

$\endgroup$
4
  • 1
    $\begingroup$ How come you're using pandas.get_dummies() over the sklearn function? $\endgroup$ Commented Jun 10, 2019 at 9:57
  • 1
    $\begingroup$ It's the method I'm used to, I work all the time with pandas dataframes and I find it useful. But it's not necessarily better than sklearn. I used this because I'm sure it worked. $\endgroup$ Commented Jun 10, 2019 at 10:51
  • $\begingroup$ I'll definitely give it a try. Thank you. I'll accept your answer. If you think that this was a well asked question, could you give me an upvote? $\endgroup$ Commented Jun 10, 2019 at 10:52
  • $\begingroup$ So do you just create a new variable for every instance of this? This dataset has many categorical variables. $\endgroup$ Commented Jun 11, 2019 at 6:43
3
$\begingroup$

Here is an approach using the encoders from sklearn

import numpy as np import pandas as pd from sklearn.preprocessing import LabelEncoder, OneHotEncoder train = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/train.csv") test = pd.read_csv("https://raw.githubusercontent.com/oo92/Boston-Kaggle/master/test.csv") 
labelEncoder = LabelEncoder() MSZoning_label = labelEncoder.fit_transform(train['MSZoning']) 

The order mapping of classes and labels from sklearn's LabelEncoder can be seen from its classes_ property

labelEncoder.classes_ 
array(['C (all)', 'FV', 'RH', 'RL', 'RM'], dtype=object) 
onehotEncoder = OneHotEncoder(n_values=len(labelEncoder.classes_)) MSZoning_onehot_sparse = onehotEncoder.fit_transform([MSZoning_label]) 
  • Convert MSZoning_onehot from sparse array to dense array
  • Reshape the dense array to be (n_classes,n_examples)
  • Convert from float to int type
MSZoning_onehot = MSZoning_onehot_sparse.toarray().reshape(len(MSZoning_label),-1).astype(int) 

Pack it back into a data frame if you wan't

MSZoning_label_onehot = pd.DataFrame(MSZoning_onehot,columns=labelEncoder.classes_) MSZoning_label_onehot.head(10) 

enter image description here

$\endgroup$
4
  • $\begingroup$ I don't get this line array(['C (all)', 'FV', 'RH', 'RL', 'RM'], dtype=object). MSZoning is already type object. $\endgroup$ Commented Jun 11, 2019 at 9:39
  • $\begingroup$ That is the output of the line above it. In [1]: print(labelEncoder.classes_) Out[2]: array(['C (all)', 'FV', 'RH', 'RL', 'RM'], dtype=object) $\endgroup$ Commented Jun 11, 2019 at 9:52
  • $\begingroup$ When you pack it back into the dataframe, the dataframe isn't train. Shouldn't you submit your OHE variables back into the mother data? $\endgroup$ Commented Jun 11, 2019 at 10:20
  • $\begingroup$ I created a new dataframe in the example, but you can add it back to the train dataframe if you like. The indexes between the two are mapped. $\endgroup$ Commented Jun 22, 2019 at 7:44

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.