Mass convert categorical columns in Pandas (not one-hot encoding)

Question

I have pandas dataframe with tons of categorical columns, which I am planning to use in decision tree with scikit-learn. I need to convert them to numerical values (not one hot vectors). I can do it with LabelEncoder from scikit-learn. The problem is there are too many of them, and I do not want to convert them manually.

What would be an easy way to automate this process.

The get_dummies function in pandas can help you. Check the documentation here for more details. I think it perfectly covers this use case and you can further tweak the behavior by supplying custom prefixes. — hssay
– hssay, Commented Sep 19, 2016 at 4:51

george_w_kush · Accepted Answer · 2016-11-18 19:33:12Z

If your categorical columns are currently character/object you can use something like this to do each one:

char_cols = df.dtypes.pipe(lambda x: x[x == 'object']).index for c in char_cols: df[c] = pd.factorize(df[c])[0]

If you need to be able to get back to the categories I'd create a dictionary to save the encoding; something like:

char_cols = df.dtypes.pipe(lambda x: x[x == 'object']).index label_mapping = {} for c in char_cols: df[c], label_mapping[c] = pd.factorize(df[c])

Using Julien's mcve will output:

In [3]: print(df) Out[3]: a b c d 0 0 0 0 0.155463 1 1 1 1 0.496427 2 0 0 2 0.168625 3 2 0 1 0.209681 4 0 2 1 0.661857 In [4]: print(label_mapping) Out[4]: {'a': Index(['Var2', 'Var3', 'Var1'], dtype='object'), 'b': Index(['Var2', 'Var1', 'Var3'], dtype='object'), 'c': Index(['Var3', 'Var2', 'Var1'], dtype='object')}

$\begingroup$ Your code to find the object columns is v useful. $\endgroup$

WestCoastProjects
– WestCoastProjects

2018-08-22 20:32:56 +00:00
Commented Aug 22, 2018 at 20:32 — WestCoastProjects
– WestCoastProjects, Commented Aug 22, 2018 at 20:32

Community · Accepted Answer · 2017-05-23 12:38:53Z

First, let's create a mcve to play with:

import pandas as pd import numpy as np In [1]: categorical_array = np.random.choice(['Var1','Var2','Var3'], size=(5,3), p=[0.25,0.5,0.25]) df = pd.DataFrame(categorical_array, columns=map(lambda x:chr(97+x), range(categorical_array.shape[1]))) # Add another column that isn't categorical but float df['d'] = np.random.rand(len(df)) print(df) Out[1]: a b c d 0 Var3 Var3 Var3 0.953153 1 Var1 Var2 Var1 0.924896 2 Var2 Var2 Var2 0.273205 3 Var2 Var1 Var3 0.459676 4 Var2 Var1 Var1 0.114358

Now we can use pd.get_dummies to encode the first three columns.

Note that I'm using the drop_firstparameter because N-1 dummies are sufficient to fully describe N possibilities (eg: if a_Var2 and a_Var3 are 0, then it's a_Var1). Also, I'm specifically specifying the columns but I don't have to as it will be columns with dtype either object or categorical (more below).

In [2]: df_encoded = pd.get_dummies(df, columns=['a','b', 'c'], drop_first=True) print(df_encoded] Out[2]: d a_Var2 a_Var3 b_Var2 b_Var3 c_Var2 c_Var3 0 0.953153 0 1 0 1 0 1 1 0.924896 0 0 1 0 0 0 2 0.273205 1 0 1 0 1 0 3 0.459676 1 0 0 0 0 1 4 0.114358 1 0 0 0 0 0

In your specific application, you'll have to provide a list of column that are Categorical, or you'll have to infer which columns are Categorical.

Best case scenario your dataframe already has these columns with a dtype=category and you can pass columns=df.columns[df.dtypes == 'category'] to get_dummies.

Otherwise I suggest setting the dtype of all other columns as appropriate (hint: pd.to_numeric, pd.to_datetime, etc) and you'll be left with columns that have an object dtype and these should be your categorical columns.

The pd.get_dummies parameter columns defaults as follows:

columns : list-like, default None Column names in the DataFrame to be encoded. If `columns` is None then all the columns with `object` or `category` dtype will be converted.

Ozkan Serttas · Accepted Answer · 2018-06-13 22:51:18Z

In order to convert types of multiple columns at once I would use something like this :

df2 = df.select_dtypes(include = ['type_of_insterest']) df2[df2.columns].apply(lambda x:x.astype('category'))

Then I would join them back to original df.

I think df2[df2.columns] = df2[df2.columns].astype('category') does the same, no apply, no lambda. — paulperry
– paulperry, Commented Jan 31, 2019 at 19:50

Stack Exchange Network

Mass convert categorical columns in Pandas (not one-hot encoding)

3 Answers 3

Hot Network Questions

Mass convert categorical columns in Pandas (not one-hot encoding)

3 Answers 3

Related

Hot Network Questions