Impute categorical missing values in scikit-learn

Question

I've got pandas data with some columns of text type. There are some NaN values along with these text columns. What I'm trying to do is to impute those NaN's by sklearn.preprocessing.Imputer (replacing NaN by the most frequent value). The problem is in implementation. Suppose there is a Pandas dataframe df with 30 columns, 10 of which are of categorical nature. Once I run:

from sklearn.preprocessing import Imputer imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0) imp.fit(df)

Python generates an error: 'could not convert string to float: 'run1'', where 'run1' is an ordinary (non-missing) value from the first column with categorical data.

Any help would be very welcome

Imputer works on numbers, not strings. Convert to numbers, then impute, then convert back. — Fred Foo
– Fred Foo, Commented Aug 11, 2014 at 9:32
Are there any suitable ways to automate it via scikit-learn? — night_bat
– night_bat, Commented Aug 11, 2014 at 20:51
Why would it not allow categorical vars for most_frequent strategy? strange. — ksha
– ksha, Commented Dec 16, 2016 at 19:24
You can now use from sklearn.impute import SimpleImputer and then imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent') — pentandrous
– pentandrous, Commented Sep 19, 2019 at 1:13

sveitser · Accepted Answer · 2014-08-29 15:29:17Z

To use mean values for numeric columns and the most frequent value for non-numeric columns you could do something like this. You could further distinguish between integers and floats. I guess it might make sense to use the median for integer columns instead.

import pandas as pd import numpy as np from sklearn.base import TransformerMixin class DataFrameImputer(TransformerMixin): def __init__(self): """Impute missing values. Columns of dtype object are imputed with the most frequent value in column. Columns of other types are imputed with mean of column. """ def fit(self, X, y=None): self.fill = pd.Series([X[c].value_counts().index[0] if X[c].dtype == np.dtype('O') else X[c].mean() for c in X], index=X.columns) return self def transform(self, X, y=None): return X.fillna(self.fill) data = [ ['a', 1, 2], ['b', 1, 1], ['b', 2, 2], [np.nan, np.nan, np.nan] ] X = pd.DataFrame(data) xt = DataFrameImputer().fit_transform(X) print('before...') print(X) print('after...') print(xt)

which prints,

before... 0 1 2 0 a 1 2 1 b 1 1 2 b 2 2 3 NaN NaN NaN after... 0 1 2 0 a 1.000000 2.000000 1 b 1.000000 1.000000 2 b 2.000000 2.000000 3 b 1.333333 1.666667

Great job. I'm going to use your snippet in xtoy :) If you have any further suggestions, I'd be happy to hear them.
This is great, but if any column has all NaN values, it won't work. These all NaN columns should be dropped from the DF.
Great :) I'm going to use this but change it a bit so that it used mean for floats, median for ints, mode for strings
DataFrameImputer() does not have get_params() attribute error when used in GridSearchCV. The fix is to inherit from sklearn.base.BaseEstimator also.
@mamun The fit_transform method is provided by the TransfomerMixin class.

Austin · Accepted Answer · 2018-01-09 06:51:35Z

You can use sklearn_pandas.CategoricalImputer for the categorical columns. Details:

First, (from the book Hands-On Machine Learning with Scikit-Learn and TensorFlow) you can have subpipelines for numerical and string/categorical features, where each subpipeline's first transformer is a selector that takes a list of column names (and the full_pipeline.fit_transform() takes a pandas DataFrame):

class DataFrameSelector(BaseEstimator, TransformerMixin): def __init__(self, attribute_names): self.attribute_names = attribute_names def fit(self, X, y=None): return self def transform(self, X): return X[self.attribute_names].values

You can then combine these sub pipelines with sklearn.pipeline.FeatureUnion, for example:

full_pipeline = FeatureUnion(transformer_list=[ ("num_pipeline", num_pipeline), ("cat_pipeline", cat_pipeline) ])

Now, in the num_pipeline you can simply use sklearn.preprocessing.Imputer(), but in the cat_pipline, you can use CategoricalImputer() from the sklearn_pandas package.

note: sklearn-pandas package can be installed with pip install sklearn-pandas, but it is imported as import sklearn_pandas

prashanth · Accepted Answer · 2018-11-15 10:21:45Z

There is a package sklearn-pandas which has option for imputation for categorical variable https://github.com/scikit-learn-contrib/sklearn-pandas#categoricalimputer

>>> from sklearn_pandas import CategoricalImputer >>> data = np.array(['a', 'b', 'b', np.nan], dtype=object) >>> imputer = CategoricalImputer() >>> imputer.fit_transform(data) array(['a', 'b', 'b', 'b'], dtype=object)

I back this answer; the official sklearn-pandas documentation on the pypi website mentions this: "CategoricalImputer Since the scikit-learn Imputer transformer currently only works with numbers, sklearn-pandas provides an equivalent helper transformer that do work with strings, substituting null values with the most frequent value in that column."pypi.org/project/sklearn-pandas/1.5.0

Piyush · Accepted Answer · 2018-11-13 06:17:04Z

strategy = 'most_frequent' can be used only with quantitative feature, not with qualitative. This custom impuer can be used for both qualitative and quantitative. Also with scikit learn imputer either we can use it for whole data frame(if all features are quantitative) or we can use 'for loop' with list of similar type of features/columns(see the below example). But custom imputer can be used with any combinations.
```
 from sklearn.preprocessing import Imputer impute = Imputer(strategy='mean') for cols in ['quantitative_column', 'quant']: # here both are quantitative features. xx[cols] = impute.fit_transform(xx[[cols]]) 
```

Custom Imputer :

 from sklearn.preprocessing import Imputer from sklearn.base import TransformerMixin class CustomImputer(TransformerMixin): def __init__(self, cols=None, strategy='mean'): self.cols = cols self.strategy = strategy def transform(self, df): X = df.copy() impute = Imputer(strategy=self.strategy) if self.cols == None: self.cols = list(X.columns) for col in self.cols: if X[col].dtype == np.dtype('O') : X[col].fillna(X[col].value_counts().index[0], inplace=True) else : X[col] = impute.fit_transform(X[[col]]) return X def fit(self, *_): return self

Dataframe:

 X = pd.DataFrame({'city':['tokyo', np.NaN, 'london', 'seattle', 'san francisco', 'tokyo'], 'boolean':['yes', 'no', np.NaN, 'no', 'no', 'yes'], 'ordinal_column':['somewhat like', 'like', 'somewhat like', 'like', 'somewhat like', 'dislike'], 'quantitative_column':[1, 11, -.5, 10, np.NaN, 20]}) city boolean ordinal_column quantitative_column 0 tokyo yes somewhat like 1.0 1 NaN no like 11.0 2 london NaN somewhat like -0.5 3 seattle no like 10.0 4 san francisco no somewhat like NaN 5 tokyo yes dislike 20.0

1) Can be used with list of similar type of features.

 cci = CustomImputer(cols=['city', 'boolean']) # here default strategy = mean cci.fit_transform(X)

can be used with strategy = median

 sd = CustomImputer(['quantitative_column'], strategy = 'median') sd.fit_transform(X)

3) Can be used with whole data frame, it will use default mean(or we can also change it with median. for qualitative features it uses strategy = 'most_frequent' and for quantitative mean/median.
```
 call = CustomImputer() call.fit_transform(X) 
```

user1367204 · Accepted Answer · 2017-03-17 15:06:14Z

Copying and modifying sveitser's answer, I made an imputer for a pandas.Series object

import numpy import pandas from sklearn.base import TransformerMixin class SeriesImputer(TransformerMixin): def __init__(self): """Impute missing values. If the Series is of dtype Object, then impute with the most frequent object. If the Series is not of dtype Object, then impute with the mean. """ def fit(self, X, y=None): if X.dtype == numpy.dtype('O'): self.fill = X.value_counts().index[0] else : self.fill = X.mean() return self def transform(self, X, y=None): return X.fillna(self.fill)

To use it you would do:

# Make a series s1 = pandas.Series(['k', 'i', 't', 't', 'e', numpy.NaN]) a = SeriesImputer() # Initialize the imputer a.fit(s1) # Fit the imputer s2 = a.transform(s1) # Get a new series

Gautham Kumaran · Accepted Answer · 2017-11-07 20:58:41Z

Inspired by the answers here and for the want of a goto Imputer for all use-cases I ended up writing this. It supports four strategies for imputation mean, mode, median, fill works on both pd.DataFrame and Pd.Series.

mean and median works only for numeric data, mode and fill works for both numeric and categorical data.

class CustomImputer(BaseEstimator, TransformerMixin): def __init__(self, strategy='mean',filler='NA'): self.strategy = strategy self.fill = filler def fit(self, X, y=None): if self.strategy in ['mean','median']: if not all(X.dtypes == np.number): raise ValueError('dtypes mismatch np.number dtype is \ required for '+ self.strategy) if self.strategy == 'mean': self.fill = X.mean() elif self.strategy == 'median': self.fill = X.median() elif self.strategy == 'mode': self.fill = X.mode().iloc[0] elif self.strategy == 'fill': if type(self.fill) is list and type(X) is pd.DataFrame: self.fill = dict([(cname, v) for cname,v in zip(X.columns, self.fill)]) return self def transform(self, X, y=None): return X.fillna(self.fill)

usage

>> df MasVnrArea FireplaceQu Id 1 196.0 NaN 974 196.0 NaN 21 380.0 Gd 5 350.0 TA 651 NaN Gd >> CustomImputer(strategy='mode').fit_transform(df) MasVnrArea FireplaceQu Id 1 196.0 Gd 974 196.0 Gd 21 380.0 Gd 5 350.0 TA 651 196.0 Gd >> CustomImputer(strategy='fill', filler=[0, 'NA']).fit_transform(df) MasVnrArea FireplaceQu Id 1 196.0 NA 974 196.0 NA 21 380.0 Gd 5 350.0 TA 651 0.0 Gd

user2458922 · Accepted Answer · 2022-05-15 11:42:16Z

Missforest can be used for the imputation of missing values in categorical variable along with the other categorical features. It works in an iterative way similar to IterativeImputer taking random forest as a base model.

Following is the code to label encode the features along with the target variable, fitting model to impute nan values, and encoding the features back

import sklearn.neighbors._base from sklearn.preprocessing import LabelEncoder import sys sys.modules['sklearn.neighbors.base'] = sklearn.neighbors._base from missingpy import MissForest def label_encoding(df, columns): """ Label encodes the set of the features to be used for imputation Args: df: data frame (processed data) columns: list (features to be encoded) Returns: dictionary """ encoders = dict() for col_name in columns: series = df[col_name] label_encoder = LabelEncoder() df[col_name] = pd.Series( label_encoder.fit_transform(series[series.notnull()]), index=series[series.notnull()].index ) encoders[col_name] = label_encoder return encoders # adding to be imputed global category along with features features = ['feature_1', 'feature_2', 'target_variable'] # label encoding features encoders = label_encoding(data, features) # categorical imputation using random forest # parameters can be tuned accordingly imp_cat = MissForest(n_estimators=50, max_depth=80) data[features] = imp_cat.fit_transform(data[features], cat_vars=[0, 1, 2]) # decoding features for variable in features: data[variable] = encoders[variable].inverse_transform(data[variable].astype(int))

scottlittle · Accepted Answer · 2016-07-28 16:32:07Z

This code fills in a series with the most frequent category:

import pandas as pd import numpy as np # create fake data m = pd.Series(list('abca')) m.iloc[1] = np.nan #artificially introduce nan print('m = ') print(m) #make dummy variables, count and sort descending: most_common = pd.get_dummies(m).sum().sort_values(ascending=False).index[0] def replace_most_common(x): if pd.isnull(x): return most_common else: return x new_m = m.map(replace_most_common) #apply function to original data print('new_m = ') print(new_m)

Outputs:

m = 0 a 1 NaN 2 c 3 a dtype: object new_m = 0 a 1 a 2 c 3 a dtype: object

Digvijay · Accepted Answer · 2020-09-23 18:33:41Z

sklearn.impute.SimpleImputer instead of Imputer can easily resolve this, which can handle categorical variable.

As per the Sklearn documentation: If “most_frequent”, then replace missing using the most frequent value along each column. Can be used with strings or numeric data.

https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

impute_size=SimpleImputer(strategy="most_frequent") data['Outlet_Size']=impute_size.transform(data[['Outlet_Size']])

qAp · Accepted Answer · 2017-07-24 07:46:49Z

Similar. Modify Imputer for strategy='most_frequent':

class GeneralImputer(Imputer): def __init__(self, **kwargs): Imputer.__init__(self, **kwargs) def fit(self, X, y=None): if self.strategy == 'most_frequent': self.fills = pd.DataFrame(X).mode(axis=0).squeeze() self.statistics_ = self.fills.values return self else: return Imputer.fit(self, X, y=y) def transform(self, X): if hasattr(self, 'fills'): return pd.DataFrame(X).fillna(self.fills).values.astype(str) else: return Imputer.transform(self, X)

where pandas.DataFrame.mode() finds the most frequent value for each column and then pandas.DataFrame.fillna() fills missing values with these. Other strategy values are still handled the same way by Imputer.

sunnyspain1 · Accepted Answer · 2020-02-17 15:50:42Z

You could try the following:

replace = df.<yourcolumn>.value_counts().argmax() df['<yourcolumn>'].fillna(replace, inplace=True)

GSA · Accepted Answer · 2023-05-27 16:29:48Z

This is my attempt at multiple imputation based on @Gautham Kumaran ideas. It will use mode, "most frequent", for categorical variables replacement and then do multiple imputation via regression for numeric variables

# mising values imputation from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer from sklearn.base import BaseEstimator, TransformerMixin # class for missing data imputation # ============================================================= class MVImputer(BaseEstimator, TransformerMixin): def __init__(self, random_state=None, filler='NA'): self.random_state = random_state self.fill = filler def fit(self, X, y=None): categorical_dtypes = ['object', 'category', 'bool'] numerical_dtypes = ['float', 'int'] for col in X.columns: if X[col].dtype.name in categorical_dtypes: self.fill = X.mode().iloc[0] elif X[col].dtype.name in numerical_dtypes: min_val = X[col].min(axis=0) max_val = X[col].max(axis=0) imputer = (IterativeImputer(max_iter=10, random_state=self.random_state, min_value=min_val, max_value=max_val)) self.fill = imputer.fit(X) return self def transform(self, X, y=None): return X.fillna(self.fill) # call for single imputed dataframe imp = MVImputer() imp.fit_transform(df) # multiple imputed dict of dataframes mvi = {} for i in range(3): imp = Imputer() mvi[i] = imp.fit_transform(df)

Collectives™ on Stack Overflow

Impute categorical missing values in scikit-learn

12 Answers 12

8 Comments

Comments

2 Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

12 Answers 12

8 Comments

Comments

2 Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Linked

Related