3

I'm trying to learn how to implement MICE in imputing missing values for my datasets. I've heard about fancyimpute's MICE, but I also read that sklearn's IterativeImputer class can accomplish similar results. From sklearn's docs:

Our implementation of IterativeImputer was inspired by the R MICE package (Multivariate Imputation by Chained Equations) [1], but differs from it by returning a single imputation instead of multiple imputations. However, IterativeImputer can also be used for multiple imputations by applying it repeatedly to the same dataset with different random seeds when sample_posterior=True

I've seen "seeds" being used in different pipelines, but I never understood them well enough to implement them in my own code. I was wondering if anyone could explain and provide an example on how to implement seeds for a MICE imputation using sklearn's IterativeImputer? Thanks!

1
  • 1
    If you are willing to forego sklearn you can try miceforest. Commented Dec 16, 2021 at 15:13

2 Answers 2

4

IterativeImputer behavior can change depending on a random state. The random state which can be set is also called a "seed".

As stated by the documentation, we can get multiple imputations when setting sample_posterior to True and changing the random seeds, i.e. the parameter random_state.

Here is an example of how to use it:

import numpy as np from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer X_train = [[1, 2], [3, 6], [4, 8], [np.nan, 3], [7, np.nan]] X_test = [[np.nan, 2], [np.nan, np.nan], [np.nan, 6]] for i in range(3): imp = IterativeImputer(max_iter=10, random_state=i, sample_posterior=True) imp.fit(X_train) print(f"imputation {i}:") print(np.round(imp.transform(X_test))) 

It outputs:

imputation 0: [[ 1. 2.] [ 5. 10.] [ 3. 6.]] imputation 1: [[1. 2.] [0. 1.] [3. 6.]] imputation 2: [[1. 2.] [1. 2.] [3. 6.]] 

We can observe the three different imputations.

Sign up to request clarification or add additional context in comments.

3 Comments

Would it be correct to pool the three imputations into a single set? If so, how would you accomplish this? I'm probably misunderstanding your explanation, but it looks like I would be creating 3 different datasets, each representing a different imputation seed.
It is indeed creating 3 different datasets. How to use it depends on your final task (classification, regression, etc. or just to infer the missing values of your features?). I would suggest to ask another question, and it is probably better on Cross Validated than Stack Overflow.
@GlennG. were you able to figure out how to pool the datasets into a single dataset? I am also currently in the same position, and would like to fill the missing values in my features.
1

A way to go about stacking the data might be to change @Stanislas' code around a bit like so:

mvi = {} # just my preference for dict, you can use a list too # mvi collects each dataframe into a dict of dataframes using index: 0 thru 2 for i in range(3): imp = IterativeImputer(max_iter=10, random_state=i, sample_posterior=True) mvi[i] = np.round(imp.fit_transform(X_train)) 

combine the imputations into a single dataset using

# a. pandas concat, or pd.concat(list(dfImp.values()), axis=0) #b. np stack dfs = np.stack(list(dfImp.values()), axis=0) 

pd.concat creates a 2D data, on the other hand,np.stack creates a 3D array that you can reshape into 2D. The breakdown of the numpy 3D is as follows:

  • axis 0: num of iterated dataframes
  • axis 1: len of original df (num of rows)
  • axis 2: num of columns in original dataframe

create a 2D from 3D

You can use numpy reshape like so:

np.reshape(dfs, newshape=(dfs.shape[0]*dfs.shape[1], -1)) 

which means you essentially multiply axis 0 by axis 1 to stack the dataframes into one big dataframe. The -1 at the end just means that whatever axes is left off, use that, in this case it is the columns.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.