Multiple imputation by chained equation implemented from scratch.
Load the iris data from sklearn and introduce missing values with pyampute package
from sklearn.datasets import load_iris from pyampute.ampute import MultivariateAmputation iris = load_iris(as_frame=True, return_X_y=False)["data"] ma = MultivariateAmputation() X_amp = ma.fit_transform(iris.to_numpy()) # pyampute requires the input as numpy arrayNow we can apply MICE in the amputed dataset
from src import mice imp = mice.mice(X, n_iterations = 20, m_imputations = 10, seed=42)After imputation you should make diagnostic plots and check the distribution of the multiply imputed datasets comparing with the complete case data. Bellow you can find the plot for the example we provide in /tests directory:
import seaborn as sns import matplotlib.pyplot as plt p = 3 # column to be plotted custom_lines = [plt.Line2D([0], [0], color="red", lw=4), plt.Line2D([0], [0], color="grey", lw=4), plt.Line2D([0], [0], color="blue", lw=4)] fig, ax = plt.subplots() for m in range(len(imp)): sns.kdeplot(imp[m][:, p], label="Imputed", color="black", lw=0.2, ax=ax) sns.kdeplot(X_amp[:,p], label="Missing", color="blue", ax=ax) sns.kdeplot(df.to_numpy()[:, p], label="Complete", color="red",ax=ax) plt.xlabel("Age (years)") ax.legend(custom_lines, ['Complete', 'Imputed', 'Missing'], loc="upper left") plt.savefig("qol_distribution_mice.png")This is a low performance implementation meant for pedagogical purposes only. There are several limitations and improvements that can be made, for research please use one of the available packages for multiple imputation:
