Is it appropriate to use missForest for imputing missing data and critical/recommended threshold?

Question

I'm currently working with a dataset from a molecular epidemiology study involving an controls and cases for a cardiovascular event. The dataset includes several categorical health and lifestyle-related variables such as smoking status, educational level, diabetes diagnosis, and hypercholesterolemia. Some of these variables present missing values, and I'm exploring the use of the missForest R package for imputation.

I understand that missForest is a nonparametric method that uses random forests to impute missing values, supporting both categorical and continuous variables.

However, I have two key questions:

Is there any recommended threshold for the proportion of missing values beyond which imputation (even with a robust method like missForest) may become unreliable or invalid? For example, the state-of-art or accepted threshold is < 5%
Given that my data is longitudinal and repeated measurements including nonlinearities and interactions, particularly associated to time and group. Would missForest be considered an appropriate method? Knowing that all these variables might be not missings completely at random, but using the remaining availabe variables to imput the missing ones i.e. diabetes, hypertension and smoking status to imput hypercholesterolemia

Any practical advice, theoretical insights, or references would be very helpful. Thank you!

EdM · Accepted Answer · 2025-05-24 15:29:47Z

Instead of focusing on a particular way to impute missing data, make sure that you understand the underlying issues. I know of two highly useful, freely available, resources about missing data. Based on your description of the data, I suspect that you will need to do multiple imputation.

Chapter 3 of Frank Harrell's Regression Modeling Strategies is a helpful, concise overview. Section 3.11 summarizes the general issues. In particular:

"Reason for missings [are] more important than number of missing values."
"Extreme amount of missing data does not prevent one from using multiple imputation, because alternatives are worse."
Only doing a single imputation of missing values (as I think that missForest provides by default) can lead to bias in coefficient estimates and low estimates of coefficient standard errors (Table 3.1)
If $f$, the fraction of cases with any missing data, is greater than 0.03, construct $100 f$ imputed data sets (with a minimum of 5 sets). For imputation, "predictive mean matching is usually preferred."

Stef van Buuren's Flexible Imputation of Missing Data (FIMD) is an extensive reference, by a respected authority who developed the R mice package for multiple imputation. As you talk about time as a variable, it seems that you have repeated measurements on the same individuals. Chapter 7 on Multilevel Multiple Imputation should help point you in the right direction in that case. Section 7.3.1 outlines the issues that can arise, depending on the nature of the data and the analysis you ultimately will conduct.

Questions to address, from Table 7.1 of FIMD:

Will the complete-data model include random slopes?
Will the data contain systematically missing values?
Will the distribution of the residuals be non-normal?
Will the error variance differ over clusters?
Will there be small clusters?
Will there be a small number of clusters?
Will the complete-data model have cross-level interactions?
Will the dataset be very large?

The answer to the question you pose here will depend on your answers to those questions about your data and intended analysis.

There is not one super-method that will address all such issues. In practice, we may need to emphasize certain issues at the expense of others. In order to gauge the complexity of the imputation task for particular dataset and model, ask yourself the questions listed in Table 7.1. If your answer to all questions is “NO”, then there are several methods for multilevel MI that are available in standard software. If many of your answers are “YES”, the situation is less clear-cut, and you may need to think about the relative priority of the questions in light of the needs for the application.

The rest of Chapter 7 illustrates ways to proceed.

Saying random forest imputation does just a single imputation might be misleading, Ref.: stats.stackexchange.com/a/296021/163114 — jay.sf
– jay.sf, Commented May 24 at 15:41
The problem with the missing values is longitudinal data is that making imputations for the same individuals, the imputated data changes. It is quite strage because the variables used to do the imputation as auxiliary variables are the categorical ones that doesn't change over time. This is why I'm quite skeptic about the methodology — Javier Hernando
– Javier Hernando, Commented May 26 at 8:57
@jay.sf although missForest provides error estimates for the imputed variables, it's not clear that it provides a way to use those error estimates for correcting standard errors of regression coefficients in downstream work. For applying Rubin's rules downstream, as I understand it you would either have to take multiple samples from the multivariate error distributions or run missForest multiple times to get multiple imputed data sets. — EdM
– EdM, Commented May 26 at 15:56
@JavierHernando the whole point of multiple versus single imputation is to allow the values of imputed variables to vary among the imputed data sets. That variability represents the uncertainty in the values of the imputed variables, and is taken into account when applying Rubin's rules to combine the separate modeling results from the multiple imputed sets to get adjusted standard errors for coefficients. In general, the imputation shouldn't just depend on categorical auxiliary variables but on all variables that might include information about missingness, including outcome values. — EdM
– EdM, Commented May 26 at 16:39
@JavierHernando if the problem is that there are some imputed values that are known to be constant over time but the imputation method is having them change over time, then you could just impute the value for one time and treat the values at the other times as derived variables equal to that value. — EdM
– EdM, Commented May 26 at 16:44

Stack Exchange Network

Is it appropriate to use missForest for imputing missing data and critical/recommended threshold?

1 Answer 1

Linked

Hot Network Questions

Is it appropriate to use *missForest* for imputing missing data and critical/recommended threshold?

1 Answer 1

Linked

Related

Hot Network Questions

Is it appropriate to use missForest for imputing missing data and critical/recommended threshold?