I'm currently working with a dataset from a molecular epidemiology study involving an controls and cases for a cardiovascular event. The dataset includes several categorical health and lifestyle-related variables such as smoking status, educational level, diabetes diagnosis, and hypercholesterolemia. Some of these variables present missing values, and I'm exploring the use of the missForest R package for imputation.
I understand that missForest is a nonparametric method that uses random forests to impute missing values, supporting both categorical and continuous variables.
However, I have two key questions:
Is there any recommended threshold for the proportion of missing values beyond which imputation (even with a robust method like
missForest) may become unreliable or invalid? For example, the state-of-art or accepted threshold is < 5%Given that my data is longitudinal and repeated measurements including nonlinearities and interactions, particularly associated to time and group. Would
missForestbe considered an appropriate method? Knowing that all these variables might be not missings completely at random, but using the remaining availabe variables to imput the missing ones i.e. diabetes, hypertension and smoking status to imput hypercholesterolemia
Any practical advice, theoretical insights, or references would be very helpful. Thank you!