I am in the situation where I have multiple variables, containing missing values, measured at time $t_0$, and some others measured at time $t_1$, which can be several years later. I need to impute missing values in the variables at $t_0$, for which I’ll use any information available at that time. Does it make sense to use also information from variables that temporally come after? I am aware of the discussions around including the response variable or not, but in this case there is a clear directionality in the relationship between variables. The goal (downstream analysis) is to estimate causal effects of some exposures at time 0 on some outcomes at time $1.$ The outcomes are used in the imputation process to impute other variables, but are not themselves imputed.
- $\begingroup$ Are you doing prediction or causal work? What is the purpose of your model? $\endgroup$Kuku– Kuku2025-01-31 12:05:39 +00:00Commented Jan 31 at 12:05
- 1$\begingroup$ The causality tag in the description suggests it is the latter, in that case please expand the context of your problem to give a better answer (e.g. what do you want to estimate) $\endgroup$Kuku– Kuku2025-01-31 12:40:47 +00:00Commented Jan 31 at 12:40
- $\begingroup$ My main answer (which has been thoroughly edited now) has maybe strayed too far away by answering the broader general question in the title rather than your specific problem (can answers be migrated?). In the specific case you mention, do the variables at $t_1$ have no missing values? Can you argue for a MAR assumption in your case? $\endgroup$Kuku– Kuku2025-02-05 23:13:40 +00:00Commented Feb 5 at 23:13
2 Answers
I think it is appropriate to include later information in the imputation process because it will improve the accuracy of the estimates. The goal in imputation is to substitute the "best" values for the missing ones, along with the "best" estimate of their variance (done through multiple imputation).
We aren't using the later information to predict the earlier; we are trying to figure out what values we would have gotten if we had had the data. We should use any information we can to improve those estimates.
- 4$\begingroup$ @Kuku even if you want to make causal inference, what Peter stated is correct under the missing at random assumption, which is what you typically assume with multiple imputation. $\endgroup$Dimitris Rizopoulos– Dimitris Rizopoulos2025-01-31 12:16:23 +00:00Commented Jan 31 at 12:16
- 1$\begingroup$ @PeterFlom we do not have further information from the original problem to know whether it could be a mediator or not (it could be a mediator for the effect of the exposure at time zero on an outcome at time two). So, in that sense, I am objecting both to the particular and the general case. $\endgroup$Kuku– Kuku2025-01-31 12:30:19 +00:00Commented Jan 31 at 12:30
- 3$\begingroup$ @Kuku I would argue that if a variable is relevant to impute the missing values, it should be included in the imputation model. In the analysis model, you can exclude mediators. $\endgroup$Dimitris Rizopoulos– Dimitris Rizopoulos2025-01-31 12:40:43 +00:00Commented Jan 31 at 12:40
- 1$\begingroup$ @PeterFlom You can certainly do imputation in a purely predictive setting. But when it is in a causal inference setting, you need to consider the DAG (see for example Karthika Mohan's work here). How else do you define the variables required to satisfy the MAR assumption if not through an (implicit) causal model? Correlation is certainly not sufficient (and not even necessary in the general sense!) $\endgroup$Kuku– Kuku2025-01-31 13:15:09 +00:00Commented Jan 31 at 13:15
- 1$\begingroup$ Interesting conversation. The ultimate goal is doing causal inference using observational data. But here I am only talking about imputation. $\endgroup$wrong_path– wrong_path2025-01-31 13:42:11 +00:00Commented Jan 31 at 13:42
Short answer
Yes, as Karthika Mohan concludes "missing data is a causal inference problem"$^1$. As such, the structural causal model of the problem matters, and that includes temporal causal relationships.
In short, you should note that:
- The MCAR/MAR/MNAR taxonomy gives an incomplete picture and can be misleading (e.g. MNAR problems can be solved with complete case analysis).
- You should not blindly use all available information.
- You should not use only association measures (such as correlations) to justify inclusion of a variable in an imputation model.
Examples below illustrate each one of these points. The full generality of these problems is captured by the literature on m-graphs (see Graphical Models for Processing Missing Data (Mohan & Pearl, 2021)), which notes that a proper answer to any problem will require a causal DAG that includes respective missingness indicators and an explicit target estimand you care about.
There have been great advances in the last decade in the missing data literature, but the adoption of these findings has been slow, especially for fields outside the Pearl camp of causal inference such as statistics, social sciences, economics and epidemiology$^2$.
Long answer
Common misconception
You will find in multiple sources the notion that the goal of multiple imputation is to maximize the predictive power of the imputation model. This is wrong, and is wrong even if your aim is only to obtain descriptive statistics such as the mean of a variable.
Let's clarify the disagreement with the other answer and comments. The answer by Peter Flom states that
I think it is appropriate to include later information in the imputation process because it will improve the accuracy of the estimates. [...] We should use any information we can to improve those estimates.
However, I state this is not correct since whether later information improves the estimates or not depends on the causal model. Dimitris responds that:
even if you want to make causal inference, [the quote above] is correct under the missing at random assumption
But this is not a contradiction to my caveat, since the only way to make a missing at random assumption is based on a (explicit or implicit) causal model. Hence, the following comment by Peter is not correct:
I don't think you need a causal model to decide what is relevant. Correlation is enough. You don't need causation. You can do imputation when there is no hint of causation at all.
Since imputation depends on a MAR assumption, and there is no way to argue for a MAR assumption without positing causes of missingness mechanisms (i.e. what does missingness 'depend on'), a causal model is necessary. As Judea Pearl quips "A fisherman enters a restaurant, orders a fried fish and tells his friend: You see, some fish need no catching.". The examples below show how correlations can lead to error.
Then there are these two extracts in comments by Dimitris (emphasis mine):
MAR states that all available observed information should be used when imputing.
The question is whether the probability that you have missing data on $X$ depends on $M$. If it does,then you need it in the imputation model
If all available observed information has to be used, it seems to be irrelevant which observed variables cause missing data (they will be in the imputation model nonetheless). Instead, the reasoning goes:
- I think missingness depends only on these observed variables $\boldsymbol{M}$, so I will assume MAR holds
- Since MAR holds, I can do multiple imputation
- To get correct results with multiple imputation, I need to use all available observed information
The first statement is a causal one, and it can be made explicit with an m-graph. The third statement is partly true: it is sufficient but not necessary (depending on the DAG), but it is true that for any MAR$^3$ problem the joint distribution of all variables can be recovered by (equation 34.4 in Mohan, 2022, p. 659):
$$ \operatorname{P}(V_{\text{observed}}, V_{\text{missing}}) = \operatorname{P}(V^{*}|V_{\text{observed}}, R = 0) \operatorname{P} (V_{\text{observed}}) $$ where $V_{\text{observed}}$ and $V_{\text{missing}}$ are variables with no missing values and variables with some missing values, respectively; $R$ are the missingness mechanism or indicators for the variables with some missing values and $V^{*}$ is the proxy variable that denotes the available values of the variables with some missing values.
Examples
Collider example
The present example shows a scenario where multiple imputation using all available data leads to bias whereas complete case analysis does not.
First, let us take the perspective of the researcher that has only access to observed data.
> head(cbind(X, M, Y)) X M Y [1,] -0.56047565 0.8201498 4.477585 [2,] NA 2.2733152 11.306199 [3,] 1.55870831 1.8900208 5.345983 [4,] 0.07050839 0.8457453 3.522221 [5,] 0.12928774 1.2122557 5.486042 [6,] NA 3.9800753 11.293618 The misleading predictive perspective says we have three variables, two of which are fully observed, and thus we have to use all available information to maximize the accuracy of our imputed values. An expert in the field tells us that there is an unobserved variable, $U$, that is the true driver of missingness, but that $Y$ has a correlation of above 0.95 with it, so it is a good proxy.
Encouraged by this solid justification, we perform multiple imputation using $Y$ and $M$ in the imputation model rather than using 'outdated' complete case analysis methods. Turns out, we have massively biased our results:
set.seed(123) # Generate data according to a structural model --------------------------- N <- 100000L X <- rnorm(N, mean = 0, sd = 1) U <- rnorm(N, mean = 1, sd = 1) M <- X + 2*U + rnorm(N, mean = 0, sd = 0.4) # M can be seen as a future measurement of X Y <- X + 4*U + rnorm(N, mean = 0, sd = 0.3) # M is not a mediator of X, X has a direct effect on Y cor(M, X) #> [1] 0.4366151 # Full data --------------------------------------------------------------- # Average Treatment Effect of X on Y true_ATE <- lm(Y ~ X)$coefficients[["X"]] # Introduce missingness --------------------------------------------------- # MAR under unobserved variable U holds X[U > 2.3] <- NA_real_ head(cbind(X, M, Y)) #> X M Y #> [1,] -0.56047565 2.085143 4.477585 #> [2,] NA 5.104063 11.306199 #> [3,] 1.55870831 2.830643 5.345983 #> [4,] 0.07050839 1.792536 3.522221 #> [5,] 0.12928774 2.650160 5.486042 #> [6,] NA 6.317524 11.293618 cor(X, M, use = "complete") #> [1] 0.4962392 cor(X, Y, use = "complete") #> [1] 0.280185 cor(U, Y, use = "complete") #> [1] 0.9677232 # Average Treatment Effect of complete case cc_ATE <- lm(Y ~ X)$coefficients[["X"]] # Multiple imputation with all observed data DF <- data.frame(Y = Y, X = X, M = M); library("mice") imp <- mice(DF, m = 20L, printFlag = FALSE) mi_fit <- mipo(with(imp, lm(Y ~ X))) mi_ATE <- mi_fit$pooled$estimate[[2]] true_ATE - mi_ATE # Bias from multiple imputation #> [1] -0.1747555 true_ATE - cc_ATE # Bias from complete case analysis #> [1] -0.003026327 We see that both $M$ and $Y$ were colliders between $U$ and $X$: i.e. $X \rightarrow M \leftarrow U$ and $X \rightarrow Y \leftarrow U$. Using them in the imputation model opens the collider path, introducing bias into our estimate, even if we do not use them in the outcome model!
Even more critically, now that we can see the data generating process, we see that MAR under $U$ holds, yet the complete case analysis is unbiased! This goes against typical MNAR/MAR/MCAR guidance, but it is a result that can be clearly seen from the m-graph below, showing that a structural causal perspective is necessary.
However, while we say MAR under $U$ holds from the perspective of the data generating process, from the perspective of the researcher $U$ is a latent variable so that MNAR holds, as Dimitris rightly notes.
MAR under M holds, but M is a collider
EDIT: The last comment in this answer by Dimitri asks:
Do you have an example in which missingness depends on observed variables (i.e., variables that will be available in the imputation step), and using them in the imputation will lead to bias
We construct this last example where MAR under $M$ holds, so that the missingness mechanism depends structurally on $M$, yet imputing with $M$ introduces bias. For simplicity, we make it so that $X$ has no causal effect on $Y$, i.e. the true ATE is zero.
set.seed(123) # Generate data according to a structural model --------------------------- N <- 100000L U <- rnorm(N, mean = 1, sd = 1) # Common cause of M and Y X <- rnorm(N, mean = 0, sd = 1) M <- X + 10*U + rnorm(N, mean = 0, sd = 0.4) Y <- 5*U + rnorm(N, mean = 0, sd = 0.3) # M is not a mediator of X, X no causal effect on Y cor(M, X) #> [1] 0.09722705 # Average Treatment Effect of X on Y true_ATE <- 0L # Introduce missingness --------------------------------------------------- # MAR under observed variable M X[M > 1.5] <- NA_real_ sum(is.na(X)) / length(X) #> [1] 0.80158 # Proportion missing head(cbind(X, M, Y)) #> X M Y #> [1,] NA 4.775869 2.175708 #> [2,] NA 9.201718 4.062499 #> [3,] NA 24.918396 12.818329 #> [4,] NA 10.480321 5.017092 #> [5,] NA 11.375845 5.251577 #> [6,] NA 28.415660 13.804082 cor(X, M, use = "complete") #> [1] 0.05508296 cor(X, Y, use = "complete") #> [1] -0.1546432 cor(U, Y, use = "complete") #> [1] 0.9982 # Average Treatment Effect of complete case cc_ATE <- lm(Y ~ X)$coefficients[["X"]] # Multiple imputation with M library("mice") DF <- data.frame(X = X, M = M); imp <- mice(DF, m = 20L, printFlag = FALSE) mi_fit <- mipo(with(imp, lm(Y ~ X))) mi_ATE <- mi_fit$pooled$estimate[[2]] true_ATE - mi_ATE # Bias from multiple imputation #> [1] -2.339173 true_ATE - cc_ATE #> [1] 0.3729012 The m-graph would show that $M$ is both a collider between $X$ and $U$ and that $M$ is a cause of the missingness mechanism for $X$, $R_X$, i.e.: $X \rightarrow M \leftarrow U$ and $R_X \leftarrow M$. Imputing with $M$, even if not used in the outcome model, opens the collider path, biasing the results. Dimitri correctly notes that using all available information provides the correct answer, but that is only because MAR holds in reality, not because we ought to use all available information always (as the previous example showed).
$^1$ As concluded in Mohan 'Causal Graphs for Missing Data: A Gentle Introduction' (2022, p. 666).
$^2$ Recently in epidemiology there has been some derivative work by Moreno-Betancur and colleagues on m-graphs via 'canonical causal diagrams' (however, note the errata).
$^3$ Strictly speaking, v-MAR is the assumption, which is relative to graphs rather than events and thus slightly different than MAR (see Mohan & Pearl, 2021).
- 3$\begingroup$ This is single imputation and does not use all data. You get the correct result if you do multiple imputations with all available information. In steps: (1)
DF <- data.frame(Y = Y, X = X, M = M); (2)library("mice"); imp <- mice(DF, m = 20L, printFlag = FALSE), (3)mipo(with(imp, lm(Y ~ X))). MAR states that all available observed information should be used when imputing. $\endgroup$Dimitris Rizopoulos– Dimitris Rizopoulos2025-01-31 14:31:47 +00:00Commented Jan 31 at 14:31 - 1$\begingroup$ @DimitrisRizopoulos You are right this example ended up being too simple. Will edit later to show an m-graph with a collider where multiple imputing with $M$ and $Y$ leads to biased results versus complete case analysis. $\endgroup$Kuku– Kuku2025-01-31 15:10:51 +00:00Commented Jan 31 at 15:10
- 1$\begingroup$ @DimitrisRizopoulos have expanded with a new example and conclusion addressing your concerns, hopefully it's clear. $\endgroup$Kuku– Kuku2025-01-31 16:13:00 +00:00Commented Jan 31 at 16:13
- 3$\begingroup$ I think this does not cover it. The missingness depends on $U$, which is unavailable during imputation (i.e., in the dataset
DF). This means that the missing data mechanism is MNAR. Multiple Imputation works with MAR. Hence, the bias stems from not assuming the correct missing data mechanism. $\endgroup$Dimitris Rizopoulos– Dimitris Rizopoulos2025-01-31 16:48:33 +00:00Commented Jan 31 at 16:48 - 2$\begingroup$ @Kuku, thanks for this additional example. However, I'm afraid that it does not address my concerns. As previously mentioned, when MAR holds, all available information must be used, meaning also the outcome $Y$. Actually, it is known that not including the outcome results in bias. If in your example you set
DF <- data.frame(X = X, M = M, Y = Y)it works, and has lower bias than the complete case analysis. $\endgroup$Dimitris Rizopoulos– Dimitris Rizopoulos2025-02-05 15:00:43 +00:00Commented Feb 5 at 15:00

