Skip to main content
8 of 10
major re-organization of the question, deleted first example, directrly refer to other provided answer and comments
Kuku
  • 1.7k
  • 11
  • 25

Short answer

Yes, as Karthika Mohan concludes "missing data is a causal inference problem"$^1$. As such, the structural causal model of the problem matters, and that includes temporal causal relationships.

In short, you should note that:

  • The MCAR/MAR/MNAR taxonomy gives an incomplete picture and can be misleading (e.g. MNAR problems can be solved with complete case analysis).
  • You should not blindly use all available information.
  • You should not use association measures (such as correlations) to justify inclusion of a variable in an imputation model.

Examples below illustrate each one of these points. The full generality of these problems is captured by the literature on m-graphs (see Graphical Models for Processing Missing Data (Mohan & Pearl, 2021)), which notes that a proper answer to any problem will require a causal DAG that includes respective missingness indicators and an explicit target estimand you care about.

There have been great advances in the last decade in the missing data literature, but the adoption of these findings has been slow, especially for fields outside the Pearl camp of causal inference such as statistics, social sciences, economics and epidemiology$^2$.

Long answer

Common misconception

You will find in multiple sources the notion that the goal of multiple imputation is to maximize the predictive power of the imputation model. This is wrong, and is wrong even if your aim is only to obtain descriptive statistics such as the mean of a variable.

Let's clarify the disagreement with the other answer and comments. The answer by Peter Flom states that

I think it is appropriate to include later information in the imputation process because it will improve the accuracy of the estimates. [...] We should use any information we can to improve those estimates.

However, I state this is not correct since whether later information improves the estimates or not depends on the causal model. Dimitris responds that:

even if you want to make causal inference, [the quote above] is correct under the missing at random assumption

But this is not a contradiction to my caveat, since the only way to make a missing at random assumption is based on a (explicit or implicit) causal model. Hence, the following comment by Peter is not correct:

I don't think you need a causal model to decide what is relevant. Correlation is enough. You don't need causation. You can do imputation when there is no hint of causation at all.

Since imputation depends on a MAR assumption, and there is no way to argue for a MAR assumption without positing causes of missingness mechanisms (i.e. what does missingness 'depend on'), a causal model is necessary. As Judea Pearl quips "A fisherman enters a restaurant, orders a fried fish and tells his friend: You see, some fish need no catching.". The examples below show how correlations can lead to error.

Then there are these two extracts in comments by Dimitris (emphasis mine):

MAR states that all available observed information should be used when imputing.

The question is whether the probability that you have missing data on $X$ depends on $M$. If it does,then you need it in the imputation model

If all available observed information has to be used, it seems to be irrelevant which observed variables cause missing data (they will be in the imputation model nonetheless). Instead, the reasoning goes:

  1. I think missingness depends only on these observed variables $\boldsymbol{M}$, so I will assume MAR holds
  2. Since MAR holds, I can do multiple imputation
  3. To get correct results with multiple imputation, I need to use all available observed information

The first statement is a causal one, and it can be made explicit with an m-graph. The third statement is partially true: it is not necessary in the sense that you can get correct answers without using all variables (depending on the DAG), but it is true that for any MAR$^3$ problem the joint distribution of all variables can be recovered by (equation 34.4 in Mohan, 2022, p. 659):

$$ \operatorname{P}(V_{\text{observed}}), V_{\text{missing}}) = \operatorname{P}(V^{*}|V_{\text{observed}}, R = 0) \operatorname{P} (V_{\text{observed}}) $$ where $V_{\text{observed}}$ and $V_{\text{missing}}$ are variables with no missing values and variables with some missing values, respectively; $R$ are the missingness mechanism or indicators for the variables with some missing values and $V^{*}$ is the proxy variable that denotes the available values of the variables with some missing values.

Examples

Collider example

The present example shows a scenario where multiple imputation using all available data leads to bias whereas complete case analysis does not.

First, let us take the perspective of the researcher that has only access to observed data.

> head(cbind(X, M, Y)) X M Y [1,] -0.56047565 0.8201498 4.477585 [2,] NA 2.2733152 11.306199 [3,] 1.55870831 1.8900208 5.345983 [4,] 0.07050839 0.8457453 3.522221 [5,] 0.12928774 1.2122557 5.486042 [6,] NA 3.9800753 11.293618 

The misleading predictive perspective says we have three variables, two of which are fully observed, and thus we have to use all available information to maximize the accuracy of our imputed values. An expert in the field tells us that there is an unobserved variable, $U$, that is the true driver of missingness, but that $Y$ has a correlation of above 0.95 with it, so it is a good proxy.

Encouraged by this solid justification, we perform multiple imputation using $Y$ and $M$ in the imputation model rather than using 'outdated' complete case analysis methods. Turns out, we have massively biased our results:

set.seed(123) # Generate data according to a structural model --------------------------- N <- 100000L X <- rnorm(N, mean = 0, sd = 1) U <- rnorm(N, mean = 1, sd = 1) M <- X + 2*U + rnorm(N, mean = 0, sd = 0.4) # M can be seen as a future measurement of X Y <- X + 4*U + rnorm(N, mean = 0, sd = 0.3) # M is not a mediator of X, X has a direct effect on Y cor(M, X) #> [1] 0.4366151 # Full data --------------------------------------------------------------- # Average Treatment Effect of X on Y true_ATE <- lm(Y ~ X)$coefficients[["X"]] # Introduce missingness --------------------------------------------------- # MAR under unobserved variable U holds X[U > 2.3] <- NA_real_ head(cbind(X, M, Y)) #> X M Y #> [1,] -0.56047565 2.085143 4.477585 #> [2,] NA 5.104063 11.306199 #> [3,] 1.55870831 2.830643 5.345983 #> [4,] 0.07050839 1.792536 3.522221 #> [5,] 0.12928774 2.650160 5.486042 #> [6,] NA 6.317524 11.293618 cor(X, M, use = "complete") #> [1] 0.4962392 cor(X, Y, use = "complete") #> [1] 0.280185 cor(U, Y, use = "complete") #> [1] 0.9677232 # Average Treatment Effect of complete case cc_ATE <- lm(Y ~ X)$coefficients[["X"]] # Multiple imputation with all observed data DF <- data.frame(Y = Y, X = X, M = M); library("mice") imp <- mice(DF, m = 20L, printFlag = FALSE) mi_fit <- mipo(with(imp, lm(Y ~ X))) mi_ATE <- mi_fit$pooled$estimate[[2]] true_ATE - mi_ATE # Bias from multiple imputation #> [1] -0.1747555 true_ATE - cc_ATE # Bias from complete case analysis #> [1] -0.003026327 

We see that both $M$ and $Y$ were colliders between $U$ and $X$: i.e. $X \rightarrow M \leftarrow U$ and $X \rightarrow Y \leftarrow U$. Using them in the imputation model opens the collider path, introducing bias into our estimate, even if we do not use them in the outcome model!

Even more critically, now that we can see the data generating process, we see that MAR under $U$ holds, yet the complete case analysis is unbiased! This goes against typical MNAR/MAR/MCAR guidance, but it is a result that can be clearly seen from the m-graph below, showing that a structural causal perspective is necessary.

However, we say MAR under $U$ holds from the perspective of the data generating process, from the perspective of the researcher $U$ is a latent variable so that MNAR holds, as Dimitris rightly notes.

MAR under M holds, but M is a collider

EDIT: The last comment in this answer by Dimitri asks:

Do you have an example in which missingness depends on observed variables (i.e., variables that will be available in the imputation step), and using them in the imputation will lead to bias

We construct this last example where MAR under $M$ holds, so that the missingness mechanism depends structurally on $M$, yet imputing with $M$ introduces bias. For simplicity, we make it so that $X$ has no causal effect on $Y$, i.e. the true ATE is zero.

set.seed(123) # Generate data according to a structural model --------------------------- N <- 100000L U <- rnorm(N, mean = 1, sd = 1) # Common cause of M and Y X <- rnorm(N, mean = 0, sd = 1) M <- X + 10*U + rnorm(N, mean = 0, sd = 0.4) Y <- 5*U + rnorm(N, mean = 0, sd = 0.3) # M is not a mediator of X, X no causal effect on Y cor(M, X) #> [1] 0.09722705 # Average Treatment Effect of X on Y true_ATE <- 0L # Introduce missingness --------------------------------------------------- # MAR under observed variable M X[M > 1.5] <- NA_real_ sum(is.na(X)) / length(X) #> [1] 0.80158 # Proportion missing head(cbind(X, M, Y)) #> X M Y #> [1,] NA 4.775869 2.175708 #> [2,] NA 9.201718 4.062499 #> [3,] NA 24.918396 12.818329 #> [4,] NA 10.480321 5.017092 #> [5,] NA 11.375845 5.251577 #> [6,] NA 28.415660 13.804082 cor(X, M, use = "complete") #> [1] 0.05508296 cor(X, Y, use = "complete") #> [1] -0.1546432 cor(U, Y, use = "complete") #> [1] 0.9982 # Average Treatment Effect of complete case cc_ATE <- lm(Y ~ X)$coefficients[["X"]] # Multiple imputation with M library("mice") DF <- data.frame(X = X, M = M); imp <- mice(DF, m = 20L, printFlag = FALSE) mi_fit <- mipo(with(imp, lm(Y ~ X))) mi_ATE <- mi_fit$pooled$estimate[[2]] true_ATE - mi_ATE # Bias from multiple imputation #> [1] -2.339173 true_ATE - cc_ATE #> [1] 0.3729012 

The m-graph would show that $M$ is both a collider between $X$ and $U$ and that $M$ is a cause of the missingness mechanism for $X$, $R_X$, i.e.: $X \rightarrow M \leftarrow U$ and $R_X \leftarrow M$. Imputing with $M$, even if not used in the outcome model, opens the collider path, biasing the results. Dimitri correctly notes that using all available information provides the correct answer, but that is only because MAR holds in reality, not because we ought to use all available information always (as the previous example showed).


$^1$ As concluded in Mohan 'Causal Graphs for Missing Data: A Gentle Introduction' (2022, p. 666).

$^2$ Recently in epidemiology there has been some derivative work by Moreno-Betancur and colleagues on m-graphs via 'canonical causal diagrams' (however, note the errata).

$^3$ Strictly speaking, v-MAR is the assumption, which is relative to graphs rather than events and thus slightly different than MAR (see Mohan & Pearl, 2021).

Kuku
  • 1.7k
  • 11
  • 25