Central Limit Theorem - Vector of Random Variables - Imputation of Missing Values

Question

I have a large dataset -- 300,000 records, each representing a customer-- and a variable holding their incomes.

Since there were missing values, I used the Multiple Imputation Chained-Equations method (Mice in R) to impute missing values. I got 10 datasets imputed from the original dataset.

When I plotted the densities of the Observed (blue) and Imputed (red) values for each of the 10 imputed datasets separately, I got the following plot:

When then I proceeded to average the imputed Incomes from the 10 imputed datasets and plotted again the Observed (blue) and averaged Imputed (red) values I got a different picture:

It is clear that the distribution of the averaged Income across the imputed datasets has shifted and become more akin to the Normal Distribution.

My understanding is that this is a manifestation of the Central Limit Theorem. However, I have difficulty articulating how exactly the Central Limit Theorem applies here.

I have 11 Vectors, 1 Observed and 10 Imputed, each consisting of thousands values. Each value row-wise across the 11 datasets corresponds to the same customer and is drawn from the same Conditional Distribution (where the conditioning factors are other variables correlated with the Income such as Profession, Region, Age, etc).

Now the Central Limit Theorem states that the sum of i.i.d. Random Variables converges in probability to the Normal Distribution. Which are here the i.i.d. Random Variables? More generally, why the row-wise average of the Imputed Datasets tends toward a Normal Distribution?

user31790 · Accepted Answer · 2017-04-30 07:57:03Z

The use of imputed values has nothing to do with the central limit theorem. What MICE and other imputation methods do is assume that the missing values are not that different from the values that you do observe in your dataset and fill in the missing values based on that (they're predictions / guesses). Now knowing that these methods do that you should expect the distribution of the missing values, if these are not systematic, look somewhat like the distribution of the non-missing values. And this is precisely what can be observed here.

@ Toby: The Chained-Equations imputation method implemented by MICE uses correlations between variables to predict missing values, by running iteratively regressions until convergence. It is not as simple as you describe it. In any case, my question is not why the imputed values fit the pattern of the observed values --as if they represent identical distributions-- which is what we hope for when we do imputations. My question is why when I average the imputed values across 10 multiple imputed datasets I get a distribution that seems to converge to a Normal. — rf7
– rf7, Commented Apr 30, 2017 at 8:46

rf7 · Accepted Answer · 2017-04-30 17:42:19Z

The central limit theorem states that:

"Given a population with a finite mean μ and a finite nonzero variance σ2, the sampling distribution of the mean approaches a normal distribution with a mean of μ and a variance of σ2/N as N, the sample size, increases."

Each Example (i.e. customer) in the original dataset has an Income that is a Random Variable associated with a PDF. Each Imputation dataset samples a value from the PDF of the Random Variable Income of each Example. Therefore, 10 datasets represent 10 samples of the Random Variable Income for each Example. The average of the Income values for each Example represent the Sample Mean, for a Sample size = 10. Assuming a similar prior PDF for the Random Variable Income across Examples, the row elements of the Averaged Income represent the Sample Means from repeated (~ 300,000) Sampling experiments of Sample size 10. According to the CLT stated above the distribution of Mean converges in probability to Normal as the Sample size increases. What we observe in the graph is the distribution of the Sample mean for Sample size = 10. If we had instead 100 Imputed datasets, and therefore a Sample size = 100, the approximation of the Normal would be almost perfect.

Stack Exchange Network

Central Limit Theorem - Vector of Random Variables - Imputation of Missing Values

2 Answers 2

Hot Network Questions

Central Limit Theorem - Vector of Random Variables - Imputation of Missing Values

2 Answers 2

Related

Hot Network Questions