I have a large dataset -- 300,000 records, each representing a customer-- and a variable holding their incomes.
Since there were missing values, I used the Multiple Imputation Chained-Equations method (Mice in R) to impute missing values. I got 10 datasets imputed from the original dataset.
When I plotted the densities of the Observed (blue) and Imputed (red) values for each of the 10 imputed datasets separately, I got the following plot:
When then I proceeded to average the imputed Incomes from the 10 imputed datasets and plotted again the Observed (blue) and averaged Imputed (red) values I got a different picture:
It is clear that the distribution of the averaged Income across the imputed datasets has shifted and become more akin to the Normal Distribution.
My understanding is that this is a manifestation of the Central Limit Theorem. However, I have difficulty articulating how exactly the Central Limit Theorem applies here.
I have 11 Vectors, 1 Observed and 10 Imputed, each consisting of thousands values. Each value row-wise across the 11 datasets corresponds to the same customer and is drawn from the same Conditional Distribution (where the conditioning factors are other variables correlated with the Income such as Profession, Region, Age, etc).
Now the Central Limit Theorem states that the sum of i.i.d. Random Variables converges in probability to the Normal Distribution. Which are here the i.i.d. Random Variables? More generally, why the row-wise average of the Imputed Datasets tends toward a Normal Distribution?

