Imputing values with linear regression, valid strategy or creating biases?

Question

I am practicing on the titanic competition from kaggle. In the dataset the Age variable has a number of missing values and I am now left with the choice of what to do about this.

I suspect thatAge and Gender and class are the leading variables ('Women and children first!').

Dropping the rows with missing age values could skew the dataset, as the missing values might not be evenly distributed (for example, they could be mainly third class passengers).

Replacing them with the mean or median seems not much better, as I fear it might weaken existing correlations by drowning them by adding a large number of means / medians

I was thinking of training a regressor to impute the missing age values. I was able to get an R2 score of around 54% and most of the predictions seemed to be within a few years of the real age, but most of the literature I've read does not mention using regression and instead defaults to mean / median.

Is there a reason for this? Am I wrong in my assessment that a regression based imputation would be vastly preferable to the mean / median approach?

Christopher Blier-Wong · Accepted Answer · 2018-08-23 13:45:00Z

You have a case of not missing at random. Solutions to this problem include stochastic imputation with a regression, where you sample from multivariate residuals in your regression, see

Multiple Imputation and its Application, Wiley, 2012.
Statistical Analysis with Missing Data, Second Edition, Wiley, 2014

for other techniques to deal with missing data. Note that with cases of data not missing at random, the correlation between variables will change and the variance of the imputated attribute might also reduce. This effect is lessened if stochastic imputation is applied.

Stack Exchange Network

Imputing values with linear regression, valid strategy or creating biases?

1 Answer 1

Hot Network Questions

Imputing values with linear regression, valid strategy or creating biases?

1 Answer 1

Related

Hot Network Questions