I am practicing on the titanic competition from kaggle. In the dataset the Age variable has a number of missing values and I am now left with the choice of what to do about this.
I suspect thatAge and Gender and class are the leading variables ('Women and children first!').
Dropping the rows with missing age values could skew the dataset, as the missing values might not be evenly distributed (for example, they could be mainly third class passengers).
Replacing them with the mean or median seems not much better, as I fear it might weaken existing correlations by drowning them by adding a large number of means / medians
I was thinking of training a regressor to impute the missing age values. I was able to get an R2 score of around 54% and most of the predictions seemed to be within a few years of the real age, but most of the literature I've read does not mention using regression and instead defaults to mean / median.
Is there a reason for this? Am I wrong in my assessment that a regression based imputation would be vastly preferable to the mean / median approach?