8
$\begingroup$

Suppose that I have missing values in one of my features, and there are missing values both in the train and test sets. I want to impute using the median of the observed features. Should I:

  • A) compute the median on the train set, $m_{\text{train}}$, and use this in place of all missing values for both train and test?

  • B) compute the median on the train set, $m_{\text{train}}$, and use this in place of all missing values for just the train set. Also compute the median on the test set, $m_{\text{test}}$, and use this in place of all missing values for just the test set.

To me, A) seems like a terrible idea because I am using information from the training data to influence my test data, so any test error estimate I get is completely useless since I am biasing the test data towards the train data. In the extreme case, suppose that a majority or all of that feature is missing in the test set, and I impute using $m_{\text{train}}$, then my estimate of test error (for that feature) is close to zero, even though I have learned absolutely nothing.

B) makes more sense because I am applying a consistent procedure to both train and test data, the resulting test error estimate should be a better reflection of the true error.

Is there a reference that studies this problem rigorously, or can anyone tell me if A is in fact better than B? Or if there is an even better alternative?

$\endgroup$
7
  • 2
    $\begingroup$ One uses training info all the time for testing, e.g. when applying the fitted model to the test data... So, option A. $\endgroup$ Commented Apr 4 at 18:02
  • $\begingroup$ Imputing is a very technically sounding word for fabricating data; are you sure you want to do so? Fabricating data gets people in trouble. If data is missing, no matter how smart your algorithm is (and using the median (or the mean) is about the worse way you could do that), you can not conjure it up. Use your dataset "as is". $\endgroup$ Commented Apr 4 at 18:26
  • 2
    $\begingroup$ @jginestet I am failing to see the relevance of this comment. There is a whole literature on methods for missing data, and virtually every single data collection analysis will have at least one missing observation, and (like it or not) one cannot "do nothing" about missing data - even to restrict analysis to complete cases (the R default) is a method for handling missing data which contains certain assumptions. $\endgroup$ Commented Apr 4 at 21:40
  • $\begingroup$ @AdamO, if only a small proportion of the data is missing, imputing, or not (e.g. use only complete data), will not make any practical difference; so why impute? If a large proportion is missing, fabricating data will invalidate all further work; so why impute? In addition, one does not need to drop the whole row (listwise), but only the missing datapoints (pairwise). Moreover blind imputation glosses over the reasons for missingness; the fact that some data is missing may be information on its own. In the end, the data is the only ground truth one has; any attempt at adding to it is suspect $\endgroup$ Commented Apr 4 at 22:03
  • $\begingroup$ ...ctd. My primary issue is with the tone of such questions, which ask "what is the best way to impute" as opposed to "is it ok to impute in the first place?". If the latter question was asked, one could get into clarifying "how much of the data is missing", "which specific variables", "why is it missing", "is it missing at random", etc...; Imputing is not acceptable at face value (no matter the volume of the literature), and it needs to be rigorously justified. Barring this justification, my answer to the OP's question is "neither". As you noted, there must be a reason it is R's default. $\endgroup$ Commented Apr 4 at 22:21

2 Answers 2

10
$\begingroup$

You should know imputing the mean (or the median) is the demonstrably poorest form of imputation. This takes no account of the uncertainty of the value, and the range of possible predictions that could arise from the possible values. The literature on this is extensive. In the context of prediction, we can show that distributional approaches like multiple imputation or expectation maximization use the whole distribution of predictions and responses to obtain the most precise predictions for both continuous and binary classifications.

Having said that, few if any predictive modelers pay due attention to the issue of missing data. They, like many, restrict to complete cases in development of the model and expect that users will supply only complete factors when applying the model prospectively. This issue is so bad that many medical experts who use online risk calculators just "guess" on values when a rigorous treatment would be more useful. In a small step up from this, modelers will at times numerically code missing values so that a kind of imputation is performed. But missingness indicators are problematic because uncommon missing patterns can have highly unstable predictions, and depending on the endpoint, can cause a kind of bias in all predictions.

A good predictive modeler has to think prospectively about how a model will be applied. The model developer has to ask questions such as: in what setting will the model be used? Is it an arithmetic model that will be computed by-hand or using Excel without access to extensive data backends? Or can the model be developed and batch processed using applets on a stable website where complex imputations can be performed? And what are the relative merits of adopting increasingly complex logistical hurdles?

We rarely have the ability to observe all "test" data in batch. It is unrealistic in most examples I can think of to expect that, when a user applies your model, they will have reasonable access to a robust data to perform imputation on their own. And if they do, they risk severe misuse of prediction. A weather forecast, for instance, will not have future values to predict what the past was.

If your prediction model is good, and your distributions are representative, then there is theoretically no problem with using the "train" set to infill missing values in your "test" set - regardless of the methodology. Put otherways, if the "train" set is no good at infilling missing data, then how shall we trust that it is reliable at predicting the response? Second, as alluded to earlier, your model is a kind of "package deal" comprising the imputation strategy as well as the missing data method. If your imputation method is so bad as to aversely affect its predictive ability, it is desirable to demonstrate this in terms of poor performing operating characteristics. Just the same, if missing data are expertly infilled so that reliable and precise predictions are obtained, we expect this model to outperform the earlier example.

I know of no better reference for all this than Rubin's "Statistical Modeling with Missing Data". Now on its 3rd edition.

$\endgroup$
4
  • $\begingroup$ regarding your first point, the discussion of imputation using medians was just for the sake of asking a simple question. A more generic question is whether one should apply the same 'procedure' to impute values in train and test, vs. using training information to impute into the test set. Your third point regarding practical considerations is reasonable but doesn't really answer the question. I'm not sure i follow your second point. Why is there theoretically no problem? Surely the two stage model renders the final test error estimate worthless without some sort of adjustment for impute model $\endgroup$ Commented Apr 5 at 1:45
  • $\begingroup$ I;m not sure i quite follow the sentence ' you should necessarily be penalized in the operator characteristics...' either.. i'll check out the reference, thanks $\endgroup$ Commented Apr 5 at 1:46
  • $\begingroup$ @WeakLearner Do read the reference. You are waiving your hand about the specific imputation strategy, but then you don't understand how one makes "adjustment for the [imputation] model" - this is why multiple imputation exists. As far as the second point - people developing predictive models rarely think how missing data occurs in application. You need to have methods for missing values to render useful predictions - this is a "so called package deal". It seems hard to make a case that you can observe the future "test" set in batch and re-develop an imputation model. $\endgroup$ Commented Apr 7 at 15:57
  • $\begingroup$ @WeakLearner your clarification is good. Please check the rewrite and see if the comment still stands. $\endgroup$ Commented Apr 16 at 17:24
5
$\begingroup$

To just address the question and not discuss whether median imputation makes sense here or not:

The principle of splitting data up into training and test set is that you want to use the test set to evaluate how well prediction works based on the training set. For this to work without bias, you should not use information from the test set, so your option A looks correct. Note that if you apply your predictor in reality to predict one individual unseen observation, you can only use the median of the set on which the predictor was trained, as at this point you don't have more observations than that.

This real process is mimicked by the training/test set split, and the relevant situation is that you predict an individual observation from the test set based on the predictor trained on the training set. In order to mimic the real situation where you would only rely on the information used for training, you should not use the whole test set for your imputation.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.