You should know imputing the mean (or the median) is the demonstrably poorest form of imputation. This takes no account of the uncertainty of the value, and the range of possible predictions that could arise from the possible values. The literature on this is extensive. In the context of prediction, we can show that distributional approaches like multiple imputation or expectation maximization use the whole distribution of predictions and responses to obtain the most precise predictions for both continuous and binary classifications.
Having said that, few if any predictive modelers pay due attention to the issue of missing data. They, like many, restrict to complete cases in development of the model and expect that users will supply only complete factors when applying the model prospectively. This issue is so bad that many medical experts who use online risk calculators just "guess" on values when a rigorous treatment would be more useful. In a small step up from this, modelers will at times numerically code missing values so that a kind of imputation is performed. But missingness indicators are problematic because uncommon missing patterns can have highly unstable predictions, and depending on the endpoint, can cause a kind of bias in all predictions.
A good predictive modeler has to think prospectively about how a model will be applied. The model developer has to ask questions such as: in what setting will the model be used? Is it an arithmetic model that will be computed by-hand or using Excel without access to extensive data backends? Or can the model be developed and batch processed using applets on a stable website where complex imputations can be performed? And what are the relative merits of adopting increasingly complex logistical hurdles?
We rarely have the ability to observe all "test" data in batch. It is unrealistic in most examples I can think of to expect that, when a user applies your model, they will have reasonable access to a robust data to perform imputation on their own. And if they do, they risk severe misuse of prediction. A weather forecast, for instance, will not have future values to predict what the past was.
If your prediction model is good, and your distributions are representative, then there is theoretically no problem with using the "train" set to infill missing values in your "test" set - regardless of the methodology. Put otherways, if the "train" set is no good at infilling missing data, then how shall we trust that it is reliable at predicting the response? Second, as alluded to earlier, your model is a kind of "package deal" comprising the imputation strategy as well as the missing data method. If your imputation method is so bad as to aversely affect its predictive ability, it is desirable to demonstrate this in terms of poor performing operating characteristics. Just the same, if missing data are expertly infilled so that reliable and precise predictions are obtained, we expect this model to outperform the earlier example.
I know of no better reference for all this than Rubin's "Statistical Modeling with Missing Data". Now on its 3rd edition.