Use bigger sample for predictors in regression

Question

I need to regress continuous y on multi-dimensional X (for prediction mostly, not inference, but do I need the betas to make sense). I only have some data for y, but a lot of rows for X. So I have a lot of rows for which I only know X.

I'm pretty sure the extra data is usable in some form (better covariance matrix? correcting sampling error?).

This seems like a usual problem, but I wasn't able to find a good text on this that would explain the math and caveats (most of what I find deals with some missing cells in X).

Actually I'll give the specific example. We're estimating DUPRs for Pickleball players (for player feedback, not gambling). Here's my labeled sample's Correlation matrix for two strong features:

┌───────────────────────────┬──────────────────────────────┬──────────┐ │ dupr_estimated_recentered ┆ total_distance_covered_per_s ┆ 0_speeds │ ╞═══════════════════════════╪══════════════════════════════╪══════════╡ │ 1.0 ┆ 0.751742 ┆ 0.65232 │ │ 0.751742 ┆ 1.0 ┆ 0.706524 │ │ 0.65232 ┆ 0.706524 ┆ 1.0 │ └───────────────────────────┴──────────────────────────────┴──────────┘

The high correlations of those two predictors makes them less powerful together

But on the bigger sample, the correlation of total_distance_covered_per_s and 0_speeds is 0.3323 . So there should be some hope that they'd be better as predictors.

(I have a bunch of other features, more than those 2)

So you have predictor data in $X$ that you don't have the outcome ($y$) for? — Stephan Kolassa
– Stephan Kolassa, Commented Aug 12 at 13:34
Yes. I put your comment into the post to make it more clear, thanks — valya
– valya, Commented Aug 12 at 13:42
What would it mean to impute them? Is this like pseudo-labeling: regress the known ones, then label the rest, regress again? I'm worried this could go awry – that's why I'm looking for a good text on doing it properly (if this is the approach to use) — valya
– valya, Commented Aug 12 at 13:44
In machine learning this sort of thing is known as "semi-supervised learning" - it is more common with classification problems (logistic regression), but I suspect it has been used with continuous regression as well. This scholar search may be useful? scholar.google.com/… — Dikran Marsupial
– Dikran Marsupial, Commented Aug 12 at 14:00

Ben Bolker · Accepted Answer · 2025-08-12 13:55:02Z

For what it's worth, point 5 of van Ginkel et al (2020) discusses "Outcome variables must not be imputed" as a misconception.

Multiple imputation is (as far as I know) the gold standard here. If you're working in R then the mice package is well-established and convenient, with a nice web site. van Ginkel et al. summarize:

To conclude, using multiple imputation does not confirm an incorrectly assumed linear model any more than analyzing a data set without missing values. Neither does it confirm a linear relationship that only applies to the observed part of the data any more than a biased sample without missing data does. What is important is that, regardless of whether there are missing data, data are inspected in advance before blindly estimating a linear regression model on highly nonlinear data. As previously stated, when this data inspection reveals that there are nonlinear relations in the data, it is important that this nonlinearity is accounted for in both the analysis (by including nonlinear terms) and the imputation process (by including the same nonlinear terms as in the analysis, or by means of PMM).

(this paragraph doesn't explicitly mention that we are imputing the response variable, but it's in the section that discusses that case)

van Ginkel, Joost R., Marielle Linting, Ralph C. A. Rippe, and Anja van der Voort. 2020. “Rebutting Existing Misconceptions About Multiple Imputation as a Method for Handling Missing Data.” Journal of Personality Assessment 102 (3): 297–308. https://doi.org/10.1080/00223891.2018.1530680.

Thomas Lumley · Accepted Answer · 2025-08-13 01:40:34Z

This is one of these questions where the answer is different depending on exactly what the question is.

Simplest version

Suppose you had linear regression with all of your $X$s on $n+m$ people and $Y$ on a random sample of $n$ of them. Let $\sigma^2$ be the residual mean square for the true line. The variance of your regression coefficient estimates (call them $\hat\beta$) would be $n^{-1}E[X^TX]^{-1}\sigma^2$ if you just used the complete data.

If you did proper multiple imputation and it worked perfectly, your within-imputation variance would be $(n+m)^{-1}E[X^TX]^{-1}\sigma^2$. The between-imputation variance, however, would include a term like $n^{-1}E[X^TX]^{-1}\sigma^2$ for the posterior uncertainty in the model parameters for $Y|X$ in the imputation. You end up not gaining anything, at least in large samples.

As an example, (in mice)

XY<-MASS::mvrnorm(500, mu=c(0,0,0,0,0), Sigma=diag(5)+1) X<-XY[,-1] Yfull<-Y<-XY[,1] Y[1:200]<-NA mouse<-mice(cbind(X,Y),m=50) fits<-with(mouse, lm(Y~X))

> MIcombine(fits[[4]]) Multiple imputation results: MIcombine.default(fits[[4]]) results se (Intercept) -0.04420686 0.07213510 X1 0.23700481 0.05430861 X2 0.22520806 0.05612435 X3 0.27398454 0.05608058 X4 0.14665648 0.05527157 > coef(summary(lm(Y~X))) Estimate Std. Error t value Pr(>|t|) (Intercept) -0.03942651 0.06353258 -0.6205716 5.353607e-01 X1 0.24226806 0.05374903 4.5073938 9.474943e-06 X2 0.23016556 0.05500893 4.1841488 3.780295e-05 X3 0.26290128 0.05641605 4.6600442 4.792135e-06 X4 0.15634203 0.05704397 2.7407285 6.504224e-03

Variations

Suppose the subsample with $Y$ is not representative but is sampled based on $X$ (and the original $X$ is more representative). In that case you still don't gain anything if the model is correctly specified, but if it's misspecified you get better fit to the population by using all the $X$s
Suppose that you have some additional variables $Z$ in addition to $X$ and $Y$ (and that you don't want $Z$ in your analysis model). In that case, using $(Z,X)$ to predict $Y$ will give you genuine additional information and better estimates of $Y|X$
Suppose you know (or want to assume) that structures in $X$ are aligned with the conditional distribution of $Y|X$. For example, suppose $X$ is a mixture distribution and you think the relationship between $Y$ and $X$ will be different for the components of the mixture. Or suppose you think that $Y$ is likely to be related to the first few principal components of $X$. There is no mathematical reason for these to be true, but there might well be domain reasons (eg if the first few principal components of your genetic data measure ancestry). In this setting, using the full data to learn about the structures in $X$ will let you partition the points with known $Y$ into their mixture components or let you estimate the first few principal components more accurately and so can be used to get better inference. This is semi-supervised inference in machine learning.

Thank you. I added a more specific example to address (partly?) "This is one of these questions where the answer is different depending on exactly what the question is.". — valya
– valya, Commented Aug 13 at 11:50
>"suppose 𝑋 is a mixture distribution and you think the relationship between 𝑌 and 𝑋 will be different for the components of the mixture" Could a case of this being that Y, X is simply not that nicely distributed? E.g. we have strong players, weak players, intermediate players, obviously it's a bit of a scale but it's not that simple, there are different kinds of strength, there are many displays of weakness... — valya
– valya, Commented Aug 13 at 11:53
If @valya needs formal literature references, see Section 2.7, "When not to use multiple imputation," of van Buuren's book. It addresses many of the points in this answer and provides links to the literature. I'm particularly fond of Variation 2 in this answer. (+1) — EdM
– EdM, Commented Aug 13 at 13:14
@valya Yes, but that's not quite the point I'm making. In general there is no reason why structure in $X$ have to tell you anything useful about $Y|X$, but they certainly can — Thomas Lumley
– Thomas Lumley, Commented Aug 14 at 19:51

Stack Exchange Network

Use bigger sample for predictors in regression

2 Answers 2

Simplest version

Variations

Hot Network Questions

Use bigger sample for predictors in regression

2 Answers 2

Simplest version

Variations

Related

Hot Network Questions