I need to regress continuous y on multi-dimensional X (for prediction mostly, not inference, but do I need the betas to make sense). I only have some data for y, but a lot of rows for X. So I have a lot of rows for which I only know X.
I'm pretty sure the extra data is usable in some form (better covariance matrix? correcting sampling error?).
This seems like a usual problem, but I wasn't able to find a good text on this that would explain the math and caveats (most of what I find deals with some missing cells in X).
Actually I'll give the specific example. We're estimating DUPRs for Pickleball players (for player feedback, not gambling). Here's my labeled sample's Correlation matrix for two strong features:
┌───────────────────────────┬──────────────────────────────┬──────────┐ │ dupr_estimated_recentered ┆ total_distance_covered_per_s ┆ 0_speeds │ ╞═══════════════════════════╪══════════════════════════════╪══════════╡ │ 1.0 ┆ 0.751742 ┆ 0.65232 │ │ 0.751742 ┆ 1.0 ┆ 0.706524 │ │ 0.65232 ┆ 0.706524 ┆ 1.0 │ └───────────────────────────┴──────────────────────────────┴──────────┘ The high correlations of those two predictors makes them less powerful together
But on the bigger sample, the correlation of total_distance_covered_per_s and 0_speeds is 0.3323 . So there should be some hope that they'd be better as predictors.
(I have a bunch of other features, more than those 2)