Revisions to Use bigger sample for predictors in regression

added 36 characters in body

edited Aug 13 at 12:01

141
5

I need to regress continuous y on multi-dimensional X (for prediction mostly, not inference, but do I need the betas to make sense). I only have some data for y, but a lot of rows for X. So I have a lot of rows for which I only know X.

I'm pretty sure the extra data is usable in some form (better covariance matrix? correcting sampling error?).

This seems like a usual problem, but I wasn't able to find a good text on this that would explain the math and caveats (most of what I find deals with some missing cells in X).

Actually I'll give the specific example. We're estimating DUPRs for Pickleball players (for player feedback, not gambling). Here's my labeled sample's Correlation matrix for two strong features:

┌───────────────────────────┬──────────────────────────────┬──────────┐ │ dupr_estimated_recentered ┆ total_distance_covered_per_s ┆ 0_speeds │ ╞═══════════════════════════╪══════════════════════════════╪══════════╡ │ 1.0 ┆ 0.751742 ┆ 0.65232 │ │ 0.751742 ┆ 1.0 ┆ 0.706524 │ │ 0.65232 ┆ 0.706524 ┆ 1.0 │ └───────────────────────────┴──────────────────────────────┴──────────┘

The high correlations of those two predictors makes them less powerful together

But on the bigger sample, the correlation of total_distance_covered_per_s and 0_speeds is 0.3323 . So there should be some hope that they'd be better as predictors.

(I have a bunch of other features, more than those 2)

I need to regress continuous y on multi-dimensional X (for prediction mostly, not inference, but do I need the betas to make sense). I only have some data for y, but a lot of rows for X. So I have a lot of rows for which I only know X.

I'm pretty sure the extra data is usable in some form (better covariance matrix? correcting sampling error?).

This seems like a usual problem, but I wasn't able to find a good text on this that would explain the math and caveats (most of what I find deals with some missing cells in X).

Actually I'll give the specific example. We're estimating DUPRs for Pickleball players. Here's my labeled sample's Correlation matrix for two strong features:

┌───────────────────────────┬──────────────────────────────┬──────────┐ │ dupr_estimated_recentered ┆ total_distance_covered_per_s ┆ 0_speeds │ ╞═══════════════════════════╪══════════════════════════════╪══════════╡ │ 1.0 ┆ 0.751742 ┆ 0.65232 │ │ 0.751742 ┆ 1.0 ┆ 0.706524 │ │ 0.65232 ┆ 0.706524 ┆ 1.0 │ └───────────────────────────┴──────────────────────────────┴──────────┘

The high correlations of those two predictors makes them less powerful together

But on the bigger sample, the correlation of total_distance_covered_per_s and 0_speeds is 0.3323 . So there should be some hope that they'd be better as predictors.

(I have a bunch of other features, more than those 2)

I need to regress continuous y on multi-dimensional X (for prediction mostly, not inference, but do I need the betas to make sense). I only have some data for y, but a lot of rows for X. So I have a lot of rows for which I only know X.

I'm pretty sure the extra data is usable in some form (better covariance matrix? correcting sampling error?).

This seems like a usual problem, but I wasn't able to find a good text on this that would explain the math and caveats (most of what I find deals with some missing cells in X).

Actually I'll give the specific example. We're estimating DUPRs for Pickleball players (for player feedback, not gambling). Here's my labeled sample's Correlation matrix for two strong features:

┌───────────────────────────┬──────────────────────────────┬──────────┐ │ dupr_estimated_recentered ┆ total_distance_covered_per_s ┆ 0_speeds │ ╞═══════════════════════════╪══════════════════════════════╪══════════╡ │ 1.0 ┆ 0.751742 ┆ 0.65232 │ │ 0.751742 ┆ 1.0 ┆ 0.706524 │ │ 0.65232 ┆ 0.706524 ┆ 1.0 │ └───────────────────────────┴──────────────────────────────┴──────────┘

The high correlations of those two predictors makes them less powerful together

But on the bigger sample, the correlation of total_distance_covered_per_s and 0_speeds is 0.3323 . So there should be some hope that they'd be better as predictors.

(I have a bunch of other features, more than those 2)

added 1009 characters in body

Source Link

edited Aug 13 at 11:46

valya

141
5

I need to regress continuous y on multi-dimensional X (for prediction mostly, not inference, but do I need the betas to make sense). I only have some data for y, but a lot of rows for X. So I have a lot of rows for which I only know X.

I'm pretty sure the extra data is usable in some form (better covariance matrix? correcting sampling error?).

This seems like a usual problem, but I wasn't able to find a good text on this that would explain the math and caveats (most of what I find deals with some missing cells in X).

Actually I'll give the specific example. We're estimating DUPRs for Pickleball players. Here's my labeled sample's Correlation matrix for two strong features:

┌───────────────────────────┬──────────────────────────────┬──────────┐ │ dupr_estimated_recentered ┆ total_distance_covered_per_s ┆ 0_speeds │ ╞═══════════════════════════╪══════════════════════════════╪══════════╡ │ 1.0 ┆ 0.751742 ┆ 0.65232 │ │ 0.751742 ┆ 1.0 ┆ 0.706524 │ │ 0.65232 ┆ 0.706524 ┆ 1.0 │ └───────────────────────────┴──────────────────────────────┴──────────┘

The high correlations of those two predictors makes them less powerful together

But on the bigger sample, the correlation of total_distance_covered_per_s and 0_speeds is 0.3323 . So there should be some hope that they'd be better as predictors.

(I have a bunch of other features, more than those 2)

I need to regress continuous y on multi-dimensional X (for prediction mostly, not inference, but do I need the betas to make sense). I only have some data for y, but a lot of rows for X. So I have a lot of rows for which I only know X.

I'm pretty sure the extra data is usable in some form (better covariance matrix? correcting sampling error?).

This seems like a usual problem, but I wasn't able to find a good text on this that would explain the math and caveats (most of what I find deals with some missing cells in X).

I need to regress continuous y on multi-dimensional X (for prediction mostly, not inference, but do I need the betas to make sense). I only have some data for y, but a lot of rows for X. So I have a lot of rows for which I only know X.

I'm pretty sure the extra data is usable in some form (better covariance matrix? correcting sampling error?).

This seems like a usual problem, but I wasn't able to find a good text on this that would explain the math and caveats (most of what I find deals with some missing cells in X).

Actually I'll give the specific example. We're estimating DUPRs for Pickleball players. Here's my labeled sample's Correlation matrix for two strong features:

┌───────────────────────────┬──────────────────────────────┬──────────┐ │ dupr_estimated_recentered ┆ total_distance_covered_per_s ┆ 0_speeds │ ╞═══════════════════════════╪══════════════════════════════╪══════════╡ │ 1.0 ┆ 0.751742 ┆ 0.65232 │ │ 0.751742 ┆ 1.0 ┆ 0.706524 │ │ 0.65232 ┆ 0.706524 ┆ 1.0 │ └───────────────────────────┴──────────────────────────────┴──────────┘

The high correlations of those two predictors makes them less powerful together

But on the bigger sample, the correlation of total_distance_covered_per_s and 0_speeds is 0.3323 . So there should be some hope that they'd be better as predictors.

(I have a bunch of other features, more than those 2)