Skip to main content
added 36 characters in body
Source Link
valya
  • 141
  • 5

I need to regress continuous y on multi-dimensional X (for prediction mostly, not inference, but do I need the betas to make sense). I only have some data for y, but a lot of rows for X. So I have a lot of rows for which I only know X.

I'm pretty sure the extra data is usable in some form (better covariance matrix? correcting sampling error?).

This seems like a usual problem, but I wasn't able to find a good text on this that would explain the math and caveats (most of what I find deals with some missing cells in X).


Actually I'll give the specific example. We're estimating DUPRs for Pickleball players (for player feedback, not gambling). Here's my labeled sample's Correlation matrix for two strong features:

┌───────────────────────────┬──────────────────────────────┬──────────┐ │ dupr_estimated_recentered ┆ total_distance_covered_per_s ┆ 0_speeds │ ╞═══════════════════════════╪══════════════════════════════╪══════════╡ │ 1.0 ┆ 0.751742 ┆ 0.65232 │ │ 0.751742 ┆ 1.0 ┆ 0.706524 │ │ 0.65232 ┆ 0.706524 ┆ 1.0 │ └───────────────────────────┴──────────────────────────────┴──────────┘ 

The high correlations of those two predictors makes them less powerful together

But on the bigger sample, the correlation of total_distance_covered_per_s and 0_speeds is 0.3323 . So there should be some hope that they'd be better as predictors.

(I have a bunch of other features, more than those 2)

I need to regress continuous y on multi-dimensional X (for prediction mostly, not inference, but do I need the betas to make sense). I only have some data for y, but a lot of rows for X. So I have a lot of rows for which I only know X.

I'm pretty sure the extra data is usable in some form (better covariance matrix? correcting sampling error?).

This seems like a usual problem, but I wasn't able to find a good text on this that would explain the math and caveats (most of what I find deals with some missing cells in X).


Actually I'll give the specific example. We're estimating DUPRs for Pickleball players. Here's my labeled sample's Correlation matrix for two strong features:

┌───────────────────────────┬──────────────────────────────┬──────────┐ │ dupr_estimated_recentered ┆ total_distance_covered_per_s ┆ 0_speeds │ ╞═══════════════════════════╪══════════════════════════════╪══════════╡ │ 1.0 ┆ 0.751742 ┆ 0.65232 │ │ 0.751742 ┆ 1.0 ┆ 0.706524 │ │ 0.65232 ┆ 0.706524 ┆ 1.0 │ └───────────────────────────┴──────────────────────────────┴──────────┘ 

The high correlations of those two predictors makes them less powerful together

But on the bigger sample, the correlation of total_distance_covered_per_s and 0_speeds is 0.3323 . So there should be some hope that they'd be better as predictors.

(I have a bunch of other features, more than those 2)

I need to regress continuous y on multi-dimensional X (for prediction mostly, not inference, but do I need the betas to make sense). I only have some data for y, but a lot of rows for X. So I have a lot of rows for which I only know X.

I'm pretty sure the extra data is usable in some form (better covariance matrix? correcting sampling error?).

This seems like a usual problem, but I wasn't able to find a good text on this that would explain the math and caveats (most of what I find deals with some missing cells in X).


Actually I'll give the specific example. We're estimating DUPRs for Pickleball players (for player feedback, not gambling). Here's my labeled sample's Correlation matrix for two strong features:

┌───────────────────────────┬──────────────────────────────┬──────────┐ │ dupr_estimated_recentered ┆ total_distance_covered_per_s ┆ 0_speeds │ ╞═══════════════════════════╪══════════════════════════════╪══════════╡ │ 1.0 ┆ 0.751742 ┆ 0.65232 │ │ 0.751742 ┆ 1.0 ┆ 0.706524 │ │ 0.65232 ┆ 0.706524 ┆ 1.0 │ └───────────────────────────┴──────────────────────────────┴──────────┘ 

The high correlations of those two predictors makes them less powerful together

But on the bigger sample, the correlation of total_distance_covered_per_s and 0_speeds is 0.3323 . So there should be some hope that they'd be better as predictors.

(I have a bunch of other features, more than those 2)

added 1009 characters in body
Source Link
valya
  • 141
  • 5

I need to regress continuous y on multi-dimensional X (for prediction mostly, not inference, but do I need the betas to make sense). I only have some data for y, but a lot of rows for X. So I have a lot of rows for which I only know X.

I'm pretty sure the extra data is usable in some form (better covariance matrix? correcting sampling error?).

This seems like a usual problem, but I wasn't able to find a good text on this that would explain the math and caveats (most of what I find deals with some missing cells in X).


Actually I'll give the specific example. We're estimating DUPRs for Pickleball players. Here's my labeled sample's Correlation matrix for two strong features:

┌───────────────────────────┬──────────────────────────────┬──────────┐ │ dupr_estimated_recentered ┆ total_distance_covered_per_s ┆ 0_speeds │ ╞═══════════════════════════╪══════════════════════════════╪══════════╡ │ 1.0 ┆ 0.751742 ┆ 0.65232 │ │ 0.751742 ┆ 1.0 ┆ 0.706524 │ │ 0.65232 ┆ 0.706524 ┆ 1.0 │ └───────────────────────────┴──────────────────────────────┴──────────┘ 

The high correlations of those two predictors makes them less powerful together

But on the bigger sample, the correlation of total_distance_covered_per_s and 0_speeds is 0.3323 . So there should be some hope that they'd be better as predictors.

(I have a bunch of other features, more than those 2)

I need to regress continuous y on multi-dimensional X (for prediction mostly, not inference, but do I need the betas to make sense). I only have some data for y, but a lot of rows for X. So I have a lot of rows for which I only know X.

I'm pretty sure the extra data is usable in some form (better covariance matrix? correcting sampling error?).

This seems like a usual problem, but I wasn't able to find a good text on this that would explain the math and caveats (most of what I find deals with some missing cells in X).

I need to regress continuous y on multi-dimensional X (for prediction mostly, not inference, but do I need the betas to make sense). I only have some data for y, but a lot of rows for X. So I have a lot of rows for which I only know X.

I'm pretty sure the extra data is usable in some form (better covariance matrix? correcting sampling error?).

This seems like a usual problem, but I wasn't able to find a good text on this that would explain the math and caveats (most of what I find deals with some missing cells in X).


Actually I'll give the specific example. We're estimating DUPRs for Pickleball players. Here's my labeled sample's Correlation matrix for two strong features:

┌───────────────────────────┬──────────────────────────────┬──────────┐ │ dupr_estimated_recentered ┆ total_distance_covered_per_s ┆ 0_speeds │ ╞═══════════════════════════╪══════════════════════════════╪══════════╡ │ 1.0 ┆ 0.751742 ┆ 0.65232 │ │ 0.751742 ┆ 1.0 ┆ 0.706524 │ │ 0.65232 ┆ 0.706524 ┆ 1.0 │ └───────────────────────────┴──────────────────────────────┴──────────┘ 

The high correlations of those two predictors makes them less powerful together

But on the bigger sample, the correlation of total_distance_covered_per_s and 0_speeds is 0.3323 . So there should be some hope that they'd be better as predictors.

(I have a bunch of other features, more than those 2)

Became Hot Network Question
edited tags
Link
valya
  • 141
  • 5
added 91 characters in body
Source Link
valya
  • 141
  • 5
Loading
Source Link
valya
  • 141
  • 5
Loading