1
$\begingroup$

Introduction

I haven’t had the chance to work on a real data-science project yet — this would be my first one. I hope someone here can help.

Problem

For past years we have plan data X_p = (X1, X2, X3) and we also have the actual granular realizations for those same observations: X_g = (X1, X2, X3, X4, X5, X6, Y) (i.e. the richer feature set plus the outcome Y).

Goal

For future years I will only have the plan data X_p = (X1, X2, X3) (no realizations yet). I want to predict/estimate Y for those future plan records.

naive Proposal

Bring the granular data X_g = (X1, X2, X3, X4, X5, X6, Y) to the same level as X_p by creating aggregated or learned features Z from the granular inputs, so that the enriched representation becomes: X_g → (X1, X2, X3, Z1, Z2, ..., Zp, Y).

Fit a model g that maps the plan features to those aggregated features: g: (X1, X2, X3) → (Z1, ..., Zp). (So for a future plan record you can produce estimated Ẑ = g(X_p).)

Fit a second model f that predicts the target from plan features and the aggregated features: f: (X1, X2, X3, Z1..Zp) → Y. In production (e.g. for 2026) use f(X_p, g(X_p)) to predict Y. This is a two-stage approach.

Question

  1. Which methods or approaches can I apply to this problem, where I want to learn from rich granular data but make predictions using only limited plan features? Are there any papers, tutorials, or step-by-step guides that walk through this kind of setup? are there any specific keywords or search terms I could use to find more literature or discussions on this type of problem

  2. How do I decide which Z features are “good”? Some candidate Z1 might be very predictive of Y but hard to estimate from X_p; another candidate Z2 might be less predictive for Y but easy to learn from X_p.

$\endgroup$
1
  • 1
    $\begingroup$ I'd be tempted to explore the following baseline model initially, to get an understanding of how predictive X_p is vs X_g. [1] Model that only uses X_p during training/inference - how well does that perform? [2] Model that only uses X_g using both train/inference - how much better does it do? If it's a lot better, which feature(s) drive that improvement? This analysis gives you a good baseline model (presuming X_p is decent) & also insights as to whether X_g is worth including (adds complexity), or whether you can approximate the key feature(s) of X_g & add them to X_p. $\endgroup$ Commented Sep 25 at 14:26

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.