Introduction
I haven’t had the chance to work on a real data-science project yet — this would be my first one. I hope someone here can help.
Problem
For past years we have plan data X_p = (X1, X2, X3) and we also have the actual granular realizations for those same observations: X_g = (X1, X2, X3, X4, X5, X6, Y) (i.e. the richer feature set plus the outcome Y).
Goal
For future years I will only have the plan data X_p = (X1, X2, X3) (no realizations yet). I want to predict/estimate Y for those future plan records.
naive Proposal
Bring the granular data X_g = (X1, X2, X3, X4, X5, X6, Y) to the same level as X_p by creating aggregated or learned features Z from the granular inputs, so that the enriched representation becomes: X_g → (X1, X2, X3, Z1, Z2, ..., Zp, Y).
Fit a model g that maps the plan features to those aggregated features: g: (X1, X2, X3) → (Z1, ..., Zp). (So for a future plan record you can produce estimated Ẑ = g(X_p).)
Fit a second model f that predicts the target from plan features and the aggregated features: f: (X1, X2, X3, Z1..Zp) → Y. In production (e.g. for 2026) use f(X_p, g(X_p)) to predict Y. This is a two-stage approach.
Question
Which methods or approaches can I apply to this problem, where I want to learn from rich granular data but make predictions using only limited plan features? Are there any papers, tutorials, or step-by-step guides that walk through this kind of setup? are there any specific keywords or search terms I could use to find more literature or discussions on this type of problem
How do I decide which
Zfeatures are “good”? Some candidateZ1might be very predictive ofYbut hard to estimate fromX_p; another candidateZ2might be less predictive forYbut easy to learn fromX_p.
X_pis vsX_g. [1] Model that only usesX_pduring training/inference - how well does that perform? [2] Model that only usesX_gusing both train/inference - how much better does it do? If it's a lot better, which feature(s) drive that improvement? This analysis gives you a good baseline model (presumingX_pis decent) & also insights as to whetherX_gis worth including (adds complexity), or whether you can approximate the key feature(s) ofX_g& add them toX_p. $\endgroup$