I'm building a churn prediction model to estimate how long users will remain active in an application. I plan to use survival analysis because it handles censoring and provides time-to-event probabilities. What is the correct way to define the event indicator (churned vs. censored) given a training data cutoff date?
Approach 1 (Retrospective):
- Features (regressors): collected using data from registration up to the training cutoff date
- Duration: time from registration to last activity before cutoff
- Event indicator: 1 if user was inactive for 30+ days before cutoff, 0 otherwise
- Observation window: from past up to training cutoff
Approach 2 (Prospective):
- Features: collected using data from registration up to the training cutoff date
- Duration: time from registration to last activity before cutoff
- Event indicator: based on user status after the cutoff (e.g., 1 if churned within 1 week post-cutoff, 0 if still active). This assumes the model will predict survival time 1 week into the future.
- Observation window: extends beyond training cutoff to label events
Which approach is correct for survival analysis, or are both valid depending on the use case?
Thanks!