Skip to main content
Bumped by Community user
Bumped by Community user
edited tags
Link
lsfischer
  • 242
  • 1
  • 2
  • 8
Source Link
lsfischer
  • 242
  • 1
  • 2
  • 8

Will historical data lead to target leakage?

I'm bulding a employee churn model. I've employee data from 2016 to 2019 (of people who stayed/left the company), my goal is to train using data from 2016 to 2018 and predict on 2019.

Since there's people who did not leave the company between 2016-2019 there's alot of repetead employees, so my training set is: data from 2018 about employees who did not leave the company injected with data of employees who left the company in 2016/2017/2018 in order to only have one person once in the training dataset.

My question

  • Does having only the persons who left in 2016/2017 lead to target leakage?
  • I'm not using time dummies but can my model overfit thinking employees are more likely to leave the company in 2016/2017 because it's what it sees more often?
  • And if so how can I avoid this problem?

Thanks :)!