1
$\begingroup$

I'm bulding a employee churn model. I've employee data from 2016 to 2019 (of people who stayed/left the company), my goal is to train using data from 2016 to 2018 and predict on 2019.

Since there's people who did not leave the company between 2016-2019 there's alot of repetead employees, so my training set is: data from 2018 about employees who did not leave the company injected with data of employees who left the company in 2016/2017/2018 in order to only have one person once in the training dataset.

My question

  • Does having only the persons who left in 2016/2017 lead to target leakage?
  • I'm not using time dummies but can my model overfit thinking employees are more likely to leave the company in 2016/2017 because it's what it sees more often?
  • And if so how can I avoid this problem?

Thanks :)!

$\endgroup$

1 Answer 1

1
$\begingroup$

I'm going to kind of ignore your question. To me it sounds like you have a dataset containing essentially employee starting and 'end dates' or if the employee is still working for you no end date. This sounds like the standard case for a class of regression techniques called 'survival regression' or survival analysis. I suggest you look into this.

Now for your question, Im not too sure about target leakage as I cannot quite see how, if you have a structural change in employees year by year then theres definitely some leakage yeah.

$\endgroup$
2
  • $\begingroup$ I belive have some alignment and some misalignment. My dataset is similar to what you described although I only have the latest information about the employe (which is at moment of exit or the present if the person has not quit). I don't think survival analysis is quite my intention as i'm not interested in predicting time until quitting (nor do I have information to do it i belive). My example resembles IBM employee churn but in my dataset I also add in info about people who left in other years. Thanks tho :)! $\endgroup$ Commented Sep 23, 2019 at 11:48
  • $\begingroup$ Hi, well technically employee churn can be modelled as a survival function. Essentially births + deaths. However I do see that you would want to model churn directly as a function of time I guess, you can do that. From a purely time-series point you should then check for structural changes in your data, if there are none you can safely use your past data. $\endgroup$ Commented Sep 23, 2019 at 12:28

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.