I'm working on a research project of which the objective is to predict the customer churn probability in the next month. We have a dataset of monthly records for each customer with variables including (the list below is not exhaustive):
month: month
customer_id: customer ID
tenure: number of months the customer has stayed
gender: whether the customer is a male or a female
churn: whether the customer churned or not
A part of the dataset looks like:
| month | customer_id | tenure | gender | ... | churn | |
|---|---|---|---|---|---|---|
| 1 | 2022-01 | 1 | 6 | 1 | 1 | |
| 2 | 2022-01 | 2 | 15 | 1 | 0 | |
| 3 | 2022-01 | 3 | 12 | 0 | 0 | |
| 4 | 2022-02 | 2 | 16 | 1 | 0 | |
| 5 | 2022-02 | 3 | 13 | 0 | 0 | |
| 5 | 2022-02 | 4 | 0 | 1 | 0 | |
| 6 | 2022-03 | 2 | 17 | 1 | 0 | |
| 7 | 2022-03 | 3 | 14 | 0 | 1 | |
| 8 | 2022-03 | 4 | 1 | 1 | 0 |
Currently, I have problems with model selection and data preparation.
Problem 1: should I choose a CoxPH model (Cox proportional hazards model) or a logistic regression model?
CoxPH: the tenure variable can be considered as time to event (churn) and we can also easily determine if a record is censored. Then with the survival function $S(t \mid x) = S_0(t)^{\exp(x^\top \beta)}$, we calculate the probability of survival (non-churn) at time $t$ for a customer.
Logistic regression: the logistic regression seems also suitable for this case. The tenure will be an explanatory variable and the churn will be the target variable.
Problem 2: how should I prepare data for a model?
If we choose Cox regression, we need and select only one line (maybe the last one) for each individual customer. So that would be like:
| month | customer_id | tenure | gender | ... | churn | |
|---|---|---|---|---|---|---|
| 1 | 2022-01 | 1 | 6 | 1 | 1 | |
| 6 | 2022-03 | 2 | 17 | 1 | 0 | |
| 7 | 2022-03 | 3 | 14 | 0 | 1 | |
| 8 | 2022-03 | 4 | 1 | 1 | 0 |
If we choose logistic regression, we fit the model with all data rows (every month for every customer).
Am I thinking correctly about the problems?