Difference between CoxPH and logistic regression; data preparation for each model

Question

I'm working on a research project of which the objective is to predict the customer churn probability in the next month. We have a dataset of monthly records for each customer with variables including (the list below is not exhaustive):

month: month
customer_id: customer ID
tenure: number of months the customer has stayed
gender: whether the customer is a male or a female
churn: whether the customer churned or not

A part of the dataset looks like:

	month	customer_id	tenure	gender	churn
1	2022-01	1	6	1	1
2	2022-01	2	15	1	0
3	2022-01	3	12	0	0
4	2022-02	2	16	1	0
5	2022-02	3	13	0	0
5	2022-02	4	0	1	0
6	2022-03	2	17	1	0
7	2022-03	3	14	0	1
8	2022-03	4	1	1	0

Currently, I have problems with model selection and data preparation.

Problem 1: should I choose a CoxPH model (Cox proportional hazards model) or a logistic regression model?

CoxPH: the tenure variable can be considered as time to event (churn) and we can also easily determine if a record is censored. Then with the survival function $S(t \mid x) = S_0(t)^{\exp(x^\top \beta)}$, we calculate the probability of survival (non-churn) at time $t$ for a customer.

Logistic regression: the logistic regression seems also suitable for this case. The tenure will be an explanatory variable and the churn will be the target variable.

Problem 2: how should I prepare data for a model?

If we choose Cox regression, we need and select only one line (maybe the last one) for each individual customer. So that would be like:

	month	customer_id	tenure	gender	churn
1	2022-01	1	6	1	1
6	2022-03	2	17	1	0
7	2022-03	3	14	0	1
8	2022-03	4	1	1	0

If we choose logistic regression, we fit the model with all data rows (every month for every customer).

Am I thinking correctly about the problems?

The entire problem boils down to: what is the nature of your response data? I only know "churn" in the context of butter, so I have no idea what your data are. However, is this a model for the time to churn, or is it a model of churn? In other words, when "churn" is 0 at the observed "tenure", is it possible it evaluates to 1 at the next observed unit of "tenure"? That would be censoring and the Cox model handles censoring. — AdamO
– AdamO, Commented Apr 15, 2022 at 20:50
@AdamO "churn" can mean something like a customer who doesn't renew an insurance policy, as on this page. That's clearly like a death in survival analysis. Sometimes, however, "churn" is used to describe something like a failure of a customer to return to a website after a period of time. How to handle that isn't so clear to me. — EdM
– EdM, Commented Apr 15, 2022 at 21:20
In general when time is relevant or there may be censoring, logistic regression is not appropriate. — Frank Harrell
– Frank Harrell, Commented Apr 15, 2022 at 21:36
thought censoring in survival analysis was people leaving the clinical experiment (or whatever we watch) not death, is this correct? @EdM — M. Chris
– M. Chris, Commented Dec 10, 2024 at 8:47
@M.Chris it’s more general. The “right censoring” typical of survival studies means that an individual hasn’t experienced the event (whatever it might be, and for whatever reason) as of the last observation time. Thus the time to event for that individual has a minimum value set by the last observation time, giving a right-censored time to event. — EdM
– EdM, Commented Dec 10, 2024 at 9:09

EdM · Accepted Answer · 2022-04-17 20:06:13Z

As time and censoring are important, this is clearly a survival-model situation. You have to decide what you want to choose as time = 0 for the model.

If you want to model tenure as an outcome, then you would effectively set time = 0 to the time that each individual started as a customer by using tenure as the (potentially censored) outcome in a survival model, as you propose for a Cox model. If no covariate values change with time and no customer departs and returns, then you can use just the last observed tenure value along with a censoring indicator as the outcome in a Cox (or other proportional-hazards) model.

You might, however, want to consider time = 0 as some fixed calendar date. See this answer and the linked reference to a thesis that used that approach instead for modeling insurance-customer churn. Then you could use tenure prior to that starting date as a predictor.

That's your choice depending on just what you want to model.

If you only have a small number of possible event times (e.g., monthly data over a year or so), you probably should be using discrete-time survival analysis. That can be set up as a logistic regression based on data for each individual at each at-risk time (to handle censoring; you evidently have data in that format already) and that includes time as a modeled covariate. This answer provides several links for study and to tools for setting up such data.

Finally, this will be most reliable if the "churn" is an active event, like the refusal to renew an insurance policy. If it's just that you haven't seen the customer in a long time at which point you call a "churn" then you might need to model this more subtly.

Thanks for your reply. That's very helpful. I would have an other question. If I choose a logistic regression (or a machine learning model) I must use all user-period data, right? And why we only use the last record per user instead of using all records for a Cox model? — micmia
– micmia, Commented Apr 22, 2022 at 15:15
@micmia yes, to handle the time-dependent part of the survival data you need to use all user-period data. it would be better to call your "logistic regression" or other binary-outcome model a "discrete-time survival model" in this situation with time included as a covariate. Although the underlying solution might be via logistic regression, the phrase "logistic regression" is more typically used for a single binary outcome without any modeling of time. A link in the answer provides resources for learning more about discrete-time survival models. — EdM
– EdM, Commented Apr 22, 2022 at 15:48

Stack Exchange Network

Difference between CoxPH and logistic regression; data preparation for each model

1 Answer 1

Linked

Hot Network Questions

Difference between CoxPH and logistic regression; data preparation for each model

1 Answer 1

Linked

Related

Hot Network Questions