Probability expression in Multi-Task Logistic Regression

Question

I'm trying to understand how the authors of this paper (Learning Patient-Specific Cancer Survival Distributions as a Sequence of Dependent Regressors) obtain the general formula on page for the probability of observing the survival status sequence.

Apparently, the R-package MTLR is based on this approach. For a given random variable $T$, the probability of survival of the organism described by $\vec x$ for $T\geq t$ is assumed to be

$$ P(T>t|\vec x) = \frac{1}{1+ e^{\vec\theta_t\cdot \vec x + b_t}}. $$

Suppose now that we want to study the probability of survival through a time series $\{t_i\}$, i.e. we want to compute the probability probability that the organism dies at $T=t\in(t_i,t_i+1)$. As the authors say, the ultimate goal is to obtain the organism's survival time distribution.

We still assume that

$$ p_i = P(T>t_i|\vec x) = \frac{1}{1+ e^{\vec\theta_i\cdot \vec x + b_i}}. $$

The proposed expression by the authors is

$$ P(\vec y|\vec x) = \frac{e^{y_1(\vec\theta_1\cdot x+b_1) + \cdots + y_n(\vec\theta_n\cdot x+b_n)}}{1+e^{\vec\theta_n\cdot\vec x + b_n}+ e ^{\vec\theta_{n-1}\cdot x+b_{n-1} + \vec\theta_n\cdot x+b_n} + \cdots + e^{\vec\theta_1\cdot x+b_1 + \cdots + \vec\theta_n\cdot x+b_n}}, $$ where $\vec y$ the results of the observations during the time series: $y_i = y(t_i) = 0$ if the organism is still alive and $1$ oterwise.

I've tried to derive the formula for $P(\vec y|\vec x)$ using thhe differents $p_i$, but without success. The formula is not the product, as $p_i$ depends on $p_{i-1}$, and I don't know how they are combined.

I was trying to understand the easiest case $n=2$ and $y=(0,0)$ (that is the organism is always alive). For this case

$$ P = \frac{1}{e^{\vec\theta_2\cdot\vec x + b_2} + e^{\vec\theta_1\cdot\vec x + b_1 + \vec\theta_2\cdot\vec x + b_2}}. $$

Any ideas? Thanks

This resembles the sequential logit model applied to discrete-time survival analysis. — dimitriy
– dimitriy, Commented Nov 8, 2024 at 6:06
@dimitriy Thanks for the comment. Unfortunately I'm not very familiar with these models and I'm looking for the mathematical explanation. Could you add more details, please? — Dog_69
– Dog_69, Commented Nov 8, 2024 at 8:58
I think that there's an error in your rewriting of the authors' form for $P(\vec y|\vec x)$. The denominator in the paper (page 4) has one more term than the total number of time intervals, while yours only has a number of terms equal to the number of time intervals. You seems to have omitted the term for their "boundary case" $f_{\Theta}(\vec x, m) = 0$ (all $m$ elements of $\vec y$ equal to 0), which provides an additive term of 1 in the denominator after exponentiation. — EdM
– EdM, Commented Nov 10, 2024 at 21:28
The way that the authors structure their model, it's not clear that "$p_i$ depends on $p_{i-1}$" as you write. Their equation 2 does not seem to include any additional conditioning on having survived to time $i-1$; that seems to be implicit in their use of $P(T \ge t_i |\vec x)$ in their formulation of the logistic regression. Also, that means your formula for $P$ should be amended to $P(T \ge t_i |\vec x)$ from $P(T > t_i |\vec x)$. It might be easier to compare against the paper if you used the authors' notation of $m$ for the number of time points and $n$ for the number of individuals. — EdM
– EdM, Commented Nov 10, 2024 at 21:57
@EdM Thanks for your comments. The dependency of $p_i$ on $p_{i-1}$ is suggested (in my opinion) by the following sentence: ''However the outputs of these logistic regression models are not independent, as a death event at or before time $t_i$ implies death at all subsequent time points $t_j$ for all $j > i$. MTLR enforces the dependency of the outputs by predicting the survival status of a patient at each of the time snapshots $t_i$ jointly instead of independently.'' — Dog_69
– Dog_69, Commented Nov 11, 2024 at 9:01

EdM · Accepted Answer · 2024-11-11 13:30:21Z

TL;DR

It's not clear that the coefficients $b_i$ and $\theta_i$ have the same meanings in the "multi-task logistic regression" (MTLR) formula for $P(\vec y|\vec x)$ as they do in the formula for the probability $P(T \ge t_i|\vec x)$. The way MTLR is parameterized, they don't have to. MTLR is set up similarly to how state probabilities are presented in statistical mechanics: a set of values, one for each state, normalized by the partition function, the sum of all the values for the individual states.

MTLR is an attempt to re-invent discrete-time survival models with a focus on the survival function over time, $S(t)$. That leads to some awkwardness that isn't present in standard discrete-time models that focus on hazards. It's not clear that MTLR does anything that a standard discrete-time model can't do.

Standard discrete-time survival

Principles and methods of discrete-time survival analysis are covered in detail for example in Tutz and Schmid, Modeling Discrete Time-to-Event Data (Springer, 2016). There are also several pages about discrete-time survival models on this site, for example here and here.

A standard discrete-time survival model, as desecribed by Tutz and Schmid, is a binomial regression that evaluates the discrete-time hazard, the probability of having an event during one time interval given that there was survival until the start of that time interval, as a function of covariate values. It can be thought of as a sequence of binomial regressions over the time intervals, each only evaluating individuals that are still at risk for an event during the time interval. That ultimately provides a cumulative event probability over time, often written as $F(t)$, as a function of covariate values. The survival function is then simply the complement of the cumulative event probability, $S(t)=1-F(t)$.

Focus on hazards starting from time 0 simplifies two common situations in survival analysis: those who have events and those who are lost to follow-up and have right-censored event times. Those individuals are simply omitted from the analysis of time intervals during which they were no longer at risk.

MTLR

MTLR attempts to model the survival function $S(t)=1-F(t)$ directly instead of evaluating the hazard for each interval to first get $F(t)$ and then $S(t)$. It's not clear what advantage that provides, and it leads to several difficulties.

The authors note the following problem arising from their focus on $S(t)$ instead of on hazards, and the way they chose to deal with it:

a death event at or before time $t_i$ implies death at all subsequent time points $t_j$ for all $j > i$. MTLR enforces the dependency of the outputs by predicting the survival status of a patient at each of the time snapshots $t_i$ jointly instead of independently.

To try to estimate the entire survival function, MTLR thus evaluates the set of all possible sequences of alive/dead 0/1 indicators for an individual, encoded in the outcome vector $\vec y$. As events are terminal, all its elements for an individual equal 1 at and after the time of the event, with 0 values prior to that. If there are $m$ time intervals, then there are $m+1$ possible $\vec y$ sequences (one with all 0 values, and $m$ for events at each of the evaluation times).

The authors parameterize a score for each of those $m+1$ sequences, given by the numerator of the formula for $P(\vec y|\vec x)$. Consider the values of the numerator for the 3 possible sequences with $m=2$:

(0,0): $\exp(0)=1$;

(0,1): $\exp(\vec\theta_2\cdot x+b_2)$;

(1,1): $\exp(\vec\theta_2\cdot x+b_2 + \vec\theta_1\cdot x+b_1)$.

The denominator of the formula for $P(\vec y|\vec x)$ with $m=2$ is just the sum of those 3 scores. As a result, you can think of MTLR as parameterizing the probability of each of those sequences in this way.

The interpretations of the coefficients don't really matter; I'm not sure whether or how they relate to standard logistic regression coefficients. The authors call MTLR a "generalization of the logistic regression model." The solution will find parameter values that maximize the likelihood of the data under this parameterization (subject to the penalization constraints for smoothing described in the paper), whether or not the parameterization makes any sense.

Potential problems with MTLR

First, it includes an individual with an early event in the calculations for the coefficients associated with all later events. That somehow seems wrong (although it might be similar to how Fine-Gray models handle competing risks after the first type of event).

Second, MTLR needs to go through an additional expectation maximization or gradient descent step to deal with individuals having right-censored event times. Those individuals are much more readily handled in standard discrete-time survival analysis: just ignore them after they no longer have data to provide.

Third, although the magnitudes and time courses of the regression coefficient vectors $\vec \theta_i$ are smoothed by penalization, I don't see that the "thresholds" $b_i$ are smoothed/penalized at all. I suspect that can lead to overfitting.

Fourth, if you allow for time-varying covariate values, what do you choose for the values of an individual after death?

Is MTLR an advance?

I think that the claims made by the authors for the advantages of MTLR are overstated. With time-varying coefficients, MTLR can handle strange shapes of survival curves, survival curves that cross in time, etc. They only compared it, however, against methods that by design cannot do that: standard Cox proportional hazards and Aalen additive hazards models. Standard survival models with time-varying coefficients, as Tutz and Schmid describe for standard discrete-time models in Section 5.3, "Time-Varying Coefficients," or Cox models with time-varying coefficients, can also handle a wider variety of survival curve shapes.

Another alleged advantage of MTLR, providing individual-specific survival curves, can also be done by other survival analysis methods once the covariate values are specified and the baseline hazard that the covariates' regression coefficients alter has been calculated. Several standard methods accommodate time-varying covariate values; it's not clear how MTLR does that properly for an individual with an early event in time.

Peer review questions

First, I don't see that the paper has been seriously peer reviewed by experts in survival analysis. It was included in peer-reviewed conference proceedings, Advances in Neural Information Processing Systems 24 (NIPS 2011), Edited by: J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira and K.Q. Weinberger; ISBN: 9781618395993. Given the conference's focus on machine learning, I wonder whether peer review for this paper at NIPS in 2011 would have included a reviewer with substantive expertise in survival analysis per se.

Second, the organizers of the NIPS conference bravely did a study involving re-review of 10% of the papers submitted for the 2014 meeting. See Wikipedia, and this page that notes:

between half and two-thirds of papers accepted at NIPS would have been rejected if reviewed a second time.

That's certainly not a problem specific to the NIPS meeting, but it does raise questions about how thoroughly MTLR was vetted in 2011. I have been unable to find any subsequent peer-reviewed publications documenting the MTLR method itself, and all of the few dozen references to it that I could find were to the 2011 NIPS proceedings.

I've chosen this paper because the R-package MTLR is based on it (see here). And could you please explain where the $+1$ in the denominator comes from? IT's not clear to me from the formula, since for $k=0$ you have $i=1$ (using the original notation in the formula of $P_\Theta$)) — Dog_69
– Dog_69, Commented Nov 11, 2024 at 13:13
@Dog_69 On re-evaluation the "empty" sum to $m+1$ is probably not a typo but an indication to flip consideration out to the "boundary case" for $k=m$. Better peer review might have helped clarify. The authors specifically define $f_{\Theta}(\vec x, m) = 0$ in the text below the formula as corresponding to "the sequence of all ‘0’s," which leads to an additive term of $\exp(0)=1$ in the denominator. It's needed for the sum of the probabilities of all possible $\vec y$ sequences to sum to 1. R packages don't have to meet standards for statistical reliability or usefulness. — EdM
– EdM, Commented Nov 11, 2024 at 13:22

Stack Exchange Network

Probability expression in Multi-Task Logistic Regression

1 Answer 1

Linked

Hot Network Questions

Probability expression in Multi-Task Logistic Regression

1 Answer 1

Linked

Related

Hot Network Questions