I'm analyzing machine failure data from a predictive maintenance perspective. In our system, machines are installed as new and are inspected from time to time. That inspection can happen
- “randomly” (someone checks on a machine by accident),
- based on some pre-set dates for future inspections (e.g., “set a 30d schedule to check on the machine"),
- or because a client reports that the machine actually failed and we send inspectors to check (at which point we observe that it has failed in the past).
At inspection, we only learn whether a machine is broken or not; we do not observe the actual time of the failure event. I want to infer the unobserved failure time (i.e., the moment when the machine actually fails) from the observed inspection time & event (machine is broken vs not) & related features/covariates that are related to machine failure causes as well as related to inspection frequency / success rates.
This has to work at prediction time as well, ie, I want to use such a model to do an inventory of all current machines and estimate how many days (as of today $t$) the machine will likely fail or might have already failed (if actual failure event time $ < t$) – so that we can deploy inspection time earlier than what we (naively) set out in a regular inspection interval. Goal is to get the inspection/realization of a machine being broken closer to the true (but unobserved) failure time.
Most importantly: it is generally true that some machines already have failed at inference time, so any algorithm we use must be able to “predict” a failure time that’s in the past for the latent event. Standard survival models with (predicted) waiting times $\geq 0$ as of today $t$ only don't suffice here.
I’m considering modeling the overall delay from installation to detection as the sum of two waiting times:
Time to Failure ($t_1$): This is the time from when a machine is installed until it actually fails.
$$ t_{1,i} \sim \text{Exponential}(\lambda_{1,i}), \quad \lambda_{1,i} = \exp\bigl(\alpha_1 + \beta_1 x_{1,i}\bigr), $$
where $x_{1,i}$ could include factors like operating conditions or machine age.
Time from Failure to Detection ($t_2$): This is the time from when the machine actually fails until the failure is detected at the next inspection.
$$ t_{2,i} \sim \text{Exponential}(\lambda_{2,i}), \quad \lambda_{2,i} = \exp\bigl(\alpha_2 + \beta_2 x_{2,i}\bigr), $$ where $x_{2,i}$ might capture factors such as maintenance schedule, inspection frequency, workload of staff, etc. The only observable is the total time $$ T_i = t_{1,i} + t_{2,i}. $$ When $\lambda_{1,i} \neq \lambda_{2,i}$, the density for $T_i$ is given by the convolution of the two exponential densities (https://en.wikipedia.org/wiki/Hypoexponential_distribution) $$ f_T(T_i \mid x_{1,i}, x_{2,i}, \theta) = \frac{\lambda_{1,i}\lambda_{2,i}}{\left|\lambda_{2,i} - \lambda_{1,i}\right|}\Bigl| e^{-\lambda_{1,i}T_i} - e^{-\lambda_{2,i}T_i} \Bigr|, $$ with $\theta = (\alpha_1, \beta_1, \alpha_2, \beta_2)$.
We have multiple observations per machine, ie a longitudinal dataset with remaining duration until inspection shows failure, $Y_{i,t} = T_i - t$ and feature $X_{i, t}$, for N machines $i = 1, \ldots, N$, each with multiple (irregular) observations $t = 1, ..., t_{i}$; a subset of machines have not been inspected (yet) and some have not shown a failure.
I've implemented a toy example version in Stan which works well (ie can recover the true $\theta$), however this does not scale well to millions of examples and 100s of features.
My questions are:
Is this convolution-based likelihood a standard problem framing for inferring the distribution of actual failure time from features X (i.e., $Pr(t_1 \mid X)$)? Any issues with that setup that I'm overlooking (e.g., identifiability issues if common covariates X -> $\lambda_1$ and $\lambda_2$ are moving in parallel for $t_1$ and $t_2$).
Are there alternative model formulations that might be more robust or computationally efficient for this type of two-stage waiting-time problem?
What are other considerations / terminology in the literature that solve this particular problem? Areas I came across are “multi state survival modeling”, [obviously] hidden Markov models, and lastly “current status data”.
Lastly, in practice we can often lower bound $t_1$, because when an inspection happens and the machine is not broken we know that $t_1 \geq $ "previous inspection time". Any references (literature and/or implementation on left-truncated Hypoexponential distributions?
Any insights, alternative approaches, or references would be greatly appreciated!
UPDATE: (after comment by EdM): Our current baseline model is currently a plain-vanilla interval censored survival model (Weibull) with lower bound = min(time from 0 to recent inspection w/o failure, 0) and upper bound = max(time until inspection found failure, $\infty$), and throwing all features into the model -- ignoring the information of knowing that some features can only affect $t_1$ and some can only affect $t_2$. Concern / question is that a) we lose information by ignoring the feature -> t1 / t2 relationship which seems very critical to the problem & justification for being able to infer $P(t_1| X, t)$ in the first place; b) at inference time, when using "remaining time to event", the predictions are necessarily in the future (we plan to switch to distributions on full real line, e.g., Normal, to avoid that), when using total time to event, predictions can be in the past, but at training time models overindex on "days since machine installed" feature as that's leaking information on the "total time" target and out-of-sample model metrics become "better" just by evaluating on times closer to the end of the lifecycle. Also a model "prediction" at time t, could still 'predict' a value that's << lower bound for the observed machine (just because the survival model got it wrong). That then creates an odd situation where, say, we are on day 100, we know a machine had the last inspection w/o failure on day 80, yet my interval censored survival model gives a prediction around day 40. I guess we'd have to incoroporate the lower bound for machine i also into the conditional distribution at inference time for that machine to avoid those cases.
Note: I asked a related question previously (How to estimate when an event of interest is overdue?), but framing this problem here differently and more aligned with existing survival / queuing theory literature.
