7
$\begingroup$

I'm analyzing machine failure data from a predictive maintenance perspective. In our system, machines are installed as new and are inspected from time to time. That inspection can happen

  • “randomly” (someone checks on a machine by accident),
  • based on some pre-set dates for future inspections (e.g., “set a 30d schedule to check on the machine"),
  • or because a client reports that the machine actually failed and we send inspectors to check (at which point we observe that it has failed in the past).

At inspection, we only learn whether a machine is broken or not; we do not observe the actual time of the failure event. I want to infer the unobserved failure time (i.e., the moment when the machine actually fails) from the observed inspection time & event (machine is broken vs not) & related features/covariates that are related to machine failure causes as well as related to inspection frequency / success rates.

timeline

This has to work at prediction time as well, ie, I want to use such a model to do an inventory of all current machines and estimate how many days (as of today $t$) the machine will likely fail or might have already failed (if actual failure event time $ < t$) – so that we can deploy inspection time earlier than what we (naively) set out in a regular inspection interval. Goal is to get the inspection/realization of a machine being broken closer to the true (but unobserved) failure time.

Most importantly: it is generally true that some machines already have failed at inference time, so any algorithm we use must be able to “predict” a failure time that’s in the past for the latent event. Standard survival models with (predicted) waiting times $\geq 0$ as of today $t$ only don't suffice here.

I’m considering modeling the overall delay from installation to detection as the sum of two waiting times:

Time to Failure ($t_1$): This is the time from when a machine is installed until it actually fails.

$$ t_{1,i} \sim \text{Exponential}(\lambda_{1,i}), \quad \lambda_{1,i} = \exp\bigl(\alpha_1 + \beta_1 x_{1,i}\bigr), $$

where $x_{1,i}$ could include factors like operating conditions or machine age.

Time from Failure to Detection ($t_2$): This is the time from when the machine actually fails until the failure is detected at the next inspection.

$$ t_{2,i} \sim \text{Exponential}(\lambda_{2,i}), \quad \lambda_{2,i} = \exp\bigl(\alpha_2 + \beta_2 x_{2,i}\bigr), $$ where $x_{2,i}$ might capture factors such as maintenance schedule, inspection frequency, workload of staff, etc. The only observable is the total time $$ T_i = t_{1,i} + t_{2,i}. $$ When $\lambda_{1,i} \neq \lambda_{2,i}$, the density for $T_i$ is given by the convolution of the two exponential densities (https://en.wikipedia.org/wiki/Hypoexponential_distribution) $$ f_T(T_i \mid x_{1,i}, x_{2,i}, \theta) = \frac{\lambda_{1,i}\lambda_{2,i}}{\left|\lambda_{2,i} - \lambda_{1,i}\right|}\Bigl| e^{-\lambda_{1,i}T_i} - e^{-\lambda_{2,i}T_i} \Bigr|, $$ with $\theta = (\alpha_1, \beta_1, \alpha_2, \beta_2)$.

We have multiple observations per machine, ie a longitudinal dataset with remaining duration until inspection shows failure, $Y_{i,t} = T_i - t$ and feature $X_{i, t}$, for N machines $i = 1, \ldots, N$, each with multiple (irregular) observations $t = 1, ..., t_{i}$; a subset of machines have not been inspected (yet) and some have not shown a failure.

I've implemented a toy example version in Stan which works well (ie can recover the true $\theta$), however this does not scale well to millions of examples and 100s of features.

My questions are:

  • Is this convolution-based likelihood a standard problem framing for inferring the distribution of actual failure time from features X (i.e., $Pr(t_1 \mid X)$​)? Any issues with that setup that I'm overlooking (e.g., identifiability issues if common covariates X -> $\lambda_1$ and $\lambda_2$ are moving in parallel for $t_1$ and $t_2$).

  • Are there alternative model formulations that might be more robust or computationally efficient for this type of two-stage waiting-time problem?

  • What are other considerations / terminology in the literature that solve this particular problem? Areas I came across are “multi state survival modeling”, [obviously] hidden Markov models, and lastly “current status data”.

  • Lastly, in practice we can often lower bound $t_1$, because when an inspection happens and the machine is not broken we know that $t_1 \geq $ "previous inspection time". Any references (literature and/or implementation on left-truncated Hypoexponential distributions?

Any insights, alternative approaches, or references would be greatly appreciated!

UPDATE: (after comment by EdM): Our current baseline model is currently a plain-vanilla interval censored survival model (Weibull) with lower bound = min(time from 0 to recent inspection w/o failure, 0) and upper bound = max(time until inspection found failure, $\infty$), and throwing all features into the model -- ignoring the information of knowing that some features can only affect $t_1$ and some can only affect $t_2$. Concern / question is that a) we lose information by ignoring the feature -> t1 / t2 relationship which seems very critical to the problem & justification for being able to infer $P(t_1| X, t)$ in the first place; b) at inference time, when using "remaining time to event", the predictions are necessarily in the future (we plan to switch to distributions on full real line, e.g., Normal, to avoid that), when using total time to event, predictions can be in the past, but at training time models overindex on "days since machine installed" feature as that's leaking information on the "total time" target and out-of-sample model metrics become "better" just by evaluating on times closer to the end of the lifecycle. Also a model "prediction" at time t, could still 'predict' a value that's << lower bound for the observed machine (just because the survival model got it wrong). That then creates an odd situation where, say, we are on day 100, we know a machine had the last inspection w/o failure on day 80, yet my interval censored survival model gives a prediction around day 40. I guess we'd have to incoroporate the lower bound for machine i also into the conditional distribution at inference time for that machine to avoid those cases.

Note: I asked a related question previously (How to estimate when an event of interest is overdue?), but framing this problem here differently and more aligned with existing survival / queuing theory literature.

$\endgroup$
10
  • 1
    $\begingroup$ Don't use a "days since machine installed" feature. That's inherent survivorship bias. Let the Weibull model handle the time since installation. There might also be survivorship bias if "some features can only affect $t_1$ and some can only affect $t_2$. You probably shouldn't include features that "predict" observation times. Also, the model predicts the distribution of survival times among machines as a function of features. Even if, say, the predicted mean failure time is 40 days after installation, there's nothing to prevent a particular machine from functioning at 80 days. $\endgroup$ Commented Mar 7 at 17:21
  • 1
    $\begingroup$ The goal is to estimate when a machine will fail or has failed. If I m looking at a machine on day 100 and I know it has not failed at day 80 because of a previous inspection, then a model that predicts any mass ( not just mean, but any >0 probability ) for day 40 is clearly not the best we can do. At day 0 ( or any day before the day 80 inspection) it's ok to have a predicted failure time mean of , say, 40 ( and of course any random machine could function at day 80). But not on day 100 after observing lower bound of 80 for a particular machine of interest $\endgroup$ Commented Mar 8 at 2:56
  • $\begingroup$ If I tell the inspection crew on day 100 that they should have inspected this machine at day 40, and they are 60 days late ... When in reality they checked on it on day 80 and it was fine, is not a useful exercise. Granted maybe the survival analysis framing is not the right way to look at this problem. Hence why I posted here for potentially better alternatives to a classic survival setup. $\endgroup$ Commented Mar 8 at 2:58
  • $\begingroup$ @Georg some thought, enumerated for clarity and discussion: 1) Do you have historic ground truth on a) failure time b) x days later observation time? If not you do not have ground truth you cannot learn either time. Why because it means you have label noise for the times. And then you cannot train reliable models which are better than this noise (!!!) 2) Regarding your point (prediction failure at day 40 but team checked at day 80 and it was fine ...) like EDM said your survival model is a probabilistic. If you encounter this during inference: why not filter those events and give no alarm $\endgroup$ Commented Mar 8 at 17:58
  • 1
    $\begingroup$ @GeorgM.Goerg "a probabilistic model should condition on the t=80 'not failed' observation, not ignore it": a probabilistic model with installation time as reference can do that. See this page. The problem is that the prediction depends on the shape of the hazard curve. If hazard decreases with time, estimated remaining lifetime increases with time. If hazard increases with time, estimated remaining lifetime decreases with time. If hazard is constant (exponential model), so is estimated remaining lifetime. A Weibull model can have any of those. $\endgroup$ Commented Mar 8 at 21:15

1 Answer 1

7
$\begingroup$

First, use the date of installation of each machine as its time = 0. Then you can use "standard survival models with (predicted) waiting times ≥ 0" relative to that origin. A survival model can (in principle) provide the distribution of potential prior failure times given that a failure happened before some time T after installation. What you probably want, however, is something else: a model of the overall distribution of failure times, given a set of features X, so that you can design an inspection schedule to keep failure durations below some target value.

Second, if you have both an upper and a lower bound for an event time (e.g., you "randomly" find a failure some time after installation or after a prior inspection), you have an interval-censored event time. It's possible to combine interval-censored event times along with known event times (e.g., a customer reports a failure at the time of failure) in survival regression models.

The survreg() function in the R survival package can fit parametric survival models with both types of event times provided that other outcome-associated covariates are fixed in time. The eha package can (with some caveats and assumptions) additionally handle time-varying covariates; the icenReg package also might (I haven't used it in that way).

Modeling survival that way allows more flexibility than what you propose, which seems to assume strictly exponential survival functions and seems to involve unnecessary computational complexity.

I fear, however, that attempts to infer "the actual failure time from the observed total time T and features X" will end up being disappointing. Typically, there is a broad distribution of failure times well beyond what is accounted for by the "features X." Even knowing that a failure occurred before some particular time T won't help much; you will often just end up with a broad distribution of potential prior failure times.

Response to comments

at training time models overindex on "days since machine installed" feature as that's leaking information on the "total time" target and out-of-sample model metrics become "better" just by evaluating on times closer to the end of the lifecycle.

That's why you shouldn't include things like "days since machine installed" as a feature if you are modeling event time after installation. It introduces explicit survivorship bias.

some features can only affect $t_1$ [start of time interval] and some can only affect $t_2$ [end of interval]

Such features need to be incorporated very carefully if at all. I suspect that you need a combined model that includes how they affect the likelihood of making an observation at a certain time, not just how they affect the likelihood of finding an event at/before/after that observation. That's beyond my expertise.

Also a model "prediction" at time $t$, could still 'predict' a value that's << lower bound for the observed machine (just because the survival model got it wrong).

The survival model didn't get it wrong. The observed machine was just lucky. Survival analysis evaluates the probability distribution of failure/event times for a population of machines with the same outcome-associated covariate values. How well it can estimate the time that a particular machine did fail in the past or might fail in the future depends on the nature of the underlying survival process.

Consider the general form of an accelerated failure time (AFT) regression model with constant covariate values $X$:

$$\log T = X^T\beta + \sigma W ,$$

where $T$ is failure time, $\beta$ is the vector of covariate coefficients, $W$ is a standard probability distribution (standard minimum extreme value for a Weibull model) and $\sigma$ is a scale factor.

$X^T \beta$ is the estimated log(event time) when $W=0$. How close that estimate comes to any individual event time, however, depends on the spread of $\sigma W$. With a very small $\sigma$ the estimate might be quite good individually, but a large-enough $\sigma$ could provide such a wide distribution that the point estimate $X^T \beta$ isn't very helpful for any particular machine.

With time-varying covariates you might refine your estimates of future survival based on changes in $X^T \beta$, but you are still just estimating the distribution of failure times for a population of machines with that set of covariate values. You should look at this page for discussion about whether/when predictions based on time-varying covariates even make epistemological sense.

You could consider a joint model of covariates over time along with event times. The random effects used to model individual machine covariates over time, however, might not be easy to incorporate into individual predictions for new machines.

What you can do

The practical problem is how to design inspection or replacement schedules to minimize overall cost. If you know the cost of an inspection or replacement and you can estimate the cost of a (potentially unidentified) machine failure as a function of time since failure, you can use your survival model to find the inspection/replacement schedule that minimizes net cost. That's also outside my expertise, but a quick web search identifies potentially useful references like: The Complete Guide to Industrial Maintenance Optimization and How to Determine Optimal Maintenance Intervals Using Reliability Centered Maintenance.

It's possible that changes in covariate values for an individual machine at an inspection time might lead to a change in an optimal inspection/replacement schedule, but there's still no assurance that the prediction will work well for that particular machine.

$\endgroup$
14
  • 1
    $\begingroup$ thanks for response. IIUC what you are proposing is exactly the default setup we are using, but it's not giving us an estimate of that failure time distribution, but failure time + delay. For example, if today is day 100 of a machine since t = 0, then a survival model on day 100 will give us an estimate of the (remaining) duration to the event (inspection finds a failure). that is necessarily in the future -- even thought we know that for some machines this event is indeed in the past $\endgroup$ Commented Mar 7 at 16:22
  • $\begingroup$ if we are using a total duration survival model (with the events being "inspected and found failure") , then it's possible to get "past" estimates of when the failure happens (though that then confounds model errors and correct past detection); its just unclear to me how we are not biasing the estimates with the inherent inspection delay. also to clarify: I meant a distribution of failure times, not an "actual" (super accurate point estimate). its perfectly fine to have a broad prior failure time distribution w/o the bias of delayed inspection. $\endgroup$ Commented Mar 7 at 16:27
  • $\begingroup$ @GeorgM.Goerg a properly constructed parametric survival model avoids bias due to "inspection delay" by including the entire interval-censored time period in the likelihood function. This page shows that the likelihood of an interval-censored event-time observation is proportional to the difference in survival probability between the left and right ends of the time interval. That provides less information than an exact lifetime, but it still provides information about the modeled distribution of exact failure times since installation. $\endgroup$ Commented Mar 7 at 17:02
  • $\begingroup$ thanks will review page. Note that I posted an "UPDATE" in my OP above to clarify what our current baseline is (exactly what you propose here), and what my concern / question is (potentially unwarranted) with that setup $\endgroup$ Commented Mar 7 at 17:07
  • 1
    $\begingroup$ @Ggjj11 for a discrete-time survival model like you describe in your last comment (for event probability in the next time interval), or a survival model of event time since last inspection, it makes sense to include "days since machine installed" as a feature. If you are modeling the event time "since machine installed," however, then using "days since machine installed" as a feature leads to circular reasoning. As the R time-dependence vignette puts it: "large values of time appear to predict long survival because long survival leads to large values for time." $\endgroup$ Commented Mar 8 at 19:09

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.