Use of training data that has been labeled by the AI model itself

Question

I'm a software engineer working with medical device AI models that predict diseases and other conditions. For the most part, I don't design the models but I help with getting FDA clearance for them. These models are generally produced with supervised learning, and most of them are boosted decision trees, with a few CNNs or SVMs here and there.

Getting the initial training data for these models is often challenging, as the patient cases have to be reviewed by a physician and need to be labeled with a "gold standard" diagnosis (usually a binary label indicating whether the disease/condition is present or absent).

Once these models are deployed they are often directly integrated into Electronic Health Record (EHR) pipelines, and they can ingest lots of new patient information, but without the "gold standard" labeling (unless a patient happened to be evaluated for the specific disease or condition).

I have noticed that some models will generate a label for new patient data (according to the model's prediction) and then use the freshly labeled data to train the next iteration of the model. To me, this does not make sense. If the prediction is correct, the model made the right prediction already without the need for further training. If it is incorrect, false positives of the model are unduly reinforced.

Is there a legitimate reason for this approach or will it generally make models worse over time?

A keyword for the dangers of this practice is "Model Collapse". I think there is debate on the pros/cons, but maybe searching for that phrase can lead you to more information. — Brady Gilg
– Brady Gilg, Commented Nov 4 at 19:39
Here's an article that's not exactly specific to your question, but related, I think: nature.com/articles/s41586-024-07566-y — JimmyJames
– JimmyJames, Commented Nov 4 at 21:08

lucafossen · Accepted Answer · 2025-11-06 20:18:09Z

This approach has a few names, including pseudo-labeling and self-training. Searching for those keywords will get you lots of papers.

Those who use it often do so because of a high labeling cost. It is inherently a risky strategy for the reasons you mentioned, and depending on the risk of the domain, you would likely need a good justification that the method is reliable. How appropriate it is depends heavily on your specific domain, your dataset, and performance of the non-bootstrapped model.

In an attempt to curb its weaknesses, many approaches implement some sort of confidence cutoff, and only use predictions that have high confidence.

If you continue with this approach indefinitely, yes, the model may get worse over time / reinforce its existing biases. Human labeling is needed beyond a certain point.

If you can share info about the domain, data, models and training approaches that are used, maybe I can understand the justification.

My source for this information is my training as a PhD student working on large language models.

$\begingroup$ Also risk of overfitting $\endgroup$

Luke Sawczak
– Luke Sawczak

2025-11-05 03:09:32 +00:00
Commented Nov 5 at 3:09 — Luke Sawczak
– Luke Sawczak, Commented Nov 5 at 3:09

Stack Exchange Network

Use of training data that has been labeled by the AI model itself

1 Answer 1

Hot Network Questions

Use of training data that has been labeled by the AI model itself

1 Answer 1

Related

Hot Network Questions