7
$\begingroup$

Background

I'm implementing a production MLOps pipeline for part classification using Databricks AutoML. My pipeline automatically retrains models when new data arrives and compares performance with existing production models.

The Challenge

I've encountered a data drift issue that affects fair model comparison. Here's my scenario:

  1. parts data: door lengths range 1-4 meters
  2. StandardScaler fitted on this range: scaler_v1.fit([1m, 2m, 3m, 4m])
  3. Model v1 trained on this scaled data

New Data Arrives:

  1. New parts: door lengths now range 2-6 meters
  2. Different distribution: new_data = [2m, 3m, 5m, 6m]
  3. Question: How should I handle scaling for fair model comparison?

Current Pipeline:

  • New data arrives
  • new_data_df
  • Train new model
  • Compare with existing model - BUT SCALING PROBLEM! It makes no sense to compare the old with new model in my opinion

Specific Questions:

Scaling Strategy: Should I:

  • Retrain new model on combined historical + new data for consistent scaling?
  • Use original scaler on new data (might clip values outside original range)?
  • Fit new scaler on combined dataset and retrain both models?

Data Drift Detection

What's the best approach to detect when scaling changes require full retraining vs. incremental updates? In MLOps, should I:

  • Always retrain on cumulative data (historical + new)?

  • Use statistical tests to detect drift threshold?

  • Maintain separate scalers and transform data appropriately for each model?

Environment:

  • Databricks AutoML for model training
  • MLflow Model Registry for
  • deployment StandardScaler for feature scaling Engineering data with
    naturally evolving distributions
$\endgroup$

1 Answer 1

6
$\begingroup$

This is a very wide-ranging question, on a very important topic, which unfortunately runs the risk of being closed for that reason (lack of focus), but hopefully I can provide a decent enough answer to cover most of the bases (though I probably omitted some important ones). It's taken me a while to post this because I wanted to provide some working code, which I now have, in appendices with a link to GitHub.

Fair Scaling and Drift Handling in Production MLOps

TL;DR

To ensure fair model comparison amidst data drift, it is important to treat the preprocessing scaler and the model as a single, versioned pipeline. Competing model pipelines should be evaluated on the same time-ordered, raw holdout dataset, with each pipeline applying its own versioned scaler. The retraining strategy - choosing between an expanding or a sliding window of data - should be determined by backtesting, guided by continuous monitoring for data drift using a combination of statistical tests, stability indices, and business performance metrics.


1. The Core Problem: A Tale of Two Comparisons

When a feature distribution shifts (eg., door length from $[1,4]$ metres to $[2,6]$ metres), the central challenge extends beyond simple rescaling. It requires us to conduct a scientifically valid comparison between the deployed model ($M_{old}$) and a candidate model ($M_{new}$). A model is not merely an algorithm but a composite system, $f$, containing both its preprocessing transform, $g$ (eg., a StandardScaler), and its learned estimator, $h$.

$$ f = g_{\theta_{scaler}} \circ h_{\theta_{model}} $$

The deployed system is $f_{old} = g_{old} \circ h_{old}$, and the candidate is $f_{new} = g_{new} \circ h_{new}$. To compare them fairly, we can distinguish between two different goals: A deployment decision, and architecture benchmarking.

For a deployment decision, we compare the actual, deployable systems as they were trained. This involves evaluating both $f_{old}$ and $f_{new}$ on the same raw, time-ordered holdout dataset, with each pipeline applying its own versioned scaler. Any other approach, such as refitting the old model with the new scaler, evaluates a counterfactual system and leads to invalid conclusions about promotion.

For architecture benchmarking, we might choose to retrain different model families (eg., Logistic Regression vs. LightGBM) on the exact same, up-to-date scaled data. This can help answer which algorithm is superior for the current data reality, but it does not replace the primary deployment comparison.

2. Strategy 1: Package and Version the Entire Pipeline

Here we enforce the principle that a scaler is inseparable from its model. Perhaps the most robust way to achieve this is to build an sklearn.Pipeline containing both the scaler and the estimator as sequential steps. This entire pipeline object can then be logged to a model registry like MLflow as a single, versioned artefact. This design approach helps to ensure the inference service always receives raw features and routes them to the selected model version, which handles its own transformation, thereby guaranteeing training-serving parity.

3. Strategy 2: Select the Training Window with Backtesting

The assumption that "more data is always better", often (without much thinking) assumed to be a good thing, is often false and very bad in the presence of drift. To select the right strategy, it is rather important to distinguish between its different forms. Covariate Shift (or data drift) occurs when the input distribution $P(X)$ changes while the underlying relationship $P(Y|X)$ remains stable. More dangerous is Concept Drift, where that fundamental relationship $P(Y|X)$ itself changes over time, rendering the model's logic obsolete. A third type, Label Shift, involves a change in the class priors $P(Y)$ while the class-conditional distributions $P(X|Y)$ are constant. While monitoring techniques can detect all three, this answer's core retraining strategies primarily address covariate and concept drift. Our primary strategic defence against this drift is to empirically choose the training data window. An expanding window uses all historical data and is effective for gradual drift. A sliding window of size $W$, which uses only the most recent data, is the critical tool for handling concept drift, as it allows the model to discard harmful, outdated patterns (Gama et al., 2014). The code in Appendix 1 simulates a severe case of this. The optimal strategy and window size $W$ is then determined by finding what minimises validation error in backtests.

$$ \begin{align} \text{Train on} &\quad [t_0, t_k] \\ \text{or} &\quad [t_{k-W}, t_k] \\ \text{Validate on} &\quad (t_k, t_{k+\Delta}] \end{align} $$

An expanding window uses all historical data and is effective for gradual, additive drift. A sliding window of size $W$ uses only the most recent data, which is better for handling abrupt shifts where older data becomes irrelevant or even harmful (Gama et al., 2014). This phenomenon, where the fundamental relationship between input features and the target variable changes over time, is known as concept drift. The code in Appendix 1 simulates a severe case of this, rendering the old model's knowledge actively harmful.

The optimal strategy and window size $W$ is then determined by finding what minimises validation error in backtests.

4. Strategy 3: Implement Trigger-Based Retraining

Retraining should be an event triggered by evidence, not a fixed schedule. We can implement a multi-layered monitoring system to provide these triggers.

4.1 Covariate Drift Monitoring

We compare the distribution of new data to a stable reference set (eg., the previous training window). Key tools include univariate tests like the Kolmogorov-Smirnov (KS) test for continuous features, the heuristic Population Stability Index (PSI) to quantify the magnitude of change, and more advanced multivariate methods like Maximum Mean Discrepancy (MMD) for detecting subtle changes in the joint distribution (Rabanser et al., 2019).

4.2 Performance Drift Monitoring

The ultimate trigger is a degradation in business-critical metrics. In the face of this we must continuously monitor the performance (eg., AUROC, log-loss) of the live or shadow-deployed model. A significant drop in performance is a "hard trigger" for immediate retraining and evaluation, whereas significant covariate drift without a performance drop can be a "soft trigger" to initiate an offline backtesting cycle.

5. Answering the Specific Questions

Scaling Strategy: How should we handle scaling for fair model comparison?

We train the new model ($M_{new}$) on a chosen data window (expanding or sliding), fitting a new scaler ($g_{new}$) only to that data. We then package ($g_{new}, h_{new}$) as a single pipeline artefact. The comparison is then made by evaluating both the old pipeline and the new pipeline on the same raw holdout set, where each applies its own internal scaler. We must not use the old scaler to train the new model, nor should we refit both models under a shared scaler for a deployment decision.

Data Drift Detection: What's the best approach?

We should move away from a "always retrain on cumulative data" policy. Doing so 'pollutes' the training data with obsolete information, forcing the model to find a suboptimal balance between past and present realities, while also incurring significant computational costs. Instead, we use drift detection as a trigger. Statistical tests like the KS test or metrics like PSI with established thresholds (eg., PSI > 0.25) initiate the retraining workflow. Within this workflow, we use backtesting to determine whether an expanding or sliding window is more appropriate for training the new candidate model. We maintain separate, versioned scalers as part of each model pipeline artefact within the MLflow Model Registry.

[Some simple Python code demonstrating these principles is the appendices and also in one of my Github repos]

6. Risks and Alternatives

For severe covariate shift, we might consider RobustScaler (using median/IQR) which is less sensitive to outliers. If label shift is the primary issue (ie., the prevalence of classes changes but the features given a class remain stable), performance may be recoverable through model re-calibration without a full retrain. Finally, while tree-based models are invariant to monotonic transformations like scaling, their split-point logic can still be negatively affected by significant distribution drift.

References

Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., & Bouchachia, A. (2014). A survey on concept drift adaptation. ACM Computing Surveys, 46(4), 44. https://doi.org/10.1145/2523813

Pedregosa, F., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830. https://jmlr.org/papers/v12/pedregosa11a.html

Rabanser, S., Günnemann, S., & Lipton, Z. (2019). Failing loudly: An empirical study of methods for detecting dataset shift. Advances in Neural Information Processing Systems, 32. https://proceedings.neurips.cc/paper/2019/hash/846c260d715e5b854ffad5f70a516c88-Abstract.html

Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., & Young, M. (2015). Hidden technical debt in machine learning systems. Advances in Neural Information Processing Systems, 28. https://papers.nips.cc/paper/2015/hash/86df7dcfd896fcaf2674f757a2463eba-Abstract.html


Appendix 1 - Minimal, Runnable Example (scikit-learn only)

# This code attempts to provides a self-contained demonstration of the core comparison principle without requiring MLflow or any other external registry. import numpy as np import pandas as pd from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_auc_score from scipy.stats import ks_2samp def generate_data_with_concept(rng, n_samples, mean, std_dev, concept_threshold, reverse_concept=False): """Generates data with a probabilistic (noisy) target label.""" lengths = rng.normal(mean, std_dev, n_samples) # The relationship between length and target can be reversed for concept drift if not reverse_concept: # Standard concept: longer doors -> more likely to be class 1 logits = 5 * (lengths - concept_threshold) else: # Reversed concept: shorter doors -> more likely to be class 1 logits = 5 * (concept_threshold - lengths) probabilities = 1 / (1 + np.exp(-logits)) labels = rng.binomial(1, probabilities) return pd.DataFrame({"door_length": lengths, "target": labels}) def run_minimal_demo(): """Demonstrates fair model comparison with versioned pipelines.""" rng = np.random.default_rng(42) # 1) Simulate historic and new data with different concepts # Historic: Longer doors are class 1 hist = generate_data_with_concept(rng, 2000, mean=2.5, std_dev=0.5, concept_threshold=2.8, reverse_concept=False) # New: Covariate drift (mean changed) AND Concept drift (relationship reversed) new = generate_data_with_concept(rng, 1200, mean=4.0, std_dev=0.5, concept_threshold=3.8, reverse_concept=True) # 2) Univariate KS drift test on raw feature ks_stat, ks_p = ks_2samp(hist["door_length"], new["door_length"]) print(f"--- Drift Analysis ---") print(f"KS p-value (hist vs. new): {ks_p:.4g}") if ks_p < 0.01: print("Result: Significant drift detected. Retraining is warranted.") else: print("Result: No significant drift.") # 3) Define and train two deployable pipelines pipe_old = Pipeline([("scaler", StandardScaler()), ("clf", LogisticRegression())]) pipe_new = Pipeline([("scaler", StandardScaler()), ("clf", LogisticRegression())]) # Old model was trained on historic data only pipe_old.fit(hist[["door_length"]], hist["target"]) # New model must be trained on a SLIDING WINDOW (new data only) to learn the new concept new_train_data = new.iloc[:800] pipe_new.fit(new_train_data[["door_length"]], new_train_data["target"]) # 4) Fair evaluation on the same RAW holdout from recent data holdout = new.iloc[800:].copy() X_hold_raw = holdout[["door_length"]] y_hold = holdout["target"] auc_old = roc_auc_score(y_hold, pipe_old.predict_proba(X_hold_raw)[:, 1]) auc_new = roc_auc_score(y_hold, pipe_new.predict_proba(X_hold_raw)[:, 1]) print("\n--- Fair Comparison on Holdout Set ---") print(f"Old Pipeline AUC: {auc_old:.3f} print(f"New Pipeline AUC: {auc_new:.3f}") print("Recommendation:", "Promote NEW" if auc_new > auc_old else "Keep OLD") if __name__ == '__main__': run_minimal_demo() 

which produces this output:

--- Drift Analysis --- KS p-value (hist vs. new): 0 Result: Significant drift detected. Retraining is warranted. --- Fair Comparison on Holdout Set --- Old Pipeline AUC: 0.107 New Pipeline AUC: 0.893 Recommendation: Promote NEW 

This appendix serves as the "Science Experiment". Its purpose is to provide an immediate, zero-friction demonstration of the core machine learning concept, isolated from any complex tooling. The output is a powerful illustration of severe (albeit somewhat artificial) concept drift:

  • Old Pipeline AUC (0.107): An AUC score of 0.5 represents a random guess. A score significantly below 0.5, like 0.107, proves that the old model is now actively harmful. Its internal logic has become anti-correlated with the new reality, making its predictions worse than random.
  • New Pipeline AUC (0.893): This high score shows that the new model, by training on a sliding window of only recent data, successfully discarded the obsolete information and learned the new, reversed relationship.

Appendix 2 Output: The 'Engineering Blueprint'

This code has uses the more robust, two-step method for logging and registering, discussed above:

import mlflow import mlflow.sklearn import os import shutil import numpy as np import pandas as pd from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_auc_score from scipy.stats import ks_2samp def generate_data_with_concept(rng, n_samples, mean, std_dev, concept_threshold, reverse_concept=False): """Generates data with a probabilistic (noisy) target label.""" lengths = rng.normal(mean, std_dev, n_samples) if not reverse_concept: logits = 5 * (lengths - concept_threshold) else: logits = 5 * (concept_threshold - lengths) probabilities = 1 / (1 + np.exp(-logits)) labels = rng.binomial(1, probabilities) return pd.DataFrame({"door_length": lengths, "target": labels}) def run_mlflow_demo(): """Demonstrates the full MLOps cycle with MLflow.""" if os.path.exists("mlruns"): shutil.rmtree("mlruns") MODEL_NAME = "door_classifier_mlflow" rng = np.random.default_rng(42) hist = generate_data_with_concept(rng, 2000, mean=2.5, std_dev=0.5, concept_threshold=2.8, reverse_concept=False) new = generate_data_with_concept(rng, 1200, mean=4.0, std_dev=0.5, concept_threshold=3.8, reverse_concept=True) client = mlflow.tracking.MlflowClient() # Ensure the registered model exists before creating versions try: client.create_registered_model(MODEL_NAME) except mlflow.exceptions.MlflowException: pass # Model already exists # ---- 1) Train, Log, and Register the initial model (V1) ---- print("--- Training and Registering V1 & Setting 'champion' Alias ---") with mlflow.start_run(run_name="v1_historic_training") as run_v1: pipe_v1 = Pipeline([("scaler", StandardScaler()), ("clf", LogisticRegression())]) pipe_v1.fit(hist[["door_length"]], hist["target"]) # Step 1: Log the model artifact model_info_v1 = mlflow.sklearn.log_model( sk_model=pipe_v1, artifact_path="model", input_example=hist[["door_length"]].head() ) # Step 2: Register the logged model to create a version model_version_v1 = client.create_model_version( name=MODEL_NAME, source=model_info_v1.model_uri, run_id=run_v1.info.run_id ) # Set the initial "champion" alias to version 1 client.set_registered_model_alias(name=MODEL_NAME, alias="champion", version=model_version_v1.version) # ---- 2) Detect drift and train a challenger model (V2) ---- ks_stat, ks_p = ks_2samp(hist["door_length"], new["door_length"]) if ks_p >= 0.01: print("No drift, workflow ends.") return print("\n--- Drift detected. Training and Registering V2 ---") new_train_data = new.iloc[:800] with mlflow.start_run(run_name="v2_sliding_window_training") as run_v2: pipe_v2 = Pipeline([("scaler", StandardScaler()), ("clf", LogisticRegression())]) pipe_v2.fit(new_train_data[["door_length"]], new_train_data["target"]) mlflow.log_metric("ks_p_value_trigger", ks_p) model_info_v2 = mlflow.sklearn.log_model( sk_model=pipe_v2, artifact_path="model", input_example=new_train_data[["door_length"]].head() ) model_version_v2 = client.create_model_version( name=MODEL_NAME, source=model_info_v2.model_uri, run_id=run_v2.info.run_id ) # ---- 3) Fair comparison using registered models ---- print("\n--- Comparing V1 and V2 on Holdout Data ---") holdout = new.iloc[800:].copy() X_hold_raw = holdout[["door_length"]] y_hold = holdout["target"] model_v1 = mlflow.sklearn.load_model(model_uri=f"models:/{MODEL_NAME}/1") model_v2 = mlflow.sklearn.load_model(model_uri=f"models:/{MODEL_NAME}/2") auc_v1 = roc_auc_score(y_hold, model_v1.predict_proba(X_hold_raw)[:, 1]) auc_v2 = roc_auc_score(y_hold, model_v2.predict_proba(X_hold_raw)[:, 1]) print(f"Registered Model V1 AUC: {auc_v1:.3f}") print(f"Registered Model V2 AUC: {auc_v2:.3f}") # ---- 4) Automate promotion using modern 'aliases' ---- print("\n--- Automating Promotion Logic with Aliases ---") if auc_v2 > auc_v1: print(f"Promoting V2 to be the new 'champion' (AUC {auc_v2:.3f} > {auc_v1:.3f}).") client.set_registered_model_alias(name=MODEL_NAME, alias="champion", version=model_version_v2.version) else: print(f"Keeping V1 as 'champion' (AUC {auc_v1:.3f} >= {auc_v2:.3f}).") # No action needed, V1 remains the champion. if __name__ == '__main__': run_mlflow_demo() 

This produces:

--- Comparing V1 and V2 on Holdout Data --- Registered Model V1 AUC: 0.107 Registered Model V2 AUC: 0.893 --- Automating Promotion Logic with Aliases --- Promoting V2 to be the new 'champion' (AUC 0.893 > 0.107). 

This appendix serves as the "Engineering Blueprint". It takes the scientific principle validated in Appendix 1 and demonstrates the basics of how to operationalise it within a robust MLOps framework using MLflow.

The output shows the exact same, correct AUC comparison. However, the additional log messages reveal the full "Ops" lifecycle:

  1. A versioned model artefact is programmatically created and logged for both the old and new models.
  2. The performance comparison is executed by loading these specific, versioned artefacts from a central registry.
  3. The final line, "Promoting V2 to be the new 'champion'...", is the vital automated conclusion. It shows the system programmatically updating the model registry to point the production "champion" alias to the superior model, completing the deployment cycle.

Together, these appendices attempt to provide a complete picture: Appendix 1 tell us why the strategy (hopefully) works, and Appendix 2 shows the basics of how to build it in a reliable, automated production system.

$\endgroup$
4
  • $\begingroup$ I don't have the same output. Mine is 0.9 AUC in both V1 and V2 $\endgroup$ Commented Aug 19 at 16:17
  • $\begingroup$ @underflow please can you provide the complete output and I will take a look. $\endgroup$ Commented Aug 19 at 18:31
  • 1
    $\begingroup$ @robert first of all thank you so much for this magnificent explanation ! I feel more confident especially since I was opting for strategy 1 all the way. Now I will investigate other strategies. My question is, when dealing with a supervised learning use cases can we limit the process to model performance monitoring (AUC, ROC, FPR, F1-score, Accuracy) and ignore the statistical metrics and statistical tests? $\endgroup$ Commented Aug 25 at 12:55
  • 1
    $\begingroup$ @RobertLong Sorry for the delay and sorry it was my mistake - the code works well. $\endgroup$ Commented Aug 30 at 12:10

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.