Lineage for the output table with predictions is not tracked in MLflow when training a model

Save the output table as a CSV file and log it as an artifact.

Last published at: April 25th, 2025

Problem

When training a model using data stored in a table in Unity Catalog, the lineage to the upstream dataset(s) is tracked using mlflow.log_input, which logs the input table with the MLflow run. However, lineage for the output table (containing predictions) is not tracked.

Cause

There is no built-in method in MLflow to log output tables, similar to mlflow.log_input for input tables.

Solution

Save the output table as a CSV file and log it as an artifact. This way, you can indirectly track the lineage of the output table. You can use the following code.

import mlflow from sklearn import datasets from sklearn.ensemble import RandomForestRegressor import pandas as pd import tempfile import os # Load dataset dataset = mlflow.data.load_delta(table_name="<your-catalog>.<your-schema>.<your-table-name>", version="0") pd_df = dataset.df.toPandas() X = pd_df.drop("species", axis=1) y = pd_df["species"] # Train model and log input table with mlflow.start_run() as run:     clf = RandomForestRegressor(n_estimators=100)     clf.fit(X, y)     mlflow.log_input(dataset, "training")     # Make predictions     predictions = clf.predict(X)     pd_df["predictions"] = predictions     # Save predictions to an output table     <your-output-table-name> = "<your-catalog>.<your-schema>.<your-iris-output>"     output_df = spark.createDataFrame(pd_df) output_df.write.format("delta").mode("overwrite").saveAsTable(<your-output-table-name>)     # Log the output table     with tempfile.TemporaryDirectory() as tmpdir:         temp_path = os.path.join(tmpdir, "predictions.csv")         pd_df.to_csv(temp_path, index=False)         # Log the temporary file as an artifact         mlflow.log_artifact(temp_path, "output_table") print(f"Output table {<your-output-table-name>} created successfully.")

To locate the output table, navigate to the Artifacts page of the specific MLflow run.

By following the above steps, we can ensure that the output table can be traced back to the corresponding run.

Databricks Help Center

Problem

Cause

Solution

Contact Us