Name	Name	Last commit message	Last commit date
Latest commit History 80 Commits
.devcontainer	.devcontainer
.github	.github
.vscode	.vscode
docs/images	docs/images
ml_data	ml_data
ml_ops	ml_ops
ml_source	ml_source
.gitignore	.gitignore
CHANGELOG.md	CHANGELOG.md
CONTRIBUTING.md	CONTRIBUTING.md
LICENSE.md	LICENSE.md
Makefile	Makefile
README.md	README.md
requirements.txt	requirements.txt

Azure Databricks MlOps using MLflow

Overview

This is a template or sample for MlOps for Python based source code in Azure Databricks using MLflow without using MLflow Project.

Why not MLflow Project

MLflow Project is a format for packaging data science code in a reusable and reproducible way. It's a perfect fit for several use cases, refer Run MLflow Projects on Azure Databricks for more details.
However in some scenarios there are challenges with MLflow Project -
- A new Databricks cluster will be created when running a MLflow Project on Databricks. Running Projects against existing clusters is not supported. This might be a problem in scenarios like -
  - Duration of run will be increased due to creation of cluster each time.
  - Need access to create Databricks cluster when running a MLflow Project, may not be possible is some restricted environments.
- MLflow Project is invoked via mlflow run command, that means a machine (that will execute the mlflow run command) is needed which should have access to source code repository and Databricks environments, which may not be possible when Databricks is restricted to run behind VPN.
- Databricks standard features like Widgets, Notebook-scoped Python libraries, etc. can not be used in MLflow Project entry point scripts.
- Hard to integrate to with other Databricks pipelines (example - data processing pipelines).
- Testability of MLOps code (MLflow Project entry point scripts) -
  - If they are Python scripts, Unit Testing might be hard since they will be outside of a Python package.
  - Since they are not Databricks Notebook, Integration testing might be challenging.

Features

This template with samples that provides the following features:

A way to run Python based MlOps without using MLflow Project, but still using MLflow for managing the end-to-end machine learning lifecycle.
Sample of machine learning source code structure along with Unit Test cases
Sample of MLOps code structure along with Unit Test cases
Demo setup to try on users subscription

Architecture

Model Training

Batch Scoring

Getting Started

Prerequisites

Azure Databricks workspace
Azure Data Lake Storage Gen2 account
Visual Studio Code in local environment for development
Docker in local environment for development

Development

git clone https://github.com/Azure-Samples/azure-databricks-mlops-mlflow.git
cd azure-databricks-mlops-mlflow
Open cloned repository in Visual Studio Code Remote Container
Open a terminal in Remote Container from Visual Studio Code
make install to install sample packages (diabetes and diabetes_mlops) locally
make test to Unit Test the code locally

Package

make dist to build wheel Ml and MLOps packages (diabetes and diabetes_mlops) locally

Deployment

make databricks-deploy-code to deploy Databricks Orchestrator Notebooks, ML and MLOps Python wheel packages. If any code changes.
make databricks-deploy-jobs to deploy Databricks Jobs. If any changes in job specs.

Run training and batch scoring

To trigger training, execute make run-diabetes-model-training
To trigger batch scoring, execute make run-diabetes-batch-scoring

Observability

[TODO]

NOTE: for deployment and running the Databricks environment should be created first, for creating a demo environment the Demo chapter can be followed.

Individual Components

ml_data - dummy data for sample model
ml_ops - sample MLOps code along with Unit Test cases, orchestrator, deployment setup.
ml_source - sample ML code along with Unit Test cases
Makefile - for build, test in local environment
requirements.txt - python dependencies

Demo

Create Databricks workspace, a storage account (Azure Data Lake Storage Gen2) and Application Insights
1. Create an Azure Account
2. Deploy resources from custom ARM template
Initialize Databricks (create cluster, base workspace, mlflow experiment, secret scope)
1. Get Databricks CLI Host and Token
2. Authenticate Databricks CLI make databricks-authenticate
3. Execute make databricks-init
Create Azure Data Lake Storage Gen2 Container and upload data
1. Create Azure Data Lake Storage Gen2 Container named - diabetes
2. Upload as blob diabetes data files into Azure Data Lake Storage Gen2 container named - diabetes
Put secrets to Mount ADLS Gen2 Storage using Shared Access Key
1. Get Azure Data Lake Storage Gen2 account name created in step 1
2. Get Shared Key for Azure Data Lake Storage Gen2 account
3. Execute make databricks-secrets-put to put secret in Databricks secret scope
Put Application Insights Key as a secret in Databricks secret scope (optional)
1. Get Application Insights Key created in step 1
2. Execute make databricks-add-app-insights-key to put secret in Databricks secret scope
Package and deploy into Databricks (Databricks Jobs, Orchestrator Notebooks, ML and MLOps Python wheel packages)
1. Execute make deploy
Run Databricks Jobs
1. To trigger training, execute make run-diabetes-model-training
2. To trigger batch scoring, execute make run-diabetes-batch-scoring
Expected results
1. Azure resources
2. Databricks jobs
3. Databricks mlflow experiment
4. Databricks mlflow model registry
5. Output of batch scoring

Resources

Glossaries

Application developer : It is a role that work mainly towards operationalize of machine learning.
Data scientist : It is a role to perform the data science parts of the project

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Azure Databricks MlOps using MLflow

Overview

Why not MLflow Project

Features