Skip to content

Azure-Samples/azure-databricks-mlops-mlflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

77 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Azure Databricks MlOps using MLflow

Overview

This is a template or sample for MlOps for Python based source code in Azure Databricks using MLflow without using MLflow Project.

Why not MLflow Project

  • MLflow Project is a format for packaging data science code in a reusable and reproducible way. It's a perfect fit for several use cases, refer Run MLflow Projects on Azure Databricks for more details.
  • However in some scenarios there are challenges with MLflow Project -
    • A new Databricks cluster will be created when running a MLflow Project on Databricks. Running Projects against existing clusters is not supported. This might be a problem in scenarios like -
      • Duration of run will be increased due to creation of cluster each time.
      • Need access to create Databricks cluster when running a MLflow Project, may not be possible is some restricted environments.
    • MLflow Project is invoked via mlflow run command, that means a machine (that will execute the mlflow run command) is needed which should have access to source code repository and Databricks environments, which may not be possible when Databricks is restricted to run behind VPN.
    • Databricks standard features like Widgets, Notebook-scoped Python libraries, etc. can not be used in MLflow Project entry point scripts.
    • Hard to integrate to with other Databricks pipelines (example - data processing pipelines).
    • Testability of MLOps code (MLflow Project entry point scripts) -
      • If they are Python scripts, Unit Testing might be hard since they will be outside of a Python package.
      • Since they are not Databricks Notebook, Integration testing might be challenging.

Features

This template with samples that provides the following features:

  • A way to run Python based MlOps without using MLflow Project, but still using MLflow for managing the end-to-end machine learning lifecycle.
  • Sample of machine learning source code structure along with Unit Test cases
  • Sample of MLOps code structure along with Unit Test cases
  • Demo setup to try on users subscription

Architecture

Model Training

Model Training

Batch Scoring

Batch Scoring

Getting Started

Prerequisites

Development

  1. git clone https://github.com/Azure-Samples/azure-databricks-mlops-mlflow.git
  2. cd azure-databricks-mlops-mlflow
  3. Open cloned repository in Visual Studio Code Remote Container
  4. Open a terminal in Remote Container from Visual Studio Code
  5. make install to install sample packages (diabetes and diabetes_mlops) locally
  6. make test to Unit Test the code locally

Package

  1. make dist to build wheel Ml and MLOps packages (diabetes and diabetes_mlops) locally

Deployment

  1. make databricks-deploy-code to deploy Databricks Orchestrator Notebooks, ML and MLOps Python wheel packages. If any code changes.
  2. make databricks-deploy-jobs to deploy Databricks Jobs. If any changes in job specs.

Run training and batch scoring

  1. To trigger training, execute make run-diabetes-model-training
  2. To trigger batch scoring, execute make run-diabetes-batch-scoring

NOTE: for deployment and running the Databricks environment should be created first, for creating a demo environment the Demo chapter can be followed.

Demo

  1. Create Databricks workspace and a storage account (Azure Data Lake Storage Gen2)
    1. Create an Azure Account
    2. Deploy resources from custom ARM template
  2. Initialize Databricks (create cluster, base workspace, mlflow experiment, secret scope)
    1. Get Databricks CLI Host and Token
    2. Authenticate Databricks CLI make databricks-authenticate
    3. Execute make databricks-init
  3. Create Azure Data Lake Storage Gen2 Container and upload data
    1. Create Azure Data Lake Storage Gen2 Container named - diabetes
    2. Upload as blob diabetes data files into Azure Data Lake Storage Gen2 container named - diabetes
  4. Put secrets to Mount ADLS Gen2 Storage using Shared Access Key
    1. Get Azure Data Lake Storage Gen2 account name created in step 1
    2. Get Shared Key for Azure Data Lake Storage Gen2 account
    3. Execute make databricks-secrets-put to put secret in Databricks secret scope
  5. Package and deploy into Databricks (Databricks Jobs, Orchestrator Notebooks, ML and MLOps Python wheel packages)
    1. Execute make deploy
  6. Run Databricks Jobs
    1. To trigger training, execute make run-diabetes-model-training
    2. To trigger batch scoring, execute make run-diabetes-batch-scoring

Repository Structure

  • ml_data - dummy data for sample model
  • ml_ops - sample MLOps code along with Unit Test cases, orchestrator, deployment setup.
  • ml_source - sample ML code along with Unit Test cases
  • Makefile - for build, test in local environment
  • requirements.txt - python dependencies

Resources

  1. Azure Databricks
  2. MLflow
  3. MLflow Project
  4. Run MLflow Projects on Azure Databricks
  5. Databricks Widgets
  6. Databricks Notebook-scoped Python libraries
  7. Databricks CLI
  8. Azure Data Lake Storage Gen2

Glossaries

  1. Application developer : It is a role that work mainly towards operationalize of machine learning.
  2. Data scientist : It is a role to perform the data science parts of the project

About

Azure Databricks MLOps sample for Python based source code using MLflow without using MLflow Project.

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Contributors 5