This is a template or sample for MlOps for Python based source code in Azure Databricks using MLflow without using MLflow Project.
- MLflow Project is a format for packaging data science code in a reusable and reproducible way. It's a perfect fit for several use cases, refer Run MLflow Projects on Azure Databricks for more details.
- However in some scenarios there are challenges with MLflow Project -
- A new Databricks cluster will be created when running a MLflow Project on Databricks. Running Projects against existing clusters is not supported. This might be a problem in scenarios like -
- Duration of run will be increased due to creation of cluster each time.
- Need access to create Databricks cluster when running a MLflow Project, may not be possible is some restricted environments.
- MLflow Project is invoked via
mlflow runcommand, that means a machine (that will execute themlflow runcommand) is needed which should have access to source code repository and Databricks environments, which may not be possible when Databricks is restricted to run behind VPN. - Databricks standard features like Widgets, Notebook-scoped Python libraries, etc. can not be used in MLflow Project entry point scripts.
- Hard to integrate to with other Databricks pipelines (example - data processing pipelines).
- Testability of MLOps code (MLflow Project entry point scripts) -
- If they are Python scripts, Unit Testing might be hard since they will be outside of a Python package.
- Since they are not Databricks Notebook, Integration testing might be challenging.
- A new Databricks cluster will be created when running a MLflow Project on Databricks. Running Projects against existing clusters is not supported. This might be a problem in scenarios like -
This template with samples that provides the following features:
- A way to run Python based MlOps without using MLflow Project, but still using MLflow for managing the end-to-end machine learning lifecycle.
- Sample of machine learning source code structure along with Unit Test cases
- Sample of MLOps code structure along with Unit Test cases
- Demo setup to try on users subscription
- Azure Databricks workspace
- Azure Data Lake Storage Gen2 account
- Visual Studio Code in local environment for development
- Docker in local environment for development
git clone https://github.com/Azure-Samples/azure-databricks-mlops-mlflow.gitcd azure-databricks-mlops-mlflow- Open cloned repository in Visual Studio Code Remote Container
- Open a terminal in Remote Container from Visual Studio Code
make installto install sample packages (diabetesanddiabetes_mlops) locallymake testto Unit Test the code locally
make distto build wheel Ml and MLOps packages (diabetesanddiabetes_mlops) locally
make databricks-deploy-codeto deploy Databricks Orchestrator Notebooks, ML and MLOps Python wheel packages. If any code changes.make databricks-deploy-jobsto deploy Databricks Jobs. If any changes in job specs.
- To trigger training, execute
make run-diabetes-model-training - To trigger batch scoring, execute
make run-diabetes-batch-scoring
[TODO]
NOTE: for deployment and running the Databricks environment should be created first, for creating a demo environment the Demo chapter can be followed.
- ml_data - dummy data for sample model
- ml_ops - sample MLOps code along with Unit Test cases, orchestrator, deployment setup.
- ml_source - sample ML code along with Unit Test cases
- Makefile - for build, test in local environment
- requirements.txt - python dependencies
- Create Databricks workspace, a storage account (Azure Data Lake Storage Gen2) and Application Insights
- Create an Azure Account
- Deploy resources from custom ARM template
- Initialize Databricks (create cluster, base workspace, mlflow experiment, secret scope)
- Get Databricks CLI Host and Token
- Authenticate Databricks CLI
make databricks-authenticate - Execute
make databricks-init
- Create Azure Data Lake Storage Gen2 Container and upload data
- Create Azure Data Lake Storage Gen2 Container named -
diabetes - Upload as blob diabetes data files into Azure Data Lake Storage Gen2 container named -
diabetes
- Create Azure Data Lake Storage Gen2 Container named -
- Put secrets to Mount ADLS Gen2 Storage using Shared Access Key
- Get Azure Data Lake Storage Gen2 account name created in step 1
- Get Shared Key for Azure Data Lake Storage Gen2 account
- Execute
make databricks-secrets-putto put secret in Databricks secret scope
- Put Application Insights Key as a secret in Databricks secret scope (optional)
- Get Application Insights Key created in step 1
- Execute
make databricks-add-app-insights-keyto put secret in Databricks secret scope
- Package and deploy into Databricks (Databricks Jobs, Orchestrator Notebooks, ML and MLOps Python wheel packages)
- Execute
make deploy
- Execute
- Run Databricks Jobs
- To trigger training, execute
make run-diabetes-model-training - To trigger batch scoring, execute
make run-diabetes-batch-scoring
- To trigger training, execute
- Expected results
- Azure Databricks
- MLflow
- MLflow Project
- Run MLflow Projects on Azure Databricks
- Databricks Widgets
- Databricks Notebook-scoped Python libraries
- Databricks CLI
- Azure Data Lake Storage Gen2
- Application Insights
- Application developer : It is a role that work mainly towards operationalize of machine learning.
- Data scientist : It is a role to perform the data science parts of the project






