© 2020 OctoML and University of Washington Introduction to the TVM Open Source Deep Learning Compiler Stack Luis Ceze w/ Tianqi Chen, Thierry Moreau, Jared Roesch, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Chien-Yu Lin, Haichen Shen, Leyuan Wang, Yuwei Hu, Carlos Guestrin, Arvind Krishnamurthy, Zach Tatlock, and many in the Apache TVM community!
© 2020 OctoML and University of Washington A perfect storm 2 Growing set of requirements: Cost, latency, power, security & privacy Cambrian explosion of models, workloads, and use cases CNN GAN RNN MLP DQNN Rapidly evolving ML software ecosystem Silicon scaling limitations (Dennard and Moore) Cambrian explosion of HW backends. Heterogeneous HW
© 2020 OctoML and University of Washington Current Dominant Deep Learning Systems Landscape 3 Frameworks and Inference engines DL Compilers Kernel Libraries Hardware Orchestrators Azure ML GCP Datalab cuDNN NNPack MKL-DNN Open source, automated end-to-end optimization framework for deep learning Hand optimized
© 2020 OctoML and University of Washington Stack 4 End-to-end, framework to metal open stack. Research and deployment. High-Level Differentiable IR Tensor Expression IR LLVM, CUDA, Metal VTA Edge FPGA Cloud FPGA ASIC Open source synthesizable deep learning accelerator design
© 2020 OctoML and University of Washington Automated by Machine Learning 5 High-Level Differentiable IR Tensor Expression IR LLVM, CUDA, Metal VTA Edge FPGA Cloud FPGA ASIC TVM: Automated End-to-end Optimizations for Deep Learning. Chen et al. OSDI 18 ML-based Optimization AutoTVM AutoVTA Hardware Fleet
© 2020 OctoML and University of Washington End-user perspective: Compile & deploy 6 import tvm from tvm import relay graph, params = Frontend.from_keras (keras_resnet50) graph, lib, params = Relay.build(graph, target) Compile Deploy
© 2020 OctoML and University of Washington Open Source Community and Impact 7 Open source: ~420+ contributors from UW, Berkeley, Cornell, UCLA, Amazon, Huawei, NTT, Facebook, Microsoft, Qualcomm, Alibaba, Intel, … Incubated as Apache TVM. Independent governance, allowing competitors to collaborate. Used in production at leading companies Deep Learning Compiler Service DSP/Tensor engine for mobile Mobile and Server Optimizations Cloud-side model optimization
© 2020 OctoML and University of Washington 8
© 2020 OctoML and University of Washington Existing Deep Learning Frameworks 9 Frameworks Hardware Primitive Tensor operators such as Conv2D High-level data flow graph Offload to heavily optimized DNN operator library eg. cuDNN
© 2020 OctoML and University of Washington Engineering costs limits progress 10 cuDNN Engineering intensive New operator introduced by operator fusion optimization potential benefit: 1.5x speedup Frameworks
© 2020 OctoML and University of Washington Our approach: Learning-based Learning System 11 Frameworks Hardware Directly generate optimized program for new operator workloads and hardware High-level data flow graph and optimizations Machine Learning based Program Optimizer
© 2020 OctoML and University of Washington Tensor Compilation/Optimization as a search problem 12 Tensor Expression (Specification) C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k)) Search Space of Possible Program Optimizations Low-level Program Variants
© 2020 OctoML and University of Washington Search Space Example (1/3) 13 Search Space of Possible Program Optimizations Vanilla Code Tensor Expression (Specification) C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))
© 2020 OctoML and University of Washington Search Space Example (2/3) 14 Search Space of Possible Program Optimizations Loop Tiling for Locality Tensor Expression (Specification) C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))
© 2020 OctoML and University of Washington Search Space Example (3/3) 15 Search Space of Possible Program Optimizations Map to Accelerators Tensor Expression (Specification) C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))
© 2020 OctoML and University of Washington Optimization space is really large… 16 Loop Transformations Thread Bindings Cache Locality Thread Cooperation Tensorization Latency Hiding Typically explored via human intuition. How can we automate this? Auto-tuning is too slow. Billions of possible optimization choices Tensor Expression (Specification) C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))
© 2020 OctoML and University of Washington Problem Formalization 17 Search Space Expression Objective Code Generator Optimization Configuration Cost: Execute Time Program AutoOpt
© 2020 OctoML and University of Washington Black-box Optimization 18 Challenge: Lots of experimental trials, each trial costs ~1 second Code Generator Try each configuration until we find a good one Search Space Expression AutoTVM
© 2020 OctoML and University of Washington Cost-model Driven Approach 19 Search Space Expression AutoOpt Challenge: Need reliable cost model per hardware Use cost model to pick configuration Code Generator Cost Model
© 2020 OctoML and University of Washington Statistical Cost Model 20 Search Space Expression AutoOpt Code Generator Our approach: Use machine learning to learn a statistical cost model Statistical Cost Model Learning Training data Benefit: Automatically adapt to hardware type Important: How to design the cost model
© 2020 OctoML and University of Washington Search Space Expression 2 2 AutoTVM Shared Cost Model Code Generator New Tasks Historical data from related operators (tasks) Need task invariant representation Transfer learning AutoTVM Overview 21 Conv2D Matmul O(microseconds) inference vs. O(seconds) execution Search Space Expression AutoTVM Code Generator Statistical Cost Model Learning Training data High-level configurations Low-level Abstract Syntax Tree (AST) Benefit: Low-level AST is a common representation (General, task invariant) Your favourite model Statistical features of AST + + Learning to Optimize Tensor Programs. Chen et al. NeurIPS 18
© 2020 OctoML and University of Washington Does it work? 22 Better than hand-tuned code in a few minutes 1.50x faster than hand-tuned in steady state AutoTVM + transferred model 3x to 10x faster tuning w/ transfer learning
© 2020 OctoML and University of Washington Device Fleet: Distributed Test Bed for AutoTVM 23 Resource Allocation Resource Token Resource Manager (Tracker) Nvidia GPU Server RPC RT CUDA Android Phone RPC RT OpenCL Zynq FPGA Board RPC RT Bitstream AutoTVM Experiment 1 AutoTVM Experiment 2 Persistent Remote Session Scale up optimization Resource sharing …
© 2020 OctoML and University of Washington State-of-the-art performance 24 Nvidia Titan X ARM GPU (MALI) ARM CPU (Cortex-A53) Key point: TVM offers good performance with low manual effort
© 2020 OctoML and University of Washington 25 End-to-end, framework to metal open stack. Research and deployment High-Level Differentiable IR Tensor Expression IR LLVM, CUDA, Metal VTA Edge FPGA Cloud FPGA ASIC Open source synthesizable deep learning accelerator design Stack
© 2020 OctoML and University of Washington DL Accelerator Design Challenges 26 CNN GAN RNN MLP DQNN • Keeping up with algorithmic changes • Finding the right generality/efficiency trade-off • Enable a “day-0” software stack on top • (VTA: two-level ISA, templatized design) • (VTA: templatized design + HW parameter search) • (VTA: tight coupling with TVM)
© 2020 OctoML and University of Washington VTA: Open & Flexible Deep Learning Accelerator 27 Current TVM Stack VTA Runtime & JIT Compiler VTA MicroArchitecture VTA Simulator VTA Hardware/Software Interface (ISA) • Move hardware complexity to software via a two-level ISA • Runtime JIT-compile accelerator micro code • Native support in TVM • Support heterogenous devices (split graph) • Support for secure execution (soon)
© 2020 OctoML and University of Washington VTA Open Source Deep Learning accelerator 28 • Decoupled access-execute with explicit software control • Two-level ISA: JIT breaks multi-cycle “CISC” instructions into micro-ops • Enables model retargeting without HW changes • Focused on FPGA deployments so far. Exploring custom silicon possibilities Note: HW-SW Blueprint for Flexible Deep Learning Acceleration. Moreau et al. IEEE Micro 2019. Template
© 2020 OctoML and University of Washington µTVM - Bare-metal model deployment for edge devices 29 Optimize, compile and package model for standalone bare metal deployment See recent demo on TVM for Azure Sphere deployment. µTVM ML model Optimized model Optimized operators Standalone runtime Edge device board (ARM, MIPS, RISC- V,...) Flash code
© 2020 OctoML and University of Washington Coming Soon - Ultra low bit-width quantization Automatic quantization: 5-20x performance gains with reasonable accuracy loss. TVM supports flexible code generation for a variety of data types Squeezenet on RaspberryPi 3
© 2020 OctoML and University of Washington What about training? 31 • Direct support for training in Apache TVM coming soon! • Automatic generation of gradient programs • Support for customized data types and training on FPGAs High-Level Differentiable IR Tensor Expression IR LLVM, CUDA, Metal VTA Edge FPGA Cloud FPGA ASIC Standalone training deployment Standalone inference deployment Gradient Program for Training Automatic Differentiation
© 2020 OctoML and University of Washington Other Ongoing TVM efforts 32 • Autoscheduling (Zheng et al. OSDI’20 @ UCBerkeley) • Automatic synthesis of operator implementations (Cowan et al. CGO’20 @ UWash) • Sparse support (NLP, graph convolutional neural networks, etc…) • Secure enclaves • … • Join the community!
© 2020 OctoML and University of Washington https://tvm.ai 33 2nd TVM conference on Dec 5, 2019. 200+ ppl last year! • Video tutorials • iPython notebooks tutorials 3rd TVM conference on Dec 3/4, 2020. https://tvmconf.org
© 2020 OctoML and University of Washington 34 https://octoml.ai
© 2020 OctoML and University of Washington What I would like you to remember… 35 TVM is an emerging open source standard for ML compilation and optimization TVM offers • Improved time to market for ML • Performance • Unified support for CPU, GPU, Accelerators • On the framework of your choice OctoML is here to help you succeed in you ML deployment needs End-to-end, framework to metal open stack. Research and deployment High-Level Differentiable IR Tensor Expression IR LLVM, CUDA, Metal VTA Edge FPGA Cloud FPGA ASIC

“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Presentation from OctoML

  • 1.
    © 2020 OctoMLand University of Washington Introduction to the TVM Open Source Deep Learning Compiler Stack Luis Ceze w/ Tianqi Chen, Thierry Moreau, Jared Roesch, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Chien-Yu Lin, Haichen Shen, Leyuan Wang, Yuwei Hu, Carlos Guestrin, Arvind Krishnamurthy, Zach Tatlock, and many in the Apache TVM community!
  • 2.
    © 2020 OctoMLand University of Washington A perfect storm 2 Growing set of requirements: Cost, latency, power, security & privacy Cambrian explosion of models, workloads, and use cases CNN GAN RNN MLP DQNN Rapidly evolving ML software ecosystem Silicon scaling limitations (Dennard and Moore) Cambrian explosion of HW backends. Heterogeneous HW
  • 3.
    © 2020 OctoMLand University of Washington Current Dominant Deep Learning Systems Landscape 3 Frameworks and Inference engines DL Compilers Kernel Libraries Hardware Orchestrators Azure ML GCP Datalab cuDNN NNPack MKL-DNN Open source, automated end-to-end optimization framework for deep learning Hand optimized
  • 4.
    © 2020 OctoMLand University of Washington Stack 4 End-to-end, framework to metal open stack. Research and deployment. High-Level Differentiable IR Tensor Expression IR LLVM, CUDA, Metal VTA Edge FPGA Cloud FPGA ASIC Open source synthesizable deep learning accelerator design
  • 5.
    © 2020 OctoMLand University of Washington Automated by Machine Learning 5 High-Level Differentiable IR Tensor Expression IR LLVM, CUDA, Metal VTA Edge FPGA Cloud FPGA ASIC TVM: Automated End-to-end Optimizations for Deep Learning. Chen et al. OSDI 18 ML-based Optimization AutoTVM AutoVTA Hardware Fleet
  • 6.
    © 2020 OctoMLand University of Washington End-user perspective: Compile & deploy 6 import tvm from tvm import relay graph, params = Frontend.from_keras (keras_resnet50) graph, lib, params = Relay.build(graph, target) Compile Deploy
  • 7.
    © 2020 OctoMLand University of Washington Open Source Community and Impact 7 Open source: ~420+ contributors from UW, Berkeley, Cornell, UCLA, Amazon, Huawei, NTT, Facebook, Microsoft, Qualcomm, Alibaba, Intel, … Incubated as Apache TVM. Independent governance, allowing competitors to collaborate. Used in production at leading companies Deep Learning Compiler Service DSP/Tensor engine for mobile Mobile and Server Optimizations Cloud-side model optimization
  • 8.
    © 2020 OctoMLand University of Washington 8
  • 9.
    © 2020 OctoMLand University of Washington Existing Deep Learning Frameworks 9 Frameworks Hardware Primitive Tensor operators such as Conv2D High-level data flow graph Offload to heavily optimized DNN operator library eg. cuDNN
  • 10.
    © 2020 OctoMLand University of Washington Engineering costs limits progress 10 cuDNN Engineering intensive New operator introduced by operator fusion optimization potential benefit: 1.5x speedup Frameworks
  • 11.
    © 2020 OctoMLand University of Washington Our approach: Learning-based Learning System 11 Frameworks Hardware Directly generate optimized program for new operator workloads and hardware High-level data flow graph and optimizations Machine Learning based Program Optimizer
  • 12.
    © 2020 OctoMLand University of Washington Tensor Compilation/Optimization as a search problem 12 Tensor Expression (Specification) C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k)) Search Space of Possible Program Optimizations Low-level Program Variants
  • 13.
    © 2020 OctoMLand University of Washington Search Space Example (1/3) 13 Search Space of Possible Program Optimizations Vanilla Code Tensor Expression (Specification) C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))
  • 14.
    © 2020 OctoMLand University of Washington Search Space Example (2/3) 14 Search Space of Possible Program Optimizations Loop Tiling for Locality Tensor Expression (Specification) C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))
  • 15.
    © 2020 OctoMLand University of Washington Search Space Example (3/3) 15 Search Space of Possible Program Optimizations Map to Accelerators Tensor Expression (Specification) C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))
  • 16.
    © 2020 OctoMLand University of Washington Optimization space is really large… 16 Loop Transformations Thread Bindings Cache Locality Thread Cooperation Tensorization Latency Hiding Typically explored via human intuition. How can we automate this? Auto-tuning is too slow. Billions of possible optimization choices Tensor Expression (Specification) C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))
  • 17.
    © 2020 OctoMLand University of Washington Problem Formalization 17 Search Space Expression Objective Code Generator Optimization Configuration Cost: Execute Time Program AutoOpt
  • 18.
    © 2020 OctoMLand University of Washington Black-box Optimization 18 Challenge: Lots of experimental trials, each trial costs ~1 second Code Generator Try each configuration until we find a good one Search Space Expression AutoTVM
  • 19.
    © 2020 OctoMLand University of Washington Cost-model Driven Approach 19 Search Space Expression AutoOpt Challenge: Need reliable cost model per hardware Use cost model to pick configuration Code Generator Cost Model
  • 20.
    © 2020 OctoMLand University of Washington Statistical Cost Model 20 Search Space Expression AutoOpt Code Generator Our approach: Use machine learning to learn a statistical cost model Statistical Cost Model Learning Training data Benefit: Automatically adapt to hardware type Important: How to design the cost model
  • 21.
    © 2020 OctoMLand University of Washington Search Space Expression 2 2 AutoTVM Shared Cost Model Code Generator New Tasks Historical data from related operators (tasks) Need task invariant representation Transfer learning AutoTVM Overview 21 Conv2D Matmul O(microseconds) inference vs. O(seconds) execution Search Space Expression AutoTVM Code Generator Statistical Cost Model Learning Training data High-level configurations Low-level Abstract Syntax Tree (AST) Benefit: Low-level AST is a common representation (General, task invariant) Your favourite model Statistical features of AST + + Learning to Optimize Tensor Programs. Chen et al. NeurIPS 18
  • 22.
    © 2020 OctoMLand University of Washington Does it work? 22 Better than hand-tuned code in a few minutes 1.50x faster than hand-tuned in steady state AutoTVM + transferred model 3x to 10x faster tuning w/ transfer learning
  • 23.
    © 2020 OctoMLand University of Washington Device Fleet: Distributed Test Bed for AutoTVM 23 Resource Allocation Resource Token Resource Manager (Tracker) Nvidia GPU Server RPC RT CUDA Android Phone RPC RT OpenCL Zynq FPGA Board RPC RT Bitstream AutoTVM Experiment 1 AutoTVM Experiment 2 Persistent Remote Session Scale up optimization Resource sharing …
  • 24.
    © 2020 OctoMLand University of Washington State-of-the-art performance 24 Nvidia Titan X ARM GPU (MALI) ARM CPU (Cortex-A53) Key point: TVM offers good performance with low manual effort
  • 25.
    © 2020 OctoMLand University of Washington 25 End-to-end, framework to metal open stack. Research and deployment High-Level Differentiable IR Tensor Expression IR LLVM, CUDA, Metal VTA Edge FPGA Cloud FPGA ASIC Open source synthesizable deep learning accelerator design Stack
  • 26.
    © 2020 OctoMLand University of Washington DL Accelerator Design Challenges 26 CNN GAN RNN MLP DQNN • Keeping up with algorithmic changes • Finding the right generality/efficiency trade-off • Enable a “day-0” software stack on top • (VTA: two-level ISA, templatized design) • (VTA: templatized design + HW parameter search) • (VTA: tight coupling with TVM)
  • 27.
    © 2020 OctoMLand University of Washington VTA: Open & Flexible Deep Learning Accelerator 27 Current TVM Stack VTA Runtime & JIT Compiler VTA MicroArchitecture VTA Simulator VTA Hardware/Software Interface (ISA) • Move hardware complexity to software via a two-level ISA • Runtime JIT-compile accelerator micro code • Native support in TVM • Support heterogenous devices (split graph) • Support for secure execution (soon)
  • 28.
    © 2020 OctoMLand University of Washington VTA Open Source Deep Learning accelerator 28 • Decoupled access-execute with explicit software control • Two-level ISA: JIT breaks multi-cycle “CISC” instructions into micro-ops • Enables model retargeting without HW changes • Focused on FPGA deployments so far. Exploring custom silicon possibilities Note: HW-SW Blueprint for Flexible Deep Learning Acceleration. Moreau et al. IEEE Micro 2019. Template
  • 29.
    © 2020 OctoMLand University of Washington µTVM - Bare-metal model deployment for edge devices 29 Optimize, compile and package model for standalone bare metal deployment See recent demo on TVM for Azure Sphere deployment. µTVM ML model Optimized model Optimized operators Standalone runtime Edge device board (ARM, MIPS, RISC- V,...) Flash code
  • 30.
    © 2020 OctoMLand University of Washington Coming Soon - Ultra low bit-width quantization Automatic quantization: 5-20x performance gains with reasonable accuracy loss. TVM supports flexible code generation for a variety of data types Squeezenet on RaspberryPi 3
  • 31.
    © 2020 OctoMLand University of Washington What about training? 31 • Direct support for training in Apache TVM coming soon! • Automatic generation of gradient programs • Support for customized data types and training on FPGAs High-Level Differentiable IR Tensor Expression IR LLVM, CUDA, Metal VTA Edge FPGA Cloud FPGA ASIC Standalone training deployment Standalone inference deployment Gradient Program for Training Automatic Differentiation
  • 32.
    © 2020 OctoMLand University of Washington Other Ongoing TVM efforts 32 • Autoscheduling (Zheng et al. OSDI’20 @ UCBerkeley) • Automatic synthesis of operator implementations (Cowan et al. CGO’20 @ UWash) • Sparse support (NLP, graph convolutional neural networks, etc…) • Secure enclaves • … • Join the community!
  • 33.
    © 2020 OctoMLand University of Washington https://tvm.ai 33 2nd TVM conference on Dec 5, 2019. 200+ ppl last year! • Video tutorials • iPython notebooks tutorials 3rd TVM conference on Dec 3/4, 2020. https://tvmconf.org
  • 34.
    © 2020 OctoMLand University of Washington 34 https://octoml.ai
  • 35.
    © 2020 OctoMLand University of Washington What I would like you to remember… 35 TVM is an emerging open source standard for ML compilation and optimization TVM offers • Improved time to market for ML • Performance • Unified support for CPU, GPU, Accelerators • On the framework of your choice OctoML is here to help you succeed in you ML deployment needs End-to-end, framework to metal open stack. Research and deployment High-Level Differentiable IR Tensor Expression IR LLVM, CUDA, Metal VTA Edge FPGA Cloud FPGA ASIC