Copyright © 2017 MathWorks, Inc 1 Girish Venkataramani, Avinash Nehemiah May 2017 Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded GPUs
Copyright © 2017 MathWorks, Inc 2 Design Deep Learning & Vision Algorithms Talk Outline High Performance Embedded Implementation Highlights • Manage large image sets • Automate image labeling • Easy access to models • Pre-built training frameworks Highlights • Automate compilation of MATLAB to CUDA • 14x faster than pyCaffe 60% faster than C++ Caffe 3x faster than TensorFlow Accelerate and Scale Training Highlights • Acceleration with GPUs • Scale to clusters
Copyright © 2017 MathWorks, Inc 3 Let’s Use Object Detection as an Example TRUCK SUV CAR In our example we’ll use deep learning for object detection.
Copyright © 2017 MathWorks, Inc 5 Transfer Learning Workflow Transfer Learning Images New Classifier Learn New Weights Modify Network Structure Load Reference NetworkLabels Training Data Labels: Car, Truck, Large Truck, SUV, Van Alexnet, VGG-16, VGG-19, GoogLeNet
Copyright © 2017 MathWorks, Inc 6 Manage Large Sets of Images Transfer Learning Images New Classifier Learn New Weights Modify Network Structure Load Reference NetworkLabels Handle Large Sets of Images Easily manage large sets of images - Single line of code to access images - Operates on disk, database, big-data file system imageData = imageDataStore(‘vehicles’) Easily manage large sets of images - Single line of code to access images - Operates on disk, database, big-data file system Organize Images in Folders (~ 10,000 images , 5 folders)
Copyright © 2017 MathWorks, Inc 7 Automate Ground Truth Labeling Transfer Learning Images New Classifier Learn New Weights Modify Network Structure Load Reference NetworkLabels Ground Truth Labeling
Copyright © 2017 MathWorks, Inc 8 Automate Ground Truth Labeling Automate Ground Truth Labeling
Copyright © 2017 MathWorks, Inc 9 Access Reference Models in MATLAB Transfer Learning Images New Classifier Learn New Weights Modify Network Structure Load Reference NetworkLabels Easily Load Reference Networks Access Models with 1-line of MATLAB Code Net1 = alexnet Net2 = vgg16 Net3 = vgg19
Copyright © 2017 MathWorks, Inc 10 Access Reference Models in MATLAB Easily manage large sets of images - Single line of code to access images - Operates on disk, database, big-data file system 1. Reference Models 2. Model Importer 3. Tutorials
Copyright © 2017 MathWorks, Inc 11 Modify Network Structure Transfer Learning Images New Classifier Learn New Weights Modify Network Structure Load Reference NetworkLabels Simple MATLAB API to modify layers: layers(23) = fullyConnectedLayer(5, 'Name','fc8'); layers(25) = classificationLayer('Name',‘VehicleClassifier')
Copyright © 2017 MathWorks, Inc 12 Training Object Detectors Transfer Learning Images New Classifier Learn New Weights Modify Network Structure Load Reference NetworkLabels Train Any Network trainNetwork(datastore, layers, options) Pre-built Frameworks for Computer Vision • Deep Learning: R-CNN, Fast R-CNN, Faster R-CNN • Machine Learning: ACF, Cascade Object Detectors
Copyright © 2017 MathWorks, Inc 13 Visualizing and Debugging Intermediate Results Filters … Activations Deep Dream Training Accuracy Visualization Deep Dream Layer Activations Feature Visualization • Many options for visualizations and debugging • Examples to get started
Copyright © 2017 MathWorks, Inc 14 Real World Systems Use More Than Deep Learning Deep learning vehicle detector performance degraded with environmental effects (fog, etc. ) Fog Removal Challenge: Deep learning frameworks do not include “classical” computer vision Solution: Convert MATLAB code with deep learning and computer vision to embedded implementation
Copyright © 2017 MathWorks, Inc 15 Talk Outline Design Deep Learning & Vision Algorithms High Performance Embedded Implementation Accelerate and Scale Training Can you solve “real” problems for production systems with MATLAB?
Copyright © 2017 MathWorks, Inc 16 Single code change trainingOptions(‘sgdm’,… ‘ExecutionEnvironment’,’CPU’) Accelerate and Scale Computing Multi-core CPU ‘ExecutionEnvironment’,’GPU’) GPU ‘ExecutionEnvironment’,’multi-GPU’) Multiple GPU ‘ExecutionEnvironment’,’parallel’) Cluster/ Cloud
Copyright © 2017 MathWorks, Inc 17 After Many Iterations to Find The Best Model
Copyright © 2017 MathWorks, Inc 18 Talk Outline Design Deep Learning & Vision Algorithms High Performance Embedded Implementation Accelerate and Scale Training Can you create high performance implementation from MATLAB code ?
Copyright © 2017 MathWorks, Inc 19 Presenting the MATLAB to CUDA parallelizing compiler Why? • Alexnet inference using MATLAB solution is • ~14x faster than pyCaffe and 50% faster than C++-Caffe • ~ 4x faster and ~3x less memory-use than TensorFlow
Copyright © 2017 MathWorks, Inc 20 Sample Generated CUDA Code MATLAB source code Auto-generated CUDA code
Copyright © 2017 MathWorks, Inc 21 MATLAB to CUDA compiler flow Control-flow graph Intermediate representation (CFG – IR) Front-end Parallel loop creation Library function mapping CUDA kernel creation cudaMemcpy minimization Shared memory synthesis CUDA code emission …. Traditional compiler optimizations …. (×) cublas-gemm () cuSolver calls fft cuFFT calls nnet cuDNN calls Library function mapping Parallel loop creation Identify loop-nests that will become CUDA kernels … . CUDA kernel creation Convert loop to CUDA kernel Thread/blocks inferred from loop dims cudaMemcpy minimization Shared memory synthesis Perform Use-def analysis. cudaMalloc GPU vars, insert memcpy Infer data locality. Map to shared memory. Synthesize shared memory access CUDA kernel optimizations
Copyright © 2017 MathWorks, Inc 22 MATLAB to CUDA compiler: Creating large parallel loops! Control-flow graph Intermediate representation (CFG – IR) Front-end Scalarization Loop perfectization Loop interchange Loop fusion Scalar replacement Library function mapping CUDA code emission …. Traditional compiler optimizations … . Loop optimizations Scalarization Loop fusion Scalar replacement Parallel loop creation CUDA kernel creation cudaMemcpy minimization Shared memory synthesis CUDA kernel optimizations
Copyright © 2017 MathWorks, Inc 23 MATLAB to CUDA compiler: Creating large parallel loops! Control-flow graph Intermediate representation (CFG – IR) Front-end Scalarization Loop perfectization Loop interchange Loop fusion Scalar replacement Library function mapping CUDA code emission …. Traditional compiler optimizations … . Loop optimizations 2 kernels (size N), 20*N bytes 1 kernel (size N), 16*N bytes Scalarization Loop fusion Scalar replacement Parallel loop creation CUDA kernel creation cudaMemcpy minimization Shared memory synthesis CUDA kernel optimizations
Copyright © 2017 MathWorks, Inc 24 cudaMemcpy minimization A(:) = …. C(:) = …. for i = 1:N …. gB = kernel1(gA); gA = kernel2(gB); if (some_condition) gC = kernel3(gA, gB); end …. end …. = C; cudaMemcpy *definitely* needed cudaMemcpy *not* needed cudaMemcpy *may be* needed Observations • Equivalent to Partial redundancy elimination (PRE) • Dynamic strategy – track memory location with a status flag per variable • Use-Def to determine where to insert memcpy A(:) = … A_isDirtyOnCpu = true; … for i = 1:N if (A_isDirtyOnCpu) cudaMemcpy(gA, A); A_isDirtyOnCpu = false; end gB = kernel1(gA); gA = kernel2(gB); if (somecondition) gC = kernel3(gA, gB); C_isDirtyOnGpu = true; end … end … if (C_isDirtyOnGpu) cudaMemcpy(C, gC); C_isDirtyOnGpu = false; end … = C; Assume gA, gB and gC are mapped to GPU memory Generated (pseudo) code
Copyright © 2017 MathWorks, Inc 25 Example: Compiling fog-rectification algorithm
Copyright © 2017 MathWorks, Inc 26 MATLAB to CUDA compilation of computer vision applications Distance transform Fog removal SURF feature extraction Ray tracing Stereo disparity
Copyright © 2017 MathWorks, Inc 27 Deep learning prediction performance: Alexnet Framerate(Fps) Batch Size CPU Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50 GHz GPU Tesla K40c 0 200 400 600 800 1000 1200 1400 1 16 32 64 Py-Caffe TensorFlow
Copyright © 2017 MathWorks, Inc 28 Deep learning prediction performance: Alexnet 0 1 2 3 4 5 6 7 8 9 CPU resident memory GPU peak memory (nvidia-smi) Memoryusage(GB) Batch Size 1 16 32 64 CPU Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50 GHz GPU Tesla K40c Py-Caffe MATLABtoCUDAcompiler TensorFlow MATLABonCPU+GPU C++-Caffe
Copyright © 2017 MathWorks, Inc 29 Deep learning prediction performance: Alexnet Jetson (Tegra) TX1 0 50 100 150 200 250 1 16 32 64 128 Framerate(Fps) Batch Size C++-Caffe MATLAB to CUDA compiler
Copyright © 2017 MathWorks, Inc 30 Create CNNs with MATLAB, Deploy with MATLAB to CUDA compiler Alexnet YOLO People detection Lane detection ~20 Fps (K40c) ~30 Fps (Tegra X1) ~66 Fps (Tegra X1) (K40c)
Copyright © 2017 MathWorks, Inc 31 Conclusions Design Deep Learning & Vision Algorithm Accelerate and Scale Training Deep learning design is easy in MATLAB Managing datasets and scaling up training is easy in MATLAB MATLAB to CUDA compiler 10x – 14x faster than pyCaffe 1.3x – 4x faster than TensorFlow 1.07 – 1.6x faster than C++ Caffe High Performance Embedded Implementation
Copyright © 2017 MathWorks, Inc 32 What next? www.mathworks.com/matlab-cuda-beta MATLAB to CUDA compiler: Sign up for our beta program Try deep learning in MATLAB Visit our booth and see our demos Booth #: 808

"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded GPUs," a Presentation from MathWorks

  • 1.
    Copyright © 2017MathWorks, Inc 1 Girish Venkataramani, Avinash Nehemiah May 2017 Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded GPUs
  • 2.
    Copyright © 2017MathWorks, Inc 2 Design Deep Learning & Vision Algorithms Talk Outline High Performance Embedded Implementation Highlights • Manage large image sets • Automate image labeling • Easy access to models • Pre-built training frameworks Highlights • Automate compilation of MATLAB to CUDA • 14x faster than pyCaffe 60% faster than C++ Caffe 3x faster than TensorFlow Accelerate and Scale Training Highlights • Acceleration with GPUs • Scale to clusters
  • 3.
    Copyright © 2017MathWorks, Inc 3 Let’s Use Object Detection as an Example TRUCK SUV CAR In our example we’ll use deep learning for object detection.
  • 4.
    Copyright © 2017MathWorks, Inc 5 Transfer Learning Workflow Transfer Learning Images New Classifier Learn New Weights Modify Network Structure Load Reference NetworkLabels Training Data Labels: Car, Truck, Large Truck, SUV, Van Alexnet, VGG-16, VGG-19, GoogLeNet
  • 5.
    Copyright © 2017MathWorks, Inc 6 Manage Large Sets of Images Transfer Learning Images New Classifier Learn New Weights Modify Network Structure Load Reference NetworkLabels Handle Large Sets of Images Easily manage large sets of images - Single line of code to access images - Operates on disk, database, big-data file system imageData = imageDataStore(‘vehicles’) Easily manage large sets of images - Single line of code to access images - Operates on disk, database, big-data file system Organize Images in Folders (~ 10,000 images , 5 folders)
  • 6.
    Copyright © 2017MathWorks, Inc 7 Automate Ground Truth Labeling Transfer Learning Images New Classifier Learn New Weights Modify Network Structure Load Reference NetworkLabels Ground Truth Labeling
  • 7.
    Copyright © 2017MathWorks, Inc 8 Automate Ground Truth Labeling Automate Ground Truth Labeling
  • 8.
    Copyright © 2017MathWorks, Inc 9 Access Reference Models in MATLAB Transfer Learning Images New Classifier Learn New Weights Modify Network Structure Load Reference NetworkLabels Easily Load Reference Networks Access Models with 1-line of MATLAB Code Net1 = alexnet Net2 = vgg16 Net3 = vgg19
  • 9.
    Copyright © 2017MathWorks, Inc 10 Access Reference Models in MATLAB Easily manage large sets of images - Single line of code to access images - Operates on disk, database, big-data file system 1. Reference Models 2. Model Importer 3. Tutorials
  • 10.
    Copyright © 2017MathWorks, Inc 11 Modify Network Structure Transfer Learning Images New Classifier Learn New Weights Modify Network Structure Load Reference NetworkLabels Simple MATLAB API to modify layers: layers(23) = fullyConnectedLayer(5, 'Name','fc8'); layers(25) = classificationLayer('Name',‘VehicleClassifier')
  • 11.
    Copyright © 2017MathWorks, Inc 12 Training Object Detectors Transfer Learning Images New Classifier Learn New Weights Modify Network Structure Load Reference NetworkLabels Train Any Network trainNetwork(datastore, layers, options) Pre-built Frameworks for Computer Vision • Deep Learning: R-CNN, Fast R-CNN, Faster R-CNN • Machine Learning: ACF, Cascade Object Detectors
  • 12.
    Copyright © 2017MathWorks, Inc 13 Visualizing and Debugging Intermediate Results Filters … Activations Deep Dream Training Accuracy Visualization Deep Dream Layer Activations Feature Visualization • Many options for visualizations and debugging • Examples to get started
  • 13.
    Copyright © 2017MathWorks, Inc 14 Real World Systems Use More Than Deep Learning Deep learning vehicle detector performance degraded with environmental effects (fog, etc. ) Fog Removal Challenge: Deep learning frameworks do not include “classical” computer vision Solution: Convert MATLAB code with deep learning and computer vision to embedded implementation
  • 14.
    Copyright © 2017MathWorks, Inc 15 Talk Outline Design Deep Learning & Vision Algorithms High Performance Embedded Implementation Accelerate and Scale Training Can you solve “real” problems for production systems with MATLAB?
  • 15.
    Copyright © 2017MathWorks, Inc 16 Single code change trainingOptions(‘sgdm’,… ‘ExecutionEnvironment’,’CPU’) Accelerate and Scale Computing Multi-core CPU ‘ExecutionEnvironment’,’GPU’) GPU ‘ExecutionEnvironment’,’multi-GPU’) Multiple GPU ‘ExecutionEnvironment’,’parallel’) Cluster/ Cloud
  • 16.
    Copyright © 2017MathWorks, Inc 17 After Many Iterations to Find The Best Model
  • 17.
    Copyright © 2017MathWorks, Inc 18 Talk Outline Design Deep Learning & Vision Algorithms High Performance Embedded Implementation Accelerate and Scale Training Can you create high performance implementation from MATLAB code ?
  • 18.
    Copyright © 2017MathWorks, Inc 19 Presenting the MATLAB to CUDA parallelizing compiler Why? • Alexnet inference using MATLAB solution is • ~14x faster than pyCaffe and 50% faster than C++-Caffe • ~ 4x faster and ~3x less memory-use than TensorFlow
  • 19.
    Copyright © 2017MathWorks, Inc 20 Sample Generated CUDA Code MATLAB source code Auto-generated CUDA code
  • 20.
    Copyright © 2017MathWorks, Inc 21 MATLAB to CUDA compiler flow Control-flow graph Intermediate representation (CFG – IR) Front-end Parallel loop creation Library function mapping CUDA kernel creation cudaMemcpy minimization Shared memory synthesis CUDA code emission …. Traditional compiler optimizations …. (×) cublas-gemm () cuSolver calls fft cuFFT calls nnet cuDNN calls Library function mapping Parallel loop creation Identify loop-nests that will become CUDA kernels … . CUDA kernel creation Convert loop to CUDA kernel Thread/blocks inferred from loop dims cudaMemcpy minimization Shared memory synthesis Perform Use-def analysis. cudaMalloc GPU vars, insert memcpy Infer data locality. Map to shared memory. Synthesize shared memory access CUDA kernel optimizations
  • 21.
    Copyright © 2017MathWorks, Inc 22 MATLAB to CUDA compiler: Creating large parallel loops! Control-flow graph Intermediate representation (CFG – IR) Front-end Scalarization Loop perfectization Loop interchange Loop fusion Scalar replacement Library function mapping CUDA code emission …. Traditional compiler optimizations … . Loop optimizations Scalarization Loop fusion Scalar replacement Parallel loop creation CUDA kernel creation cudaMemcpy minimization Shared memory synthesis CUDA kernel optimizations
  • 22.
    Copyright © 2017MathWorks, Inc 23 MATLAB to CUDA compiler: Creating large parallel loops! Control-flow graph Intermediate representation (CFG – IR) Front-end Scalarization Loop perfectization Loop interchange Loop fusion Scalar replacement Library function mapping CUDA code emission …. Traditional compiler optimizations … . Loop optimizations 2 kernels (size N), 20*N bytes 1 kernel (size N), 16*N bytes Scalarization Loop fusion Scalar replacement Parallel loop creation CUDA kernel creation cudaMemcpy minimization Shared memory synthesis CUDA kernel optimizations
  • 23.
    Copyright © 2017MathWorks, Inc 24 cudaMemcpy minimization A(:) = …. C(:) = …. for i = 1:N …. gB = kernel1(gA); gA = kernel2(gB); if (some_condition) gC = kernel3(gA, gB); end …. end …. = C; cudaMemcpy *definitely* needed cudaMemcpy *not* needed cudaMemcpy *may be* needed Observations • Equivalent to Partial redundancy elimination (PRE) • Dynamic strategy – track memory location with a status flag per variable • Use-Def to determine where to insert memcpy A(:) = … A_isDirtyOnCpu = true; … for i = 1:N if (A_isDirtyOnCpu) cudaMemcpy(gA, A); A_isDirtyOnCpu = false; end gB = kernel1(gA); gA = kernel2(gB); if (somecondition) gC = kernel3(gA, gB); C_isDirtyOnGpu = true; end … end … if (C_isDirtyOnGpu) cudaMemcpy(C, gC); C_isDirtyOnGpu = false; end … = C; Assume gA, gB and gC are mapped to GPU memory Generated (pseudo) code
  • 24.
    Copyright © 2017MathWorks, Inc 25 Example: Compiling fog-rectification algorithm
  • 25.
    Copyright © 2017MathWorks, Inc 26 MATLAB to CUDA compilation of computer vision applications Distance transform Fog removal SURF feature extraction Ray tracing Stereo disparity
  • 26.
    Copyright © 2017MathWorks, Inc 27 Deep learning prediction performance: Alexnet Framerate(Fps) Batch Size CPU Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50 GHz GPU Tesla K40c 0 200 400 600 800 1000 1200 1400 1 16 32 64 Py-Caffe TensorFlow
  • 27.
    Copyright © 2017MathWorks, Inc 28 Deep learning prediction performance: Alexnet 0 1 2 3 4 5 6 7 8 9 CPU resident memory GPU peak memory (nvidia-smi) Memoryusage(GB) Batch Size 1 16 32 64 CPU Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50 GHz GPU Tesla K40c Py-Caffe MATLABtoCUDAcompiler TensorFlow MATLABonCPU+GPU C++-Caffe
  • 28.
    Copyright © 2017MathWorks, Inc 29 Deep learning prediction performance: Alexnet Jetson (Tegra) TX1 0 50 100 150 200 250 1 16 32 64 128 Framerate(Fps) Batch Size C++-Caffe MATLAB to CUDA compiler
  • 29.
    Copyright © 2017MathWorks, Inc 30 Create CNNs with MATLAB, Deploy with MATLAB to CUDA compiler Alexnet YOLO People detection Lane detection ~20 Fps (K40c) ~30 Fps (Tegra X1) ~66 Fps (Tegra X1) (K40c)
  • 30.
    Copyright © 2017MathWorks, Inc 31 Conclusions Design Deep Learning & Vision Algorithm Accelerate and Scale Training Deep learning design is easy in MATLAB Managing datasets and scaling up training is easy in MATLAB MATLAB to CUDA compiler 10x – 14x faster than pyCaffe 1.3x – 4x faster than TensorFlow 1.07 – 1.6x faster than C++ Caffe High Performance Embedded Implementation
  • 31.
    Copyright © 2017MathWorks, Inc 32 What next? www.mathworks.com/matlab-cuda-beta MATLAB to CUDA compiler: Sign up for our beta program Try deep learning in MATLAB Visit our booth and see our demos Booth #: 808