"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded GPUs," a Presentation from MathWorks

Copyright © 2017 MathWorks, Inc 2 Design Deep Learning & Vision Algorithms Talk Outline High Performance Embedded Implementation Highlights • Manage large image sets • Automate image labeling • Easy access to models • Pre-built training frameworks Highlights • Automate compilation of MATLAB to CUDA • 14x faster than pyCaffe 60% faster than C++ Caffe 3x faster than TensorFlow Accelerate and Scale Training Highlights • Acceleration with GPUs • Scale to clusters

Copyright © 2017 MathWorks, Inc 5 Transfer Learning Workflow Transfer Learning Images New Classifier Learn New Weights Modify Network Structure Load Reference NetworkLabels Training Data Labels: Car, Truck, Large Truck, SUV, Van Alexnet, VGG-16, VGG-19, GoogLeNet

Copyright © 2017 MathWorks, Inc 6 Manage Large Sets of Images Transfer Learning Images New Classifier Learn New Weights Modify Network Structure Load Reference NetworkLabels Handle Large Sets of Images Easily manage large sets of images - Single line of code to access images - Operates on disk, database, big-data file system imageData = imageDataStore(‘vehicles’) Easily manage large sets of images - Single line of code to access images - Operates on disk, database, big-data file system Organize Images in Folders (~ 10,000 images , 5 folders)

Copyright © 2017 MathWorks, Inc 9 Access Reference Models in MATLAB Transfer Learning Images New Classifier Learn New Weights Modify Network Structure Load Reference NetworkLabels Easily Load Reference Networks Access Models with 1-line of MATLAB Code Net1 = alexnet Net2 = vgg16 Net3 = vgg19

Copyright © 2017 MathWorks, Inc 10 Access Reference Models in MATLAB Easily manage large sets of images - Single line of code to access images - Operates on disk, database, big-data file system 1. Reference Models 2. Model Importer 3. Tutorials

Copyright © 2017 MathWorks, Inc 11 Modify Network Structure Transfer Learning Images New Classifier Learn New Weights Modify Network Structure Load Reference NetworkLabels Simple MATLAB API to modify layers: layers(23) = fullyConnectedLayer(5, 'Name','fc8'); layers(25) = classificationLayer('Name',‘VehicleClassifier')

Copyright © 2017 MathWorks, Inc 12 Training Object Detectors Transfer Learning Images New Classifier Learn New Weights Modify Network Structure Load Reference NetworkLabels Train Any Network trainNetwork(datastore, layers, options) Pre-built Frameworks for Computer Vision • Deep Learning: R-CNN, Fast R-CNN, Faster R-CNN • Machine Learning: ACF, Cascade Object Detectors

Copyright © 2017 MathWorks, Inc 13 Visualizing and Debugging Intermediate Results Filters … Activations Deep Dream Training Accuracy Visualization Deep Dream Layer Activations Feature Visualization • Many options for visualizations and debugging • Examples to get started

Copyright © 2017 MathWorks, Inc 14 Real World Systems Use More Than Deep Learning Deep learning vehicle detector performance degraded with environmental effects (fog, etc. ) Fog Removal Challenge: Deep learning frameworks do not include “classical” computer vision Solution: Convert MATLAB code with deep learning and computer vision to embedded implementation

Copyright © 2017 MathWorks, Inc 15 Talk Outline Design Deep Learning & Vision Algorithms High Performance Embedded Implementation Accelerate and Scale Training Can you solve “real” problems for production systems with MATLAB?

Copyright © 2017 MathWorks, Inc 16 Single code change trainingOptions(‘sgdm’,… ‘ExecutionEnvironment’,’CPU’) Accelerate and Scale Computing Multi-core CPU ‘ExecutionEnvironment’,’GPU’) GPU ‘ExecutionEnvironment’,’multi-GPU’) Multiple GPU ‘ExecutionEnvironment’,’parallel’) Cluster/ Cloud

Copyright © 2017 MathWorks, Inc 18 Talk Outline Design Deep Learning & Vision Algorithms High Performance Embedded Implementation Accelerate and Scale Training Can you create high performance implementation from MATLAB code ?

Copyright © 2017 MathWorks, Inc 19 Presenting the MATLAB to CUDA parallelizing compiler Why? • Alexnet inference using MATLAB solution is • ~14x faster than pyCaffe and 50% faster than C++-Caffe • ~ 4x faster and ~3x less memory-use than TensorFlow

Copyright © 2017 MathWorks, Inc 21 MATLAB to CUDA compiler flow Control-flow graph Intermediate representation (CFG – IR) Front-end Parallel loop creation Library function mapping CUDA kernel creation cudaMemcpy minimization Shared memory synthesis CUDA code emission …. Traditional compiler optimizations …. (×) cublas-gemm () cuSolver calls fft cuFFT calls nnet cuDNN calls Library function mapping Parallel loop creation Identify loop-nests that will become CUDA kernels … . CUDA kernel creation Convert loop to CUDA kernel Thread/blocks inferred from loop dims cudaMemcpy minimization Shared memory synthesis Perform Use-def analysis. cudaMalloc GPU vars, insert memcpy Infer data locality. Map to shared memory. Synthesize shared memory access CUDA kernel optimizations

Copyright © 2017 MathWorks, Inc 22 MATLAB to CUDA compiler: Creating large parallel loops! Control-flow graph Intermediate representation (CFG – IR) Front-end Scalarization Loop perfectization Loop interchange Loop fusion Scalar replacement Library function mapping CUDA code emission …. Traditional compiler optimizations … . Loop optimizations Scalarization Loop fusion Scalar replacement Parallel loop creation CUDA kernel creation cudaMemcpy minimization Shared memory synthesis CUDA kernel optimizations

Copyright © 2017 MathWorks, Inc 23 MATLAB to CUDA compiler: Creating large parallel loops! Control-flow graph Intermediate representation (CFG – IR) Front-end Scalarization Loop perfectization Loop interchange Loop fusion Scalar replacement Library function mapping CUDA code emission …. Traditional compiler optimizations … . Loop optimizations 2 kernels (size N), 20*N bytes 1 kernel (size N), 16*N bytes Scalarization Loop fusion Scalar replacement Parallel loop creation CUDA kernel creation cudaMemcpy minimization Shared memory synthesis CUDA kernel optimizations

Copyright © 2017 MathWorks, Inc 24 cudaMemcpy minimization A(:) = …. C(:) = …. for i = 1:N …. gB = kernel1(gA); gA = kernel2(gB); if (some_condition) gC = kernel3(gA, gB); end …. end …. = C; cudaMemcpy *definitely* needed cudaMemcpy *not* needed cudaMemcpy *may be* needed Observations • Equivalent to Partial redundancy elimination (PRE) • Dynamic strategy – track memory location with a status flag per variable • Use-Def to determine where to insert memcpy A(:) = … A_isDirtyOnCpu = true; … for i = 1:N if (A_isDirtyOnCpu) cudaMemcpy(gA, A); A_isDirtyOnCpu = false; end gB = kernel1(gA); gA = kernel2(gB); if (somecondition) gC = kernel3(gA, gB); C_isDirtyOnGpu = true; end … end … if (C_isDirtyOnGpu) cudaMemcpy(C, gC); C_isDirtyOnGpu = false; end … = C; Assume gA, gB and gC are mapped to GPU memory Generated (pseudo) code

Copyright © 2017 MathWorks, Inc 27 Deep learning prediction performance: Alexnet Framerate(Fps) Batch Size CPU Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50 GHz GPU Tesla K40c 0 200 400 600 800 1000 1200 1400 1 16 32 64 Py-Caffe TensorFlow

Copyright © 2017 MathWorks, Inc 28 Deep learning prediction performance: Alexnet 0 1 2 3 4 5 6 7 8 9 CPU resident memory GPU peak memory (nvidia-smi) Memoryusage(GB) Batch Size 1 16 32 64 CPU Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50 GHz GPU Tesla K40c Py-Caffe MATLABtoCUDAcompiler TensorFlow MATLABonCPU+GPU C++-Caffe

Copyright © 2017 MathWorks, Inc 31 Conclusions Design Deep Learning & Vision Algorithm Accelerate and Scale Training Deep learning design is easy in MATLAB Managing datasets and scaling up training is easy in MATLAB MATLAB to CUDA compiler 10x – 14x faster than pyCaffe 1.3x – 4x faster than TensorFlow 1.07 – 1.6x faster than C++ Caffe High Performance Embedded Implementation

Copyright © 2017 MathWorks, Inc 32 What next? www.mathworks.com/matlab-cuda-beta MATLAB to CUDA compiler: Sign up for our beta program Try deep learning in MATLAB Visit our booth and see our demos Booth #: 808

"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded GPUs," a Presentation from MathWorks

More Related Content

What's hot

Viewers also liked

Similar to "Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded GPUs," a Presentation from MathWorks

More from Edge AI and Vision Alliance

Recently uploaded

"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded GPUs," a Presentation from MathWorks