Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Solutions Architect, Amazon Web Services Japan Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE Ryosuke Iwanaga October 2016

Agenda • Recommendation and DSSTNE • Data science productivity with AWS Note: Details are not the actual Amazon case, but general pattern

Product Recommendations What are people who bought items A, B, C … Z most likely to purchase next?

Input and Output Input Purchase history for each customer Output Possibility to buy each products for each customer

Machine Learning for Recommendation Lots of algorithms Matrix Factorization Logistic Regression Naïve Bayes etc. => Neural Network

Neural Networks for Product Recommendations Output (10K-10M) Input (10K-10M) Hidden (100-1K)

This Is A Huge Sparse Data Problem l Uncompressed sparse data either eats a lot of memory or it eats a lot of bandwidth uploading it to the GPU l Naively running networks with uncompressed sparse data leads to lots of multiplications of zero by zero. This wastes memory, power, and time l Product Recommendation Networks can have billions of parameters that cannot fit in a single GPU so summarizing...

Framework Requirements (2014) l Efficient support for large input and output layers l Efficient handling of sparse data (i.e. don't store zero) l Automagic multi-GPU support for large networks and scaling l Avoids multiplying zero and/or by zero l 24 hour or less training and recommendations turnaround l Human-readable descriptions of networks

DSSTNE: Deep Sparse Scalable Tensor Network Engine* l A Neural Network framework released into OSS by Amazon l Optimized for large sparse data problems and for fully connected layers l Extremely efficient model-parallel multi-GPU support l 100% Deterministic Execution l Full SM 3.x, 5.x, and 6.x support (Kepler or better GPUs) l Distributed training support OOTB (~20 lines of MPI calls) *”Destiny”

Describes Neural Networks As JSON Objects{ "Version" : 0.7, "Name" : "AE", "Kind" : "FeedForward", "SparsenessPenalty" : { "p" : 0.5, "beta" : 2.0 }, "ShuffleIndices" : false, "Denoising" : { "p" : 0.2 }, "ScaledMarginalCrossEntropy" : { "oneTarget" : 1.0, "zeroTarget" : 0.0, "oneScale" : 1.0, "zeroScale" : 1.0 }, "Layers" : [ { "Name" : "Input", "Kind" : "Input", "N" : "auto", "DataSet" : "input", "Sparse" : true }, { "Name" : "Hidden", "Kind" : "Hidden", "Type" : "FullyConnected", "N" : 128, "Activation" : "Sigmoid", "Sparse" : true }, { "Name" : "Output", "Kind" : "Output", "Type" : "FullyConnected", "DataSet" : "output", "N" : "auto", "Activation" : "Sigmoid", "Sparse" : true } ], "ErrorFunction" : "ScaledMarginalCrossEntropy" }

Summary for DSSTNE Very efficient performance for sparse fully-connected NN Multiple GPU by Model parallel and Data parallel Declare NN by human readable format JSON definition 100% Deterministic execution

Data science productivity with AWS

Productivity Agile iteration is the most important for productivity design=>train=>predict=>evaluate=>design=>… Training: GPU (DSSTNE and others) Pre/Post process: CPU How to unify these different workload? Data scientists don't want to use too much tools

What are Containers? OS virtualization Process isolation Images AutomationServer Guest OS Bins/Libs Bins/Libs App2App1

Deep Learning meets Docker(Container) A lot of Deep Learning frameworks DSSTNE, Caffe, Theano, TensorFlow, etc. To compare each framework using the same input and output Containerize each framework Just swap the container image and configuration No more worry about setup machines!

Spark moves at interactive speed join filter groupBy Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: = cached partition= RDD map • Massively parallel • Uses DAGs instead of map- reduce for execution • Minimizes I/O by storing data in DataFrames in memory • Partitioning-aware to avoid network-intensive shuffle

Apache Zeppelin notebook to develop queries

Control CPU cluster and GPU cluster Both CPU and GPU jobs are submitted via Spark driver CPU jobs: Normal Spark tasks running on Amazon EMR GPU jobs: Spark submits jobs to Amazon ECS Not only DSSTNE but also other DL frameworks by Docker

Why EMR? Automation Decouple Elastic Integration Low-costCurrent

Why EMR? Automation EC2 Provisioning Cluster Setup Hadoop Configuration Installing ApplicationsJob submissionMonitoring and Failure Handling

Why EMR? Decoupled Architecture Separate compute and storage Resize and shutdown with no data loss Point multiple clusters ad the same data on Amazon S3 Easily evolve infrastructure as technology evolves HDFS for iterative and disk I/O intensive workloads Save with spot and reserved instances

Why EMR? Decouple Storage and Compute Amazon Kinesis (Streams, Firehose) Hadoop Jobs Persistent Cluster – Interactive Queries (Spark-SQL | Presto | Impala) Transient Cluster - Batch Jobs (X hours nightly) – Add/Remove Nodes ETL Jobs Hive External Metastore i.e Amazon RDS Workload specific clusters (Different sizes, Different Versions) Amazon S3 for Storage create external table t_name(..)... location s3://bucketname/path-to-file/

Amazon EC2 Container Service (ECS) Container Management at Any Scale Flexible Container Placement Integration with the AWS Platform

Components of Amazon ECS Task Actual containers running on Instances Task Definition Definition of containers and environment for task Cluster Fleet of EC2 instances on which tasks run Manager Manage cluster resource and state of tasks Scheduler Place tasks considering cluster status Agent Coordinate EC2 instances and Manager

How Amazon ECS runs Task Scheduler ManagerCluster Task Definition Task Agent

Integration with Spark and ECS Install AWS SDK for Java on the EMR cluster Create Task Definition for each Deep Learning framework Call RunTask API ECS Scheduler will try to find enough space to run it

Why AWS? Scalability Fully-managed services GPU instances

Amazon Personalization runs on AWS Spark and Zeppelin for the single interface for data scientists DSSTNE helps running DL on a huge amount of sparse NN Using Amazon EMR for CPU and Amazon ECS for GPU You can do it!

Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

More Related Content

What's hot

Viewers also liked

Similar to Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

More from DataWorks Summit/Hadoop Summit

Recently uploaded

Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE