WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
Gurpreet Singh, Microsoft DASK and Apache Spark #UnifiedAnalytics #SparkAISummit
3#UnifiedAnalytics #SparkAISummit …this talk is also about Scaling Python for Data Analysis & Machine Learning! We’ll start with briefly reviewing the Scalability Challenge in the PyData stack… …before Comparing and Contrasting …
Pandas & Scikit-learn 4#UnifiedAnalytics #SparkAISummit Options to Scale… GIL Challenge Multiprocessing Concurrent Futures JobLib Partial_Fit() Hashing Trick
5#UnifiedAnalytics #SparkAISummit High Level APIs RDD Directed Acyclic Graph (DAG) Lazy Execution Task Scheduler Synchronous Multiprocessing Threaded Local Spark Standalone Low Level APIs Custom Algorithms Spark Streaming Spark MLlib GraphFrames Spark SQL / DataFrames Distributed DASK Delayed DASK Futures Design Approach DASK Arrays [Parallel NumPy] DASK DataFrames [Parallel Pandas] DASK-ML [Parallel Scikit-learn] DASK Bag [Parallel Lists] Mesos YARN Local Scheduler Pluggable Task Scheduling System Custom Graphs Submit graph as Python Dictionary object GraphX doesn’tsupport Python
DASK DataFrame & PySpark 6#UnifiedAnalytics #SparkAISummit DASK DataFrames [Parallel Pandas] § Performance Concerns due to the PySpark Design§ DASK DataFrames API is not identical with Pandas API § Performance Concerns with Operations involving Shuffling § Inefficiencies of Pandas are carried over Challenges Challenges § Follow the Pandas Performance tips § Avoid Shuffle, Use pre-sorting, Persist the Results § Use DataFrames API § Use Vectorized/Pandas UDF (Spark v2.3 onwards) RecommendationsRecommendations
7#UnifiedAnalytics #SparkAISummit Code Review Wildcard Wildcard Additional Details
8#UnifiedAnalytics #SparkAISummit While Pandas display a sample of the data, DASK and PySpark show metadata of the DataFrame. The npartitions value shows how many partitions the DataFrame is split into. DASK created a DAG with 99 nodes to process the data. Code Review
9#UnifiedAnalytics #SparkAISummit Code Review .compute() or .head() method tells Dask to go ahead and run the computation and display the results. .show() displays the DataFrame in a tabular form.
How does DASK-ML work? Parallelize Scikit-Learn Re-implement Algorithms Partner with existing Libraries Scalable Machine Learning 10#UnifiedAnalytics #SparkAISummit OCT ‘17 - DASK-ML Spark MLlib - As of Spark 2.0, the primary Machine Learning API for Spark is now the DataFrame-based API in the spark.ml package. It provides: Distributed JobLib Algorithms Featurization Pipelines Persistence Utilities Scalable ML Approaches § Spark for Feature Engineering + Scikit-learn etc. for Learning § Distributed ML Algorithms from Spark MLlib § Train/evaluate Scikit-learn models in parallel (spark-sklearn) from sklearn.externals.joblib import _dask, parallel_backend from sklearn.utils import register_parallel_backend register_parallel_backend('distributed', _dask.DaskDistributedBackend) from dask_ml.xgboost import XGBRegressor est = XGBRegressor(...) est.fit(train, train_labels) prediction = est.predict(test) Common algorithms e.g. classification, regression, & clustering Feature extraction, Transformation, Dimensionality Reduction Constructing, evaluating and tuning ML pipelines Save/Load Algorithms, Models, and Pipelines Linear Algebra, Statistics, and Data Handling Transformers Estimators Pipelines Chains multiple Transformers and Estimators as per the ML Flow Algorithm that transfers one DataFrame into another Algorithm that trains and produces a model
Distributed Deep Learning 11#UnifiedAnalytics #SparkAISummit with Deep Learning Pipelines Project Hydrogen Peer a DASK Distributed cluster with TensorFlow running in distributed mode. § APIs for scalable deep learning in Python from Databricks § Provides a suite of tools covering loading, Training, Tuning and Deploying § Simplifies Distributed Deep Learning Training § Supports TensorFlow, Keras and PyTorch § Integration with PySpark New scheduling option called Gang Scheduler
Other Dev Considerations… § Workloads/APIs § Custom Algorithms (only in DASK) § SQL, Graph (only in Spark) § Debugging Challenges § DASK Distributed may not align with normal Python Debugging Tools/Practices § PySpark errors may have a mix of JVM and Python Stack Trace § Visualization Options § Down-sample and use Pandas DataFrames § Use open source Libraries e.g. D3, Seaborn, Datashader (only for DASK) etc. § Use Databricks Visualization Feature 12#UnifiedAnalytics #SparkAISummit
Which one to Use? 13#UnifiedAnalytics #SparkAISummit There Are No Solutions, There Are Only Tradeoffs! – Thomas Sowell
Questions? 14#UnifiedAnalytics #SparkAISummit
DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT

DASK and Apache Spark

  • 1.
    WIFI SSID:SparkAISummit |Password: UnifiedAnalytics
  • 2.
    Gurpreet Singh, Microsoft DASKand Apache Spark #UnifiedAnalytics #SparkAISummit
  • 3.
    3#UnifiedAnalytics #SparkAISummit …this talkis also about Scaling Python for Data Analysis & Machine Learning! We’ll start with briefly reviewing the Scalability Challenge in the PyData stack… …before Comparing and Contrasting …
  • 4.
    Pandas & Scikit-learn 4#UnifiedAnalytics#SparkAISummit Options to Scale… GIL Challenge Multiprocessing Concurrent Futures JobLib Partial_Fit() Hashing Trick
  • 5.
    5#UnifiedAnalytics #SparkAISummit High LevelAPIs RDD Directed Acyclic Graph (DAG) Lazy Execution Task Scheduler Synchronous Multiprocessing Threaded Local Spark Standalone Low Level APIs Custom Algorithms Spark Streaming Spark MLlib GraphFrames Spark SQL / DataFrames Distributed DASK Delayed DASK Futures Design Approach DASK Arrays [Parallel NumPy] DASK DataFrames [Parallel Pandas] DASK-ML [Parallel Scikit-learn] DASK Bag [Parallel Lists] Mesos YARN Local Scheduler Pluggable Task Scheduling System Custom Graphs Submit graph as Python Dictionary object GraphX doesn’tsupport Python
  • 6.
    DASK DataFrame &PySpark 6#UnifiedAnalytics #SparkAISummit DASK DataFrames [Parallel Pandas] § Performance Concerns due to the PySpark Design§ DASK DataFrames API is not identical with Pandas API § Performance Concerns with Operations involving Shuffling § Inefficiencies of Pandas are carried over Challenges Challenges § Follow the Pandas Performance tips § Avoid Shuffle, Use pre-sorting, Persist the Results § Use DataFrames API § Use Vectorized/Pandas UDF (Spark v2.3 onwards) RecommendationsRecommendations
  • 7.
  • 8.
    8#UnifiedAnalytics #SparkAISummit While Pandasdisplay a sample of the data, DASK and PySpark show metadata of the DataFrame. The npartitions value shows how many partitions the DataFrame is split into. DASK created a DAG with 99 nodes to process the data. Code Review
  • 9.
    9#UnifiedAnalytics #SparkAISummit Code Review .compute()or .head() method tells Dask to go ahead and run the computation and display the results. .show() displays the DataFrame in a tabular form.
  • 10.
    How does DASK-ML work? ParallelizeScikit-Learn Re-implement Algorithms Partner with existing Libraries Scalable Machine Learning 10#UnifiedAnalytics #SparkAISummit OCT ‘17 - DASK-ML Spark MLlib - As of Spark 2.0, the primary Machine Learning API for Spark is now the DataFrame-based API in the spark.ml package. It provides: Distributed JobLib Algorithms Featurization Pipelines Persistence Utilities Scalable ML Approaches § Spark for Feature Engineering + Scikit-learn etc. for Learning § Distributed ML Algorithms from Spark MLlib § Train/evaluate Scikit-learn models in parallel (spark-sklearn) from sklearn.externals.joblib import _dask, parallel_backend from sklearn.utils import register_parallel_backend register_parallel_backend('distributed', _dask.DaskDistributedBackend) from dask_ml.xgboost import XGBRegressor est = XGBRegressor(...) est.fit(train, train_labels) prediction = est.predict(test) Common algorithms e.g. classification, regression, & clustering Feature extraction, Transformation, Dimensionality Reduction Constructing, evaluating and tuning ML pipelines Save/Load Algorithms, Models, and Pipelines Linear Algebra, Statistics, and Data Handling Transformers Estimators Pipelines Chains multiple Transformers and Estimators as per the ML Flow Algorithm that transfers one DataFrame into another Algorithm that trains and produces a model
  • 11.
    Distributed Deep Learning 11#UnifiedAnalytics#SparkAISummit with Deep Learning Pipelines Project Hydrogen Peer a DASK Distributed cluster with TensorFlow running in distributed mode. § APIs for scalable deep learning in Python from Databricks § Provides a suite of tools covering loading, Training, Tuning and Deploying § Simplifies Distributed Deep Learning Training § Supports TensorFlow, Keras and PyTorch § Integration with PySpark New scheduling option called Gang Scheduler
  • 12.
    Other Dev Considerations… §Workloads/APIs § Custom Algorithms (only in DASK) § SQL, Graph (only in Spark) § Debugging Challenges § DASK Distributed may not align with normal Python Debugging Tools/Practices § PySpark errors may have a mix of JVM and Python Stack Trace § Visualization Options § Down-sample and use Pandas DataFrames § Use open source Libraries e.g. D3, Seaborn, Datashader (only for DASK) etc. § Use Databricks Visualization Feature 12#UnifiedAnalytics #SparkAISummit
  • 13.
    Which one toUse? 13#UnifiedAnalytics #SparkAISummit There Are No Solutions, There Are Only Tradeoffs! – Thomas Sowell
  • 14.
  • 15.
    DON’T FORGET TORATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT