Introduction to Spark ML Machine learning at scale ApacheCon 2016 Hella-Legit
Who am I? Holden ● I prefer she/her for pronouns ● Co-author of the Learning Spark book ● Software Engineer at IBM’s Spark Technology Center ● @holdenkarau ● http://www.slideshare.net/hkarau ● https://www.linkedin.com/in/holdenkarau
What we are going to explore together! ● Who I think you all are ● Spark’s two different ML APIs ● Running through a simple example with one ● A brief detour into some codegen funtimes ● Exercises! ● Model save/load ● Discussion of “serving” options
The different pieces of Spark Apache Spark SQL & DataFrames Streaming Language APIs Scala, Java, Python, & R Graph Tools Spark ML bagel & Grah X MLLib Community Packages
Who do I think you all are? ● Nice people* ● Some knowledge of Apache Spark core & maybe SQL ● Interested in using Spark for Machine Learning ● Familiar-ish with Scala or Java or Python Amanda
Skipping intro & set-up time :)
But maybe time to upgrade... ● Spark 1.5+ (Spark 1.6 would be best!) ○ (built with Hive support if building from source) Amanda
Some pages to keep open: http://bit.ly/sparkDocs http://bit.ly/sparkPyDocs OR http://bit.ly/sparkScalaDoc http://bit.ly/sparkMLGuide https://github.com/holdenk/spark-intro-ml-pipeline- workshop http://www.slideshare.net/hkarau Download census data https://archive.ics.uci. edu/ml/datasets/Adult Dwight Sipler
Getting some data for working with: ● census data: https://archive.ics.uci. edu/ml/datasets/Adult ● Goal: predict income > 50k ● Also included in the github repo ● Download that now if you haven’t already ● We will add a header to the data ○ http://pastebin.ca/3318687 PROTill Westermayer
So what are the two APIs? ● Traditional and Pipeline ○ Pipeline is the new shiny future which will fix all problems* ● Traditional API works on RDDs ○ Data preparation work is generally done in traditional Spark transformations ● Pipeline API works on DataFrames ○ Often we want to apply some transformations to our data before feeding to the machine learning algorithm ○ Makes it easy to chain these together (*until replaced by a newer shinier future) Steve Jurvetson
So what are DataFrames? ● Spark SQL’s version of RDDs of the world (its for more than just SQL) ● Restricted data types, schema information, compile time untyped ● Restricted operations (more relational style) ● Allow lots of fun extra optimizations ○ Tungsten etc. ● We’ll talk more about them (& Datasets) when we do the Spark SQL component of this workshop
Transformers, Estimators and Pipelines ● Transformers transform a DataFrame into another ● Estimators can be trained on a DataFrame to produce a transformer ● Pipelines chain together multiple transformers and estimators
Let’s start with loading some data ● We’ve got some CSV data, we could use textfile and parse by hand ● spark-packages can save by providing the spark-csv package by Hossein Falaki ○ If we were building a Java project we can include maven coordinates ○ For the Spark shell when launching add: --packages com.databricks:spark-csv_2.10:1.3.0 Jess Johnson
Loading with sparkSQL & spark-csv sqlContext.read returns a DataFrameReader We can specify general properties & data specific options ● option(“key”, “value”) ○ spark-csv ones we will use are header & inferSchema ● format(“formatName”) ○ built in formats include parquet, jdbc, etc. today we will use com. databricks.spark.csv ● load(“path”) Jess Johnson
Loading with sparkSQL & spark-csv df = sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") .option("inferSchema", "true") .load("resources/adult.data") Jess Johnson
Lets explore training a Decision Tree ● Step 1: Data loading (done!) ● Step 2: Data prep (select features, etc.) ● Step 3: Train ● Step 4: Predict
Data prep / cleaning ● We need to predict a double (can be 0.0, 1.0, but type must be double) ● We need to train with a vector of features Imports: from pyspark.mllib.linalg import Vectors from pyspark.ml.classification import DecisionTreeClassifier from pyspark.ml.param import Param, Params from pyspark.ml.feature import Bucketizer, VectorAssembler, StringIndexer from pyspark.ml import Pipeline Huang Yun Chung
Data prep / cleaning continued # Combines a list of double input features into a vector assembler = VectorAssembler(inputCols=["age", "education-num"], outputCol="feautres") # String indexer converts a set of strings into doubles indexer = StringIndexer(inputCol="category") .setOutputCol("category-index") # Can be used to combine pipeline components together pipeline = Pipeline().setStages([assembler, indexer]) Huang Yun Chung
So a bit more about that pipeline ● Each of our previous components has “fit” & “transform” stage ● Constructing the pipeline this way makes it easier to work with (only need to call one fit & one transform) ● Can re-use the fitted model on future data model=pipeline.fit(df) prepared = model.transform(df) Andrey
What does our pipeline look like so far? Input Data Assembler Input Data + Vectors StringIndexer Input Data +Cat ID + Vectors While not an ML learning algorithm this still needs to be fit This is a regular transformer - no fitting required.
Let's train a model on our prepared data: # Specify model dt = DecisionTreeClassifier(labelCol = "category-index", featuresCol="features") # Fit it dt_model = dt.fit(prepared) # Or as part of the pipeline pipeline_and_model = Pipeline().setStages([assembler, indexer, dt]) pipeline_model = pipeline_and_model.fit(df)
And predict the results on the same data: pipeline_model.transform(df).select("prediction", "category-index").take(20)
Exercise 1: Go from the index to something useful ● We could manually look up the labels and then write a select statement ● Or we could look at the features on the StringIndexerModel and use IndexToString ● Our pipeline has an array of stages we can use for this
Solution: from pyspark.ml.feature import IndexToString labels = list(pipeline_model.stages[1].labels()) inverter = IndexToString(inputCol="prediction", outputCol=" prediction-label", labels=labels) inverter.transform(pipeline_model.transform(df)).select ("prediction-label", "category").take(20) # Pre Spark 1.6 use SQL if/else or similar
So what could we do for other types of data? ● org.apache.spark.ml.feature has a lot of options ○ HashingTF ○ Tokenizer ○ Word2Vec ○ etc.
Exercise 2: Add more features to your tree ● Finished quickly? Help others! ● Or tell me if adding these features helped or not… ○ We can download a reserve “test” dataset but how would we know if we couldn’t do that? cobra libre
And not just for getting data into doubles... ● Maybe a customers cat food preference only matters if the owns_cats boolean is true ● Maybe the scale is _way_ off ● Maybe we’ve got stop words ● Maybe we know one component has a non-linear relation ● etc.
Cross-validation because saving a test set is effort ● Automagically* fit your model params ● Because thinking is effort ● org.apache.spark.ml.tuning has the tools ○ (not in Python yet so skipping for now) Jonathan Kotta
Pipeline API has many models: ● org.apache.spark.ml.classification ○ BinaryLogisticRegressionClassification, DecissionTreeClassification, GBTClassifier, etc. ● org.apache.spark.ml.regression ○ DecissionTreeRegression, GBTRegressor, IsotonicRegression, LinearRegression, etc. ● org.apache.spark.ml.recommendation ○ ALS PROcarterse Follow
Exercise 3: Train a new model type ● Your choice! ● If you want to do regression - change what we are predicting
So serving... ● Generally refers to using your model online ○ Generating recommendations... ● In batch mode you can “just” save & use the Spark bits ● Spark’s “native” formats (often parquet w/metadata) ○ Understood by Spark libraries and thats pretty much it ○ If you are serving in JVM can load these but need Spark dependencies (albeit often not a Spark cluster) ● Some models support PMML export ○ https://github.com/jpmml/openscoring etc. ● We can also write our own export & serving by hand :( Ambernectar 13
So what models are PMML exportable? ● Right now “old” style models ○ KMeans, LinearRegresion, RidgeRegression, Lasso, SVM, and Binary LogisticRegression ○ However if we look in the code we can sometimes find converters to the old style models and use this to export our “new” style model ● Waiting on https://issues.apache. org/jira/browse/SPARK-11171 / https://github. com/apache/spark/pull/9207 for pipeline models ● Not getting in for 2.0 :(
How to PMML export toPMML ● returns a string or ● takes a path to local fs and saves results or ● takes a SparkContext & a distributed path and saves or ● takes a stream and writes result to stream
Optional* exercise time ● Take a model you trained and save it to PMML ○ You will have to dig around in the Spark code to be able to do this ● Look at the file ● Load it into a serving system and try some predictions ● Note: PMML export currently only includes the model - not any transformations beforehand ● Also: you might need to train a new model ● If you don’t get it don’t worry - hints to follow :)

Introduction to Spark ML Pipelines Workshop

  • 1.
    Introduction to SparkML Machine learning at scale ApacheCon 2016 Hella-Legit
  • 2.
    Who am I? Holden ●I prefer she/her for pronouns ● Co-author of the Learning Spark book ● Software Engineer at IBM’s Spark Technology Center ● @holdenkarau ● http://www.slideshare.net/hkarau ● https://www.linkedin.com/in/holdenkarau
  • 3.
    What we aregoing to explore together! ● Who I think you all are ● Spark’s two different ML APIs ● Running through a simple example with one ● A brief detour into some codegen funtimes ● Exercises! ● Model save/load ● Discussion of “serving” options
  • 4.
    The different piecesof Spark Apache Spark SQL & DataFrames Streaming Language APIs Scala, Java, Python, & R Graph Tools Spark ML bagel & Grah X MLLib Community Packages
  • 5.
    Who do Ithink you all are? ● Nice people* ● Some knowledge of Apache Spark core & maybe SQL ● Interested in using Spark for Machine Learning ● Familiar-ish with Scala or Java or Python Amanda
  • 6.
    Skipping intro &set-up time :)
  • 7.
    But maybe timeto upgrade... ● Spark 1.5+ (Spark 1.6 would be best!) ○ (built with Hive support if building from source) Amanda
  • 8.
    Some pages tokeep open: http://bit.ly/sparkDocs http://bit.ly/sparkPyDocs OR http://bit.ly/sparkScalaDoc http://bit.ly/sparkMLGuide https://github.com/holdenk/spark-intro-ml-pipeline- workshop http://www.slideshare.net/hkarau Download census data https://archive.ics.uci. edu/ml/datasets/Adult Dwight Sipler
  • 9.
    Getting some datafor working with: ● census data: https://archive.ics.uci. edu/ml/datasets/Adult ● Goal: predict income > 50k ● Also included in the github repo ● Download that now if you haven’t already ● We will add a header to the data ○ http://pastebin.ca/3318687 PROTill Westermayer
  • 10.
    So what arethe two APIs? ● Traditional and Pipeline ○ Pipeline is the new shiny future which will fix all problems* ● Traditional API works on RDDs ○ Data preparation work is generally done in traditional Spark transformations ● Pipeline API works on DataFrames ○ Often we want to apply some transformations to our data before feeding to the machine learning algorithm ○ Makes it easy to chain these together (*until replaced by a newer shinier future) Steve Jurvetson
  • 11.
    So what areDataFrames? ● Spark SQL’s version of RDDs of the world (its for more than just SQL) ● Restricted data types, schema information, compile time untyped ● Restricted operations (more relational style) ● Allow lots of fun extra optimizations ○ Tungsten etc. ● We’ll talk more about them (& Datasets) when we do the Spark SQL component of this workshop
  • 12.
    Transformers, Estimators andPipelines ● Transformers transform a DataFrame into another ● Estimators can be trained on a DataFrame to produce a transformer ● Pipelines chain together multiple transformers and estimators
  • 13.
    Let’s start withloading some data ● We’ve got some CSV data, we could use textfile and parse by hand ● spark-packages can save by providing the spark-csv package by Hossein Falaki ○ If we were building a Java project we can include maven coordinates ○ For the Spark shell when launching add: --packages com.databricks:spark-csv_2.10:1.3.0 Jess Johnson
  • 14.
    Loading with sparkSQL& spark-csv sqlContext.read returns a DataFrameReader We can specify general properties & data specific options ● option(“key”, “value”) ○ spark-csv ones we will use are header & inferSchema ● format(“formatName”) ○ built in formats include parquet, jdbc, etc. today we will use com. databricks.spark.csv ● load(“path”) Jess Johnson
  • 15.
    Loading with sparkSQL& spark-csv df = sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") .option("inferSchema", "true") .load("resources/adult.data") Jess Johnson
  • 16.
    Lets explore traininga Decision Tree ● Step 1: Data loading (done!) ● Step 2: Data prep (select features, etc.) ● Step 3: Train ● Step 4: Predict
  • 17.
    Data prep /cleaning ● We need to predict a double (can be 0.0, 1.0, but type must be double) ● We need to train with a vector of features Imports: from pyspark.mllib.linalg import Vectors from pyspark.ml.classification import DecisionTreeClassifier from pyspark.ml.param import Param, Params from pyspark.ml.feature import Bucketizer, VectorAssembler, StringIndexer from pyspark.ml import Pipeline Huang Yun Chung
  • 18.
    Data prep /cleaning continued # Combines a list of double input features into a vector assembler = VectorAssembler(inputCols=["age", "education-num"], outputCol="feautres") # String indexer converts a set of strings into doubles indexer = StringIndexer(inputCol="category") .setOutputCol("category-index") # Can be used to combine pipeline components together pipeline = Pipeline().setStages([assembler, indexer]) Huang Yun Chung
  • 19.
    So a bitmore about that pipeline ● Each of our previous components has “fit” & “transform” stage ● Constructing the pipeline this way makes it easier to work with (only need to call one fit & one transform) ● Can re-use the fitted model on future data model=pipeline.fit(df) prepared = model.transform(df) Andrey
  • 20.
    What does ourpipeline look like so far? Input Data Assembler Input Data + Vectors StringIndexer Input Data +Cat ID + Vectors While not an ML learning algorithm this still needs to be fit This is a regular transformer - no fitting required.
  • 21.
    Let's train amodel on our prepared data: # Specify model dt = DecisionTreeClassifier(labelCol = "category-index", featuresCol="features") # Fit it dt_model = dt.fit(prepared) # Or as part of the pipeline pipeline_and_model = Pipeline().setStages([assembler, indexer, dt]) pipeline_model = pipeline_and_model.fit(df)
  • 22.
    And predict theresults on the same data: pipeline_model.transform(df).select("prediction", "category-index").take(20)
  • 23.
    Exercise 1: Go fromthe index to something useful ● We could manually look up the labels and then write a select statement ● Or we could look at the features on the StringIndexerModel and use IndexToString ● Our pipeline has an array of stages we can use for this
  • 24.
    Solution: from pyspark.ml.feature importIndexToString labels = list(pipeline_model.stages[1].labels()) inverter = IndexToString(inputCol="prediction", outputCol=" prediction-label", labels=labels) inverter.transform(pipeline_model.transform(df)).select ("prediction-label", "category").take(20) # Pre Spark 1.6 use SQL if/else or similar
  • 25.
    So what couldwe do for other types of data? ● org.apache.spark.ml.feature has a lot of options ○ HashingTF ○ Tokenizer ○ Word2Vec ○ etc.
  • 26.
    Exercise 2: Addmore features to your tree ● Finished quickly? Help others! ● Or tell me if adding these features helped or not… ○ We can download a reserve “test” dataset but how would we know if we couldn’t do that? cobra libre
  • 27.
    And not justfor getting data into doubles... ● Maybe a customers cat food preference only matters if the owns_cats boolean is true ● Maybe the scale is _way_ off ● Maybe we’ve got stop words ● Maybe we know one component has a non-linear relation ● etc.
  • 28.
    Cross-validation because saving atest set is effort ● Automagically* fit your model params ● Because thinking is effort ● org.apache.spark.ml.tuning has the tools ○ (not in Python yet so skipping for now) Jonathan Kotta
  • 29.
    Pipeline API hasmany models: ● org.apache.spark.ml.classification ○ BinaryLogisticRegressionClassification, DecissionTreeClassification, GBTClassifier, etc. ● org.apache.spark.ml.regression ○ DecissionTreeRegression, GBTRegressor, IsotonicRegression, LinearRegression, etc. ● org.apache.spark.ml.recommendation ○ ALS PROcarterse Follow
  • 30.
    Exercise 3: Traina new model type ● Your choice! ● If you want to do regression - change what we are predicting
  • 31.
    So serving... ● Generallyrefers to using your model online ○ Generating recommendations... ● In batch mode you can “just” save & use the Spark bits ● Spark’s “native” formats (often parquet w/metadata) ○ Understood by Spark libraries and thats pretty much it ○ If you are serving in JVM can load these but need Spark dependencies (albeit often not a Spark cluster) ● Some models support PMML export ○ https://github.com/jpmml/openscoring etc. ● We can also write our own export & serving by hand :( Ambernectar 13
  • 32.
    So what modelsare PMML exportable? ● Right now “old” style models ○ KMeans, LinearRegresion, RidgeRegression, Lasso, SVM, and Binary LogisticRegression ○ However if we look in the code we can sometimes find converters to the old style models and use this to export our “new” style model ● Waiting on https://issues.apache. org/jira/browse/SPARK-11171 / https://github. com/apache/spark/pull/9207 for pipeline models ● Not getting in for 2.0 :(
  • 33.
    How to PMMLexport toPMML ● returns a string or ● takes a path to local fs and saves results or ● takes a SparkContext & a distributed path and saves or ● takes a stream and writes result to stream
  • 34.
    Optional* exercise time ●Take a model you trained and save it to PMML ○ You will have to dig around in the Spark code to be able to do this ● Look at the file ● Load it into a serving system and try some predictions ● Note: PMML export currently only includes the model - not any transformations beforehand ● Also: you might need to train a new model ● If you don’t get it don’t worry - hints to follow :)