Introduction to Spark ML Pipelines Workshop

Introduction to Spark ML Machine learning at scale ApacheCon 2016 Hella-Legit

Who am I? Holden ● I prefer she/her for pronouns ● Co-author of the Learning Spark book ● Software Engineer at IBM’s Spark Technology Center ● @holdenkarau ● http://www.slideshare.net/hkarau ● https://www.linkedin.com/in/holdenkarau

What we are going to explore together! ● Who I think you all are ● Spark’s two different ML APIs ● Running through a simple example with one ● A brief detour into some codegen funtimes ● Exercises! ● Model save/load ● Discussion of “serving” options

The different pieces of Spark Apache Spark SQL & DataFrames Streaming Language APIs Scala, Java, Python, & R Graph Tools Spark ML bagel & Grah X MLLib Community Packages

Who do I think you all are? ● Nice people* ● Some knowledge of Apache Spark core & maybe SQL ● Interested in using Spark for Machine Learning ● Familiar-ish with Scala or Java or Python Amanda

Skipping intro & set-up time :)

But maybe time to upgrade... ● Spark 1.5+ (Spark 1.6 would be best!) ○ (built with Hive support if building from source) Amanda

Some pages to keep open: http://bit.ly/sparkDocs http://bit.ly/sparkPyDocs OR http://bit.ly/sparkScalaDoc http://bit.ly/sparkMLGuide https://github.com/holdenk/spark-intro-ml-pipeline- workshop http://www.slideshare.net/hkarau Download census data https://archive.ics.uci. edu/ml/datasets/Adult Dwight Sipler

Getting some data for working with: ● census data: https://archive.ics.uci. edu/ml/datasets/Adult ● Goal: predict income > 50k ● Also included in the github repo ● Download that now if you haven’t already ● We will add a header to the data ○ http://pastebin.ca/3318687 PROTill Westermayer

So what are the two APIs? ● Traditional and Pipeline ○ Pipeline is the new shiny future which will fix all problems* ● Traditional API works on RDDs ○ Data preparation work is generally done in traditional Spark transformations ● Pipeline API works on DataFrames ○ Often we want to apply some transformations to our data before feeding to the machine learning algorithm ○ Makes it easy to chain these together (*until replaced by a newer shinier future) Steve Jurvetson

So what are DataFrames? ● Spark SQL’s version of RDDs of the world (its for more than just SQL) ● Restricted data types, schema information, compile time untyped ● Restricted operations (more relational style) ● Allow lots of fun extra optimizations ○ Tungsten etc. ● We’ll talk more about them (& Datasets) when we do the Spark SQL component of this workshop

Transformers, Estimators and Pipelines ● Transformers transform a DataFrame into another ● Estimators can be trained on a DataFrame to produce a transformer ● Pipelines chain together multiple transformers and estimators

Let’s start with loading some data ● We’ve got some CSV data, we could use textfile and parse by hand ● spark-packages can save by providing the spark-csv package by Hossein Falaki ○ If we were building a Java project we can include maven coordinates ○ For the Spark shell when launching add: --packages com.databricks:spark-csv_2.10:1.3.0 Jess Johnson

Loading with sparkSQL & spark-csv sqlContext.read returns a DataFrameReader We can specify general properties & data specific options ● option(“key”, “value”) ○ spark-csv ones we will use are header & inferSchema ● format(“formatName”) ○ built in formats include parquet, jdbc, etc. today we will use com. databricks.spark.csv ● load(“path”) Jess Johnson

Loading with sparkSQL & spark-csv df = sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") .option("inferSchema", "true") .load("resources/adult.data") Jess Johnson

Lets explore training a Decision Tree ● Step 1: Data loading (done!) ● Step 2: Data prep (select features, etc.) ● Step 3: Train ● Step 4: Predict

Data prep / cleaning ● We need to predict a double (can be 0.0, 1.0, but type must be double) ● We need to train with a vector of features Imports: from pyspark.mllib.linalg import Vectors from pyspark.ml.classification import DecisionTreeClassifier from pyspark.ml.param import Param, Params from pyspark.ml.feature import Bucketizer, VectorAssembler, StringIndexer from pyspark.ml import Pipeline Huang Yun Chung

Data prep / cleaning continued # Combines a list of double input features into a vector assembler = VectorAssembler(inputCols=["age", "education-num"], outputCol="feautres") # String indexer converts a set of strings into doubles indexer = StringIndexer(inputCol="category") .setOutputCol("category-index") # Can be used to combine pipeline components together pipeline = Pipeline().setStages([assembler, indexer]) Huang Yun Chung

So a bit more about that pipeline ● Each of our previous components has “fit” & “transform” stage ● Constructing the pipeline this way makes it easier to work with (only need to call one fit & one transform) ● Can re-use the fitted model on future data model=pipeline.fit(df) prepared = model.transform(df) Andrey

What does our pipeline look like so far? Input Data Assembler Input Data + Vectors StringIndexer Input Data +Cat ID + Vectors While not an ML learning algorithm this still needs to be fit This is a regular transformer - no fitting required.

Let's train a model on our prepared data: # Specify model dt = DecisionTreeClassifier(labelCol = "category-index", featuresCol="features") # Fit it dt_model = dt.fit(prepared) # Or as part of the pipeline pipeline_and_model = Pipeline().setStages([assembler, indexer, dt]) pipeline_model = pipeline_and_model.fit(df)

And predict the results on the same data: pipeline_model.transform(df).select("prediction", "category-index").take(20)

Exercise 1: Go from the index to something useful ● We could manually look up the labels and then write a select statement ● Or we could look at the features on the StringIndexerModel and use IndexToString ● Our pipeline has an array of stages we can use for this

Solution: from pyspark.ml.feature import IndexToString labels = list(pipeline_model.stages[1].labels()) inverter = IndexToString(inputCol="prediction", outputCol=" prediction-label", labels=labels) inverter.transform(pipeline_model.transform(df)).select ("prediction-label", "category").take(20) # Pre Spark 1.6 use SQL if/else or similar

So what could we do for other types of data? ● org.apache.spark.ml.feature has a lot of options ○ HashingTF ○ Tokenizer ○ Word2Vec ○ etc.

Exercise 2: Add more features to your tree ● Finished quickly? Help others! ● Or tell me if adding these features helped or not… ○ We can download a reserve “test” dataset but how would we know if we couldn’t do that? cobra libre

And not just for getting data into doubles... ● Maybe a customers cat food preference only matters if the owns_cats boolean is true ● Maybe the scale is _way_ off ● Maybe we’ve got stop words ● Maybe we know one component has a non-linear relation ● etc.

Cross-validation because saving a test set is effort ● Automagically* fit your model params ● Because thinking is effort ● org.apache.spark.ml.tuning has the tools ○ (not in Python yet so skipping for now) Jonathan Kotta

Pipeline API has many models: ● org.apache.spark.ml.classification ○ BinaryLogisticRegressionClassification, DecissionTreeClassification, GBTClassifier, etc. ● org.apache.spark.ml.regression ○ DecissionTreeRegression, GBTRegressor, IsotonicRegression, LinearRegression, etc. ● org.apache.spark.ml.recommendation ○ ALS PROcarterse Follow

Exercise 3: Train a new model type ● Your choice! ● If you want to do regression - change what we are predicting

So serving... ● Generally refers to using your model online ○ Generating recommendations... ● In batch mode you can “just” save & use the Spark bits ● Spark’s “native” formats (often parquet w/metadata) ○ Understood by Spark libraries and thats pretty much it ○ If you are serving in JVM can load these but need Spark dependencies (albeit often not a Spark cluster) ● Some models support PMML export ○ https://github.com/jpmml/openscoring etc. ● We can also write our own export & serving by hand :( Ambernectar 13

So what models are PMML exportable? ● Right now “old” style models ○ KMeans, LinearRegresion, RidgeRegression, Lasso, SVM, and Binary LogisticRegression ○ However if we look in the code we can sometimes find converters to the old style models and use this to export our “new” style model ● Waiting on https://issues.apache. org/jira/browse/SPARK-11171 / https://github. com/apache/spark/pull/9207 for pipeline models ● Not getting in for 2.0 :(

How to PMML export toPMML ● returns a string or ● takes a path to local fs and saves results or ● takes a SparkContext & a distributed path and saves or ● takes a stream and writes result to stream

Optional* exercise time ● Take a model you trained and save it to PMML ○ You will have to dig around in the Spark code to be able to do this ● Look at the file ● Load it into a serving system and try some predictions ● Note: PMML export currently only includes the model - not any transformations beforehand ● Also: you might need to train a new model ● If you don’t get it don’t worry - hints to follow :)

Introduction to Spark ML Pipelines Workshop

More Related Content

What's hot

Similar to Introduction to Spark ML Pipelines Workshop

Recently uploaded

Introduction to Spark ML Pipelines Workshop