MINING BIG DATA STREAMS WITH APACHE SAMOA Albert Bifet @abifet #J_OnTheBeach Malaga, 20 May 2016
MOTIVATION
REALTIME ANALYTICS
REALTIME ANALYTICS eal time analytics
REALTIME ANALYTICS real time analytics
APACHE SA(MOA)VISION • Data Stream mining platform • Library of state-of-the-art algorithms
 for practitioners • Development and collaboration framework
 for researchers • Algorithms & Systems
IMPORTANCE • Example: spam detection in comments onYahoo News • Trends change in time • Need to retrain model with new data Importance$of$O •  As$spam$trends$change retrain$the$model$with
INTERNET OF THINGS • EMC Digital Universe, 2014 digital universe Figure 3: EMC Digital Universe, 2014 7
BIG DATA STREAM • Volume +Velocity (+Variety) • Too large for single commodity server main memory • Too fast for single commodity server CPU • A solution should be: • Distributed • Scalable
BIG DATA PROCESSING ENGINES • Low latency • High Latency (Not real time) apache storm Storm characteristics for real-time data processing workloads 1 Fast 2 Scalable 3 Fault-tolerant 4 Reliable 5 Easy to operate apache samza from linkedin Storm and Samza are fairly similar. Both systems provide: 1 a partitioned stream model, 2 a distributed execution environment, 3 an API for stream processing, 4 fault tolerance, 5 Kafka integration real time computation: streaming computation MapReduce Limitations Example How compute in real time (latency less than 1 second): 1 predictions 2 frequent items as Twitter hashtags 3 sentiment analysis 14 apache spark streaming
MACHINE LEARNING • Classification • Regression • Clustering • Frequent Pattern Mining
WHAT IS MOA?
MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from data streams. • It is closely related to WEKA • It includes a collection of offline and online as well as tools for evaluation: • classification, regression • clustering, frequent pattern mining • Easy to extend, design and run experiments {M}assive {O}nline {A} MOA (Bifet et al. 20 {M}assive {O}nline {A}nalysis is a framework learning from data streams. It is closely related to WEKA
STREAM SETTING • Process an example at a time,and inspect it only once (at most) • Use a limited amount of memory • Work in a limited amount of time • Be ready to predict at any point
STREAM EVALUATION • Holdout Evaluation • InterleavedTest-Then-Train or Prequential
STREAM EVALUATION Holdout an independent test set • Apply the current decision model to the test set, at regular time intervals • The loss estimated in the holdout is an unbiased estimator
STREAM EVALUATION Prequential Evaluation • The error of a model is computed from the sequence of examples. • For each example in the stream, the actual model makes a prediction based only on the example attribute-values.
CLUSTERING
COMMAND LINE • java -cp .:moa.jar:weka.jar -javaagent:sizeofag.jar moa.DoTask "EvaluatePeriodicHeldOutTest -l DecisionStump -s generators.WaveformGenerator -n 100000 -i 100000000 -f 1000000" > dsresult.csv • This command creates a comma separated values file: • training the DecisionStump classifier on the WaveformGenerator data, • using the first 100 thousand examples for testing, • training on a total of 100 million examples, • and testing every one million examples
WHAT IS APACHE SAMOA?
STREAMING MODEL • Sequence is potentially infinite • High amount of data, high speed of arrival • Change over time (concept drift) • Approximation algorithms
 (small error with high probability) • Single pass, one data item at a time • Sub-linear space and time per data item
TAXONOMY Data Mining Distributed Batch Hadoop Mahout Stream Storm, S4, Samza SAMOA Non Distributed Batch R, WEKA, … Stream MOA
ARCHITECTURE An adapter for integrating Apache Flink into Apache SAMOA was implemente n scope of this master thesis, with the main parts of its implementation bein addressed in this section. With the use of our adapter, ML algorithms can b executed on top of Apache Flink. The implemented adapter will be used for th evaluation of the ML pipelines and HT algorithm variations. Figure 20: Apache SAMOA’s high level architecture.
STATUSSTATUS • Parallel algorithms • Classification (Vertical HoeffdingTree) • Clustering (CluStream) • Regression (Adaptive Model Rules) • Execution engines
IS SAMOA USEFUL FORYOU? • Only if you need to deal with: • Large fast data • Evolving process (model updates) • What is happening now? • Use feedback in real-time • Adapt to changes faster
ML DEVELOPER API Processing Item Processor Stream
ML DEVELOPER API TopologyBuilder builder; Processor sourceOne = new SourceProcessor(); builder.addProcessor(sourceOne); Stream streamOne = builder.createStream(sourceOne); Processor sourceTwo = new SourceProcessor(); builder.addProcessor(sourceTwo); Stream streamTwo = builder.createStream(sourceTwo); Processor join = new JoinProcessor()); builder.addProcessor(join) .connectInputShuffle(streamOne) .connectInputKey(streamTwo);
VERTICAL HOEFFDINGTREE (VHT)
DECISIONTREE • Nodes are tests on attributes • Branches are possible outcomes • Leafs are class assignments
 
 Class Instance Attributes Road Tested? Mileage? Age? NoYes High ✅ ❌ Low OldRecent ✅ ❌ Car deal?
HOEFFDINGTREE • Sample of stream enough for near optimal decision • Estimate merit of alternatives from prefix of stream • Choose sample size based on statistical principles • When to expand a leaf? • Let x1 be the most informative attribute,
 x2 the second most informative one • Hoeffding bound: split if G(x1, x2) > ✏ = r R2 ln(1/ ) 2n P. Domingos and G. Hulten, “Mining High-Speed Data Streams,” KDD ’00
PARALLEL DECISIONTREES • Which kind of parallelism? • Task • Data • Horizontal • Vertical Data Attributes Instances
HORIZONTAL PARALLELISM Y. Ben-Haim and E.Tom-Tov,“A Streaming Parallel DecisionTree Algorithm,” JMLR, vol. 11, pp. 849–872, 2010 Stats Stats Stats Stream Histograms Model Instances Model Updates Aggregation to compute splits Single attribute tracked in multiple node 32
HOEFFDINGTREE PROFILING Other 6 % Split 24 % Learn 70 % CPU time for training
 100 nominal and 100 numeric attributes
VERTICAL PARALLELISM Single attribute tracked in single node Stats Stats Stats Stream Model Attributes Splits
ADVANTAGES OFVERTICAL • High number of attributes => high level of parallelism
 (e.g., documents) • Vs task parallelism • Parallelism observed immediately • Vs horizontal parallelism • Reduced memory usage (no model replication) • Parallelized split computation
VERTICAL HOEFFDINGTREE Control Split Result Source (n) Model (n) Stats (n) Evaluator (1) InstanceStream Shuffle Grouping Key Grouping All Grouping
ACCURACY No. Leaf Nodes VHT2 – tree-100 30 Very close and very high accuracy
PERFORMANCE 35 0 50 100 150 200 250 MHT VHT2-par-3 ExecutionTime(seconds) Classifier Profiling Results for text-10000 with 100000 instances t_calc t_comm t_serial Throughput VHT2-par-3: 2631 inst/sec MHT : 507 inst/sec
SUMMARY • Streaming is an importantV of Big Data • Mining big data streams is an open field • MOA: Massive Online Analytics • Available and open-source http://moa.cms.waikato.ac.nz/ • SAMOA:A Platform for Mining Big Data Streams • Available and open-source (incubating @ASF)
 http://samoa.incubator.apache.org
OPEN CHALLENGES • Distributed stream mining algorithms • Active & semi-supervised learning + crowdsourcing • Millions of classes (e.g.,Wikipedia pages) • Multi-target learning • System issues (load balancing, communication) • Programming paradigms and abstractions
SAMOATEAM Albert
 Bifet Matthieu
 Morel Gianmarco
 De Francisci Morales Arinto
 Murdopo Nicolas
 Kourtellis Olivier
 Van Laere
SUPPORTING ORGANISATIONS
THANKS! https://samoa.incubator.apache.org @ApacheSAMOA

Mining Big Data Streams with APACHE SAMOA

  • 1.
    MINING BIG DATASTREAMS WITH APACHE SAMOA Albert Bifet @abifet #J_OnTheBeach Malaga, 20 May 2016
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
    APACHE SA(MOA)VISION • DataStream mining platform • Library of state-of-the-art algorithms
 for practitioners • Development and collaboration framework
 for researchers • Algorithms & Systems
  • 7.
    IMPORTANCE • Example: spamdetection in comments onYahoo News • Trends change in time • Need to retrain model with new data Importance$of$O •  As$spam$trends$change retrain$the$model$with
  • 8.
    INTERNET OF THINGS •EMC Digital Universe, 2014 digital universe Figure 3: EMC Digital Universe, 2014 7
  • 9.
    BIG DATA STREAM •Volume +Velocity (+Variety) • Too large for single commodity server main memory • Too fast for single commodity server CPU • A solution should be: • Distributed • Scalable
  • 10.
    BIG DATA PROCESSING ENGINES •Low latency • High Latency (Not real time) apache storm Storm characteristics for real-time data processing workloads 1 Fast 2 Scalable 3 Fault-tolerant 4 Reliable 5 Easy to operate apache samza from linkedin Storm and Samza are fairly similar. Both systems provide: 1 a partitioned stream model, 2 a distributed execution environment, 3 an API for stream processing, 4 fault tolerance, 5 Kafka integration real time computation: streaming computation MapReduce Limitations Example How compute in real time (latency less than 1 second): 1 predictions 2 frequent items as Twitter hashtags 3 sentiment analysis 14 apache spark streaming
  • 11.
    MACHINE LEARNING • Classification •Regression • Clustering • Frequent Pattern Mining
  • 12.
  • 13.
    MOA • {M}assive {O}nline{A}nalysis is a framework for online learning from data streams. • It is closely related to WEKA • It includes a collection of offline and online as well as tools for evaluation: • classification, regression • clustering, frequent pattern mining • Easy to extend, design and run experiments {M}assive {O}nline {A} MOA (Bifet et al. 20 {M}assive {O}nline {A}nalysis is a framework learning from data streams. It is closely related to WEKA
  • 14.
    STREAM SETTING • Processan example at a time,and inspect it only once (at most) • Use a limited amount of memory • Work in a limited amount of time • Be ready to predict at any point
  • 15.
    STREAM EVALUATION • HoldoutEvaluation • InterleavedTest-Then-Train or Prequential
  • 16.
    STREAM EVALUATION Holdout anindependent test set • Apply the current decision model to the test set, at regular time intervals • The loss estimated in the holdout is an unbiased estimator
  • 17.
    STREAM EVALUATION Prequential Evaluation •The error of a model is computed from the sequence of examples. • For each example in the stream, the actual model makes a prediction based only on the example attribute-values.
  • 18.
  • 19.
    COMMAND LINE • java-cp .:moa.jar:weka.jar -javaagent:sizeofag.jar moa.DoTask "EvaluatePeriodicHeldOutTest -l DecisionStump -s generators.WaveformGenerator -n 100000 -i 100000000 -f 1000000" > dsresult.csv • This command creates a comma separated values file: • training the DecisionStump classifier on the WaveformGenerator data, • using the first 100 thousand examples for testing, • training on a total of 100 million examples, • and testing every one million examples
  • 20.
  • 21.
    STREAMING MODEL • Sequenceis potentially infinite • High amount of data, high speed of arrival • Change over time (concept drift) • Approximation algorithms
 (small error with high probability) • Single pass, one data item at a time • Sub-linear space and time per data item
  • 22.
  • 23.
    ARCHITECTURE An adapter forintegrating Apache Flink into Apache SAMOA was implemente n scope of this master thesis, with the main parts of its implementation bein addressed in this section. With the use of our adapter, ML algorithms can b executed on top of Apache Flink. The implemented adapter will be used for th evaluation of the ML pipelines and HT algorithm variations. Figure 20: Apache SAMOA’s high level architecture.
  • 24.
    STATUSSTATUS • Parallel algorithms •Classification (Vertical HoeffdingTree) • Clustering (CluStream) • Regression (Adaptive Model Rules) • Execution engines
  • 25.
    IS SAMOA USEFULFORYOU? • Only if you need to deal with: • Large fast data • Evolving process (model updates) • What is happening now? • Use feedback in real-time • Adapt to changes faster
  • 26.
    ML DEVELOPER API ProcessingItem Processor Stream
  • 27.
    ML DEVELOPER API TopologyBuilderbuilder; Processor sourceOne = new SourceProcessor(); builder.addProcessor(sourceOne); Stream streamOne = builder.createStream(sourceOne); Processor sourceTwo = new SourceProcessor(); builder.addProcessor(sourceTwo); Stream streamTwo = builder.createStream(sourceTwo); Processor join = new JoinProcessor()); builder.addProcessor(join) .connectInputShuffle(streamOne) .connectInputKey(streamTwo);
  • 28.
  • 29.
    DECISIONTREE • Nodes aretests on attributes • Branches are possible outcomes • Leafs are class assignments
 
 Class Instance Attributes Road Tested? Mileage? Age? NoYes High ✅ ❌ Low OldRecent ✅ ❌ Car deal?
  • 30.
    HOEFFDINGTREE • Sample ofstream enough for near optimal decision • Estimate merit of alternatives from prefix of stream • Choose sample size based on statistical principles • When to expand a leaf? • Let x1 be the most informative attribute,
 x2 the second most informative one • Hoeffding bound: split if G(x1, x2) > ✏ = r R2 ln(1/ ) 2n P. Domingos and G. Hulten, “Mining High-Speed Data Streams,” KDD ’00
  • 31.
    PARALLEL DECISIONTREES • Whichkind of parallelism? • Task • Data • Horizontal • Vertical Data Attributes Instances
  • 32.
    HORIZONTAL PARALLELISM Y. Ben-Haimand E.Tom-Tov,“A Streaming Parallel DecisionTree Algorithm,” JMLR, vol. 11, pp. 849–872, 2010 Stats Stats Stats Stream Histograms Model Instances Model Updates Aggregation to compute splits Single attribute tracked in multiple node 32
  • 33.
    HOEFFDINGTREE PROFILING Other 6 % Split 24 % Learn 70 % CPU time fortraining
 100 nominal and 100 numeric attributes
  • 34.
    VERTICAL PARALLELISM Single attributetracked in single node Stats Stats Stats Stream Model Attributes Splits
  • 35.
    ADVANTAGES OFVERTICAL • Highnumber of attributes => high level of parallelism
 (e.g., documents) • Vs task parallelism • Parallelism observed immediately • Vs horizontal parallelism • Reduced memory usage (no model replication) • Parallelized split computation
  • 36.
    VERTICAL HOEFFDINGTREE Control Split Result Source (n)Model (n) Stats (n) Evaluator (1) InstanceStream Shuffle Grouping Key Grouping All Grouping
  • 37.
    ACCURACY No. Leaf NodesVHT2 – tree-100 30 Very close and very high accuracy
  • 38.
    PERFORMANCE 35 0 50 100 150 200 250 MHT VHT2-par-3 ExecutionTime(seconds) Classifier Profiling Resultsfor text-10000 with 100000 instances t_calc t_comm t_serial Throughput VHT2-par-3: 2631 inst/sec MHT : 507 inst/sec
  • 39.
    SUMMARY • Streaming isan importantV of Big Data • Mining big data streams is an open field • MOA: Massive Online Analytics • Available and open-source http://moa.cms.waikato.ac.nz/ • SAMOA:A Platform for Mining Big Data Streams • Available and open-source (incubating @ASF)
 http://samoa.incubator.apache.org
  • 40.
    OPEN CHALLENGES • Distributedstream mining algorithms • Active & semi-supervised learning + crowdsourcing • Millions of classes (e.g.,Wikipedia pages) • Multi-target learning • System issues (load balancing, communication) • Programming paradigms and abstractions
  • 41.
  • 42.
  • 43.