Machine Learning with Apache Mahout
http://twitter.com/danielglauser http://www.linkedin.com/in/danglauser danglauser@gmail.com
What is Machine Learning?
What is Machine Learning? A branch of Artificial Intelligence
What is Machine Learning? A branch of Artificial Intelligence Creative use of statistics
What is Machine Learning? A branch of Artificial Intelligence Creative use of statistics Smart decisions from large data sets
What is Machine Learning? A branch of Artificial Intelligence Creative use of statistics Smart decisions from large data sets All of the above
Common Applications
Common Applications ?
Spam Filtering
Credit Card Fraud
Medical Diagnostics
Search Engines
Sentiment Analysis
Math Alert If you want to go big with Machine Learning math is necessary What math?
Statistics Discrete Math Linear Algebra Probability
Apache Mahout A platform for Machine Learning Roll your own algorithm, use the platform Easy integration with Hadoop
History • 2005 The Taste framework • 2008 Services built on Lucene
Mahout is composed of... Recommender Engines Classification Clustering Frequent itemsets
A brief intro to: Recommender Engines Classification Clustering
Recommendations For a given set of input, make a recommendation
Recommendations Rank the best out of many possibilities
Recommenders are typically User based or Item based
Neighborhood Nearest N Users Threshold
Similarity
PearsonCorrelationSimilarity Produces a value between 1 and -1 Tendency of two series to move together
PearsonCorrelationSimilarity 1 - the two series are similar 0 - no similarity -1 - opposite similarity
PearsonCorrelationSimilarity Problems Doesn’t take into account how many items overlap between users Cannot find similarity between two users if they only have one item in common Undefined if two users have identical preferences
Similarity Algorithms PersonCorrelationSimilarity EuclidianDistanceSimilarity TanimotoCoefficientSimilarity LogLikelyhoodSimilarity
To the code!
How big is a Java Object?
GenericPreference user id - long - 8 bytes item id - long - 8 bytes preference value - float - 4 bytes
PreferenceArray Why not just use an array or an ArrayList? A little overhead x millions of items = a *lot* of overhead
GenericUserPreferenceArray item id - long - 8 bytes preference value - float - 4 bytes ] x millions - one user id - long - 8 bytes
Phew!
Clustering
Clustering
Clustering Surface naturally occurring groups of data A notion of similarity (and dissimilarity)
Clustering Algorithms do not require training Stopping condition - iterate until close enough
Common Clustering Algorithms K-Means Fuzzy K-Means Meanshift Centroid generation Direchlet clustering
Representing Data Feature Selection Vectorization
Feature Selection Figure out what features of your data are interesting
Vectorization Represent the interesting features in an n- dimensional space
N-Dimensional Space Every word in a group of documents Size, shape, color of an object
N-Dimensional Space Every word in a group of documents Size, shape, color of an object
Representing Vectors DenseVector RandomAccessSparseVector SequentialAccessSparseVector
Representing Vectors DenseVector Random Seek RandomAccessSparseVector SequentialAccessSparseVector
Hadoop SequenceFiles Input vectors SequenceFile(s) Initial SequenceFile(s) Centoids
K-Means 50+ years old, in commonly used for 25 years Set the number of clusters - k Works well even if you don’t pick a good - k
K-Means Guess at initial placement of the centers (centroids) ]- Expectation - assign the nearest Wash, points to each centroid rinse, Maximization - reposition the centroid repeat
C1 C2 C3
C1 C2 C3
C1 C1 C2 C3 C3 C2
C1 C3 C2
C1 C3 C2
C1 C1 C3 C3 C2 C2
C1 C3 C2
C1 C3 C2
C1 C1 C3 C3 C2 C2
C1 C3 C2
C1 C3 C2
C1 C1 C3 C3 C2 C2
Stop! C1 C3 C2
Clustering
Clustering
Classification
Classification
Classification
Classification BFF39D 577335 B3E631 D0F5B0 90B073 AFCF3C
Classification BFF39D 577335 B3E631 D0F5B0 90B073 AFCF3C
Classification BFF39D 577335 B3E631 D0F5B0 90B073 AFCF3C Green
Attributes of Classification Algorithms Require training (supervised) Make a single decision with a very limited set of outcomes
Classification Typical answers naturally fit into categories
Examples of Classification Credit card fraud prediction Customer attrition Diabetes detector Search Engine
Training - learned process that produces a model Model - output of the training algorithm
Predictor variable - input for classification model Target variable - what we are trying to predict
Classification
Common Algorithms Stochastic Gradient Decent (SGD) Support Vector Machine (SVM) Naive Bayes Complementary Naive Bayes Random Forrest
Going Distributed
Overhead Parallel processing requires management overhead Especially when spread over multiple machines
Vector SequenceFile Keys Values Implements Implements WritableComparable Writeable
Java WritableComparable Comparable Writeable Serializable
Recap
Recommender Rank large datasets
Clustering Group your data
Classification Train me to think like you
Integration with Hadoop Through SequenceFiles and Map/Reduce jobs
Resources
Resources
n-dimensional space http://en.wikipedia.org/wiki/File:Coord_system_CA_0.svg Batman http://www.flickr.com/photos/farukahmet/3005752670/sizes/l/in/photostream/ duke http://kenai.com/projects/duke/pages/Home mahout logo http://mahout.apache.org/ scalability diagram http://manning.com/owen/ Thanks! classification diagram http://manning.com/owen/ phew http://www.flickr.com/photos/iain/1022210850/ clouds http://www.flickr.com/photos/spazzo_1493/3682989696/ spam http://www.flickr.com/photos/johotravels/4334224546/ credit card http://www.flickr.com/photos/thetruthabout/4542026865/ medical diagnostics http://www.flickr.com/photos/adrianclarkmbbs/3063516728/ search engines http://www.flickr.com/photos/enda/144377951/ angry http://www.flickr.com/photos/jmgasalla/3467458535/ crystal ball http://www.flickr.com/photos/mache/142561526/ glasses http://www.flickr.com/photos/nickwheeleroz/2220008689/ coffee http://www.flickr.com/photos/mr_t_in_dc/2818254382/
http://twitter.com/danielglauser http://www.linkedin.com/in/danglauser danglauser@gmail.com

Machine Learning with Apache Mahout