MongoDB Days Germany: Data Processing with MongoDB

MongoDB and Apache Flink / Spark “How to do Data Processing?” Marc Schwering Sr. Solution Architect – EMEA marc@mongodb.com @m4rcsch

2 Agenda For This Session • Data Processing Architectural Overview • The Life of an Application • Separation of Concerns / Real World Architecture • Apache Spark and Flink Data Processing Projects • Clustering with Apache Flink • Next Steps

3 Data Processing Architectural Overview 1. Profile created 2. Enrich with public data 3. Capture activity 4. Clustering analysis 5. Define Personas 6. Tag with personas 7. Personalize interactions Batch analytics Public data Common technologies • R • Hadoop • Spark • Python • Java • Many other options Personas changed much less often than tagging

4 Evolution of a Profile (1) { "_id" : ObjectId("553ea57b588ac9ef066428e1"), "ipAddress" : "216.58.219.238", "referrer" : ”kay.com", "firstName" : "John", "lastName" : "Doe", "email" : "johndoe@gmail.com" }

5 Evolution of a Profile (n+1) { "_id" : ObjectId("553e7dca588ac9ef066428e0"), "firstName" : "John", "lastName" : "Doe", "address" : "229 W. 43rd St.", "city" : "New York", "state" : "NY", "zipCode" : "10036", "age" : 30, "email" : "john.doe@mongodb.com", "twitterHandle" : "johndoe", "gender" : "male", "interests" : [ "electronics", "basketball", "weightlifting", "ultimate frisbee", "traveling", "technology" ], "visitedCounts" : { "watches" : 3, "shirts" : 1, "sunglasses" : 1, "bags" : 2 }, "purchases" : [ { "id" : 1, "desc" : "Power Oxford Dress Shoe", "category" : "Mens shoes" }, { "id" : 2, "desc" : "Striped Sportshirt", "category" : "Mens shirts" } ], "persona" : "shoe-fanatic” }

6 One size/document fits all? • Profile Data – Preferences – Personal information • Contact information • DOB, gender, ZIP... • Customer Data – Purchase History – Marketing History • „Session Data“ – View History – Shopping Cart Data – Information Broker Data • Personalisation Data – Persona Vectors – Product and Category recommendations Application Batch analytics

7 Separation of Concerns • Profile Data – Preferences – Personal information • Contact information • DOB, gender, ZIP... • Customer Data – Purchase History – Marketing History • „Session Data“ – View History – Shopping Cart Data – Information Broker Data • Personalisation Data – Persona Vectors – Product and Category recommendations Batch analytics Layer Frontend - System Profile Service Customer Service Session Service Persona Service

8 Benefits • Code does less, Document and Code stays focused • Split ability – Different Teams – New Languages – Defined Dependencies

9 Advice for Developers (1) • Code does less, Document and Code stays focused • Split ability – Different Teams – New Languages – Defined Dependencies KISS => Keep it simple and save! => Clean Code <= • Robert C. Marten: https://cleancoders.com/ • M. Fowler / B. Meyer. et. al.: Command Query Separation

Analytics and Personalization From Query to Clustering

11 Separation of Concerns • Profile Data – Preferences – Personal information • Contact information • DOB, gender, ZIP... • Customer Data – Purchase History – Marketing History • „Session Data“ – View History – Shopping Cart Data – Information Broker Data • Personalisation Data – Persona Vectors – Product and Category recommendations Batch analytics Layer Frontend – System Profile Service Customer Service Session Service Persona Service

12 Separation of Concerns • Profile Data – Preferences – Personal information • Contact information • DOB, gender, ZIP... • Customer Data – Purchase History – Marketing History • „Session Data“ – View History – Shopping Cart Data – Information Broker Data • Personalisation Data – Persona Vectors – Product and Category recommendations Batch analytics Layer Frontend – System Profile Service Customer Service Session Service Persona Service

13 Architecture revised Profile Service Customer Service Session Service Persona Service Frontend – System Backend– Systems Data Processing

14 Advice for Developers (2) • OWN YOUR DATA! (but only relevant Data) • Say no! (to direct Data ie. DB Access)

16 Hadoop in a Nutshell • An open source distributed storage and distributed batch oriented processing framework • Hadoop Distributed File System (HDFS) to store data on commodity hardware • Yarn as resource management platform • MapReduce as programming model working on top of HDFS

17 Spark in a Nutshell • Spark is a top-level Apache project • Can be run on top of YARN and can read any Hadoop API data, including HDFS or MongoDB • Fast and general engine for large-scale data processing and analytics • Advanced DAG execution engine with support for data locality and in-memory computing

18 Flink in a Nutshell • Flink is a top-level Apache project • Can be run on top of YARN and can read any Hadoop API data, including HDFS or MongoDB • A distributed streaming dataflow engine • Streaming and batch • Iterative in memory execution and handling • Cost based optimizer

19 Latency of query operations Query Aggregation MapReduce Cluster Algorithms time MongoDB Hadoop Spark/Flink

Iterative Algorithms / Clustering

21 K-Means in Pictures • Source: Wikipedia K-Means

23 Iterations in Hadoop and Spark

24 Iterations in Flink • Dedicated iteration operators • Tasks keep running for the iterations, not redeployed for each step • Caching and optimizations done automatically

26 Reader / Writer Config //reader config public static DataSet<Tuple2<BSONWritable, BSONWritable>> readFromMongo(ExecutionEnvironment env, String uri) { JobConf conf = new JobConf(); conf.set("mongo.input.uri", uri); MongoInputFormat mongoInputFormat = new MongoInputFormat(); return env.createHadoopInput(mongoInputFormat, BSONWritable.class, BSONWritable.class, conf); } //writer config public static void writeToMongo(DataSet<Tuple2<BSONWritable, BSONWritable>> result, String uri) { JobConf conf = new JobConf(); conf.set("mongo.output.uri", uri); MongoOutputFormat<BSONWritable, BSONWritable> mongoOutputFormat = new MongoOutputFormat<BSONWritable, BSONWritable>(); result.output(new HadoopOutputFormat<BSONWritable, BSONWritable>(mongoOutputFormat, conf)); }

27 Import data //points DataSet<Tuple2<BSONWritable, BSONWritable>> inPoints = readFromMongo(env, mongoInputUri + pointsSource); //centers DataSet<Tuple2<BSONWritable, BSONWritable>> inCenters = readFromMongo(env, mongoInputUri + centerSource); DataSet<Point> points = convertToPointSet(inPoints); DataSet<Centroid> centroids = convertToCentroidSet(inCenters);

28 Converting public Tuple2<BSONWritable, BSONWritable> map(Tuple2<Integer, Point> integerPointTuple2) throws Exception { Integer id = integerPointTuple2.f0; Point point = integerPointTuple2.f1; BasicDBObject idDoc = new BasicDBObject(); idDoc.put("_id", id); BSONWritable bsonId = new BSONWritable(); bsonId.setDoc(idDoc); BasicDBObject doc = new BasicDBObject(); doc.put("_id", id); doc.put("x", point.x); doc.put("y", point.y); BSONWritable bsonDoc = new BSONWritable(); bsonDoc.setDoc(doc); return new Tuple2(bsonId,bsonDoc); }

31 Takeaways • Evolution is amazing and exiting! – Be ready to learn new things, ask questions across Silos! • Stay focused => Start and stay small – Evaluate with BigDocuments but do a PoC focussed on the topic • Extending functionality could be challenging – Evolution is outpacing help channels – A lot of options (Spark, Flink, Storm, Hadoop….) – More than just a binary • Extending functionality is easy – Aggregation, MapReduce – Connectors opening a new variety of Use Cases

32 Next Steps • Try out Flink – http://flink.apache.org/ – https://github.com/mongodb/mongo-hadoop – https://github.com/m4rcsch/flink-mongodb-example – http://sparkbigdata.com • Participate and ask Questions! – @m4rcsch – marc@mongodb.com

Thank you! Marc Schwering Sr. Solutions Architect – EMEA marc@mongodb.com @m4rcsch

MongoDB Days Germany: Data Processing with MongoDB

More Related Content

What's hot

Similar to MongoDB Days Germany: Data Processing with MongoDB

More from MongoDB

Recently uploaded

MongoDB Days Germany: Data Processing with MongoDB

Editor's Notes