Warehousing MongoDB Data Using Apache Beam and BigQuery Sandeep Parikh Head of Solutions Architecture, Americas East @crcsmnky
Google Cloud Platform 2 About Me
Agenda MongoDB on Google Cloud Platform What is Data Warehousing Tools & Technologies Example Use Case Show, Don’t Tell
Confidential & ProprietaryGoogle Cloud Platform 4 MongoDB on Google Cloud Platform
Google Cloud Platform 5 MongoDB on Google Cloud Platform
Google Cloud Platform 6 Manually Deploying MongoDB
Google Cloud Platform 7 Google Cloud Launcher
Google Cloud Platform 8 MongoDB Cloud Manager
Google Cloud Platform 9 MongoDB Cloud Manager How do you automate this?
Google Cloud Platform 10 Bootstrapping MongoDB Cloud Manager Deployment Manager Template
Google Cloud Platform 11 Cloud Deployment Manager Provision, configure your deployment Configuration as code Declarative approach to configuration Template-driven Supports YAML, Jinja, and Python Use schemas to constrain parameters References control order and dependencies
Google Cloud Platform 12 Bootstrapping Cloud Manager Schema, Configuration & Template Posted on Github https://github.com/GoogleCloudPlatform/mongodb-cloud-manager Three Compute Engine instances, each with 500 GB PD-SSD MongoDB Cloud Manager automation agent pre-installed and configured $ gcloud deployment-manager deployments create mongodb-cloud-manager --config mongodb-cloud-manager.jinja --properties mmsGroupId=MMSGROUPID,mmsApiKey=MMSAPIKEY
Confidential & ProprietaryGoogle Cloud Platform 13 What’s a Data Warehouse
Data Warehouses are central repositories of integrated data from one or more disparate sources https://en.wikipedia.org/wiki/Data_warehouse
Google Cloud Platform 15 Data Warehouse Money Data Data Data Insights Profit!
Confidential & ProprietaryGoogle Cloud Platform 16 Tools and Technologies
Google Cloud Platform 17 Where: BigQuery Complex, Petabyte-scale data warehousing made simple Scales automatically; No setup or admin Foundation for analytics and machine learning
Google Cloud Platform 18 RUN QUERY
Google Cloud Platform 19
Google Cloud Platform 20 How: Apache Beam (incubating) MapReduce BigTable DremelColossus FlumeMegastoreSpanner PubSub Millwheel Apache Beam Google Cloud Dataflow
Google Cloud Platform 21 Understand What, Where, When, How 3 Streaming 4 Streaming + Accumulation 1 Classic Batch 2 Windowed Batch
Google Cloud Platform 22 Pipelines in Beam Pipeline p = Pipeline.create(); p.begin() .apply(TextIO.Read.from(“gs://…”)) .apply(ParDo.of(new ExtractTags()) .apply(Count.create()) .apply(ParDo.of(new ExpandPrefixes()) .apply(Top.largestPerKey(3)) .apply(TextIO.Write.to(“gs://…”)); p.run(); Pipeline p = Pipeline.create(); p.begin() .apply(TextIO.Read.from(“gs://…”)) .apply(ParDo.of(new ExtractTags()) .apply(Count.create()) .apply(ParDo.of(new ExpandPrefixes()) .apply(Top.largestPerKey(3)) .apply(TextIO.Write.to(“gs://…”)); p.run(); .apply(PubsubIO.Read.from(“input_topic”)) .apply(Window.<Integer>by(FixedWindows.of(5, MINUTES)) .apply(PubsubIO.Write.to(“output_topic”)); Batch to Streaming
Google Cloud Platform 23 Apache Beam Vision Beam Model: Fn Runners Apache Flink Apache Spark Beam Model: Pipeline Construction Other LanguagesBeam Java Beam Python Execution Execution Cloud Dataflow Execution
Google Cloud Platform 24 Running Apache Beam Cloud Dataflow Local Runner
25 A great place for executing Beam pipelines which provides: ● Fully managed, no-ops execution environment ● Integration with Google Cloud Platform ● Java support in GA. Python in Alpha Cloud Dataflow Service
Deploy Tear Down Fully Managed: Worker Lifecycle Management
Fully Managed: Dynamic Worker Scaling
100 mins. 65 mins. vs. Fully Managed: Dynamic Work Rebalancing
Integrated: Monitoring UI
Integrated: Distributed Logging
Cloud Logs Google App Engine Google Analytics Premium Cloud Pub/Sub BigQuery Storage (tables) Cloud Bigtable (NoSQL) Cloud Storage (files) Cloud Dataflow BigQuery Analytics (SQL) Capture Store Analyze Batch Cloud DataStore Process Stream Cloud Monitoring Cloud Bigtable Real time analytics and Alerts Cloud Dataflow Cloud Dataproc Integrated: Part of Google Cloud Platform Cloud Dataproc 31
Confidential & ProprietaryGoogle Cloud Platform 32 Example Use Case
Google Cloud Platform 33 Sensor Data
Confidential & ProprietaryGoogle Cloud Platform 34 Show, Don’t Tell
Insert Demo Here

MongoDB Europe 2016 - Warehousing MongoDB Data using Apache Beam and BigQuery

  • 1.
    Warehousing MongoDB Data UsingApache Beam and BigQuery Sandeep Parikh Head of Solutions Architecture, Americas East @crcsmnky
  • 2.
  • 3.
    Agenda MongoDB on GoogleCloud Platform What is Data Warehousing Tools & Technologies Example Use Case Show, Don’t Tell
  • 4.
    Confidential & ProprietaryGoogleCloud Platform 4 MongoDB on Google Cloud Platform
  • 5.
    Google Cloud Platform5 MongoDB on Google Cloud Platform
  • 6.
    Google Cloud Platform6 Manually Deploying MongoDB
  • 7.
    Google Cloud Platform7 Google Cloud Launcher
  • 8.
    Google Cloud Platform8 MongoDB Cloud Manager
  • 9.
    Google Cloud Platform9 MongoDB Cloud Manager How do you automate this?
  • 10.
    Google Cloud Platform10 Bootstrapping MongoDB Cloud Manager Deployment Manager Template
  • 11.
    Google Cloud Platform11 Cloud Deployment Manager Provision, configure your deployment Configuration as code Declarative approach to configuration Template-driven Supports YAML, Jinja, and Python Use schemas to constrain parameters References control order and dependencies
  • 12.
    Google Cloud Platform12 Bootstrapping Cloud Manager Schema, Configuration & Template Posted on Github https://github.com/GoogleCloudPlatform/mongodb-cloud-manager Three Compute Engine instances, each with 500 GB PD-SSD MongoDB Cloud Manager automation agent pre-installed and configured $ gcloud deployment-manager deployments create mongodb-cloud-manager --config mongodb-cloud-manager.jinja --properties mmsGroupId=MMSGROUPID,mmsApiKey=MMSAPIKEY
  • 13.
    Confidential & ProprietaryGoogleCloud Platform 13 What’s a Data Warehouse
  • 14.
    Data Warehouses arecentral repositories of integrated data from one or more disparate sources https://en.wikipedia.org/wiki/Data_warehouse
  • 15.
    Google Cloud Platform15 Data Warehouse Money Data Data Data Insights Profit!
  • 16.
    Confidential & ProprietaryGoogleCloud Platform 16 Tools and Technologies
  • 17.
    Google Cloud Platform17 Where: BigQuery Complex, Petabyte-scale data warehousing made simple Scales automatically; No setup or admin Foundation for analytics and machine learning
  • 18.
  • 19.
  • 20.
    Google Cloud Platform20 How: Apache Beam (incubating) MapReduce BigTable DremelColossus FlumeMegastoreSpanner PubSub Millwheel Apache Beam Google Cloud Dataflow
  • 21.
    Google Cloud Platform21 Understand What, Where, When, How 3 Streaming 4 Streaming + Accumulation 1 Classic Batch 2 Windowed Batch
  • 22.
    Google Cloud Platform22 Pipelines in Beam Pipeline p = Pipeline.create(); p.begin() .apply(TextIO.Read.from(“gs://…”)) .apply(ParDo.of(new ExtractTags()) .apply(Count.create()) .apply(ParDo.of(new ExpandPrefixes()) .apply(Top.largestPerKey(3)) .apply(TextIO.Write.to(“gs://…”)); p.run(); Pipeline p = Pipeline.create(); p.begin() .apply(TextIO.Read.from(“gs://…”)) .apply(ParDo.of(new ExtractTags()) .apply(Count.create()) .apply(ParDo.of(new ExpandPrefixes()) .apply(Top.largestPerKey(3)) .apply(TextIO.Write.to(“gs://…”)); p.run(); .apply(PubsubIO.Read.from(“input_topic”)) .apply(Window.<Integer>by(FixedWindows.of(5, MINUTES)) .apply(PubsubIO.Write.to(“output_topic”)); Batch to Streaming
  • 23.
    Google Cloud Platform23 Apache Beam Vision Beam Model: Fn Runners Apache Flink Apache Spark Beam Model: Pipeline Construction Other LanguagesBeam Java Beam Python Execution Execution Cloud Dataflow Execution
  • 24.
    Google Cloud Platform24 Running Apache Beam Cloud Dataflow Local Runner
  • 25.
    25 A great placefor executing Beam pipelines which provides: ● Fully managed, no-ops execution environment ● Integration with Google Cloud Platform ● Java support in GA. Python in Alpha Cloud Dataflow Service
  • 26.
    Deploy Tear Down FullyManaged: Worker Lifecycle Management
  • 27.
    Fully Managed: DynamicWorker Scaling
  • 28.
    100 mins. 65mins. vs. Fully Managed: Dynamic Work Rebalancing
  • 29.
  • 30.
  • 31.
    Cloud Logs Google AppEngine Google Analytics Premium Cloud Pub/Sub BigQuery Storage (tables) Cloud Bigtable (NoSQL) Cloud Storage (files) Cloud Dataflow BigQuery Analytics (SQL) Capture Store Analyze Batch Cloud DataStore Process Stream Cloud Monitoring Cloud Bigtable Real time analytics and Alerts Cloud Dataflow Cloud Dataproc Integrated: Part of Google Cloud Platform Cloud Dataproc 31
  • 32.
    Confidential & ProprietaryGoogleCloud Platform 32 Example Use Case
  • 33.
    Google Cloud Platform33 Sensor Data
  • 34.
    Confidential & ProprietaryGoogleCloud Platform 34 Show, Don’t Tell
  • 35.