Apache Spark Core APIs RDDs, DataFrame, Datasets Spark SQL GraphX / GraphFrames (graph) Structured Streaming Mllib (machine learning) Spark: The Definitive Guide
Managed Apache Spark platform optimized for Azure Microsoft Azure
Optimized Databricks Runtime Engine DATABRICKS I/O SERVERLESS Collaborative Workspace Cloud storage Data warehouses Hadoop storage IoT / streaming data Rest APIs Machine learning models BI tools Data exports Data warehouses AZURE DATABRICKS Enhance Productivity Deploy Production Jobs & Workflows APACHE SPARK MULTI-STAGE PIPELINES DATA ENGINEER JOB SCHEDULER NOTIFICATION & LOGS DATA SCIENTIST BUSINESS ANALYST Build on secure & trusted cloud Scale without limits
DBFS Storage blob CLI
Optimized Databricks Runtime Engine DATABRICKS I/O SERVERLESS Collaborative Workspace Cloud storage Data warehouses Hadoop storage IoT / streaming data Rest APIs Machine learning models BI tools Data exports Data warehouses AZURE DATABRICKS Enhance Productivity Deploy Production Jobs & Workflows APACHE SPARK MULTI-STAGE PIPELINES DATA ENGINEER JOB SCHEDULER NOTIFICATION & LOGS DATA SCIENTIST BUSINESS ANALYST Build on secure & trusted cloud Scale without limits
Executor0 TASKTASK Executor7 TASKTASK… Master SparkConnSparkConnSparkConnSparkConn Primary Secondary Secondary
Official Apache Spark website Azure Databricks Documentation MongoDB Connector for Apache Spark
MongoDB and Azure Databricks

MongoDB and Azure Databricks

  • 9.
    Apache Spark CoreAPIs RDDs, DataFrame, Datasets Spark SQL GraphX / GraphFrames (graph) Structured Streaming Mllib (machine learning) Spark: The Definitive Guide
  • 13.
    Managed Apache Sparkplatform optimized for Azure Microsoft Azure
  • 14.
    Optimized Databricks RuntimeEngine DATABRICKS I/O SERVERLESS Collaborative Workspace Cloud storage Data warehouses Hadoop storage IoT / streaming data Rest APIs Machine learning models BI tools Data exports Data warehouses AZURE DATABRICKS Enhance Productivity Deploy Production Jobs & Workflows APACHE SPARK MULTI-STAGE PIPELINES DATA ENGINEER JOB SCHEDULER NOTIFICATION & LOGS DATA SCIENTIST BUSINESS ANALYST Build on secure & trusted cloud Scale without limits
  • 16.
  • 19.
    Optimized Databricks RuntimeEngine DATABRICKS I/O SERVERLESS Collaborative Workspace Cloud storage Data warehouses Hadoop storage IoT / streaming data Rest APIs Machine learning models BI tools Data exports Data warehouses AZURE DATABRICKS Enhance Productivity Deploy Production Jobs & Workflows APACHE SPARK MULTI-STAGE PIPELINES DATA ENGINEER JOB SCHEDULER NOTIFICATION & LOGS DATA SCIENTIST BUSINESS ANALYST Build on secure & trusted cloud Scale without limits
  • 20.
  • 29.
    Official Apache Sparkwebsite Azure Databricks Documentation MongoDB Connector for Apache Spark

Editor's Notes

  • #9 Objective: Show heterogenous set of tools in big data world Slice of the big data ecosystem For
  • #10 Talking points: Unified. Computing engine. Not a storage solution (interfaces w/ existing storage) Libraries (Mllib, GraphX, Spark SQL, Structured Streaming, open source packages)
  • #12 Developers can also choose to cache For Jobs that reuse over again a particular Dataset
  • #14 Fun fact: Employees of Databricks have written over 75% of the code in Apache Spark Why it’s important Scalable distributed computing environment PAYG https://docs.microsoft.com/en-us/azure/azure-databricks/what-is-azure-databricks
  • #15 14
  • #16 Workspaces Workspaces allow you to organize all the work that you are doing on Databricks. Like a folder structure in your computer, it allows you to save notebooks and libraries and share them with other users. Workspaces are not connected to data and should not be used to store data. They're simply for you to store the notebooks and libraries that you use to operate on and manipulate your data with. Notebooks Notebooks are a set of any number of cells that allow you to execute commands. Cells hold code in any of the following languages: Scala, Python, R, SQL, or Markdown. Notebooks have a default language, but each cell can have a language override to another language. This is done by including %[language name] at the top of the cell. For instance %python. We'll see this feature shortly. Notebooks need to be connected to a cluster in order to be able to execute commands however they are not permanently tied to a cluster. This allows notebooks to be shared via the web or downloaded onto your local machine. Here is a demonstration video of Notebooks. Dashboards Dashboards can be created from notebooks as a way of displaying the output of cells without the code that generates them. Notebooks can also be scheduled as jobs in one click either to run a data pipeline, update a machine learning model, or update a dashboard. Libraries Libraries are packages or modules that provide additional functionality that you need to solve your business problems. These may be custom written Scala or Java jars; Python eggs or custom written packages. You can write and upload these manually or you may install them directly via package management utilities like pypi or maven. Tables Tables are structured data that you and your team will use for analysis. Tables can exist in several places. Tables can be stored in cloud storage, they can be stored on the cluster that you're currently using, or they can be cached in memory. For more about tables see the documentation. Clusters Clusters are groups of computers that you treat as a single computer. In Databricks, this means that you can effectively treat 20 computers as you might treat one computer. Clusters allow you to execute code from notebooks or libraries on set of data. That data may be raw data located on cloud storage or structured data that you uploaded as a table to the cluster you are working on. It is important to note that clusters have access controls to control who has access to each cluster. Here is a demonstration video of Clusters. Jobs Jobs are the tool by which you can schedule execution to occur either on an already existing cluster or a cluster of its own. These can be notebooks as well as jars or Python scripts. They can be created either manually or via the REST API. Here is a demonstration video of Jobs. Apps Apps are third party integrations with the Databricks platform. These include applications like Tableau.
  • #17 If Spark is computing engine, where does Databricks store the data?
  • #18 OBJECTIVE: Show how easy it is to get started - Create Databricks workspace - Create a spark cluster Create a notebook Import notebook: https://databricks.com/resources/type/example-notebooks (https://cdn2.hubspot.net/hubfs/438089/notebooks/Quick_Start/Quick_Start_Using_Python.html)
  • #29 28