WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Sri Chintala, Microsoft Cosmos DB Real-time Advanced Analytics Workshop #UnifiedDataAnalytics #SparkAISummit Cosmos DB Real-time advanced analytics workshop
Today’s customer scenario ž Woodgrove Bank provides payment processing services for commerce. ž Want to build PoC of an innovative online fraud detection solution. ž Goal: Monitor fraud in real-time across millions of transactions to prevent financial loss and detect widespread attacks. 3#UnifiedDataAnalytics #SparkAISummit
Part 1: Customer Scenario • Woodgrove Banks’ customers – end merchants – are all around the world. • The right solution would minimize any latencies experienced by using their service by distributing the solution as close as possible to the regions used by customers. 4
Part 1: Customer scenario • Have decades-worth of historical transactional data, including transactions identified as fraudulent. • Data is in tabular format and can be exported to CSVs. • The analysts are very interested in the recent notebook-driven approach to data science & data engineering tasks. • They would prefer a solution that features notebooks to explore and prepare data, model, & define the logic for scheduled processing. 5
Part 1: Customer needs • Provide fraud detection services to merchant customers, using incoming payment transaction data to provide early warning of fraudulent activity. • Schedule offline scoring of “suspicious activity” using trained model, and make globally available. • Store data from streaming sources into long-term storage without interfering with read jobs. • Use standard platform that supports near-term data pipeline needs and long- term standard for data science, data engineering, & development. 6
Common scenarios
Part 2: Design the solution (10 min) • Design a solution and prepare to present the solution to the target customer audience in a chalk-talk format. 8
Part 3: Discuss preferred solution 9
Preferred solution - overall
Preferred solution - overall
Preferred solution - overall
Preferred solution – Data Ingest ž Payment transactions can be ingested in real-time using Event Hubs or Azure Cosmos DB. ž Factors to consider are: ž rate of flow (how many transactions/second) ž data source and compatibility ž level of effort to implement ž long-term storage needs
Preferred solution – Data Ingest ž Cosmos DB: ž Is optimized for high write throughput ž Provides streaming through its change feed. ž TTL (time to live) – automatic expiration & save in storage cost ž Event Hub: ž Data streams through, and can be persisted (Capture) in Blob or ADLS ž Both guarantee event ordering per-partition. It is important how you partition your data with either service.
Preferred solution – Data Ingest ž Cosmos DB likely easier for Woodgrove to integrate because they are already writing payment transactions to a database. ž Cosmos DB multi-master accepts writes from any region (failover auto redirects to next available region) ž Event Hub requires multiple instances in different geographies (failover requires more planning) ž Recommend: Cosmos DB – think of as “persistent event store”
Preferred solution – Data pipeline processing ž Azure Databricks: ž Managed Spark environment that can process streaming & batch data ž Enables data science, data engineering, and development needs. ž Features it provides on top of standard Apache Spark include: ž AAD integration and RBAC ž Collaborative features such as workspace and git integration ž Run scheduled jobs for automatic notebook/library execution ž Integrates with Azure Key Vault ž Train and evaluate machine learning models at scale
Preferred solution – Data pipeline processing ž Azure Databricks can connect to both Event Hubs and Cosmos DB, using Spark connectors for both. ž Spark Structured Streaming to process real-time payment transactions into Databricks Delta tables. ž Be sure to set a checkpoint directory on your streams. This allows you to restart stream processing if the job is stopped at any point.
Preferred solution – Data pipeline processing ž Store secrets such as account keys and connection strings centrally in Azure Key Vault ž Set Key Vault as the source for secret scopes in Azure Databricks. Secrets are [REDACTED].
Preferred solution – Data pipeline processing ž Databricks Delta tables are Spark tables with built-in reliability and performance optimizations. ž Supports batch & streaming with additional features: ACID transactions: Multiple writers can simultaneously modify data, without interfering with jobs reading the data set. DELETES/UPDATES/UPSERTS: Automatic file management: Data access speeds up by organizing data into large files that can be read efficiently Statistics and data skipping: Reads are 10-100x faster when statistics are tracked about data in each file, avoiding irrelevant information
Preferred solution - overall
Preferred solution - overall
Preferred solution - overall
Preferred solution - overall
Preferred solution – Model training & deployment ž Azure Databricks supports machine learning training at scale. ž Train model using historical payment transaction data
Preferred solution - overall
Preferred solution - overall
Preferred solution – Model training & deployment ž Use Azure Machine Learning service (AML) to: ž Register the trained model ž Deploy it to Azure Kubernetes Service (AKS) cluster for easy web accessibility and high availability. ž For scheduled, batch scoring, Access model from notebook and write results to Cosmos via Cosmos DB Spark connector.
Preferred solution - overall
Preferred solution - overall
Preferred solution – serving pre-scored data Use Cosmos DB for storing offline suspicious transaction data globally. qAdd applicable customer regions qEstimate RU/s needed – Cosmos can scale up & down to handle workload. qConsistency: Session consistency qPartition key: Choose to get even distribution of request volume & storage
Preferred solution - overall
Preferred solution - overall
Preferred solution – Long-term storage ž Use Azure Data Lake Storage Gen2 (ADLS Gen2) as the underlying long-term file store for Databricks Delta tables. ž Databricks Delta can compact small files together into larger files up to 1 GB in size using the OPTIMIZE operator. This can improve query performance over time. ž Define file paths in ADLS for query, dimension, and summary tables. Point to those paths when saving to Delta. ž Delta tables can be accessed by Power BI through a JDBC connector.
Preferred solution - overall
Preferred solution - overall
Preferred solution – Dashboards & Reporting ž Connect to Databricks Delta tables from Power BI to allow analysts to build reports and dashboards. ž The connection can be made using a JDBC connection string to an Azure Databricks cluster. Querying the tables is similar to querying a more traditional relational database. ž Data scientists and data engineers can use Azure Databricks notebooks to craft complex queries and data visualizations.
Preferred solution – Dashboards & Reporting ž A more cost-effective option for serving summary data for business analysts to use from Power BI is to use Azure Analysis Services. ž Eliminates having to have a dedicated Databricks cluster running at all times for reporting and analysis. ž Data is stored in a tabular semantic data model ž Write to it during stream processing (using rolling aggregates) ž Schedule batch writes via Databricks job or ADF.
Preferred solution - overall
Participant Guide • https://aka.ms/cosmos-mcw 39
DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT

Cosmos DB Real-time Advanced Analytics Workshop

  • 1.
    WIFI SSID:Spark+AISummit |Password: UnifiedDataAnalytics
  • 2.
    Sri Chintala, Microsoft CosmosDB Real-time Advanced Analytics Workshop #UnifiedDataAnalytics #SparkAISummit Cosmos DB Real-time advanced analytics workshop
  • 3.
    Today’s customer scenario žWoodgrove Bank provides payment processing services for commerce. ž Want to build PoC of an innovative online fraud detection solution. ž Goal: Monitor fraud in real-time across millions of transactions to prevent financial loss and detect widespread attacks. 3#UnifiedDataAnalytics #SparkAISummit
  • 4.
    Part 1: CustomerScenario • Woodgrove Banks’ customers – end merchants – are all around the world. • The right solution would minimize any latencies experienced by using their service by distributing the solution as close as possible to the regions used by customers. 4
  • 5.
    Part 1: Customerscenario • Have decades-worth of historical transactional data, including transactions identified as fraudulent. • Data is in tabular format and can be exported to CSVs. • The analysts are very interested in the recent notebook-driven approach to data science & data engineering tasks. • They would prefer a solution that features notebooks to explore and prepare data, model, & define the logic for scheduled processing. 5
  • 6.
    Part 1: Customerneeds • Provide fraud detection services to merchant customers, using incoming payment transaction data to provide early warning of fraudulent activity. • Schedule offline scoring of “suspicious activity” using trained model, and make globally available. • Store data from streaming sources into long-term storage without interfering with read jobs. • Use standard platform that supports near-term data pipeline needs and long- term standard for data science, data engineering, & development. 6
  • 7.
  • 8.
    Part 2: Designthe solution (10 min) • Design a solution and prepare to present the solution to the target customer audience in a chalk-talk format. 8
  • 9.
    Part 3: Discusspreferred solution 9
  • 10.
  • 11.
  • 12.
  • 13.
    Preferred solution –Data Ingest ž Payment transactions can be ingested in real-time using Event Hubs or Azure Cosmos DB. ž Factors to consider are: ž rate of flow (how many transactions/second) ž data source and compatibility ž level of effort to implement ž long-term storage needs
  • 14.
    Preferred solution –Data Ingest ž Cosmos DB: ž Is optimized for high write throughput ž Provides streaming through its change feed. ž TTL (time to live) – automatic expiration & save in storage cost ž Event Hub: ž Data streams through, and can be persisted (Capture) in Blob or ADLS ž Both guarantee event ordering per-partition. It is important how you partition your data with either service.
  • 15.
    Preferred solution –Data Ingest ž Cosmos DB likely easier for Woodgrove to integrate because they are already writing payment transactions to a database. ž Cosmos DB multi-master accepts writes from any region (failover auto redirects to next available region) ž Event Hub requires multiple instances in different geographies (failover requires more planning) ž Recommend: Cosmos DB – think of as “persistent event store”
  • 16.
    Preferred solution –Data pipeline processing ž Azure Databricks: ž Managed Spark environment that can process streaming & batch data ž Enables data science, data engineering, and development needs. ž Features it provides on top of standard Apache Spark include: ž AAD integration and RBAC ž Collaborative features such as workspace and git integration ž Run scheduled jobs for automatic notebook/library execution ž Integrates with Azure Key Vault ž Train and evaluate machine learning models at scale
  • 17.
    Preferred solution –Data pipeline processing ž Azure Databricks can connect to both Event Hubs and Cosmos DB, using Spark connectors for both. ž Spark Structured Streaming to process real-time payment transactions into Databricks Delta tables. ž Be sure to set a checkpoint directory on your streams. This allows you to restart stream processing if the job is stopped at any point.
  • 18.
    Preferred solution –Data pipeline processing ž Store secrets such as account keys and connection strings centrally in Azure Key Vault ž Set Key Vault as the source for secret scopes in Azure Databricks. Secrets are [REDACTED].
  • 19.
    Preferred solution –Data pipeline processing ž Databricks Delta tables are Spark tables with built-in reliability and performance optimizations. ž Supports batch & streaming with additional features: ACID transactions: Multiple writers can simultaneously modify data, without interfering with jobs reading the data set. DELETES/UPDATES/UPSERTS: Automatic file management: Data access speeds up by organizing data into large files that can be read efficiently Statistics and data skipping: Reads are 10-100x faster when statistics are tracked about data in each file, avoiding irrelevant information
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
    Preferred solution –Model training & deployment ž Azure Databricks supports machine learning training at scale. ž Train model using historical payment transaction data
  • 25.
  • 26.
  • 27.
    Preferred solution –Model training & deployment ž Use Azure Machine Learning service (AML) to: ž Register the trained model ž Deploy it to Azure Kubernetes Service (AKS) cluster for easy web accessibility and high availability. ž For scheduled, batch scoring, Access model from notebook and write results to Cosmos via Cosmos DB Spark connector.
  • 28.
  • 29.
  • 30.
    Preferred solution –serving pre-scored data Use Cosmos DB for storing offline suspicious transaction data globally. qAdd applicable customer regions qEstimate RU/s needed – Cosmos can scale up & down to handle workload. qConsistency: Session consistency qPartition key: Choose to get even distribution of request volume & storage
  • 31.
  • 32.
  • 33.
    Preferred solution –Long-term storage ž Use Azure Data Lake Storage Gen2 (ADLS Gen2) as the underlying long-term file store for Databricks Delta tables. ž Databricks Delta can compact small files together into larger files up to 1 GB in size using the OPTIMIZE operator. This can improve query performance over time. ž Define file paths in ADLS for query, dimension, and summary tables. Point to those paths when saving to Delta. ž Delta tables can be accessed by Power BI through a JDBC connector.
  • 34.
  • 35.
  • 36.
    Preferred solution –Dashboards & Reporting ž Connect to Databricks Delta tables from Power BI to allow analysts to build reports and dashboards. ž The connection can be made using a JDBC connection string to an Azure Databricks cluster. Querying the tables is similar to querying a more traditional relational database. ž Data scientists and data engineers can use Azure Databricks notebooks to craft complex queries and data visualizations.
  • 37.
    Preferred solution –Dashboards & Reporting ž A more cost-effective option for serving summary data for business analysts to use from Power BI is to use Azure Analysis Services. ž Eliminates having to have a dedicated Databricks cluster running at all times for reporting and analysis. ž Data is stored in a tabular semantic data model ž Write to it during stream processing (using rolling aggregates) ž Schedule batch writes via Databricks job or ADF.
  • 38.
  • 39.
  • 40.
    DON’T FORGET TORATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT