Cosmos DB Real-time Advanced Analytics Workshop

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

Sri Chintala, Microsoft Cosmos DB Real-time Advanced Analytics Workshop #UnifiedDataAnalytics #SparkAISummit Cosmos DB Real-time advanced analytics workshop

Today’s customer scenario ž Woodgrove Bank provides payment processing services for commerce. ž Want to build PoC of an innovative online fraud detection solution. ž Goal: Monitor fraud in real-time across millions of transactions to prevent financial loss and detect widespread attacks. 3#UnifiedDataAnalytics #SparkAISummit

Part 1: Customer Scenario • Woodgrove Banks’ customers – end merchants – are all around the world. • The right solution would minimize any latencies experienced by using their service by distributing the solution as close as possible to the regions used by customers. 4

Part 1: Customer scenario • Have decades-worth of historical transactional data, including transactions identified as fraudulent. • Data is in tabular format and can be exported to CSVs. • The analysts are very interested in the recent notebook-driven approach to data science & data engineering tasks. • They would prefer a solution that features notebooks to explore and prepare data, model, & define the logic for scheduled processing. 5

Part 1: Customer needs • Provide fraud detection services to merchant customers, using incoming payment transaction data to provide early warning of fraudulent activity. • Schedule offline scoring of “suspicious activity” using trained model, and make globally available. • Store data from streaming sources into long-term storage without interfering with read jobs. • Use standard platform that supports near-term data pipeline needs and long- term standard for data science, data engineering, & development. 6

Part 2: Design the solution (10 min) • Design a solution and prepare to present the solution to the target customer audience in a chalk-talk format. 8

Part 3: Discuss preferred solution 9

Preferred solution – Data Ingest ž Payment transactions can be ingested in real-time using Event Hubs or Azure Cosmos DB. ž Factors to consider are: ž rate of flow (how many transactions/second) ž data source and compatibility ž level of effort to implement ž long-term storage needs

Preferred solution – Data Ingest ž Cosmos DB: ž Is optimized for high write throughput ž Provides streaming through its change feed. ž TTL (time to live) – automatic expiration & save in storage cost ž Event Hub: ž Data streams through, and can be persisted (Capture) in Blob or ADLS ž Both guarantee event ordering per-partition. It is important how you partition your data with either service.

Preferred solution – Data Ingest ž Cosmos DB likely easier for Woodgrove to integrate because they are already writing payment transactions to a database. ž Cosmos DB multi-master accepts writes from any region (failover auto redirects to next available region) ž Event Hub requires multiple instances in different geographies (failover requires more planning) ž Recommend: Cosmos DB – think of as “persistent event store”

Preferred solution – Data pipeline processing ž Azure Databricks: ž Managed Spark environment that can process streaming & batch data ž Enables data science, data engineering, and development needs. ž Features it provides on top of standard Apache Spark include: ž AAD integration and RBAC ž Collaborative features such as workspace and git integration ž Run scheduled jobs for automatic notebook/library execution ž Integrates with Azure Key Vault ž Train and evaluate machine learning models at scale

Preferred solution – Data pipeline processing ž Azure Databricks can connect to both Event Hubs and Cosmos DB, using Spark connectors for both. ž Spark Structured Streaming to process real-time payment transactions into Databricks Delta tables. ž Be sure to set a checkpoint directory on your streams. This allows you to restart stream processing if the job is stopped at any point.

Preferred solution – Data pipeline processing ž Store secrets such as account keys and connection strings centrally in Azure Key Vault ž Set Key Vault as the source for secret scopes in Azure Databricks. Secrets are [REDACTED].

Preferred solution – Data pipeline processing ž Databricks Delta tables are Spark tables with built-in reliability and performance optimizations. ž Supports batch & streaming with additional features: ACID transactions: Multiple writers can simultaneously modify data, without interfering with jobs reading the data set. DELETES/UPDATES/UPSERTS: Automatic file management: Data access speeds up by organizing data into large files that can be read efficiently Statistics and data skipping: Reads are 10-100x faster when statistics are tracked about data in each file, avoiding irrelevant information

Preferred solution – Model training & deployment ž Azure Databricks supports machine learning training at scale. ž Train model using historical payment transaction data

Preferred solution – Model training & deployment ž Use Azure Machine Learning service (AML) to: ž Register the trained model ž Deploy it to Azure Kubernetes Service (AKS) cluster for easy web accessibility and high availability. ž For scheduled, batch scoring, Access model from notebook and write results to Cosmos via Cosmos DB Spark connector.

Preferred solution – serving pre-scored data Use Cosmos DB for storing offline suspicious transaction data globally. qAdd applicable customer regions qEstimate RU/s needed – Cosmos can scale up & down to handle workload. qConsistency: Session consistency qPartition key: Choose to get even distribution of request volume & storage

Preferred solution – Long-term storage ž Use Azure Data Lake Storage Gen2 (ADLS Gen2) as the underlying long-term file store for Databricks Delta tables. ž Databricks Delta can compact small files together into larger files up to 1 GB in size using the OPTIMIZE operator. This can improve query performance over time. ž Define file paths in ADLS for query, dimension, and summary tables. Point to those paths when saving to Delta. ž Delta tables can be accessed by Power BI through a JDBC connector.

Preferred solution – Dashboards & Reporting ž Connect to Databricks Delta tables from Power BI to allow analysts to build reports and dashboards. ž The connection can be made using a JDBC connection string to an Azure Databricks cluster. Querying the tables is similar to querying a more traditional relational database. ž Data scientists and data engineers can use Azure Databricks notebooks to craft complex queries and data visualizations.

Preferred solution – Dashboards & Reporting ž A more cost-effective option for serving summary data for business analysts to use from Power BI is to use Azure Analysis Services. ž Eliminates having to have a dedicated Databricks cluster running at all times for reporting and analysis. ž Data is stored in a tabular semantic data model ž Write to it during stream processing (using rolling aggregates) ž Schedule batch writes via Databricks job or ADF.

Participant Guide • https://aka.ms/cosmos-mcw 39

DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT

Cosmos DB Real-time Advanced Analytics Workshop

More Related Content

What's hot

Similar to Cosmos DB Real-time Advanced Analytics Workshop

More from Databricks

Recently uploaded

Cosmos DB Real-time Advanced Analytics Workshop