Grokking Techtalk #42: Engineering challenges on building data platform for ML application

ML IN DATA PLATFORM A Case Study with NLP Application US Office 2150 Ringwood Ave, San Jose, CA 95131 UK Office 3 Beeston Place, Belgravia, London SW1W 0JJ, UK Vietnam Office Floor #1-4, 302 Le Van Sy, Ward 1, Tan Binh District, HCMC, Vietnam SG Office 6A Shenton Way #04-08 OUE Downtown Gallery Singapore 068815

2 Table of content No Content 1 Introduction 2 Data Platform – ETL Process 3 Data Platform – Analytics Workﬂow 4 Afterthoughts

3 INTRODUCTION 01 1. Introduction to Case Study 2. Introduction to Data Platform

1.1.1. Potential Values of ML/NLP Application 4 - ML applications can bring new-found values - Case study: Online Review Analytics - Opinions from others increasingly guide customer's purchases => Growth, Improvement, Investment implications Refs - https://www.mckinsey.com/industries/consumer-packaged-goods/our-insights/ﬁve-star-growth-using-online-ratings-to-design-better-products - https://www.thinkwithgoogle.com/consumer-insights/consumer-trends/customer-review-preference-statistics/

1.1.2. Dealing with text data 5 - An insight-mining platform for review text is highly valuable. It is difficult though - Engineering challenges - Getting the reviews => web-scraping, data collection - Storing reviews => moving, maintaining, deduplicating large amount of texts - Processing reviews => text cleaning, processing, and analytics at scale - Analytics challenges - Natural Language Processing – NLP - Insight communication: dashboards and visualization

1.2.1. Data Platform overall architecture 6

1.2.2. Example: output from ETL Process 7

1.3. Example: output from Analytics Workﬂow 8

1.3. Example: insight communication – Web Application 9

10 ETL PROCESS 02 1. Extract, Transform, Load 2. Data Collection 3. Data Storage

2.1. Extract, Transform, Load 11 - Extract: - Data Collector: collect data from websites - Extract and Map from raw data collected - Transform: clean up data (trim, special characters,…), deduplications, etc. - Load: to databases for storage and analysis: MongoDB, BigQuery - Batching: split large amount of data into batches for parallel processing - Worker: a container that moves/processes data -> Mini-ETL

2.1. Data Collection: web-scraping 12 Web Scraper

2.1. Data Collection: Beneﬁt & Challenge 13 Beneﬁt Challenge It’s Free It’s Big Data Fake Data - Captcha - IP Blocking Hard to collect - Javascript Rendering

2.1. Data Collection: How to deal with challenges? 14 WEB BROWSER SELENIUM PROXY To avoid IPs blocking & Captcha To overcome Javascript rendering Control Browser by Code Control Browser by Code

2.2. Data Storage 15 - PostgreSQL: store process metadata (used by orchestrator) - Google Cloud Storage: store intermediary CSV ﬁles - MongoDB: ﬂexible, persistent storage for text documents. Allow easy and frequent edits - Google BigQuery: analytics data storage and distributed processing engine using SQL – familiar language for Data Analysts

16 ANALYTICS WORKFLOW 03 1. First Implementation 2. Inference Services

3.1.1 Analytics Workﬂow 17 - After ETL process, data is available for further processing and analysis - Analytics Workﬂow: - A part of Data Platform - Extract information from data for insights - Machine Learning models are integral part of text analytics - Information is extracted, and pushed to BigQuery for queries

3.1.2 First implementation 18 - Implement each model as a worker - Advantages: - Easy to implement - Suitable for early stages: fast implementation and acceptable performance - Several drawbacks: technical debts - Mixing of concerns - Low ﬂexibility - Limited scalability

3.1.3 First implementation: mixing of concerns 19 - Data Platform’s intended purpose: moving data, processing, and interacting with various API on the way => mostly I/O operations - Computationally-heavy tasks are usually delegated: e.g. to BigQuery - Mixing I/O and computations

3.1.4 First implementation: scalability 20 - Everything seems ok, until we must process many reviews (100,000s - 1,000,000s, various lengths, can be very long) - Manual scaling: replicate workers -> VM resource/cost constraint - GPU acceleration? -> ETL workers don’t need GPU

3.1.5 First implementation: monitoring and maintenance 21 - No real monitoring components for performance degradation - Data drift, concept drift? - If needed, model is inspected manually - Collect, process, re-train models manually - Upload trained model to GCS, re-deploy workers

3.2.1 Inference Services: separation of concerns 22 - Income Inference Services - No direct I/O for data, only accept HTTP requests with input and response with computed results => Easier to maintain and optimize both ends

3.2.2 Inference Services: overall architecture 23

3.2.3 Inference Services: solving redundancy and reusability 24 - Each ML model is treated as a microservice - Several ML models can be connected as an inference pipeline for complex tasks - Promote reusability and ﬂexibility => save resources

3.2.4 Inference Services: solving scalability 25 - Services are containerized, run, and deployed independently - Can be migrate to any environment with relative ease - For maximum scalability => K8s cluster (GKE) with autoscaling - Thanks to K8s, deployment is easier. - Rollout deployments: no/minimal downtime

3.2.5. Inference Services: monitoring 26 - Metrics are logged to a central data-lake and visualized in a dashboard. Image from https://www.datarobot.com/wiki/machine-learning-operations-mlops/

3.2.6. Inference Services: results and drawbacks 27 - Results - A more ﬂexible and effective solution - More resilient ETL process: less complex - Reduced ETL resource consumption and processing time - New system of services can be developed and maintained separately - Drawbacks - Appearance of more infrastructures and tools -> management overhead - Complex inter-dependency of inference services as it expands - Requires more expertise in managing K8s clusters and deployment

4.1. What We Learned? 29 - ML Application can be tricky to be done right - Not much resources and best practices - Solved by: thorough analysis of use-cases - Solved by: proper scoping and sizing - Separating I/O Intensive from Computationally-intensive tasks - ETL components - ML components - Good architecture design from the beginning can save time and cost later - Over-engineered vs under-engineered - Easy in hindsight, difficult in practice Hope these ideas help you in designing your next ML Application

Grokking Techtalk #42: Engineering challenges on building data platform for ML application

More Related Content

What's hot

Similar to Grokking Techtalk #42: Engineering challenges on building data platform for ML application

More from Grokking VN

Recently uploaded

Grokking Techtalk #42: Engineering challenges on building data platform for ML application