ML IN DATA PLATFORM A Case Study with NLP Application US Office 2150 Ringwood Ave, San Jose, CA 95131 UK Office 3 Beeston Place, Belgravia, London SW1W 0JJ, UK Vietnam Office Floor #1-4, 302 Le Van Sy, Ward 1, Tan Binh District, HCMC, Vietnam SG Office 6A Shenton Way #04-08 OUE Downtown Gallery Singapore 068815
2 Table of content No Content 1 Introduction 2 Data Platform – ETL Process 3 Data Platform – Analytics Workflow 4 Afterthoughts
3 INTRODUCTION 01 1. Introduction to Case Study 2. Introduction to Data Platform
1.1.1. Potential Values of ML/NLP Application 4 - ML applications can bring new-found values - Case study: Online Review Analytics - Opinions from others increasingly guide customer's purchases => Growth, Improvement, Investment implications Refs - https://www.mckinsey.com/industries/consumer-packaged-goods/our-insights/five-star-growth-using-online-ratings-to-design-better-products - https://www.thinkwithgoogle.com/consumer-insights/consumer-trends/customer-review-preference-statistics/
1.1.2. Dealing with text data 5 - An insight-mining platform for review text is highly valuable. It is difficult though - Engineering challenges - Getting the reviews => web-scraping, data collection - Storing reviews => moving, maintaining, deduplicating large amount of texts - Processing reviews => text cleaning, processing, and analytics at scale - Analytics challenges - Natural Language Processing – NLP - Insight communication: dashboards and visualization
1.2.1. Data Platform overall architecture 6
1.2.2. Example: output from ETL Process 7
1.3. Example: output from Analytics Workflow 8
1.3. Example: insight communication – Web Application 9
10 ETL PROCESS 02 1. Extract, Transform, Load 2. Data Collection 3. Data Storage
2.1. Extract, Transform, Load 11 - Extract: - Data Collector: collect data from websites - Extract and Map from raw data collected - Transform: clean up data (trim, special characters,…), deduplications, etc. - Load: to databases for storage and analysis: MongoDB, BigQuery - Batching: split large amount of data into batches for parallel processing - Worker: a container that moves/processes data -> Mini-ETL
2.1. Data Collection: web-scraping 12 Web Scraper
2.1. Data Collection: Benefit & Challenge 13 Benefit Challenge It’s Free It’s Big Data Fake Data - Captcha - IP Blocking Hard to collect - Javascript Rendering
2.1. Data Collection: How to deal with challenges? 14 WEB BROWSER SELENIUM PROXY To avoid IPs blocking & Captcha To overcome Javascript rendering Control Browser by Code Control Browser by Code
2.2. Data Storage 15 - PostgreSQL: store process metadata (used by orchestrator) - Google Cloud Storage: store intermediary CSV files - MongoDB: flexible, persistent storage for text documents. Allow easy and frequent edits - Google BigQuery: analytics data storage and distributed processing engine using SQL – familiar language for Data Analysts
16 ANALYTICS WORKFLOW 03 1. First Implementation 2. Inference Services
3.1.1 Analytics Workflow 17 - After ETL process, data is available for further processing and analysis - Analytics Workflow: - A part of Data Platform - Extract information from data for insights - Machine Learning models are integral part of text analytics - Information is extracted, and pushed to BigQuery for queries
3.1.2 First implementation 18 - Implement each model as a worker - Advantages: - Easy to implement - Suitable for early stages: fast implementation and acceptable performance - Several drawbacks: technical debts - Mixing of concerns - Low flexibility - Limited scalability
3.1.3 First implementation: mixing of concerns 19 - Data Platform’s intended purpose: moving data, processing, and interacting with various API on the way => mostly I/O operations - Computationally-heavy tasks are usually delegated: e.g. to BigQuery - Mixing I/O and computations
3.1.4 First implementation: scalability 20 - Everything seems ok, until we must process many reviews (100,000s - 1,000,000s, various lengths, can be very long) - Manual scaling: replicate workers -> VM resource/cost constraint - GPU acceleration? -> ETL workers don’t need GPU
3.1.5 First implementation: monitoring and maintenance 21 - No real monitoring components for performance degradation - Data drift, concept drift? - If needed, model is inspected manually - Collect, process, re-train models manually - Upload trained model to GCS, re-deploy workers
3.2.1 Inference Services: separation of concerns 22 - Income Inference Services - No direct I/O for data, only accept HTTP requests with input and response with computed results => Easier to maintain and optimize both ends
3.2.2 Inference Services: overall architecture 23
3.2.3 Inference Services: solving redundancy and reusability 24 - Each ML model is treated as a microservice - Several ML models can be connected as an inference pipeline for complex tasks - Promote reusability and flexibility => save resources
3.2.4 Inference Services: solving scalability 25 - Services are containerized, run, and deployed independently - Can be migrate to any environment with relative ease - For maximum scalability => K8s cluster (GKE) with autoscaling - Thanks to K8s, deployment is easier. - Rollout deployments: no/minimal downtime
3.2.5. Inference Services: monitoring 26 - Metrics are logged to a central data-lake and visualized in a dashboard. Image from https://www.datarobot.com/wiki/machine-learning-operations-mlops/
3.2.6. Inference Services: results and drawbacks 27 - Results - A more flexible and effective solution - More resilient ETL process: less complex - Reduced ETL resource consumption and processing time - New system of services can be developed and maintained separately - Drawbacks - Appearance of more infrastructures and tools -> management overhead - Complex inter-dependency of inference services as it expands - Requires more expertise in managing K8s clusters and deployment
28 WHAT WE LEARNED 04
4.1. What We Learned? 29 - ML Application can be tricky to be done right - Not much resources and best practices - Solved by: thorough analysis of use-cases - Solved by: proper scoping and sizing - Separating I/O Intensive from Computationally-intensive tasks - ETL components - ML components - Good architecture design from the beginning can save time and cost later - Over-engineered vs under-engineered - Easy in hindsight, difficult in practice Hope these ideas help you in designing your next ML Application
THANK YOU – Q&A

Grokking Techtalk #42: Engineering challenges on building data platform for ML application

  • 1.
    ML IN DATAPLATFORM A Case Study with NLP Application US Office 2150 Ringwood Ave, San Jose, CA 95131 UK Office 3 Beeston Place, Belgravia, London SW1W 0JJ, UK Vietnam Office Floor #1-4, 302 Le Van Sy, Ward 1, Tan Binh District, HCMC, Vietnam SG Office 6A Shenton Way #04-08 OUE Downtown Gallery Singapore 068815
  • 2.
    2 Table of content NoContent 1 Introduction 2 Data Platform – ETL Process 3 Data Platform – Analytics Workflow 4 Afterthoughts
  • 3.
    3 INTRODUCTION 01 1. Introduction toCase Study 2. Introduction to Data Platform
  • 4.
    1.1.1. Potential Valuesof ML/NLP Application 4 - ML applications can bring new-found values - Case study: Online Review Analytics - Opinions from others increasingly guide customer's purchases => Growth, Improvement, Investment implications Refs - https://www.mckinsey.com/industries/consumer-packaged-goods/our-insights/five-star-growth-using-online-ratings-to-design-better-products - https://www.thinkwithgoogle.com/consumer-insights/consumer-trends/customer-review-preference-statistics/
  • 5.
    1.1.2. Dealing withtext data 5 - An insight-mining platform for review text is highly valuable. It is difficult though - Engineering challenges - Getting the reviews => web-scraping, data collection - Storing reviews => moving, maintaining, deduplicating large amount of texts - Processing reviews => text cleaning, processing, and analytics at scale - Analytics challenges - Natural Language Processing – NLP - Insight communication: dashboards and visualization
  • 6.
    1.2.1. Data Platformoverall architecture 6
  • 7.
    1.2.2. Example: outputfrom ETL Process 7
  • 8.
    1.3. Example: outputfrom Analytics Workflow 8
  • 9.
    1.3. Example: insightcommunication – Web Application 9
  • 10.
    10 ETL PROCESS 02 1. Extract,Transform, Load 2. Data Collection 3. Data Storage
  • 11.
    2.1. Extract, Transform,Load 11 - Extract: - Data Collector: collect data from websites - Extract and Map from raw data collected - Transform: clean up data (trim, special characters,…), deduplications, etc. - Load: to databases for storage and analysis: MongoDB, BigQuery - Batching: split large amount of data into batches for parallel processing - Worker: a container that moves/processes data -> Mini-ETL
  • 12.
    2.1. Data Collection:web-scraping 12 Web Scraper
  • 13.
    2.1. Data Collection:Benefit & Challenge 13 Benefit Challenge It’s Free It’s Big Data Fake Data - Captcha - IP Blocking Hard to collect - Javascript Rendering
  • 14.
    2.1. Data Collection:How to deal with challenges? 14 WEB BROWSER SELENIUM PROXY To avoid IPs blocking & Captcha To overcome Javascript rendering Control Browser by Code Control Browser by Code
  • 15.
    2.2. Data Storage 15 -PostgreSQL: store process metadata (used by orchestrator) - Google Cloud Storage: store intermediary CSV files - MongoDB: flexible, persistent storage for text documents. Allow easy and frequent edits - Google BigQuery: analytics data storage and distributed processing engine using SQL – familiar language for Data Analysts
  • 16.
    16 ANALYTICS WORKFLOW 03 1. FirstImplementation 2. Inference Services
  • 17.
    3.1.1 Analytics Workflow 17 -After ETL process, data is available for further processing and analysis - Analytics Workflow: - A part of Data Platform - Extract information from data for insights - Machine Learning models are integral part of text analytics - Information is extracted, and pushed to BigQuery for queries
  • 18.
    3.1.2 First implementation 18 -Implement each model as a worker - Advantages: - Easy to implement - Suitable for early stages: fast implementation and acceptable performance - Several drawbacks: technical debts - Mixing of concerns - Low flexibility - Limited scalability
  • 19.
    3.1.3 First implementation:mixing of concerns 19 - Data Platform’s intended purpose: moving data, processing, and interacting with various API on the way => mostly I/O operations - Computationally-heavy tasks are usually delegated: e.g. to BigQuery - Mixing I/O and computations
  • 20.
    3.1.4 First implementation:scalability 20 - Everything seems ok, until we must process many reviews (100,000s - 1,000,000s, various lengths, can be very long) - Manual scaling: replicate workers -> VM resource/cost constraint - GPU acceleration? -> ETL workers don’t need GPU
  • 21.
    3.1.5 First implementation:monitoring and maintenance 21 - No real monitoring components for performance degradation - Data drift, concept drift? - If needed, model is inspected manually - Collect, process, re-train models manually - Upload trained model to GCS, re-deploy workers
  • 22.
    3.2.1 Inference Services:separation of concerns 22 - Income Inference Services - No direct I/O for data, only accept HTTP requests with input and response with computed results => Easier to maintain and optimize both ends
  • 23.
    3.2.2 Inference Services:overall architecture 23
  • 24.
    3.2.3 Inference Services:solving redundancy and reusability 24 - Each ML model is treated as a microservice - Several ML models can be connected as an inference pipeline for complex tasks - Promote reusability and flexibility => save resources
  • 25.
    3.2.4 Inference Services:solving scalability 25 - Services are containerized, run, and deployed independently - Can be migrate to any environment with relative ease - For maximum scalability => K8s cluster (GKE) with autoscaling - Thanks to K8s, deployment is easier. - Rollout deployments: no/minimal downtime
  • 26.
    3.2.5. Inference Services:monitoring 26 - Metrics are logged to a central data-lake and visualized in a dashboard. Image from https://www.datarobot.com/wiki/machine-learning-operations-mlops/
  • 27.
    3.2.6. Inference Services:results and drawbacks 27 - Results - A more flexible and effective solution - More resilient ETL process: less complex - Reduced ETL resource consumption and processing time - New system of services can be developed and maintained separately - Drawbacks - Appearance of more infrastructures and tools -> management overhead - Complex inter-dependency of inference services as it expands - Requires more expertise in managing K8s clusters and deployment
  • 28.
  • 29.
    4.1. What WeLearned? 29 - ML Application can be tricky to be done right - Not much resources and best practices - Solved by: thorough analysis of use-cases - Solved by: proper scoping and sizing - Separating I/O Intensive from Computationally-intensive tasks - ETL components - ML components - Good architecture design from the beginning can save time and cost later - Over-engineered vs under-engineered - Easy in hindsight, difficult in practice Hope these ideas help you in designing your next ML Application
  • 30.