Lambda Architecture Platform Using SQL Sep 13 2014 HadoopCon 2014 Taiwan TAGOMORI Satoshi (@tagomoris)
Taipei
Topics About Me & LINE Data analytics workloads Batch processing Stream processing Lambda architecture Lambda architecture using SQL Norikra: Stream processing with SQL 13:30-14:20 4F
@tagomoris Satoshi Tagomori (田籠 聡) LINE Corporation Analytics Platform Team
Tokyo
LINE Offices Tokyo HQ Spain Thailand Taipei USA Korea
LINE is born! JUNE 23, 2011
Data Analytics Workload Part 01
Various Data Analytics Workload Reports Monthly/Daily reports Hourly (or shorter) news Real-time metrics Automatically updated reports/graphs Alerts for abuse of services, overload, ...
Batch Processing Hadoop MapReduce (or Spark, Tez) & DSLs (Hive, Pig, ...) For reports MPP Engines Cloudera Impala, Apache Drill, Facebook Presto, ... For interactive analysis For reports of shorter window
Stream Processing Apache Storm Incubator project “Distributed and fault-tolerant realtime computation” Norikra by tagomoris Non-distributed “Stream processing with SQL”
Why Stream Processing? Less latency Realtime metrics Short-term prompt reports Less computing power 10Mbps for batch processing: 100GB/day 10Mbps for stream processing: 1 Server No query schedule management Once query registered, it runs forever
Disadvantage of Stream Processing Queries must be written before data There should be another way to query past data Queries cannot be run twice All results will be lost when any error occurs All data have gone when bugs found Disorders of events break results Recorded time based queries? Or arrival time based queries?
Part 02 Lambda Architecture
Lambda Architecture “The Lambda-architecture aims to satisfy the needs for a robust system that is fault-tolerant, both against hardware failures and human mistakes, being able to serve a wide range of workloads and use cases, and in which low-latency reads and updates are required. The resulting system should be linearly scalable, and it should scale out rather than up.” http://lambda-architecture.net/
Lambda Architecture: Overview new data batch layer master dataset serving layer view speed layer real-time view query
Twitter Summingbird Lambda architecture library Batch mode: Scalding on Hadoop MapReduce Realtime mode: Storm Word counting by Summingbird (scala): def wordCount[P <: Platform[P]] (source: Producer[P, String], store: P#Store[String, Long]) = source.flatMap { sentence => toWords(sentence).map(_ -> 1L) }.sumByKey(store) https://github.com/twitter/summingbird https://blog.twitter.com/2013/streaming-mapreduce-with-summingbird
What Lambda Architecture Provides Replayable queries Redo queries anytime if results of speed layer are broken Accurate results on demand Prompt reports in speed layer with arrival time Fixed reports in batch layer with recorded time ... And many more benefits of stream processing
Why All of Us Don’t Use It? Storm doesn’t fit well with many uses Storm requires computer resources too big to deploy Summingbird requires many steps to deploy Many directors/analysts don’t write Scala/Java Summingbird DSL is not enough easy for non-professional people
Lambda Architecture Using SQL Part 03
Existing Hadoop Platform new data HDFS hive query Fluentd presto query
Norikra Schema-less stream processing with SQL “Norikra is a open source server software provides "Stream Processing" with SQL, written in JRuby, runs on JVM, licensed under GPLv2.” SELECT path, COUNT(1, status=200) AS success_count, COUNT(1, status=500) AS server_error_count, COUNT(*) AS count FROM AccessLog.win:time_batch(10 min, 0L) WHERE service='myservice' AND path LIKE '/api/%' GROUP BY path http://norikra.github.io/
Added-on Lambda Architecture Platform new data presto query HDFS hive query norikra query
“Pseudo Lambda” Architecture Using SQL Lambda architecture platform with almost same queries SELECT path, COUNT(IF(status=200,1,NULL)) AS success_count, COUNT(IF(status=500,1,NULL)) AS server_error_count, COUNT(*) AS count FROM AccessLog WHERE service='myservice' AND path LIKE '/api/%' AND timestamp >= ‘2014-09-13 10:40:00’ AND timestamp < ‘2014-09-13 10:50:00’ GROUP BY path SELECT path, COUNT(1, status=200) AS success_count, COUNT(1, status=500) AS server_error_count, COUNT(*) AS count FROM AccessLog.win:time_batch(10 min, 0L) WHERE service='myservice' AND path LIKE '/api/%' GROUP BY path
“Pseudo Lambda” Architecture Using SQL SQL dialects are easy to learn! Standard SQL, Hive, Presto, Impala, Drill, ... + Norikra For non-professional people too! SQL queries are very easy to write twice!
Use Cases in LINE Prompt reports for Ads service Short-term prompt reports by Norikra Daily fixed reports by Hive Summary of application server error log Aggregate error log for alerting by Norikra Check details with Hive, Presto (or grep!) See you later for details!
TMTOWTDI “There’s more than one way to do it.” - Perl programming language
SHARE What I want & What I’m doing! - tagomoris
Q & A

Lambda Architecture Using SQL

  • 1.
    Lambda Architecture Platform Using SQL Sep 13 2014 HadoopCon 2014 Taiwan TAGOMORI Satoshi (@tagomoris)
  • 2.
  • 3.
    Topics About Me& LINE Data analytics workloads Batch processing Stream processing Lambda architecture Lambda architecture using SQL Norikra: Stream processing with SQL 13:30-14:20 4F
  • 4.
    @tagomoris Satoshi Tagomori(田籠 聡) LINE Corporation Analytics Platform Team
  • 5.
  • 7.
    LINE Offices TokyoHQ Spain Thailand Taipei USA Korea
  • 8.
    LINE is born!JUNE 23, 2011
  • 10.
  • 11.
    Various Data AnalyticsWorkload Reports Monthly/Daily reports Hourly (or shorter) news Real-time metrics Automatically updated reports/graphs Alerts for abuse of services, overload, ...
  • 13.
    Batch Processing Hadoop MapReduce (or Spark, Tez) & DSLs (Hive, Pig, ...) For reports MPP Engines Cloudera Impala, Apache Drill, Facebook Presto, ... For interactive analysis For reports of shorter window
  • 14.
    Stream Processing ApacheStorm Incubator project “Distributed and fault-tolerant realtime computation” Norikra by tagomoris Non-distributed “Stream processing with SQL”
  • 15.
    Why Stream Processing? Less latency Realtime metrics Short-term prompt reports Less computing power 10Mbps for batch processing: 100GB/day 10Mbps for stream processing: 1 Server No query schedule management Once query registered, it runs forever
  • 16.
    Disadvantage of StreamProcessing Queries must be written before data There should be another way to query past data Queries cannot be run twice All results will be lost when any error occurs All data have gone when bugs found Disorders of events break results Recorded time based queries? Or arrival time based queries?
  • 17.
    Part 02 LambdaArchitecture
  • 18.
    Lambda Architecture “TheLambda-architecture aims to satisfy the needs for a robust system that is fault-tolerant, both against hardware failures and human mistakes, being able to serve a wide range of workloads and use cases, and in which low-latency reads and updates are required. The resulting system should be linearly scalable, and it should scale out rather than up.” http://lambda-architecture.net/
  • 19.
    Lambda Architecture: Overview new data batch layer master dataset serving layer view speed layer real-time view query
  • 20.
    Twitter Summingbird Lambdaarchitecture library Batch mode: Scalding on Hadoop MapReduce Realtime mode: Storm Word counting by Summingbird (scala): def wordCount[P <: Platform[P]] (source: Producer[P, String], store: P#Store[String, Long]) = source.flatMap { sentence => toWords(sentence).map(_ -> 1L) }.sumByKey(store) https://github.com/twitter/summingbird https://blog.twitter.com/2013/streaming-mapreduce-with-summingbird
  • 21.
    What Lambda ArchitectureProvides Replayable queries Redo queries anytime if results of speed layer are broken Accurate results on demand Prompt reports in speed layer with arrival time Fixed reports in batch layer with recorded time ... And many more benefits of stream processing
  • 22.
    Why All ofUs Don’t Use It? Storm doesn’t fit well with many uses Storm requires computer resources too big to deploy Summingbird requires many steps to deploy Many directors/analysts don’t write Scala/Java Summingbird DSL is not enough easy for non-professional people
  • 23.
  • 24.
    Existing Hadoop Platform new data HDFS hive query Fluentd presto query
  • 25.
    Norikra Schema-less streamprocessing with SQL “Norikra is a open source server software provides "Stream Processing" with SQL, written in JRuby, runs on JVM, licensed under GPLv2.” SELECT path, COUNT(1, status=200) AS success_count, COUNT(1, status=500) AS server_error_count, COUNT(*) AS count FROM AccessLog.win:time_batch(10 min, 0L) WHERE service='myservice' AND path LIKE '/api/%' GROUP BY path http://norikra.github.io/
  • 26.
    Added-on Lambda ArchitecturePlatform new data presto query HDFS hive query norikra query
  • 27.
    “Pseudo Lambda” ArchitectureUsing SQL Lambda architecture platform with almost same queries SELECT path, COUNT(IF(status=200,1,NULL)) AS success_count, COUNT(IF(status=500,1,NULL)) AS server_error_count, COUNT(*) AS count FROM AccessLog WHERE service='myservice' AND path LIKE '/api/%' AND timestamp >= ‘2014-09-13 10:40:00’ AND timestamp < ‘2014-09-13 10:50:00’ GROUP BY path SELECT path, COUNT(1, status=200) AS success_count, COUNT(1, status=500) AS server_error_count, COUNT(*) AS count FROM AccessLog.win:time_batch(10 min, 0L) WHERE service='myservice' AND path LIKE '/api/%' GROUP BY path
  • 28.
    “Pseudo Lambda” ArchitectureUsing SQL SQL dialects are easy to learn! Standard SQL, Hive, Presto, Impala, Drill, ... + Norikra For non-professional people too! SQL queries are very easy to write twice!
  • 29.
    Use Cases inLINE Prompt reports for Ads service Short-term prompt reports by Norikra Daily fixed reports by Hive Summary of application server error log Aggregate error log for alerting by Norikra Check details with Hive, Presto (or grep!) See you later for details!
  • 30.
    TMTOWTDI “There’s morethan one way to do it.” - Perl programming language
  • 31.
    SHARE What Iwant & What I’m doing! - tagomoris
  • 32.