Lambda Architecture Using SQL

Lambda Architecture Platform Using SQL Sep 13 2014 HadoopCon 2014 Taiwan TAGOMORI Satoshi (@tagomoris)

Topics About Me & LINE Data analytics workloads Batch processing Stream processing Lambda architecture Lambda architecture using SQL Norikra: Stream processing with SQL 13:30-14:20 4F

@tagomoris Satoshi Tagomori (田籠聡) LINE Corporation Analytics Platform Team

LINE Offices Tokyo HQ Spain Thailand Taipei USA Korea

Data Analytics Workload Part 01

Various Data Analytics Workload Reports Monthly/Daily reports Hourly (or shorter) news Real-time metrics Automatically updated reports/graphs Alerts for abuse of services, overload, ...

Batch Processing Hadoop MapReduce (or Spark, Tez) & DSLs (Hive, Pig, ...) For reports MPP Engines Cloudera Impala, Apache Drill, Facebook Presto, ... For interactive analysis For reports of shorter window

Stream Processing Apache Storm Incubator project “Distributed and fault-tolerant realtime computation” Norikra by tagomoris Non-distributed “Stream processing with SQL”

Why Stream Processing? Less latency Realtime metrics Short-term prompt reports Less computing power 10Mbps for batch processing: 100GB/day 10Mbps for stream processing: 1 Server No query schedule management Once query registered, it runs forever

Disadvantage of Stream Processing Queries must be written before data There should be another way to query past data Queries cannot be run twice All results will be lost when any error occurs All data have gone when bugs found Disorders of events break results Recorded time based queries? Or arrival time based queries?

Lambda Architecture “The Lambda-architecture aims to satisfy the needs for a robust system that is fault-tolerant, both against hardware failures and human mistakes, being able to serve a wide range of workloads and use cases, and in which low-latency reads and updates are required. The resulting system should be linearly scalable, and it should scale out rather than up.” http://lambda-architecture.net/

Lambda Architecture: Overview new data batch layer master dataset serving layer view speed layer real-time view query

Twitter Summingbird Lambda architecture library Batch mode: Scalding on Hadoop MapReduce Realtime mode: Storm Word counting by Summingbird (scala): def wordCount[P <: Platform[P]] (source: Producer[P, String], store: P#Store[String, Long]) = source.flatMap { sentence => toWords(sentence).map(_ -> 1L) }.sumByKey(store) https://github.com/twitter/summingbird https://blog.twitter.com/2013/streaming-mapreduce-with-summingbird

What Lambda Architecture Provides Replayable queries Redo queries anytime if results of speed layer are broken Accurate results on demand Prompt reports in speed layer with arrival time Fixed reports in batch layer with recorded time ... And many more benefits of stream processing

Why All of Us Don’t Use It? Storm doesn’t fit well with many uses Storm requires computer resources too big to deploy Summingbird requires many steps to deploy Many directors/analysts don’t write Scala/Java Summingbird DSL is not enough easy for non-professional people

Lambda Architecture Using SQL Part 03

Existing Hadoop Platform new data HDFS hive query Fluentd presto query

Norikra Schema-less stream processing with SQL “Norikra is a open source server software provides "Stream Processing" with SQL, written in JRuby, runs on JVM, licensed under GPLv2.” SELECT path, COUNT(1, status=200) AS success_count, COUNT(1, status=500) AS server_error_count, COUNT(*) AS count FROM AccessLog.win:time_batch(10 min, 0L) WHERE service='myservice' AND path LIKE '/api/%' GROUP BY path http://norikra.github.io/

Added-on Lambda Architecture Platform new data presto query HDFS hive query norikra query

“Pseudo Lambda” Architecture Using SQL Lambda architecture platform with almost same queries SELECT path, COUNT(IF(status=200,1,NULL)) AS success_count, COUNT(IF(status=500,1,NULL)) AS server_error_count, COUNT(*) AS count FROM AccessLog WHERE service='myservice' AND path LIKE '/api/%' AND timestamp >= ‘2014-09-13 10:40:00’ AND timestamp < ‘2014-09-13 10:50:00’ GROUP BY path SELECT path, COUNT(1, status=200) AS success_count, COUNT(1, status=500) AS server_error_count, COUNT(*) AS count FROM AccessLog.win:time_batch(10 min, 0L) WHERE service='myservice' AND path LIKE '/api/%' GROUP BY path

“Pseudo Lambda” Architecture Using SQL SQL dialects are easy to learn! Standard SQL, Hive, Presto, Impala, Drill, ... + Norikra For non-professional people too! SQL queries are very easy to write twice!

Use Cases in LINE Prompt reports for Ads service Short-term prompt reports by Norikra Daily fixed reports by Hive Summary of application server error log Aggregate error log for alerting by Norikra Check details with Hive, Presto (or grep!) See you later for details!

TMTOWTDI “There’s more than one way to do it.” - Perl programming language

SHARE What I want & What I’m doing! - tagomoris

Lambda Architecture Using SQL

More Related Content

What's hot

Viewers also liked

Similar to Lambda Architecture Using SQL

More from SATOSHI TAGOMORI

Recently uploaded

Lambda Architecture Using SQL