Introduction to Real-time data processing Yogi Devendra (yogidevendra@apache.org)
Agenda ●What is big data? ●Data at rest Vs Data in motion ●Batch processing Vs Real - time data processing (streaming) ●Examples ●When to use: Batch? Real-time? ●Current trends 2
Image ref [4] 3 Big data
Exploding sizes of datasets 4 ●Google ○>100PB data everyday [3] ●Large Hydron collidor : ○150M sensors x 40M sample per sec x 600 M collisions per sec ○>500 exabytes per day [2] ○0.0001% of data is actually analysed
Data at rest Vs Data in motion ● At rest : ○ Dataset is fixed ○ a.k.a bounded [15] ● In motion : ○ continuously incoming data ○ a.k.a unbounded 5
Data at rest Vs Data in motion (continued) ●Generally Big data has velocity ○continuous data ●Difference lies in when are you analyzing your data? [5] ○after the event occurs ⇒ at rest ○as the event occurs ⇒ in motion 6
Examples ●Data at rest ○Finding stats about group in a closed room ○Analyzing sales data for last month to make strategic decisions ●Data in motion ○Finding stats about group in a marathon ○e-commerce order processing 7
Batch processing ●Problem statement : ○Process this entire data ○give answer for X at the end. 8
Batch processing : Use-cases 9 ● Sales summary for the previous month[5] ● Model training for Spam emails
Batch processing : Characteristics 10 ●Access to entire data ●Split decided at the launch time. ●Capable of doing complex analysis (e.g. Model training) [6] ●Optimize for Throughput (data processed per sec) ●Example frameworks : Map Reduce, Apache Spark [6]
Real time data processing ● a.k.a. Stream processing ● Problem statement : ○ Process incoming stream of data ○ to give answer for X at this moment. 11
Stream processing : Use-cases ● e-commerce order processing ● Credit card fraud detection ● Label given email as : spam vs non- spam 12
Image ref [7] 13
Stream processing : Characteristics ● Results for X are based on the current data ● Computes function on one record or smaller window. [6] ● Optimizations for latency (avg. time taken for a record) 14
Stream processing : Characteristics ●Need to complete computes in near real- time ●Computes something relatively simple e.g. Using pre-defined model to label a record. ●Example frameworks: Apache Apex, Apache storm 15
16
Batch Vs Streaming pani puri ⇒ Streaming image ref [9] wada ⇒ batch image ref [8] 17
Micro-batch ●Create batch of small size ●Process each micro-batch separately ●Example frameworks: Spark streaming pani puri ⇒ micro-batch image ref [10] 18
● Depends on use-case ○Some are suitable for batch ○Some are suitable for streaming ○Some can be solved by any one ○Some might need combination of two. 19 When to use : Batch Vs Streaming?
When to use : Batch Vs Real time?(continued) ●Answers for current snapshot ⇒ Real-time ○Answers at the end ⇒ Open ●Complex calculations, multiple iterations over entire data ⇒ Batch ○Simple computations ⇒ Open ●Low latency requirements (< 1s) ⇒ Real- time 20
When to use : Batch Vs Real time?(continued) ●Each record can be processed independently ⇒ Open ○Independent processing not possible ⇒ Batch ● Depends on use-case ○Some use-cases can be solved by any one ○Some other might need combination of two. 21
Can one replace the other? ●Batch processing is designed for ‘data at rest’. ‘data in motion’ becomes stale; if processed in batch mode. ●Real-time processing is designed for ‘data in motion’. But, can be used for ‘data at rest’ as well (in many cases). 22
Quiz : is this Batch or Real-time? ●Queue for roller coaster ride image ref [11] ●Queue at the petrol pump image ref [12] 23
Quiz : is this Batch or Real-time? ●Selecting relevant ad to show for requested page ●Courier dispatch from city A to B image ref [13] image ref [14] 24
Current trends ●Difficulty in splitting problems as Map Reduce : Alternative paradigms for expressing user intent . ●More and more use-cases demanding faster insight to data (near real-time) ●‘Data in motion’ is common. ●‘Real-time data processing’ getting traction. 25
26 Questions Image ref [16]
27
References 1. Big Data | Gartner IT Glossary http://www.gartner.com/it-glossary/big-data/ 2. Big Data | Wikipedia https://en.wikipedia.org/wiki/Big_data 3. Data size estimates | Follow the data https://followthedata.wordpress.com/2014/06/24/data-size-estimates/ 4. Data Never Sleeps 2.0 | Domo https://www.domo.com/blog/2014/04/data-never-sleeps-2-0/ 5. Data in motion vs. data at rest | Internap http://www.internap.com/2013/06/20/data-in-motion-vs-data-at-rest/ 6. Difference between batch processing and stream processing | Quora https://www.quora.com/What-are-the-differences-between-batch- processing-and-stream-processing-systems/answer/Sean-Owen?srid=O9ht 7. How FAST is Credit Card Fraud Detection | FICO http://www.fico.com/en/latest-thinking/infographic/how-fast-is-credit-card-fraud- detection 8. CULINARY TERMS | panjakhada http://panjakhada.com/the-basics/ 9. Crispy Chaat ... | grabhouse http://grabhouse.com/urbancocktail/11-crispy-chaat-joints-food-lovers-hyderabad/ 10. Paani puri stall | citiyshor http://www.cityshor.com/pune/food/street-food/camp/murali-paani-puri-stall/ 11. Great Inventions: The Roller Coaster | findingdulcinea http://www.findingdulcinea.com/features/science/innovations/great-inventions/the- roller-coaster.html 12. RIL petrol pump network | economictimes http://articles.economictimes.indiatimes.com/2015-05-24/news/62583419_1_petrol-and-diesel- fuel-retailing-ril 13. Publishers | Propellerads https://propellerads.com/publishers/ 14. Michael Bishop Couriers | Google plus https://plus.google.com/110684176517668223067 15. The world beyond batch: Streaming 101 http://radar.oreilly.com/2015/08/the-world-beyond-batch-streaming-101.html 16. How to Answer the Question http://www.clipartpanda.com/clipart_images/how-to-answer-the-question-46954146 17. Thank You http://www.planwallpaper.com/thank-you 28

Introduction to Real-Time Data Processing

  • 1.
    Introduction to Real-time dataprocessing Yogi Devendra (yogidevendra@apache.org)
  • 2.
    Agenda ●What is bigdata? ●Data at rest Vs Data in motion ●Batch processing Vs Real - time data processing (streaming) ●Examples ●When to use: Batch? Real-time? ●Current trends 2
  • 3.
  • 4.
    Exploding sizes ofdatasets 4 ●Google ○>100PB data everyday [3] ●Large Hydron collidor : ○150M sensors x 40M sample per sec x 600 M collisions per sec ○>500 exabytes per day [2] ○0.0001% of data is actually analysed
  • 5.
    Data at restVs Data in motion ● At rest : ○ Dataset is fixed ○ a.k.a bounded [15] ● In motion : ○ continuously incoming data ○ a.k.a unbounded 5
  • 6.
    Data at restVs Data in motion (continued) ●Generally Big data has velocity ○continuous data ●Difference lies in when are you analyzing your data? [5] ○after the event occurs ⇒ at rest ○as the event occurs ⇒ in motion 6
  • 7.
    Examples ●Data at rest ○Findingstats about group in a closed room ○Analyzing sales data for last month to make strategic decisions ●Data in motion ○Finding stats about group in a marathon ○e-commerce order processing 7
  • 8.
    Batch processing ●Problem statement: ○Process this entire data ○give answer for X at the end. 8
  • 9.
    Batch processing :Use-cases 9 ● Sales summary for the previous month[5] ● Model training for Spam emails
  • 10.
    Batch processing :Characteristics 10 ●Access to entire data ●Split decided at the launch time. ●Capable of doing complex analysis (e.g. Model training) [6] ●Optimize for Throughput (data processed per sec) ●Example frameworks : Map Reduce, Apache Spark [6]
  • 11.
    Real time dataprocessing ● a.k.a. Stream processing ● Problem statement : ○ Process incoming stream of data ○ to give answer for X at this moment. 11
  • 12.
    Stream processing :Use-cases ● e-commerce order processing ● Credit card fraud detection ● Label given email as : spam vs non- spam 12
  • 13.
  • 14.
    Stream processing :Characteristics ● Results for X are based on the current data ● Computes function on one record or smaller window. [6] ● Optimizations for latency (avg. time taken for a record) 14
  • 15.
    Stream processing :Characteristics ●Need to complete computes in near real- time ●Computes something relatively simple e.g. Using pre-defined model to label a record. ●Example frameworks: Apache Apex, Apache storm 15
  • 16.
  • 17.
    Batch Vs Streaming panipuri ⇒ Streaming image ref [9] wada ⇒ batch image ref [8] 17
  • 18.
    Micro-batch ●Create batch of smallsize ●Process each micro-batch separately ●Example frameworks: Spark streaming pani puri ⇒ micro-batch image ref [10] 18
  • 19.
    ● Depends onuse-case ○Some are suitable for batch ○Some are suitable for streaming ○Some can be solved by any one ○Some might need combination of two. 19 When to use : Batch Vs Streaming?
  • 20.
    When to use: Batch Vs Real time?(continued) ●Answers for current snapshot ⇒ Real-time ○Answers at the end ⇒ Open ●Complex calculations, multiple iterations over entire data ⇒ Batch ○Simple computations ⇒ Open ●Low latency requirements (< 1s) ⇒ Real- time 20
  • 21.
    When to use: Batch Vs Real time?(continued) ●Each record can be processed independently ⇒ Open ○Independent processing not possible ⇒ Batch ● Depends on use-case ○Some use-cases can be solved by any one ○Some other might need combination of two. 21
  • 22.
    Can one replacethe other? ●Batch processing is designed for ‘data at rest’. ‘data in motion’ becomes stale; if processed in batch mode. ●Real-time processing is designed for ‘data in motion’. But, can be used for ‘data at rest’ as well (in many cases). 22
  • 23.
    Quiz : isthis Batch or Real-time? ●Queue for roller coaster ride image ref [11] ●Queue at the petrol pump image ref [12] 23
  • 24.
    Quiz : isthis Batch or Real-time? ●Selecting relevant ad to show for requested page ●Courier dispatch from city A to B image ref [13] image ref [14] 24
  • 25.
    Current trends ●Difficulty insplitting problems as Map Reduce : Alternative paradigms for expressing user intent . ●More and more use-cases demanding faster insight to data (near real-time) ●‘Data in motion’ is common. ●‘Real-time data processing’ getting traction. 25
  • 26.
  • 27.
  • 28.
    References 1. Big Data| Gartner IT Glossary http://www.gartner.com/it-glossary/big-data/ 2. Big Data | Wikipedia https://en.wikipedia.org/wiki/Big_data 3. Data size estimates | Follow the data https://followthedata.wordpress.com/2014/06/24/data-size-estimates/ 4. Data Never Sleeps 2.0 | Domo https://www.domo.com/blog/2014/04/data-never-sleeps-2-0/ 5. Data in motion vs. data at rest | Internap http://www.internap.com/2013/06/20/data-in-motion-vs-data-at-rest/ 6. Difference between batch processing and stream processing | Quora https://www.quora.com/What-are-the-differences-between-batch- processing-and-stream-processing-systems/answer/Sean-Owen?srid=O9ht 7. How FAST is Credit Card Fraud Detection | FICO http://www.fico.com/en/latest-thinking/infographic/how-fast-is-credit-card-fraud- detection 8. CULINARY TERMS | panjakhada http://panjakhada.com/the-basics/ 9. Crispy Chaat ... | grabhouse http://grabhouse.com/urbancocktail/11-crispy-chaat-joints-food-lovers-hyderabad/ 10. Paani puri stall | citiyshor http://www.cityshor.com/pune/food/street-food/camp/murali-paani-puri-stall/ 11. Great Inventions: The Roller Coaster | findingdulcinea http://www.findingdulcinea.com/features/science/innovations/great-inventions/the- roller-coaster.html 12. RIL petrol pump network | economictimes http://articles.economictimes.indiatimes.com/2015-05-24/news/62583419_1_petrol-and-diesel- fuel-retailing-ril 13. Publishers | Propellerads https://propellerads.com/publishers/ 14. Michael Bishop Couriers | Google plus https://plus.google.com/110684176517668223067 15. The world beyond batch: Streaming 101 http://radar.oreilly.com/2015/08/the-world-beyond-batch-streaming-101.html 16. How to Answer the Question http://www.clipartpanda.com/clipart_images/how-to-answer-the-question-46954146 17. Thank You http://www.planwallpaper.com/thank-you 28

Editor's Notes