MAKING ENTERPRISE DATA AVAILABLE IN REAL TIME WITH ELASTICSEARCH Yann Cluchey CTO @ Cogenta CTO @ GfK Online Pricing Intelligence
What is Enterprise Data?
What is Enterprise Data?
Online Pricing Intelligence 1. Gather data from 500+ of eCommerce sites 2. Organise into high quality market view 3. Competitive intelligence tools
Price, Stock, Meta HTML Price, Stock, Meta Price, Price, Stock, Meta Stock, Meta Custom Crawler  Parse web content  Discover product data  Tracking 20m products  Daily+ HTML HTML HTML
Database Processing, Storage  Enrichment  Persistent Storage  Product Catalogue  + time series data Processing
Matcher Index Builder Database Thing #1 - Detection  Identify distinct products  Automated information retrieval  Lucene + custom index builder  Continuous process  (Humans for QA) Lucene Index GUI
Thing #2 - BI Tools  Web Applications  Also based on Lucene  Batch index build process  Per-customer indexes Database Index Builder Customer Index 1 2 BI Tools Customer Index 3
Thing #1 - Pain  Continuously indexing  Track changes, read back out to index  Drain on performance  Latency, coping with peaks  Full rebuild for index schema change or inconsistencies  Full rebuild doesn’t scale well…  Unnecessary work..? Database Index Builder Lucene Index GUI
Batch Sync Customer Index 2 Thing #2 - Pain  Twice daily batch rebuild, per customer  Very slow  Moar customers?  Moar data?  Moar often?  Data set too complex, keeps changing  Index shipping  Moar web servers? Database Customer Index 1 Index Builder BI Tools Customer Index 3 Indexing Database Web Server 1 Web Server 2
Pain Points  As data, customers scale, processes slow down  Adapting to change  Easy to layer on, hard to make fundamental changes  Read vs write concerns  Database Maintenance Database Index Builder Index
Goals  Eliminate latencies Improve scalability Improve availability Something achievable  Your mileage will vary
elasticsearch  Open source, distributed search engine  Based on Lucene, fully featured API  Querying, filtering, aggregation  Text processing / IR  Schema-free  Yummy (real-time, sharding, highly available)  Silver bullets not included
Indexing Database Indexing Database Our Pipeline Database Processors Processors Processors Processors Crawlers Crawlers Processors Processors Indexers Indexes Indexes Indexes Web Servers Web Servers Web Servers
Our New Pipeline Database Processors Processors Processors Processors Crawlers Crawlers Processors Processors Indexers Indexes Indexes Indexes Web Servers Web Servers Web Servers
Event Hooks  Messages fired OnCreate.. and OnUpdate  Payload contains everything needed for indexing  The data  Keys (still mastered in SQL)  Versioning  Sender has all the information already  Use RabbitMQ to control event message flow  Messages are durable
Indexing Strategy  RESTful API (HTTP, Thrift, Memcache)  Use bulk methods  They support percolation  Rivers (pull)  RabbitMQ River  JDBC River  Mongo/Couch/etc. River  Logstash Event Q Index Q Indexer
Model Your Data  What’s in your documents?  Database = Index Table = Type ...?  Start backwards  What do your applications need?  How will they need to query the data?  Prototyping! Fail quickly!  elasticsearch supports Nested objects, parent/child docs
Joins  Events relate to line-items  Amazon decreased price  Pixmania is running a promotion  Need to group by Product  Use key/value store  Get full Product document  Modify it, write it back  Enqueue indexing instruction 3 3 5 Event Q Indexer 1 4 1 2 1 3 4 Index Q Key/value store 5
Where to join?  elasticsearch  Consider performance  Depends how data is structured/indexed (e.g. parent/child)  Compression, collisions  In-memory cache (e.g. Memcache)  Persistent storage (e.g. Cassandra or Mongo)  Two awesome benefits  Quickly re-index if needed  Updates have access to the full Product data  Serialisation is costly
Synchronisation & Concurrency  Fault tolerance  Code to expect missing data  Out of sequence events  Concurrency Control  Apply Optimistic Concurrency Control at Mongo  Optimise for collisions
Synchronisation & Concurrency  Synchronisation  Out of sequence index instructions  elasticsearch external versioning  Can rebuild from scratch if need to  Consistency  Which version is right?  Dates  Revision numbers from SQL  Independent updates
Figures  Ingestion  20m data points/day (continuously)  ~ 200GB  3K msgs/second at peak  Hardware  SQL: 2 x 12-core, 64GB, 72-spindle SAN  Indexing: 4 x 4-core, 8GB  Mongo: 1 x 4-core, 16GB, 1xSSD  Elastic: 5 x 4-core, 16GB, 1xSSD Custom-Built Lucene elasticsearch Latency 3 hours < 1 second Bottleneck Disk (SQL) CPU
Managing Change Key/value store Client Index_A Event Q Indexer Alias Index_B Index
Thanks  @YannCluchey  Concurrency Patterns with MongoDB http://slidesha.re/YFOehF  Consistency without Consensus Peter Bourgon, SoundCloud http://bit.ly/1DUAO1R  Eventually Consistent Data Structures Sean Cribbs, Basho https://vimeo.com/43903960

GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elasticsearch

  • 1.
    MAKING ENTERPRISE DATA AVAILABLE IN REAL TIME WITH ELASTICSEARCH Yann Cluchey CTO @ Cogenta CTO @ GfK Online Pricing Intelligence
  • 2.
  • 3.
  • 4.
    Online Pricing Intelligence 1. Gather data from 500+ of eCommerce sites 2. Organise into high quality market view 3. Competitive intelligence tools
  • 5.
    Price, Stock, Meta HTML Price, Stock, Meta Price, Price, Stock, Meta Stock, Meta Custom Crawler  Parse web content  Discover product data  Tracking 20m products  Daily+ HTML HTML HTML
  • 6.
    Database Processing, Storage  Enrichment  Persistent Storage  Product Catalogue  + time series data Processing
  • 7.
    Matcher Index Builder Database Thing #1 - Detection  Identify distinct products  Automated information retrieval  Lucene + custom index builder  Continuous process  (Humans for QA) Lucene Index GUI
  • 8.
    Thing #2 -BI Tools  Web Applications  Also based on Lucene  Batch index build process  Per-customer indexes Database Index Builder Customer Index 1 2 BI Tools Customer Index 3
  • 9.
    Thing #1 -Pain  Continuously indexing  Track changes, read back out to index  Drain on performance  Latency, coping with peaks  Full rebuild for index schema change or inconsistencies  Full rebuild doesn’t scale well…  Unnecessary work..? Database Index Builder Lucene Index GUI
  • 10.
    Batch Sync Customer Index 2 Thing #2 - Pain  Twice daily batch rebuild, per customer  Very slow  Moar customers?  Moar data?  Moar often?  Data set too complex, keeps changing  Index shipping  Moar web servers? Database Customer Index 1 Index Builder BI Tools Customer Index 3 Indexing Database Web Server 1 Web Server 2
  • 11.
    Pain Points As data, customers scale, processes slow down  Adapting to change  Easy to layer on, hard to make fundamental changes  Read vs write concerns  Database Maintenance Database Index Builder Index
  • 12.
    Goals  Eliminatelatencies Improve scalability Improve availability Something achievable  Your mileage will vary
  • 13.
    elasticsearch  Opensource, distributed search engine  Based on Lucene, fully featured API  Querying, filtering, aggregation  Text processing / IR  Schema-free  Yummy (real-time, sharding, highly available)  Silver bullets not included
  • 14.
    Indexing Database Indexing Database Our Pipeline Database Processors Processors Processors Processors Crawlers Crawlers Processors Processors Indexers Indexes Indexes Indexes Web Servers Web Servers Web Servers
  • 15.
    Our New Pipeline Database Processors Processors Processors Processors Crawlers Crawlers Processors Processors Indexers Indexes Indexes Indexes Web Servers Web Servers Web Servers
  • 16.
    Event Hooks Messages fired OnCreate.. and OnUpdate  Payload contains everything needed for indexing  The data  Keys (still mastered in SQL)  Versioning  Sender has all the information already  Use RabbitMQ to control event message flow  Messages are durable
  • 17.
    Indexing Strategy RESTful API (HTTP, Thrift, Memcache)  Use bulk methods  They support percolation  Rivers (pull)  RabbitMQ River  JDBC River  Mongo/Couch/etc. River  Logstash Event Q Index Q Indexer
  • 18.
    Model Your Data  What’s in your documents?  Database = Index Table = Type ...?  Start backwards  What do your applications need?  How will they need to query the data?  Prototyping! Fail quickly!  elasticsearch supports Nested objects, parent/child docs
  • 19.
    Joins  Eventsrelate to line-items  Amazon decreased price  Pixmania is running a promotion  Need to group by Product  Use key/value store  Get full Product document  Modify it, write it back  Enqueue indexing instruction 3 3 5 Event Q Indexer 1 4 1 2 1 3 4 Index Q Key/value store 5
  • 20.
    Where to join?  elasticsearch  Consider performance  Depends how data is structured/indexed (e.g. parent/child)  Compression, collisions  In-memory cache (e.g. Memcache)  Persistent storage (e.g. Cassandra or Mongo)  Two awesome benefits  Quickly re-index if needed  Updates have access to the full Product data  Serialisation is costly
  • 21.
    Synchronisation & Concurrency  Fault tolerance  Code to expect missing data  Out of sequence events  Concurrency Control  Apply Optimistic Concurrency Control at Mongo  Optimise for collisions
  • 22.
    Synchronisation & Concurrency  Synchronisation  Out of sequence index instructions  elasticsearch external versioning  Can rebuild from scratch if need to  Consistency  Which version is right?  Dates  Revision numbers from SQL  Independent updates
  • 23.
    Figures  Ingestion  20m data points/day (continuously)  ~ 200GB  3K msgs/second at peak  Hardware  SQL: 2 x 12-core, 64GB, 72-spindle SAN  Indexing: 4 x 4-core, 8GB  Mongo: 1 x 4-core, 16GB, 1xSSD  Elastic: 5 x 4-core, 16GB, 1xSSD Custom-Built Lucene elasticsearch Latency 3 hours < 1 second Bottleneck Disk (SQL) CPU
  • 24.
    Managing Change Key/value store Client Index_A Event Q Indexer Alias Index_B Index
  • 25.
    Thanks  @YannCluchey  Concurrency Patterns with MongoDB http://slidesha.re/YFOehF  Consistency without Consensus Peter Bourgon, SoundCloud http://bit.ly/1DUAO1R  Eventually Consistent Data Structures Sean Cribbs, Basho https://vimeo.com/43903960

Editor's Notes

  • #4 Complex, relational Spanning Built up long time World requirements keep changing Designed by children Dangerous at night time Like Minecraft, very valuable Opportunity more value Cost savings, efficiencies or new services Paywall, baggage
  • #5 My use case Problems we faced How solved Premise can’t start from scratch Need to deliver more
  • #7 I’m lying Separate arrows, because all processing is async
  • #8 Two specific things we do once data in system Need to make sense of the raw data Need to understand that these two products are in fact the same Lucene, Customised analysis chain Volume and rate of change = automated
  • #9 Very similar Batch on twice daily schedule
  • #10 Embarrass myself. I know better now.. ;-) Effort writing data to db, drag back out again
  • #11 Very similar process Also using Lucene Batch on twice daily schedule
  • #12 So where does it hurt? Constantly changing data Hard to read back out (consistently) ..without impacting performance
  • #13 Get data where needed (quickly) Real-time not requirement, > Microsoft minutes ;-) Consider any one a success. (cost benefit) !scratch, roll out incrementally, reasonable time-frame. So not a lot then! ** My environment is unique and my specific problems won’t apply to you
  • #14 Sharding, replication, simple clustering
  • #15 Where is the bottleneck?
  • #16 Where is the bottleneck now?
  • #18 Watch out for slow HTTP clients Rivers subscription model Answer, roll YO.. with a queue We use Thrift for API calls, Custom RMQ river (QoS)
  • #19 Black holes General tip for NoSQL/denorm Countless prototypes, worked but too slow Good at prototyping, bad at failing fast. No silver bullet! Nuances in performance and querying. Blog post.
  • #20 Since your data is complicated and you haven’t got one document per row.
  • #21 Can use ES as canonical store Prototyped incremental map/reduce CouchDB view was good logical fit, but didn’t perform Mongo API was lightning fast
  • #22 Distributed event processing is hard Messages go missing Be forgiving, support recovery OCC pattern with Mongo Clever tricks for performance / high collision rate. My talk.
  • #23 Use universal versioning scheme ES has very clever versioning, e.g. deletion memory (efficient thanks to Lucene) OCC protects against accidental overwrite. Doesn’t protect against wrong thing. Consistency without Consensus Peter Bourgon, SoundCloud Eventually Consistent Data Structures, Sean Cribbs, Basho
  • #24 Bottlenecking on everything, CPU, LAN Go SSD Cost of scale is massively reduced Active data only, 90%
  • #25 Atomic switch Capacity