elasticsearch ADVANCED FEATURES IN PRACTICE @JSUCHAL #RUBYSLAVA
elasticwhat?  based on Apache Lucene  REST API  Data & API in JSON  Schema-free  Real time  Distributed  Advanced functionality
Quickstart 1. Download & extract from http://www.elasticsearch.org/download/ 2. $ bin/elasticsearch –f 3. There is no step 3.
Quickstart - index $ curl -XPOST 'http://localhost:9200/rubyslava/talks/1' -d '{ "title" : "elasticsearch - advanced features in practice", "presenter" : "jsuchal", "presented_at" : "2011-09-22T19:00:00", "message" : "hopefully clear", "tags" : ["elasticsearch", "rocks"] }' => {"ok":true,"_index":"rubyslava","_type":"talks","_id":"1","_version":1}
Quickstart - search $ curl -XPOST 'http://localhost:9200/rubyslava/talks/_search?q=jsuchal&pretty‘ => { "took" : 2, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.054244425, "hits" : [ { "_index" : "rubyslava", "_type" : "talks", "_id" : "1", "_score" : 0.054244425, "_source" : { "title" : "elasticsearch - advanced features in practice", "presenter" : "jsuchal", "presented_at" : "2011-09-22T19:00:00", "message" : "hopefully clear", "tags" : ["elasticsearch", "rocks"] } } ] } }
Advanced features  Search  analyzer, stemming, ngrams, ascii folding & custom analyzers  boosting, fragment highlighting, fuzzy search  …  Facets  Percolate  Scroll  and more…
The Case Study  Find suspicious government contracts  using heuristics  IT contract where price > 1M euro  Supplier company age < 3 months  using crowdsourcing  Data  Central government contract repositories  www.crz.gov.sk, zmluvy.egov.sk  ~70K contracts in 8 months  100+ GB pdf/doc/scan
The Solution Faceted search
The Solution  Faceted search  Search  e.g. Find all contracts by Orange Slovakia  Analyze  e.g. Which department has most contracts with Orange Slovakia?  e.g. What is the contract price distribution for Orange Slovakia? …  Define penalty heuristics
Facets  Types  term, range, histogram, statistical, geo distance $ curl -XPOST 'http://localhost:9200/rubyslava/talks/_search?pretty' -d '{ "query" : { "match_all" : { } }, "facets" : { "tags_facet" : { "terms" : { "field" : "tags", "size" : 10 } } } }'
Facets - results { "took" : 2, … "hits" : { … }, "facets" : { "tags_facet" : { "_type" : "terms", "missing" : 0, "total" : 2, "other" : 0, "terms" : [ { "term" : "rocks", "count" : 1 }, { "term" : "elasticsearch", "count" : 1 } ] } } }
Facets - advanced  Problem  Generate options for facets with some selected restrictions  Solution  global facet  facet_filter { "facets" : { "<FACET NAME>" : { "<FACET TYPE>" : { ... }, "global" : true, "facet_filter" : { "term" : { “supplier.untouched" : “Orange Slovakia, a.s."} } } } }
Percolate  Problem  New contract/document added, which heuristics does it match?  Solution 1. Save heuristics/searches in percolator index 2. Percolate new documents
Percolate $ curl -XPUT 'localhost:9200/_percolator/rubyslava/heuristic-1' -d '{ "query" : { "term" : { "tags" : "rubyslava" } } }' $ curl -XPOST 'http://localhost:9200/rubyslava/talks/_percolate' -d '{ "doc" : { "tags" : ["rubyslava", "rocks", "too"] } }‘ => {"ok":true,"matches":["heuristic-1"]}
Scroll  Problem  New heuristic added and matches many (1K+) documents  Add heuristic to all matching documents  + Offset performance problem known in RDBMS  Solution  Use async background job  Scroll through results (a.k.a. cursor)
Scroll $ curl -XGET 'http://localhost:9200/rubyslava/talks/_search?scroll=5m&pretty' -d '{ "query": { "match_all" : {} } } ‘ => { "_scroll_id" : "cXVlcnlUaGVuRmV0Y2g7NTs5MzpQYmlzX3I2VFJRS0dRSEhGX2t6TTRROzk0OlBiaXNfcjZUUlFLR1 FISEZfa3pNNFE7OTU6UGJpc19yNlRSUUtHUUhIRl9rek00UTs5MjpQYmlzX3I2VFJRS0dRSEhGX2t6TTRROzkxOlBiaXNfcj ZUUlFLR1FISEZfa3pNNFE7MDs=", "took" : 2, … "hits" : […], } $ curl -XGET 'http://localhost:9200/_search/scroll?scroll=5m&scroll_id=cXVlcnlUaGVuRmV0Y2g…' => more results & repeat
Ruby Scroll API  Mimics find_each in ActiveRecord def find_each(query, &block) scroll_id = nil processed = 0 begin unless scroll_id result = initiate_scroll(query) scroll_id = result.scroll_id else result = scroll(scroll_id) end result.hits.each do |document| yield document end processed += result.hits.size end while processed < result.hits.total end
Tutorials & Guides  http://www.slideshare.net/clintongormley/cool- bonsai-cool-an-introduction-to-elasticsearch  http://www.slideshare.net/clintongormley/terms- of-endearment-the-elasticsearch-query-dsl- explained  http://www.elasticsearch.org/guide/

elasticsearch - advanced features in practice

  • 1.
    elasticsearch ADVANCED FEATURES INPRACTICE @JSUCHAL #RUBYSLAVA
  • 2.
    elasticwhat?  based onApache Lucene  REST API  Data & API in JSON  Schema-free  Real time  Distributed  Advanced functionality
  • 3.
    Quickstart 1. Download &extract from http://www.elasticsearch.org/download/ 2. $ bin/elasticsearch –f 3. There is no step 3.
  • 4.
    Quickstart - index $ curl -XPOST 'http://localhost:9200/rubyslava/talks/1' -d '{ "title" : "elasticsearch - advanced features in practice", "presenter" : "jsuchal", "presented_at" : "2011-09-22T19:00:00", "message" : "hopefully clear", "tags" : ["elasticsearch", "rocks"] }' => {"ok":true,"_index":"rubyslava","_type":"talks","_id":"1","_version":1}
  • 5.
    Quickstart - search $curl -XPOST 'http://localhost:9200/rubyslava/talks/_search?q=jsuchal&pretty‘ => { "took" : 2, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.054244425, "hits" : [ { "_index" : "rubyslava", "_type" : "talks", "_id" : "1", "_score" : 0.054244425, "_source" : { "title" : "elasticsearch - advanced features in practice", "presenter" : "jsuchal", "presented_at" : "2011-09-22T19:00:00", "message" : "hopefully clear", "tags" : ["elasticsearch", "rocks"] } } ] } }
  • 6.
    Advanced features  Search  analyzer, stemming, ngrams, ascii folding & custom analyzers  boosting, fragment highlighting, fuzzy search  …  Facets  Percolate  Scroll  and more…
  • 7.
    The Case Study Find suspicious government contracts  using heuristics  IT contract where price > 1M euro  Supplier company age < 3 months  using crowdsourcing  Data  Central government contract repositories  www.crz.gov.sk, zmluvy.egov.sk  ~70K contracts in 8 months  100+ GB pdf/doc/scan
  • 8.
  • 10.
    The Solution  Facetedsearch  Search  e.g. Find all contracts by Orange Slovakia  Analyze  e.g. Which department has most contracts with Orange Slovakia?  e.g. What is the contract price distribution for Orange Slovakia? …  Define penalty heuristics
  • 11.
    Facets  Types  term, range, histogram, statistical, geo distance $ curl -XPOST 'http://localhost:9200/rubyslava/talks/_search?pretty' -d '{ "query" : { "match_all" : { } }, "facets" : { "tags_facet" : { "terms" : { "field" : "tags", "size" : 10 } } } }'
  • 12.
    Facets - results { "took" : 2, … "hits" : { … }, "facets" : { "tags_facet" : { "_type" : "terms", "missing" : 0, "total" : 2, "other" : 0, "terms" : [ { "term" : "rocks", "count" : 1 }, { "term" : "elasticsearch", "count" : 1 } ] } } }
  • 13.
    Facets - advanced Problem  Generate options for facets with some selected restrictions  Solution  global facet  facet_filter { "facets" : { "<FACET NAME>" : { "<FACET TYPE>" : { ... }, "global" : true, "facet_filter" : { "term" : { “supplier.untouched" : “Orange Slovakia, a.s."} } } } }
  • 14.
    Percolate  Problem  New contract/document added, which heuristics does it match?  Solution 1. Save heuristics/searches in percolator index 2. Percolate new documents
  • 15.
    Percolate $ curl -XPUT'localhost:9200/_percolator/rubyslava/heuristic-1' -d '{ "query" : { "term" : { "tags" : "rubyslava" } } }' $ curl -XPOST 'http://localhost:9200/rubyslava/talks/_percolate' -d '{ "doc" : { "tags" : ["rubyslava", "rocks", "too"] } }‘ => {"ok":true,"matches":["heuristic-1"]}
  • 16.
    Scroll  Problem  New heuristic added and matches many (1K+) documents  Add heuristic to all matching documents  + Offset performance problem known in RDBMS  Solution  Use async background job  Scroll through results (a.k.a. cursor)
  • 17.
    Scroll $ curl -XGET'http://localhost:9200/rubyslava/talks/_search?scroll=5m&pretty' -d '{ "query": { "match_all" : {} } } ‘ => { "_scroll_id" : "cXVlcnlUaGVuRmV0Y2g7NTs5MzpQYmlzX3I2VFJRS0dRSEhGX2t6TTRROzk0OlBiaXNfcjZUUlFLR1 FISEZfa3pNNFE7OTU6UGJpc19yNlRSUUtHUUhIRl9rek00UTs5MjpQYmlzX3I2VFJRS0dRSEhGX2t6TTRROzkxOlBiaXNfcj ZUUlFLR1FISEZfa3pNNFE7MDs=", "took" : 2, … "hits" : […], } $ curl -XGET 'http://localhost:9200/_search/scroll?scroll=5m&scroll_id=cXVlcnlUaGVuRmV0Y2g…' => more results & repeat
  • 18.
    Ruby Scroll API Mimics find_each in ActiveRecord def find_each(query, &block) scroll_id = nil processed = 0 begin unless scroll_id result = initiate_scroll(query) scroll_id = result.scroll_id else result = scroll(scroll_id) end result.hits.each do |document| yield document end processed += result.hits.size end while processed < result.hits.total end
  • 19.
    Tutorials & Guides http://www.slideshare.net/clintongormley/cool- bonsai-cool-an-introduction-to-elasticsearch  http://www.slideshare.net/clintongormley/terms- of-endearment-the-elasticsearch-query-dsl- explained  http://www.elasticsearch.org/guide/