TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world search problems

© 2016 Knorex 1. Architecture 2. Ingredients • Data gathering • Content extraction • Preprocessing • Modelling: terms -> phrases, entities -> documents 3. Elasticsearch • Basic analysis, faceting and filtering • Do you mean • Percolator • Recommendation • Deduplication 3. Summary Outline 6 / 36

© 2016 Knorex 1. Data gathering • Deep crawler • Lazy crawler • Visual scraper • Social media adapters 2. Content extraction • Take news article as an example • Title • Content • Published date • Author • Image • … Ingredients 8 / 36

© 2016 Knorex 3. Modelling • Goal: synthesizing words, tokens into larger units and attach meaning to them • Key phrases extractions • Named entity recognition • Basic building block of knowledge • Basis for computing relatedness and extracting relations • Sentiment analysis • Social media snippet • General article or towards concepts / named entities • Emotion • Document classification • Group search results into faceted categories • Recommend related articles by category Ingredients 12 / 36

© 2016 Knorex • First released Feb 2010, among fastest-growing open- source projects, total funding $104M (3 rounds) • Based on Apache Lucene (same as Solr) • Written in Java, support HTTP interface, schema-free JSON document (yay no XML!) • Designed to be scalable, distributed in nature 17 / 36

© 2016 Knorex Analysis Successful! [“https”, “www.facebook.com”, ”events”, “194454270949757“] No hits! WTH… it is not working!!!! Default analyzer as-is • url => not_analyzed / keyword analyzer • Use match query instead of term filter / term query: field analyzer awareness • Custom analyzer: e.g. keyword tokenizer + lowercase filter 19 / 36

© 2016 Knorex Analysis I n Search analyzer Index analyzer Elasticsearch index Search Index • Design carefully what fields that search will be executed frequently on • Determine what analyzers to use for each field (experimental based on application needs) • Search analyzer and index analyzer might be different for the same field • Use match query instead of term filter / term query: field analyzer awareness • Exploit multi-field 20 / 36

© 2016 Knorex Do you mean • “grok” -> “grokking”, “sear” -> “search” • Natural approach: • Compute terms aggregation (facet) across all text fields • title • description • content • Use regex to filter matched terms, sort DESC by frequency, take most popular terms to suggest DON’T!!! 22 / 36

© 2016 Knorex Do you mean • Limitations • Single terms only. Cannot suggest phrases • Terms occurring frequently might not be useful • Improvements • Building another field “phrases” in the document • adding entire title • Using key phrases extraction, named entity recognition to populate meaningful phrases • Custom tokenizers: keyword, edgeNGram • edgeNGram example: “grokking” => “gro”, “grok”, “grokk” • Query: “burs mal” => matched: “bursa malaysia” • memory explosion!!! • Custom scoring (importance, popularity score) instead of term frequency 24 / 36

© 2016 Knorex Do you mean • Elasticsearch built-in suggester • FST example. Source: https://www.elastic.co/blog/you-complete-me • Features: • Speed & scale: FST per-segment, build in real-time, scale horizontally • Analysis: synonym, fuzzy • Support custom ordering and scoring • Limitations: can’t find word anywhere within a phrase 25 / 36

© 2016 Knorex Do you mean • Speed test: 1 millions articles, 2.7 GB index size on single laptop with SSD • Cautions • Don’t add all terms/phrases to suggestion (only meaningful ones!) • Don’t start suggesting immediately. How many words starting with “c”? • Don’t suggest terms that yield no search results • Apply same filter condition of current query to the term suggestion query Regex terms facet Terms suggester 296.5 ms 13 ms 26 / 36

© 2016 Knorex Recommendation • Natural approach • More-like-this or fuzzy-like-this on title, content • Not accurate, bag-of-word approach. • Tricky in determining threshold. ”Good value” varies across different document types and domains • Slow. The more terms allowed in the queries, the slower it is. If cut off based on max terms, then accuracy drops • Proposed approaches • Utilize NLP results (modelling step): • Category: recommend articles from same categories • Key phrases: match and rank documents w.r.t target documents by key phrases • Named entities: model with parent/child relationship • Combine with function score feature to rescore results • Example: applying a Gauss decay function to favor more recent results 29 / 36

© 2016 Knorex Deduplication • Natural approach • Term matching on URL, title • Failed if these are slightly different (very common!) • More-like-this or fuzzy-like-this on content, with high matching threshold: e.g. 70%, 80% • Not accurate, bag-of-word approach. • Tricky in determining threshold. ”Good value” varies across different dcoument types and domains • Slow. The more terms allowed in the queries, the slower it is. If cut off based on max terms, then accuracy drops • Proposed approach • Semantic hashing: minhash, simhash • for a document, compute a hash value • convert the hash value to binary string form • robust and efficient, can cater to near-duplicate • Implement Hamming distance search using Elasticsearch fuzzy_like_this 31 / 36

© 2016 Knorex Deduplication • Do not index duplicate at all or • Collapse similar items in search results, display only the one with highest score • Assign same id for articles that are duplicate (called it groupid) • Use Elasticsearch Top Hits query to collapse result by groupid ⇒ 64-bit hash: 1000010001000111101001011011110010111101000011100 101101001011101 Modified version: 1010010001000111101011011011110010111101000011100 101101000011101 Hamming distance: 3 32 / 36

© 2016 Knorex Summary • ES is very flexible with numerous features and knobs • Critical to understand basic analysis, different types of queries • Indexing time and search time tradeoff • Precision and recall tradeoff • Complexity and memory estimation • Use NLP techniques as modelling step to improve search quality • Pay great attention to data input and data gathering step 34 / 36

© 2016 Knorex About Knorex Founded in 2010 as spin-off from Data Mining Dept. of A*STAR, Singapore  Enabling our customers to make smarter discovery and turn it into actionable insight Mission 35 / 36

TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world search problems

In this document