Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Stop Stupid Fuzzy Searches

Stop Stupid Fuzzy Searches

Fuzzy logic is a must-have for almost any search application and especially in the case of e-commerce. Most search tools include fuzzy logic to handle plurals, misspellings, term decomposition and other near-matches. The goal of using fuzzy logic is to increase recall for a given search and ultimately improve results. But in reality we are often reducing precision and produce inconsistent results for queries with the same meaning.

Andreas Wagner

June 14, 2017
Tweet

More Decks by Andreas Wagner

Other Decks in Technology

Transcript

  1. 37% 23% 9% 5% 15% 11% 26% 20% 10% 12%

    18% 14% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Edit distance 0 Edit distance 1 Edit distance 2 Edit distance >2 Singluar & Plural Decomposition Frequency Value/Search Why we need it / Distribution of spelling errors
  2. 10% 25% 27% 10% 17% 11% 6% 23% 34% 7%

    16% 14% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Insert Delete Replace Transpose Singular & Plural Decomposition Desktop Mobile Why we need it / Distribution of spelling errors by device type
  3. Causes of spelling errors format phonetic typo decomposition -spannbettlaken spannbettlacken

    spanbettlaken spannbettllaken spammbettlaken Spann bettlaken Bettlaken zum spannen 1% 3% 13% 9% 7% 1% 4% 1% spann-bettlaken spannbettlaken 61% 4% 22% 8% 5% …42 additional spellings 0 83 56 50 47 0 43 0 Query-Intent Error-type query Result size
  4. generates 835 candidates How it works generates ~650k candidates GET

    catalog/products/_search { “query”: { “fuzzy”: { “title”: { “value”: “spannbettlacken”, “fuzziness”: 1 } } } } EditDistance 1 GET catalog/products/_search { “query”: { “fuzzy”: { “title”: { “value”: “spannbettlacken”, “fuzziness”: 2 } } } } EditDistance 2
  5. Resulting in / high recall but low precision 0 0,1

    0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,0 Precision (PREC) Recall (TPR)
  6. Resulting in / low search throughput ~0.1 seconds for spelling

    a short word 0 5000 10000 15000 20000 25000 30000 35000 1 2 3 4 5 6 Searches per Second Query Terms or - term and - term or - fuzzy 2 and - fuzzy 2
  7. - + Searches for all possible candidates inside a given

    edit- distance Natively implemented in Elasticsearch and Lucene Increased CPU usage and query response time Inconsistent and not always relevant results Skewed search analytics Observations
  8. spannbettlaken spann-bettlaken spannbettlacken schpanbettlaken spannbettllaken spammbettlaken spanmbettlaken spannbettlaken Cluster similar

    Queries Test & Select MasterQuery spannbettlaken MasterQuery Search Engine Our Solution / smart query rewrites
  9. Cluster similar Queries Based on deep learning & crafted algorithms

    we clean and cluster queries with the same meaning We use the concept of controlled precision reduction Exact Match Fingerprint Lemmatization & Phonems Fuzzy Match Our Solution / smart query rewrites spannbettlaken spann-bettlaken spannbettlacken schpanbettlaken spannbettllaken spammbettlaken spanmbettlaken spannbettlaken
  10. Test & Select MasterQuery Based on tracking KPIs and deep

    learning and global parameter optimization we test & select the query which maximises the balance between the search result interaction probability and the economic outcome Our Solution / smart query rewrites spannbettlaken spann-bettlaken spannbettlacken schpanbettlaken spannbettllaken spammbettlaken spanmbettlaken spannbettlaken
  11. CXP search|hub / Query Intelligence Platform Frontend Search Endpoint High

    performance Caching & Logging Data|hub Semantic Query Parsing Site Search Analytics Guided Selling Personalization … Solr Elasticsearch FACT-Finder Fredhopper Celebros Algolia ACS Search Engine Smart|Query Query Segmentation Query Scoping
  12. Impact – top-10 ecom player A 90% 100% 110% 120%

    130% 140% Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec w/o smart|query w smart|query Uses an already a highly optimized state-of-the-art eCommerce Search solution
  13. Impact – top-50 ecom player B 90% 100% 110% 120%

    130% 140% Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar w/o smart|query w smart|query Uses an optimized SolR implementation
  14. Resulting in / High recall & high precision 0,97 0,98

    0,99 1 0,5 0,55 0,6 0,65 0,7 0,75 0,8 0,85 0,9 0,95 1 0 500.000 1.000.000 1.500.000 2.000.000 2.500.000 3.000.000 3.500.000 4.000.000 4.500.000 5.000.000 Precision (PREC) Recall (TPR) Queries Recall (TPR) Precision (PREC)
  15. Resulting in / insane query performance ~0.00005 seconds for spelling

    a short word – 80 ops/ms 0 5000 10000 15000 20000 25000 30000 35000 1 2 3 4 5 6 Searches per Seconds search|hub & Elastic Query Terms or - term and - term or - fuzzy 2 and - fuzzy 2
  16. more relevant results consistent results reduced manual effort for curated

    search results save CPU usage improved query response time consistent site search analytics additional complexity Observations + -
  17. search|hub -PreDictLib fast & accurate spell correction at scale Quick

    Highlights: § extremely fast & constant index access § truly language independent edit distance § ability to add records to the index at runtime without performance decrease based on one of the most efficient spell correction implementations out there called symspell by Wolf Grabe
  18. Symspell/ some Benchmarks 88,7% 45,8% 69,2% 88,3% 88,7% 1,0% 1,0%

    1,7% 2,2% 100,0% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Lucene WordCorrect ElasticSearch WordCorrect No.2 eCommerce Search No.1 in eCommerce Search SymSpell Throughput vs Accuracy Accuracy Searches/sec
  19. • modified edit distance to a weighted edit distance •

    changed Damerau Levenshtein distance with a weighted Damerau Levenshtein distance – taking into account keyboard distance • re-rank the candidate list by applying additional similarity algorithms search|hub -PreDictLib fast & accurate spell correction at scale
  20. 89% 46% 69% 88% 89% 89% 99% 1% 1% 1%

    2% 86% 100% 98% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Lucene WordCorrect ElasticSearch WordCorrect No.2 eCommerce Search No.1 in eCommerce Search Symspell CXP PreDict (CE) CXP Searchhub Throughput vs. Accuracy Accuracy Searches/sec Search|hub– PreDice(CE) & PreDict(EE) / some Benchmarks
  21. what you‘ll get CXP SmartQuery – PreDictLib (CE) fast &

    accurate spell correction at scale § the Lib as Java source § accuracy and benchmark tests § real-life test data https://github.com/searchhub/preDict