Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a Fast Fuzzy Searcher and Spell Checker

Building a Fast Fuzzy Searcher and Spell Checker

Ling Zhang (Software Engineer @ Aiden.ai) @ Moscow Python Conf 2017
"Spelling is hard, really hard. It's an everyday user frustration to try to search for a friend's name or the name of a restaurant that they heard but end up writing it wrong. In this talk, we will cover a python implementation of a single fast algorithm that can recover from spelling errors, typing errors, and even transliteration mistakes! We will also integrate this with a language model to make it context aware. With this technique, you can build powerful fuzzy text searchers and spell checkers".
Video: https://conf.python.ru/building-fast-fuzzy-searcher-and-spell-checker/

Moscow Python Meetup

October 20, 2017
Tweet

More Decks by Moscow Python Meetup

Other Decks in Programming

Transcript

  1. Why Does Spell Check Matter? • Misspelled searches = Product

    Failure ◦ Can’t find restaurant on Yelp ◦ Can’t find friend on Facebook ◦ Can’t find shop on Google Maps ◦ Can’t find song on Spotify • Necessary for machine learning / NLP ◦ Pre-processing step for NLP ◦ Sentiment analysis, intent detection, entity extraction • Speech Recognition disambiguation • Chatbots
  2. - 23% of queries were spelt wrong. - 1 in

    4 uses of our app will result in failure. - Long tail end of errors Error Distribution
  3. Distribution Within Errors Queries Captured Spelling Variations 90% 18 95%

    40 99% 196 99.9% 519 ~Improving accuracy 10% requires 10x more variations
  4. Top Spelling Variations Rank Spelling 1 britney spears 2 brittany

    spears 3 brittney spears 4 britany spears 5 britny spears 6 briteny spears 7 britteny spears 8 briney spears 9 brittny spears
  5. Levenshtein Distance • Count total edits required to go from

    one string to another ◦ Insertion ◦ Deletion ◦ Edits • Example, levenshtein between “elllu” and “hello” = 3 Str 1 - E L L L U Total Str 2 H E L L - O Edit Insertion - - - Deletion Substitution Cost 1 0 0 0 1 1 3
  6. Peter Norvig Spelling Correction Algorithm • Check if word in

    dictionary, if so return • Generate all words 1 levenshtein distance away. If any of these are in dictionary return. • Generate all words 2 levenshtein distance away. If any of these are in dictionary return. • If there is more than one candidate in each step, take the one with the highest score (scoring has an open definition). • Norvig’s algorithm is expanded levenshtein, allowing transposes.
  7. Is levenshtein good enough? Rank Spelling Norvig Distance 1 britney

    spears 0 2 brittany spears 3 3 brittney spears 1 4 britany spears 2 5 britny spears 1 6 briteny spears 1 7 britteny spears 2 8 briney spears 1 9 brittny spears 2 Total Proportion Caught: 94% (70% of errors)
  8. Balancing Speed and Accuracy Norvig Distance Variants Generated Correction Accuracy

    Worst Case Execution Time 0 1 0% < 1 ms 1 ~800 54% 1.4ms 2 ~600,000 70% 320ms 3 ~57,000,000 Unknown 397000ms (6.5 minutes) Conclusion: - Not great performance, but unable to scale beyond distance 2. - Still too slow. Spell checking a 200 word facebook post using distance 2 = 7 seconds (35ms average) - Generating 600,000 variants to find 40 used variants
  9. We need a smarter way to get from 70% ->

    95% And it needs to be faster
  10. Top Spelling Errors Analyzed Spelling Error brittany Vowel added brittney

    Phonetically correct britany Vowel added britny Phonetically correct briteny Vowel added britteny Vowel added briney Vowel added brittny Phonetically correct ritany Consonant Removed • Errors are not randomly distributed ◦ Keyboard errors are in fact rare • Most errors are just one phonetic difference away • There are multiple ways to spell the same phonetic sound (ney, ny)
  11. A Phonetic Point of View • Distribution of errors very

    far from random given by Levenshtein / Norvig. • Work in phonetic space (dealing with sounds). • Model the word as a series of consonants + vowels. Do insertions / deletions / substitutions on these. • Give each insertion / deletion / substitution a score • Insertion ◦ Vowels are more likely to be added • Deletion ◦ Vowels are more likely to be deleted • Substitution ◦ Vowels are all somewhat similar ◦ Certain consonants are similar i.e. (s, sh) (b, p) but others are very different (sh, p)
  12. Phoneticization • We will use the International Phonetic Alphabet (IPA)

    • Standard alphabet of sounds independent of language. Each symbol is a phone. • English is a mixed language, so sometimes phoneticization seems illogical English h e ll o IPA h ə l ʊ
  13. Obtaining a mapping • We could use deep learning (I’ve

    tried) ◦ It doesn’t work that well ◦ It’s slow to execute. • We can do a simple letter mapping. • Wikipedia has IPA mappings for most languages • English is a hard language to map • Many other languages have straightforward mappings (Spanish, Arabic) • Being approximately correct is good enough • Also we have phonetic dictionaries: https://github.com/cmusphinx/cmudict
  14. English Mapping • English -> IPA mapping In ~70 rules

    • Many-to-many mapping • Vowels are particularly tricky in English
  15. Phoneticization Code ‘hello’ ‘_h_e_l_l_o’ ‘$h$ɛ$l$l$oʊ’ (‘h’, ‘ɛ’, ‘l’, ‘oʊ’) •

    Many variants of phoneticization produced. • Some incorrect but doesn’t matter, unlikely to match to any word. • Incorrect variants can even be positive, they allow us to correct spelling when user reads the pronunciation of the word wrong.
  16. Generating Correction Candidates • Goal: generate variations of pronunciation, each

    with a numeric score distance. • Includes insertions, deletions and substitutions. • Variable cost per phone. (‘h’, ‘ɛ’, ‘l’, ‘o’) (‘h’, ‘ɛ’, ‘l’, ‘o’, ‘s’) (‘ɛ’, ‘l’, ‘o’) (‘h’, ‘a’, ‘l’, ‘o’) Insertion Deletion Substitution
  17. Phone Features in Python • Many distance metrics in research,

    this is just one. • Everyone phone is a ternary feature vector (True, False, None)
  18. Distance Algorithm • Distance between two phones is just the

    number of features that don’t match. • Some features are more important than others, so they contribute different weights to the distance
  19. Option: Levenshtein with Phones • We could do weighted levenshtein

    distance with phones instead of letters. • Advantage: we have ways to estimate how expensive inserting/deleting phones is, and also the similarity between two phones for substitution.
  20. Problem: Too slow! • Can get good results but this

    is too slow! • Weighted levenshtein requires us to sort, sorting is slow. • Also, with 50 phones and searching on a 6 letter word, we will get. • Computing the costs and sorting to find best results is too slow! • Much slower than Norvig’s algorithm. # Edits Variations 1 ~600 2 ~360,000 3 ~216,000,000
  21. Solution: Cluster the Phones • There exist natural groupings of

    similar phones. • We could cluster each group into a single symbol. • Bonus: only allow substitutions within a cluster, then we remove half our variations and no longer need to consider subsitution edits.
  22. Idea: Clustering Algorithm • Create a distance matrix between all

    phones in a language using PyPhone. • Cluster these phones into groups using this distance matrix,
  23. SciPy to the rescue • Everything is easy in Python.

    • We can easily calculate the triangular distance matrix, and create clusters in ~20 lines of Python.
  24. Phonex (Phone Clusters) • Phonetic Indexing by PyPhone • Inspired

    by SoundEx (Developed 1918) • Clusters of sounds • Automatically generated for any language set.
  25. Candidate Generation Similar to a reduced levenshtein, with only deletion

    and insertion. Substitutions handled by our clustering already!
  26. ‘hello’ ‘_h_e_l_l_o’ ‘$h$ɛ$l$l$oʊ’ (‘h’, ‘ɛ’, ‘l’, ‘oʊ’) (19, 1, 18,

    3) Fuzzy Search Algorithm Overview (19, 1, 18, 3, 5) (1, 18, 3) Insertion Deletion Tokenization Phoneticization Strip Repeats Clustering ‘hello’ Dictionary Lookup Repeat
  27. Comparing Results Algorithm Norvig Levenshtein Spell Check Fast Fuzzy Search

    Errors Corrected 71% 95% Average Query Time 33ms 25ms Worst Case Query Time 670ms 300ms • 34% more errors caught, taking us to 95%. • Average query time reduced by 25%. • Worst case query time reduced by over 50%. • Absolute accuracy at 99% of all queries
  28. Combining with Other Tools • Spell checking enhances your tooling,

    it doesn’t replace it. • Can do Norvig + Fast Fuzzy Phonetic Search. One handles keyboard errors well, the other handles not knowing how to spell. • You can combine this with a traditional tools like Elastic Search or indexed DB searches. • Use spell checking as a pre-processing step in NLP to boost performance. • We can do cross-language search with phonetic fuzzy search. • We can recover from errors in speech recognition, especially when we are dealing with closed vocabularies.
  29. Recap • People are bad at spelling. • Spelling errors

    break products! • Spelling variants are many, but have a predictable distribution (that’s not random!) • Most spelling errors are phonetic errors. • It’s possible to build a powerful spell correct / fuzzy searcher in only a few lines of python. • Fast fuzzy search can be used to boost accuracy of existing tools like indexed DB searches, elastic search, NLP, and speech recognition. • You can cross language barriers using a phonetic search technique.