Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a Fast Fuzzy Searcher and Spell Checker

Building a Fast Fuzzy Searcher and Spell Checker

Ling Zhang (Software Engineer @ Aiden.ai) @ Moscow Python Conf 2017
"Spelling is hard, really hard. It's an everyday user frustration to try to search for a friend's name or the name of a restaurant that they heard but end up writing it wrong. In this talk, we will cover a python implementation of a single fast algorithm that can recover from spelling errors, typing errors, and even transliteration mistakes! We will also integrate this with a language model to make it context aware. With this technique, you can build powerful fuzzy text searchers and spell checkers".
Video: https://conf.python.ru/building-fast-fuzzy-searcher-and-spell-checker/

Moscow Python Meetup
PRO

October 20, 2017
Tweet

More Decks by Moscow Python Meetup

Other Decks in Programming

Transcript

  1. Fast Fuzzy Search
    & Spelling Correction
    (A Phonetic Approach)
    Ling Zhang
    Software Engineer at Aiden.ai
    [email protected]

    View Slide

  2. Awesome searching
    In 30 lines of Python

    View Slide

  3. Spelling is hard. Really hard.

    View Slide

  4. Why Does Spell Check Matter?
    ● Misspelled searches = Product Failure
    ○ Can’t find restaurant on Yelp
    ○ Can’t find friend on Facebook
    ○ Can’t find shop on Google Maps
    ○ Can’t find song on Spotify
    ● Necessary for machine learning / NLP
    ○ Pre-processing step for NLP
    ○ Sentiment analysis, intent detection, entity extraction
    ● Speech Recognition disambiguation
    ● Chatbots

    View Slide

  5. How many ways do people
    spell Britney Spears?

    View Slide

  6. 593 ways to spell
    Britney Spears

    View Slide

  7. - 23% of queries
    were spelt
    wrong.
    - 1 in 4 uses of our
    app will result in
    failure.
    - Long tail end of
    errors
    Error Distribution

    View Slide

  8. Distribution Within Errors
    Queries Captured
    Spelling
    Variations
    90% 18
    95% 40
    99% 196
    99.9% 519
    ~Improving accuracy
    10% requires 10x more
    variations

    View Slide

  9. Goal: can we correct the error 95% of the time?

    View Slide

  10. Top Spelling Variations
    Rank Spelling
    1 britney spears
    2 brittany spears
    3 brittney spears
    4 britany spears
    5 britny spears
    6 briteny spears
    7 britteny spears
    8 briney spears
    9 brittny spears

    View Slide

  11. Peter Norvig Spelling Corrector
    Levenshtein
    Distance
    http://norvig.com/spell-correct.html
    Fits on one page!

    View Slide

  12. Levenshtein Distance
    ● Count total edits required to go from one string to another
    ○ Insertion
    ○ Deletion
    ○ Edits
    ● Example, levenshtein between “elllu” and “hello” = 3
    Str 1 - E L L L U
    Total
    Str 2 H E L L - O
    Edit Insertion - - - Deletion Substitution
    Cost 1 0 0 0 1 1 3

    View Slide

  13. Peter Norvig Spelling Correction Algorithm
    ● Check if word in dictionary, if so return
    ● Generate all words 1 levenshtein distance away. If any of these are in dictionary
    return.
    ● Generate all words 2 levenshtein distance away. If any of these are in dictionary
    return.
    ● If there is more than one candidate in each step, take the one with the highest
    score (scoring has an open definition).
    ● Norvig’s algorithm is expanded levenshtein, allowing transposes.

    View Slide

  14. Is levenshtein good enough?
    Rank Spelling Norvig Distance
    1 britney spears 0
    2 brittany spears 3
    3 brittney spears 1
    4 britany spears 2
    5 britny spears 1
    6 briteny spears 1
    7 britteny spears 2
    8 briney spears 1
    9 brittny spears 2
    Total Proportion Caught: 94% (70% of errors)

    View Slide

  15. Balancing Speed and Accuracy
    Norvig Distance Variants Generated Correction Accuracy Worst Case Execution Time
    0 1 0% < 1 ms
    1 ~800 54% 1.4ms
    2 ~600,000 70% 320ms
    3 ~57,000,000 Unknown 397000ms
    (6.5 minutes)
    Conclusion:
    - Not great performance, but unable to scale beyond distance 2.
    - Still too slow. Spell checking a 200 word facebook post using distance 2
    = 7 seconds (35ms average)
    - Generating 600,000 variants to find 40 used variants

    View Slide

  16. We need a smarter way to get from
    70% -> 95%
    And it needs to be faster

    View Slide

  17. Top Spelling Errors Analyzed
    Spelling Error
    brittany Vowel added
    brittney Phonetically correct
    britany Vowel added
    britny Phonetically correct
    briteny Vowel added
    britteny Vowel added
    briney Vowel added
    brittny Phonetically correct
    ritany Consonant Removed
    ● Errors are not randomly
    distributed
    ○ Keyboard errors are in
    fact rare
    ● Most errors are just one
    phonetic difference away
    ● There are multiple ways to
    spell the same phonetic
    sound (ney, ny)

    View Slide

  18. A Phonetic Point of View
    ● Distribution of errors very far from random given by Levenshtein / Norvig.
    ● Work in phonetic space (dealing with sounds).
    ● Model the word as a series of consonants + vowels. Do insertions / deletions /
    substitutions on these.
    ● Give each insertion / deletion / substitution a score
    ● Insertion
    ○ Vowels are more likely to be added
    ● Deletion
    ○ Vowels are more likely to be deleted
    ● Substitution
    ○ Vowels are all somewhat similar
    ○ Certain consonants are similar i.e. (s, sh) (b, p) but others are very different (sh, p)

    View Slide

  19. Fast Fuzzy Search Code

    View Slide

  20. Parts to fill in
    Phoneticization
    Candidate
    Generation

    View Slide

  21. Phoneticization
    ● We will use the International Phonetic Alphabet (IPA)
    ● Standard alphabet of sounds independent of language. Each symbol is a phone.
    ● English is a mixed language, so sometimes phoneticization seems illogical
    English h e ll o
    IPA h ə l ʊ

    View Slide

  22. Obtaining a mapping
    ● We could use deep learning (I’ve tried)
    ○ It doesn’t work that well
    ○ It’s slow to execute.
    ● We can do a simple letter mapping.
    ● Wikipedia has IPA mappings for most languages
    ● English is a hard language to map
    ● Many other languages have straightforward mappings (Spanish, Arabic)
    ● Being approximately correct is good enough
    ● Also we have phonetic dictionaries:
    https://github.com/cmusphinx/cmudict

    View Slide

  23. English IPA
    https://en.wikipedia.org/wiki/Help:IPA/English

    View Slide

  24. Russian IPA
    https://en.wikipedia.org/wiki/Help:IPA/Russian

    View Slide

  25. English Mapping
    ● English -> IPA
    mapping In ~70 rules
    ● Many-to-many
    mapping
    ● Vowels are
    particularly tricky in
    English

    View Slide

  26. Phoneticization Code
    ● 40 lines of
    Python

    View Slide

  27. Phoneticization Code
    ‘hello’ ‘_h_e_l_l_o’ ‘$h$ɛ$l$l$oʊ’ (‘h’, ‘ɛ’, ‘l’, ‘oʊ’)

    View Slide

  28. Phoneticization Code
    ‘hello’ ‘_h_e_l_l_o’ ‘$h$ɛ$l$l$oʊ’ (‘h’, ‘ɛ’, ‘l’, ‘oʊ’)
    ● Many variants of phoneticization produced.
    ● Some incorrect but doesn’t matter, unlikely
    to match to any word.
    ● Incorrect variants can even be positive, they
    allow us to correct spelling when user reads
    the pronunciation of the word wrong.

    View Slide

  29. Parts to fill in
    Phoneticization
    Candidate
    Generation

    View Slide

  30. Generating Correction Candidates
    ● Goal: generate variations of
    pronunciation, each with a
    numeric score distance.
    ● Includes insertions, deletions
    and substitutions.
    ● Variable cost per phone.
    (‘h’, ‘ɛ’, ‘l’,
    ‘o’)
    (‘h’, ‘ɛ’, ‘l’, ‘o’,
    ‘s’)
    (‘ɛ’, ‘l’, ‘o’)
    (‘h’, ‘a’, ‘l’, ‘o’)
    Insertion
    Deletion
    Substitution

    View Slide

  31. Phonetic Distance using Feature Vectors
    Source: http://idiom.ucsd.edu/~bakovic/grad-phonology/fa14/stuff/pdf/PhonChart_v1102.pdf

    View Slide

  32. PyPhone: Python Phonology Library
    Open source on Github:
    https://github.com/lingz/pyphone

    View Slide

  33. Phone Features in Python
    ● Many distance metrics in
    research, this is just one.
    ● Everyone phone is a ternary
    feature vector
    (True, False, None)

    View Slide

  34. Distance Algorithm
    ● Distance between two
    phones is just the number of
    features that don’t match.
    ● Some features are more
    important than others, so
    they contribute different
    weights to the distance

    View Slide

  35. Option: Levenshtein with Phones
    ● We could do weighted
    levenshtein distance
    with phones instead
    of letters.
    ● Advantage: we have
    ways to estimate how
    expensive
    inserting/deleting
    phones is, and also
    the similarity between
    two phones for
    substitution.

    View Slide

  36. Problem: Too slow!
    ● Can get good results but
    this is too slow!
    ● Weighted levenshtein
    requires us to sort, sorting
    is slow.
    ● Also, with 50 phones and
    searching on a 6 letter
    word, we will get.
    ● Computing the costs and
    sorting to find best results
    is too slow!
    ● Much slower than Norvig’s
    algorithm.
    # Edits Variations
    1 ~600
    2 ~360,000
    3 ~216,000,000

    View Slide

  37. Solution: Cluster the Phones
    ● There exist natural
    groupings of similar
    phones.
    ● We could cluster each
    group into a single symbol.
    ● Bonus: only allow
    substitutions within a
    cluster, then we remove
    half our variations and no
    longer need to consider
    subsitution edits.

    View Slide

  38. Idea: Clustering Algorithm
    ● Create a distance matrix between all
    phones in a language using
    PyPhone.
    ● Cluster these phones into groups
    using this distance matrix,

    View Slide

  39. SciPy to the rescue
    ● Everything is easy in Python.
    ● We can easily calculate the
    triangular distance matrix,
    and create clusters in ~20
    lines of Python.

    View Slide

  40. Phonex (Phone Clusters)
    ● Phonetic Indexing by
    PyPhone
    ● Inspired by SoundEx
    (Developed 1918)
    ● Clusters of sounds
    ● Automatically generated for
    any language set.

    View Slide

  41. Phoneticization Code + Phonex
    ‘hello’ ‘_h_e_l_l_o’ ‘$h$ɛ$l$l$oʊ’ (‘h’, ‘ɛ’, ‘l’, ‘oʊ’)
    (19, 1, 18, 3)

    View Slide

  42. Candidate Generation
    Similar to a reduced levenshtein, with only deletion and
    insertion. Substitutions handled by our clustering already!

    View Slide

  43. ‘hello’ ‘_h_e_l_l_o’ ‘$h$ɛ$l$l$oʊ’ (‘h’, ‘ɛ’, ‘l’, ‘oʊ’)
    (19, 1, 18, 3)
    Fuzzy Search Algorithm Overview
    (19, 1, 18, 3, 5)
    (1, 18, 3)
    Insertion
    Deletion
    Tokenization Phoneticization Strip Repeats
    Clustering
    ‘hello’ Dictionary
    Lookup
    Repeat

    View Slide

  44. Final Code
    30 Lines of Code
    + PyPhone library

    View Slide

  45. Comparing Results
    Algorithm Norvig Levenshtein Spell Check Fast Fuzzy Search
    Errors Corrected 71% 95%
    Average Query Time 33ms 25ms
    Worst Case Query Time 670ms 300ms
    ● 34% more errors caught, taking us to 95%.
    ● Average query time reduced by 25%.
    ● Worst case query time reduced by over 50%.
    ● Absolute accuracy at 99% of all queries

    View Slide

  46. Combining with Other Tools
    ● Spell checking enhances your tooling, it doesn’t replace it.
    ● Can do Norvig + Fast Fuzzy Phonetic Search. One handles keyboard errors well, the
    other handles not knowing how to spell.
    ● You can combine this with a traditional tools like Elastic Search or indexed DB
    searches.
    ● Use spell checking as a pre-processing step in NLP to boost performance.
    ● We can do cross-language search with phonetic fuzzy search.
    ● We can recover from errors in speech recognition, especially when we are dealing
    with closed vocabularies.

    View Slide

  47. Recap
    ● People are bad at spelling.
    ● Spelling errors break products!
    ● Spelling variants are many, but have a predictable distribution (that’s not
    random!)
    ● Most spelling errors are phonetic errors.
    ● It’s possible to build a powerful spell correct / fuzzy searcher in only a few lines of
    python.
    ● Fast fuzzy search can be used to boost accuracy of existing tools like indexed DB
    searches, elastic search, NLP, and speech recognition.
    ● You can cross language barriers using a phonetic search technique.

    View Slide

  48. Fork it on Github!
    ● PyPhone: https://github.com/lingz/PyPhone
    ● fast_fuzzy_search: https://github.com/lingz/fast_fuzzy_search
    All the code from today is
    open source on Github
    under the MIT License.

    View Slide

  49. Questions?
    [email protected]

    View Slide