Building a Fast Fuzzy Searcher and Spell Checker

Fast Fuzzy Search & Spelling Correction (A Phonetic Approach) Ling
Zhang Software Engineer at Aiden.ai [email protected]

Awesome searching In 30 lines of Python

Spelling is hard. Really hard.

Why Does Spell Check Matter? • Misspelled searches = Product
Failure ◦ Can’t find restaurant on Yelp ◦ Can’t find friend on Facebook ◦ Can’t find shop on Google Maps ◦ Can’t find song on Spotify • Necessary for machine learning / NLP ◦ Pre-processing step for NLP ◦ Sentiment analysis, intent detection, entity extraction • Speech Recognition disambiguation • Chatbots

How many ways do people spell Britney Spears?

593 ways to spell Britney Spears

- 23% of queries were spelt wrong. - 1 in
4 uses of our app will result in failure. - Long tail end of errors Error Distribution

Distribution Within Errors Queries Captured Spelling Variations 90% 18 95%
40 99% 196 99.9% 519 ~Improving accuracy 10% requires 10x more variations

Goal: can we correct the error 95% of the time?

Top Spelling Variations Rank Spelling 1 britney spears 2 brittany
spears 3 brittney spears 4 britany spears 5 britny spears 6 briteny spears 7 britteny spears 8 briney spears 9 brittny spears

Peter Norvig Spelling Corrector Levenshtein Distance http://norvig.com/spell-correct.html Fits on one
page!

Levenshtein Distance • Count total edits required to go from
one string to another ◦ Insertion ◦ Deletion ◦ Edits • Example, levenshtein between “elllu” and “hello” = 3 Str 1 - E L L L U Total Str 2 H E L L - O Edit Insertion - - - Deletion Substitution Cost 1 0 0 0 1 1 3

Peter Norvig Spelling Correction Algorithm • Check if word in
dictionary, if so return • Generate all words 1 levenshtein distance away. If any of these are in dictionary return. • Generate all words 2 levenshtein distance away. If any of these are in dictionary return. • If there is more than one candidate in each step, take the one with the highest score (scoring has an open definition). • Norvig’s algorithm is expanded levenshtein, allowing transposes.

Is levenshtein good enough? Rank Spelling Norvig Distance 1 britney
spears 0 2 brittany spears 3 3 brittney spears 1 4 britany spears 2 5 britny spears 1 6 briteny spears 1 7 britteny spears 2 8 briney spears 1 9 brittny spears 2 Total Proportion Caught: 94% (70% of errors)

Balancing Speed and Accuracy Norvig Distance Variants Generated Correction Accuracy
Worst Case Execution Time 0 1 0% < 1 ms 1 ~800 54% 1.4ms 2 ~600,000 70% 320ms 3 ~57,000,000 Unknown 397000ms (6.5 minutes) Conclusion: - Not great performance, but unable to scale beyond distance 2. - Still too slow. Spell checking a 200 word facebook post using distance 2 = 7 seconds (35ms average) - Generating 600,000 variants to find 40 used variants

We need a smarter way to get from 70% ->
95% And it needs to be faster

Top Spelling Errors Analyzed Spelling Error brittany Vowel added brittney
Phonetically correct britany Vowel added britny Phonetically correct briteny Vowel added britteny Vowel added briney Vowel added brittny Phonetically correct ritany Consonant Removed • Errors are not randomly distributed ◦ Keyboard errors are in fact rare • Most errors are just one phonetic difference away • There are multiple ways to spell the same phonetic sound (ney, ny)

A Phonetic Point of View • Distribution of errors very
far from random given by Levenshtein / Norvig. • Work in phonetic space (dealing with sounds). • Model the word as a series of consonants + vowels. Do insertions / deletions / substitutions on these. • Give each insertion / deletion / substitution a score • Insertion ◦ Vowels are more likely to be added • Deletion ◦ Vowels are more likely to be deleted • Substitution ◦ Vowels are all somewhat similar ◦ Certain consonants are similar i.e. (s, sh) (b, p) but others are very different (sh, p)

Fast Fuzzy Search Code

Parts to fill in Phoneticization Candidate Generation

Phoneticization • We will use the International Phonetic Alphabet (IPA)
• Standard alphabet of sounds independent of language. Each symbol is a phone. • English is a mixed language, so sometimes phoneticization seems illogical English h e ll o IPA h ə l ʊ

Obtaining a mapping • We could use deep learning (I’ve
tried) ◦ It doesn’t work that well ◦ It’s slow to execute. • We can do a simple letter mapping. • Wikipedia has IPA mappings for most languages • English is a hard language to map • Many other languages have straightforward mappings (Spanish, Arabic) • Being approximately correct is good enough • Also we have phonetic dictionaries: https://github.com/cmusphinx/cmudict

English IPA https://en.wikipedia.org/wiki/Help:IPA/English

Russian IPA https://en.wikipedia.org/wiki/Help:IPA/Russian

English Mapping • English -> IPA mapping In ~70 rules
• Many-to-many mapping • Vowels are particularly tricky in English

Phoneticization Code • 40 lines of Python

Phoneticization Code ‘hello’ ‘_h_e_l_l_o’ ‘$h$ɛ$l$l$oʊ’ (‘h’, ‘ɛ’, ‘l’, ‘oʊ’)

Phoneticization Code ‘hello’ ‘_h_e_l_l_o’ ‘$h$ɛ$l$l$oʊ’ (‘h’, ‘ɛ’, ‘l’, ‘oʊ’) •
Many variants of phoneticization produced. • Some incorrect but doesn’t matter, unlikely to match to any word. • Incorrect variants can even be positive, they allow us to correct spelling when user reads the pronunciation of the word wrong.

Parts to fill in Phoneticization Candidate Generation

Generating Correction Candidates • Goal: generate variations of pronunciation, each
with a numeric score distance. • Includes insertions, deletions and substitutions. • Variable cost per phone. (‘h’, ‘ɛ’, ‘l’, ‘o’) (‘h’, ‘ɛ’, ‘l’, ‘o’, ‘s’) (‘ɛ’, ‘l’, ‘o’) (‘h’, ‘a’, ‘l’, ‘o’) Insertion Deletion Substitution

Phonetic Distance using Feature Vectors Source: http://idiom.ucsd.edu/~bakovic/grad-phonology/fa14/stuff/pdf/PhonChart_v1102.pdf

PyPhone: Python Phonology Library Open source on Github: https://github.com/lingz/pyphone

Phone Features in Python • Many distance metrics in research,
this is just one. • Everyone phone is a ternary feature vector (True, False, None)

Distance Algorithm • Distance between two phones is just the
number of features that don’t match. • Some features are more important than others, so they contribute different weights to the distance

Option: Levenshtein with Phones • We could do weighted levenshtein
distance with phones instead of letters. • Advantage: we have ways to estimate how expensive inserting/deleting phones is, and also the similarity between two phones for substitution.

Problem: Too slow! • Can get good results but this
is too slow! • Weighted levenshtein requires us to sort, sorting is slow. • Also, with 50 phones and searching on a 6 letter word, we will get. • Computing the costs and sorting to find best results is too slow! • Much slower than Norvig’s algorithm. # Edits Variations 1 ~600 2 ~360,000 3 ~216,000,000

Solution: Cluster the Phones • There exist natural groupings of
similar phones. • We could cluster each group into a single symbol. • Bonus: only allow substitutions within a cluster, then we remove half our variations and no longer need to consider subsitution edits.

Idea: Clustering Algorithm • Create a distance matrix between all
phones in a language using PyPhone. • Cluster these phones into groups using this distance matrix,

SciPy to the rescue • Everything is easy in Python.
• We can easily calculate the triangular distance matrix, and create clusters in ~20 lines of Python.

Phonex (Phone Clusters) • Phonetic Indexing by PyPhone • Inspired
by SoundEx (Developed 1918) • Clusters of sounds • Automatically generated for any language set.

Phoneticization Code + Phonex ‘hello’ ‘_h_e_l_l_o’ ‘$h$ɛ$l$l$oʊ’ (‘h’, ‘ɛ’, ‘l’,
‘oʊ’) (19, 1, 18, 3)

Candidate Generation Similar to a reduced levenshtein, with only deletion
and insertion. Substitutions handled by our clustering already!

‘hello’ ‘_h_e_l_l_o’ ‘$h$ɛ$l$l$oʊ’ (‘h’, ‘ɛ’, ‘l’, ‘oʊ’) (19, 1, 18,
3) Fuzzy Search Algorithm Overview (19, 1, 18, 3, 5) (1, 18, 3) Insertion Deletion Tokenization Phoneticization Strip Repeats Clustering ‘hello’ Dictionary Lookup Repeat

Final Code 30 Lines of Code + PyPhone library

Comparing Results Algorithm Norvig Levenshtein Spell Check Fast Fuzzy Search
Errors Corrected 71% 95% Average Query Time 33ms 25ms Worst Case Query Time 670ms 300ms • 34% more errors caught, taking us to 95%. • Average query time reduced by 25%. • Worst case query time reduced by over 50%. • Absolute accuracy at 99% of all queries

Combining with Other Tools • Spell checking enhances your tooling,
it doesn’t replace it. • Can do Norvig + Fast Fuzzy Phonetic Search. One handles keyboard errors well, the other handles not knowing how to spell. • You can combine this with a traditional tools like Elastic Search or indexed DB searches. • Use spell checking as a pre-processing step in NLP to boost performance. • We can do cross-language search with phonetic fuzzy search. • We can recover from errors in speech recognition, especially when we are dealing with closed vocabularies.

Recap • People are bad at spelling. • Spelling errors
break products! • Spelling variants are many, but have a predictable distribution (that’s not random!) • Most spelling errors are phonetic errors. • It’s possible to build a powerful spell correct / fuzzy searcher in only a few lines of python. • Fast fuzzy search can be used to boost accuracy of existing tools like indexed DB searches, elastic search, NLP, and speech recognition. • You can cross language barriers using a phonetic search technique.

Fork it on Github! • PyPhone: https://github.com/lingz/PyPhone • fast_fuzzy_search: https://github.com/lingz/fast_fuzzy_search
All the code from today is open source on Github under the MIT License.

Questions? [email protected]

Building a Fast Fuzzy Searcher and Spell Checker

Building a Fast Fuzzy Searcher and Spell Checker

More Decks by Moscow Python Meetup

Other Decks in Programming

Featured

Transcript