Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Going The Distance
Search
Richard Schneeman
September 22, 2014
Programming
3
13k
Going The Distance
Levenshtein Distance and the beauty of algorithms
Richard Schneeman
September 22, 2014
Tweet
Share
More Decks by Richard Schneeman
See All by Richard Schneeman
[RubyConf] Beware the Dreaded Dead End
schneems
0
230
[Kaigi] Beware the Dead End
schneems
0
81
Threads Aren't Evil
schneems
0
370
Bayes is BAE
schneems
0
2.9k
Testing the Untestable
schneems
1
490
SLOMO
schneems
2
730
Saving Sprockets
schneems
8
16k
Memory Leaks, Tweaks, and Techniques
schneems
1
150
Speed Science
schneems
20
34k
Other Decks in Programming
See All in Programming
RustでAWS Lambda functionをいい感じに書く
taiki45
1
180
AWS Infrastructure as Code の新機能を総まとめ！ 2023.1-2024.5
konokenj
7
1.8k
TypeScriptとGraphQLで実現する 型安全なAPI実装 / TSKaigi 2024
hokaccha
5
3k
Runtime Objects in Rust
mitsuhiko
0
230
Long journey of Ruby standard library RubyKaigi 2024
andpad
3
2k
Namespace, What and Why
tagomoris
8
5.3k
How to implement a RubyVM with PHP?
memory1994
PRO
3
1.3k
CQRS+ES解体新書 / CQRS ES Disassembly Book
nrslib
7
610
20240525_社内でPower Platform勉強会開いたら900人来た話
ponponmikankan
0
180
Using "modern" Ruby to build a better, faster Homebrew
mikemcquaid
3
500
Getting along with YAML comments with Psych
qnighy
2
430
Hi, have you met Kotlin Multiplatform? | Kotlin User Group Berlin
prof18
0
130
Featured
See All Featured
Mobile First: as difficult as doing things right
swwweet
217
8.6k
Build your cross-platform service in a week with App Engine
jlugia
227
17k
Embracing the Ebb and Flow
colly
80
4.2k
Design by the Numbers
sachag
274
18k
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
22
1.4k
The Illustrated Children's Guide to Kubernetes
chrisshort
32
47k
VelocityConf: Rendering Performance Case Studies
addyosmani
321
23k
StorybookのUI Testing Handbookを読んだ
zakiyama
13
4.7k
Imperfection Machines: The Place of Print at Facebook
scottboms
261
12k
Pencils Down: Stop Designing & Start Developing
hursman
117
11k
Building Adaptive Systems
keathley
32
1.9k
How to name files
jennybc
65
94k
Transcript
Going The Distance @schneems
They Call me @Schneems
Ruby Schneems
Ruby Python
None
Code Triage .com
Challenge: Comment on an issue
Docs Doctor .org
None
Top 50 Rails Contrib
Cats
None
Can you keep Ruby weird?
%CP%CP
'XGT[QPG%CP%CP
Thank You!
Algorithms:
I went to Georgia Tech for…
Mechanical Engineering
Self taught Programmer ~8 years
CS is boring to me
Programming is interesting
Building programs that accomplish tasks
But…
Those CS students are on to something
Algorithms
Are
Beautiful
<3
Algorithms solve problems
Introducting A Problem…
Spelling
When you are tired, or distracted spelling becomes harder
$ git commmit -m first WARNING: You called a Git
command named 'commmit', which does not exist. Continuing under the assumption that you meant 'commit' in 0.1 seconds automatically...
How does Git know?
Introducing: Edge Distance
The “cost” to change one word to another
Less “cost” means less more likely match
Cost of? zat => bat
Cost of? zat => bat 1
Cost of? zzat => bat
Cost of? bat 2 zzat =>
How do we code it though?
My First Attempt
def distance(str1, str2) cost = 0 str1.each_char.with_index do |char, index|
cost += 1 if str2[index] != char end cost end zat bat =>
def distance(str1, str2) cost = 0 str1.each_char.with_index do |char, index|
cost += 1 if str2[index] != char end cost end zat bat => Cost = 1
Perfect?
saturday sunday Cost = ?
saturday sunday Cost = 7
Wat?
Turns out I almost recreated
Hamming Distance
Hamming AKA: Signal Distance
Measures: the errors introduced in a string Hamming
Only valid for strings of same length Hamming
Good for: Detecting and correcting errors in binary and telecommunications
Hamming
Bad for: mis- spelled words Hamming
Does not include: Insertion Deletion Hamming
Introducing: An Algorithm!
Introducing: Levenshtein Distance
How do we calculate deletion?
distance("schneems", "zschneems")
distance("schneems", "zschneems") Match! deletion
str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] deletion
How do we calculate insertion?
distance("schneems", "chneems")
distance("schneems", "chneems") Match! insertion
str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 insertion
substitution?
distance("zchneems", "schneems") Substitution
str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] Substitution
How do we calculate Distance?
Pretend we have a distance() method
Strings of different lengths
distance(“”, “foo”) # => 3 distance(“foo”, “”) # => 3
return str2.length if str1.empty? return str1.length if str2.empty? Different Lengths
Calculate distance between every substring
l1 = distance(str1, str2[1..-1]) # deletion l2 = distance(str1[1..-1], str2)
# insertion l3 = distance(str1[1..-1], str2[1..-1]) # substitution Calculate costs
l1 = distance(str1, str2[1..-1]) # deletion l2 = distance(str1[1..-1], str2)
# insertion l3 = distance(str1[1..-1], str2[1..-1]) # substitution cost = 1 + [l1,l2,l3].min Take minimum
l1 = distance(str1, str2[1..-1]) # deletion l2 = distance(str1[1..-1], str2)
# insertion l3 = distance(str1[1..-1], str2[1..-1]) # substitution cost = 1 + [l1,l2,l3].min Why did we add one?
Pick the cheapest operation, then add one
What about when characters match?
distance(str1[1..-1], str2[1..-1]) if str1[0] == str2[0] Match str1 = “saturday”
str2 = “sunday”
Our powers combined, we form!
None
`
Recursive Levenshtein Distance!
def distance(str1, str2) # Different lengths return str2.length if str1.empty?
return str1.length if str2.empty? ! return distance(str1[1..-1], str2[1..-1]) if str1[0] == str2[0] # match l1 = distance(str1, str2[1..-1]) # deletion l2 = distance(str1[1..-1], str2) # insertion l3 = distance(str1[1..-1], str2[1..-1]) # substitution return 1 + [l1,l2,l3].min # increment cost end Recursive
distance(“saturday”, “sunday”) # => 3 Recursive Levenshtein
Much better
What does that look like?
None
github.com/ schneems/ going_the_distance
Hmm…
“Dirty Distance” took 8 comparisons
“Recursive” took 1647 comparisons
No trophy, no flowers, no flashbulbs, no wine,
Ouch
Can we do better?
If you watch the recursive algorithm closely, you notice repeats
Maybe we can store substring distance and use to calculate
total distance
I want you to join the club
A members only club
Matrix: Levenshtein Distance
Matrix: Levenshtein Distance
“” => “saturday” Cost?
+---+---+ | | S | +---+---+ | | 1 |
+---+---+ Matrix
+---+---+---+ | | S | A | +---+---+---+ | |
1 | 2 | +---+---+---+ Matrix
+---+---+---+---+ | | S | A | T | +---+---+---+---+
| | 1 | 2 | 3 | +---+---+---+---+ Matrix
+---+---+---+---+---+ | | S | A | T | U
| +---+---+---+---+---+ | | 1 | 2 | 3 | 4 | +---+---+---+---+---+ Matrix
+---+---+---+---+---+---+ | | S | A | T | U
| R | +---+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | +---+---+---+---+---+---+ Matrix
+---+---+---+---+---+---+---+ | | S | A | T | U
| R | D | +---+---+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+ Matrix
+---+---+---+---+---+---+---+---+ | | S | A | T | U
| R | D | A | +---+---+---+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+ Matrix
+---+---+---+---+---+---+---+---+---+ | | S | A | T | U
| R | D | A | Y | +---+---+---+---+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+ Matrix
“sunday” => “” Cost?
+---+---+ | | | +---+---+ | | 0 | +---+---+
| S | 1 | +---+---+ Matrix
+---+---+ | | | +---+---+ | | 0 | +---+---+
| S | 1 | +---+---+ | U | 2 | +---+---+ Matrix
+---+---+ | | | +---+---+ | | 0 | +---+---+
| S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ Matrix
+---+---+ | | | +---+---+ | | 0 | +---+---+
| S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ | D | 4 | +---+---+ Matrix
+---+---+ | | | +---+---+ | | 0 | +---+---+
| S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ | D | 4 | +---+---+ | A | 5 | +---+---+ Matrix
+---+---+ | | | +---+---+ | | 0 | +---+---+
| S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ | D | 4 | +---+---+ | A | 5 | +---+---+ | Y | 6 | +---+---+ Matrix
Now, fill in the matrix
+---+---+---+---+---+---+---+---+---+---+ | | | S | A | T |
U | R | D | A | Y | +---+---+---+---+---+---+---+---+---+---+ | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+---+ | S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ | D | 4 | +---+---+ | A | 5 | +---+---+ | Y | 6 | +---+---+ Matrix
Break it down
How much does it cost to change “s” into “s”?
+---+---+---+ | | | S | +---+---+---+ | | 0
| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ Match! Cost = 0
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! Cost = ?
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ Insertion! Cost = 1
How do we calculate insertion programmatically?
str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 Insertion!
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1] Insertion!
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1]
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1]
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1]
+ cost of change (+1)
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1] + 1
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ Insertion! Cost = 1 str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1] + 1
Keep Going
+---+---+---+---+---+---+---+---+---+---+ | | | S | A | T |
U | R | D | A | Y | +---+---+---+---+---+---+---+---+---+---+ | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+---+ | S | 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+---+---+ Insertion(s)
Next Char “su” => “s”
+---+---+---+ | | | S | +---+---+---+ | | 0
| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Action?
Change “su” to “s” is a deletion. How do we
calculate?
Deletion
str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] deletion
+---+---+---+ | | | S | +---+---+---+ | | 0
| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Deletion str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index]
+---+---+---+ | | | S | +---+---+---+ | | 0
| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Deletion str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index]
+---+---+---+ | | | S | +---+---+---+ | | 0
| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Deletion str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index]
+---+---+---+ | | | S | +---+---+---+ | | 0
| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Deletion str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index] + 1
+---+---+---+ | | | S | +---+---+---+ | | 0
| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | 1 | +---+---+---+ Deletion Cost = 1 str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index] + 1
- Insertion - Deletion - Substitution
- Insertion - Deletion - Substitution
Substitution
str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] Substitution!
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index - 1][column_index- 1]
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index - 1][column_index- 1]
str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index
- 1][column_index- 1] +---+---+---+---+ | | | S | A | +---+---+---+---+ | | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution
str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index
- 1][column_index- 1] +---+---+---+---+ | | | S | A | +---+---+---+---+ | | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution
str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index
- 1][column_index- 1] +---+---+---+---+ | | | S | A | +---+---+---+---+ | | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | 1 | +---+---+---+---+ Substitution Cost = 1 str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index - 1][column_index- 1] + 1
- Insertion - Deletion - Substitution
What about match?
+---+---+---+ | | | S | +---+---+---+ | | 0
| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ Match! str1 = “schneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index - 1][column_index- 1]
Why?
If the current character matches, cost is to change previous
character to previous sub string
i.e. changing “” to “” +---+---+---+ | | | S
| +---+---+---+ | | 0 | 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+
Now What?
Algorithm str2.each_char.each_with_index do |char1,i| str1.each_char.each_with_index do |char2, j| if char1
== char2 puts [:skip, matrix[i][j]].inspect matrix[i + 1 ][j + 1 ] = matrix[i][j] else actions = { deletion: matrix[i][j + 1] + 1, insert: matrix[i + 1][j] + 1, substitution: matrix[i][j] + 1 } action = actions.sort {|(k,v), (k2, v2)| v <=> v2 }.first puts action.inspect matrix[i + 1 ][j + 1 ] = action.last end each_step.call(matrix) if each_step end end
Iterate!
None
Final Cost +---+---+---+---+---+---+---+---+---+---+ | | | S | A |
T | U | R | D | A | Y | +---+---+---+---+---+---+---+---+---+---+ | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+---+ | S | 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+---+---+ | U | 2 | 1 | 1 | 2 | 2 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+---+---+---+ | N | 3 | 2 | 2 | 2 | 3 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+---+---+---+ | D | 4 | 3 | 3 | 3 | 3 | 4 | 3 | 4 | 5 | +---+---+---+---+---+---+---+---+---+---+ | A | 5 | 4 | 3 | 4 | 4 | 4 | 4 | 3 | 4 | +---+---+---+---+---+---+---+---+---+---+ | Y | 6 | 5 | 4 | 4 | 5 | 5 | 5 | 4 | 3 | +---+---+---+---+---+---+---+---+---+---+ 3
We can also get cost of sub strings
“sun” => “sat”
Final Cost +---+---+---+---+---+---+---+---+---+---+ | | | S | A |
T | U | R | D | A | Y | +---+---+---+---+---+---+---+---+---+---+ | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+---+ | S | 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+---+---+ | U | 2 | 1 | 1 | 2 | 2 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+---+---+---+ | N | 3 | 2 | 2 | 2 | 3 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+---+---+---+ | D | 4 | 3 | 3 | 3 | 3 | 4 | 3 | 4 | 5 | +---+---+---+---+---+---+---+---+---+---+ | A | 5 | 4 | 3 | 4 | 4 | 4 | 4 | 3 | 4 | +---+---+---+---+---+---+---+---+---+---+ | Y | 6 | 5 | 4 | 4 | 5 | 5 | 5 | 4 | 3 | +---+---+---+---+---+---+---+---+---+---+ 2
Better than Recursive?
As they speed through the finish, the flags go down.
48 iterations
Wayyyyyy better than 1647
bit.ly/ going_the_distance
My Problem
I am human
I get tired
Machines don’t understand tired
One day I tried typing
$ rails generate migration
But accidentally typed
$ rails generate migratoon
ERROR
None
Stress is increased when we fail at simple tasks
Why? It’s not hard
Why can’t my software be more like git?
We know what you’re trying to accomplish. Let’s help you
out
None
When you have ERROR
Compare given command to possible commands
Recommend smallest distance.
Google:
Google:
Read A lot of words from real books
~1+ million words
Count Each word
Higher count, higher probability
Get edit distance between input and dictionary
Lower edit, higher probability
Show Suggestion
None
Cache correct spelling suggestions
did_you_mean gem
None
More distance measurements
Levenshtein Distance
Hamming Distance
longest common subsequence Distance
Manhattan Distance
Tree Distance
Jaro-Winkler Distance
Many Many More
Algorithms are Awesome
Where to go next?
I want to learn more about algorithms?
Wikipedia!
Rosetta code
Give an algorithm talk
Everyone suggests a new Algorithm!
Algorithms are a way of sharing knowledge
Expand your knowledge Explore Algorithms
Antepenultimate Slide
YAY!
Questions @schneems
Going The Distance