Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Speaker Deck
PRO
Sign in
Sign up
for free
Going The Distance
Richard Schneeman
September 22, 2014
Programming
3
12k
Going The Distance
Levenshtein Distance and the beauty of algorithms
Richard Schneeman
September 22, 2014
Tweet
Share
More Decks by Richard Schneeman
See All by Richard Schneeman
schneems
0
260
schneems
0
1.8k
schneems
1
190
schneems
2
360
schneems
8
14k
schneems
1
100
schneems
20
28k
schneems
7
260
schneems
2
400
Other Decks in Programming
See All in Programming
nishigori
2
1.5k
startjapan
4
480
mlopskr
0
930
fkubota
12
7.3k
ktgrstsh
1
260
masuda220
9
3k
star_zero
0
120
kampersanda
28
9.5k
wisetlaloc
0
240
s_uryu
0
470
mlopskr
0
310
kamilahsantos
1
460
Featured
See All Featured
eitanlees
116
10k
jakevdp
776
200k
scottboms
252
11k
3n
163
22k
jnunemaker
PRO
42
4.8k
chriscoyier
683
180k
62gerente
587
210k
jeffersonlam
332
16k
bermonpainter
344
27k
schacon
147
6.7k
lara
172
10k
danielanewman
3
650
Transcript
Going The Distance @schneems
They Call me @Schneems
Ruby Schneems
Ruby Python
None
Code Triage .com
Challenge: Comment on an issue
Docs Doctor .org
None
Top 50 Rails Contrib
Cats
None
Can you keep Ruby weird?
%CP%CP
'XGT[QPG%CP%CP
Thank You!
Algorithms:
I went to Georgia Tech for…
Mechanical Engineering
Self taught Programmer ~8 years
CS is boring to me
Programming is interesting
Building programs that accomplish tasks
But…
Those CS students are on to something
Algorithms
Are
Beautiful
<3
Algorithms solve problems
Introducting A Problem…
Spelling
When you are tired, or distracted spelling becomes harder
$ git commmit -m first WARNING: You called a Git
command named 'commmit', which does not exist. Continuing under the assumption that you meant 'commit' in 0.1 seconds automatically...
How does Git know?
Introducing: Edge Distance
The “cost” to change one word to another
Less “cost” means less more likely match
Cost of? zat => bat
Cost of? zat => bat 1
Cost of? zzat => bat
Cost of? bat 2 zzat =>
How do we code it though?
My First Attempt
def distance(str1, str2) cost = 0 str1.each_char.with_index do |char, index|
cost += 1 if str2[index] != char end cost end zat bat =>
def distance(str1, str2) cost = 0 str1.each_char.with_index do |char, index|
cost += 1 if str2[index] != char end cost end zat bat => Cost = 1
Perfect?
saturday sunday Cost = ?
saturday sunday Cost = 7
Wat?
Turns out I almost recreated
Hamming Distance
Hamming AKA: Signal Distance
Measures: the errors introduced in a string Hamming
Only valid for strings of same length Hamming
Good for: Detecting and correcting errors in binary and telecommunications
Hamming
Bad for: mis- spelled words Hamming
Does not include: Insertion Deletion Hamming
Introducing: An Algorithm!
Introducing: Levenshtein Distance
How do we calculate deletion?
distance("schneems", "zschneems")
distance("schneems", "zschneems") Match! deletion
str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] deletion
How do we calculate insertion?
distance("schneems", "chneems")
distance("schneems", "chneems") Match! insertion
str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 insertion
substitution?
distance("zchneems", "schneems") Substitution
str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] Substitution
How do we calculate Distance?
Pretend we have a distance() method
Strings of different lengths
distance(“”, “foo”) # => 3 distance(“foo”, “”) # => 3
return str2.length if str1.empty? return str1.length if str2.empty? Different Lengths
Calculate distance between every substring
l1 = distance(str1, str2[1..-1]) # deletion l2 = distance(str1[1..-1], str2)
# insertion l3 = distance(str1[1..-1], str2[1..-1]) # substitution Calculate costs
l1 = distance(str1, str2[1..-1]) # deletion l2 = distance(str1[1..-1], str2)
# insertion l3 = distance(str1[1..-1], str2[1..-1]) # substitution cost = 1 + [l1,l2,l3].min Take minimum
l1 = distance(str1, str2[1..-1]) # deletion l2 = distance(str1[1..-1], str2)
# insertion l3 = distance(str1[1..-1], str2[1..-1]) # substitution cost = 1 + [l1,l2,l3].min Why did we add one?
Pick the cheapest operation, then add one
What about when characters match?
distance(str1[1..-1], str2[1..-1]) if str1[0] == str2[0] Match str1 = “saturday”
str2 = “sunday”
Our powers combined, we form!
None
`
Recursive Levenshtein Distance!
def distance(str1, str2) # Different lengths return str2.length if str1.empty?
return str1.length if str2.empty? ! return distance(str1[1..-1], str2[1..-1]) if str1[0] == str2[0] # match l1 = distance(str1, str2[1..-1]) # deletion l2 = distance(str1[1..-1], str2) # insertion l3 = distance(str1[1..-1], str2[1..-1]) # substitution return 1 + [l1,l2,l3].min # increment cost end Recursive
distance(“saturday”, “sunday”) # => 3 Recursive Levenshtein
Much better
What does that look like?
None
github.com/ schneems/ going_the_distance
Hmm…
“Dirty Distance” took 8 comparisons
“Recursive” took 1647 comparisons
No trophy, no flowers, no flashbulbs, no wine,
Ouch
Can we do better?
If you watch the recursive algorithm closely, you notice repeats
Maybe we can store substring distance and use to calculate
total distance
I want you to join the club
A members only club
Matrix: Levenshtein Distance
Matrix: Levenshtein Distance
“” => “saturday” Cost?
+---+---+ | | S | +---+---+ | | 1 |
+---+---+ Matrix
+---+---+---+ | | S | A | +---+---+---+ | |
1 | 2 | +---+---+---+ Matrix
+---+---+---+---+ | | S | A | T | +---+---+---+---+
| | 1 | 2 | 3 | +---+---+---+---+ Matrix
+---+---+---+---+---+ | | S | A | T | U
| +---+---+---+---+---+ | | 1 | 2 | 3 | 4 | +---+---+---+---+---+ Matrix
+---+---+---+---+---+---+ | | S | A | T | U
| R | +---+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | +---+---+---+---+---+---+ Matrix
+---+---+---+---+---+---+---+ | | S | A | T | U
| R | D | +---+---+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+ Matrix
+---+---+---+---+---+---+---+---+ | | S | A | T | U
| R | D | A | +---+---+---+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+ Matrix
+---+---+---+---+---+---+---+---+---+ | | S | A | T | U
| R | D | A | Y | +---+---+---+---+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+ Matrix
“sunday” => “” Cost?
+---+---+ | | | +---+---+ | | 0 | +---+---+
| S | 1 | +---+---+ Matrix
+---+---+ | | | +---+---+ | | 0 | +---+---+
| S | 1 | +---+---+ | U | 2 | +---+---+ Matrix
+---+---+ | | | +---+---+ | | 0 | +---+---+
| S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ Matrix
+---+---+ | | | +---+---+ | | 0 | +---+---+
| S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ | D | 4 | +---+---+ Matrix
+---+---+ | | | +---+---+ | | 0 | +---+---+
| S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ | D | 4 | +---+---+ | A | 5 | +---+---+ Matrix
+---+---+ | | | +---+---+ | | 0 | +---+---+
| S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ | D | 4 | +---+---+ | A | 5 | +---+---+ | Y | 6 | +---+---+ Matrix
Now, fill in the matrix
+---+---+---+---+---+---+---+---+---+---+ | | | S | A | T |
U | R | D | A | Y | +---+---+---+---+---+---+---+---+---+---+ | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+---+ | S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ | D | 4 | +---+---+ | A | 5 | +---+---+ | Y | 6 | +---+---+ Matrix
Break it down
How much does it cost to change “s” into “s”?
+---+---+---+ | | | S | +---+---+---+ | | 0
| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ Match! Cost = 0
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! Cost = ?
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ Insertion! Cost = 1
How do we calculate insertion programmatically?
str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 Insertion!
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1] Insertion!
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1]
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1]
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1]
+ cost of change (+1)
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1] + 1
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ Insertion! Cost = 1 str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1] + 1
Keep Going
+---+---+---+---+---+---+---+---+---+---+ | | | S | A | T |
U | R | D | A | Y | +---+---+---+---+---+---+---+---+---+---+ | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+---+ | S | 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+---+---+ Insertion(s)
Next Char “su” => “s”
+---+---+---+ | | | S | +---+---+---+ | | 0
| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Action?
Change “su” to “s” is a deletion. How do we
calculate?
Deletion
str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] deletion
+---+---+---+ | | | S | +---+---+---+ | | 0
| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Deletion str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index]
+---+---+---+ | | | S | +---+---+---+ | | 0
| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Deletion str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index]
+---+---+---+ | | | S | +---+---+---+ | | 0
| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Deletion str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index]
+---+---+---+ | | | S | +---+---+---+ | | 0
| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Deletion str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index] + 1
+---+---+---+ | | | S | +---+---+---+ | | 0
| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | 1 | +---+---+---+ Deletion Cost = 1 str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index] + 1
- Insertion - Deletion - Substitution
- Insertion - Deletion - Substitution
Substitution
str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] Substitution!
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index - 1][column_index- 1]
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index - 1][column_index- 1]
str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index
- 1][column_index- 1] +---+---+---+---+ | | | S | A | +---+---+---+---+ | | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution
str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index
- 1][column_index- 1] +---+---+---+---+ | | | S | A | +---+---+---+---+ | | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution
str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index
- 1][column_index- 1] +---+---+---+---+ | | | S | A | +---+---+---+---+ | | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | 1 | +---+---+---+---+ Substitution Cost = 1 str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index - 1][column_index- 1] + 1
- Insertion - Deletion - Substitution
What about match?
+---+---+---+ | | | S | +---+---+---+ | | 0
| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ Match! str1 = “schneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index - 1][column_index- 1]
Why?
If the current character matches, cost is to change previous
character to previous sub string
i.e. changing “” to “” +---+---+---+ | | | S
| +---+---+---+ | | 0 | 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+
Now What?
Algorithm str2.each_char.each_with_index do |char1,i| str1.each_char.each_with_index do |char2, j| if char1
== char2 puts [:skip, matrix[i][j]].inspect matrix[i + 1 ][j + 1 ] = matrix[i][j] else actions = { deletion: matrix[i][j + 1] + 1, insert: matrix[i + 1][j] + 1, substitution: matrix[i][j] + 1 } action = actions.sort {|(k,v), (k2, v2)| v <=> v2 }.first puts action.inspect matrix[i + 1 ][j + 1 ] = action.last end each_step.call(matrix) if each_step end end
Iterate!
None
Final Cost +---+---+---+---+---+---+---+---+---+---+ | | | S | A |
T | U | R | D | A | Y | +---+---+---+---+---+---+---+---+---+---+ | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+---+ | S | 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+---+---+ | U | 2 | 1 | 1 | 2 | 2 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+---+---+---+ | N | 3 | 2 | 2 | 2 | 3 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+---+---+---+ | D | 4 | 3 | 3 | 3 | 3 | 4 | 3 | 4 | 5 | +---+---+---+---+---+---+---+---+---+---+ | A | 5 | 4 | 3 | 4 | 4 | 4 | 4 | 3 | 4 | +---+---+---+---+---+---+---+---+---+---+ | Y | 6 | 5 | 4 | 4 | 5 | 5 | 5 | 4 | 3 | +---+---+---+---+---+---+---+---+---+---+ 3
We can also get cost of sub strings
“sun” => “sat”
Final Cost +---+---+---+---+---+---+---+---+---+---+ | | | S | A |
T | U | R | D | A | Y | +---+---+---+---+---+---+---+---+---+---+ | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+---+ | S | 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+---+---+ | U | 2 | 1 | 1 | 2 | 2 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+---+---+---+ | N | 3 | 2 | 2 | 2 | 3 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+---+---+---+ | D | 4 | 3 | 3 | 3 | 3 | 4 | 3 | 4 | 5 | +---+---+---+---+---+---+---+---+---+---+ | A | 5 | 4 | 3 | 4 | 4 | 4 | 4 | 3 | 4 | +---+---+---+---+---+---+---+---+---+---+ | Y | 6 | 5 | 4 | 4 | 5 | 5 | 5 | 4 | 3 | +---+---+---+---+---+---+---+---+---+---+ 2
Better than Recursive?
As they speed through the finish, the flags go down.
48 iterations
Wayyyyyy better than 1647
bit.ly/ going_the_distance
My Problem
I am human
I get tired
Machines don’t understand tired
One day I tried typing
$ rails generate migration
But accidentally typed
$ rails generate migratoon
ERROR
None
Stress is increased when we fail at simple tasks
Why? It’s not hard
Why can’t my software be more like git?
We know what you’re trying to accomplish. Let’s help you
out
None
When you have ERROR
Compare given command to possible commands
Recommend smallest distance.
Google:
Google:
Read A lot of words from real books
~1+ million words
Count Each word
Higher count, higher probability
Get edit distance between input and dictionary
Lower edit, higher probability
Show Suggestion
None
Cache correct spelling suggestions
did_you_mean gem
None
More distance measurements
Levenshtein Distance
Hamming Distance
longest common subsequence Distance
Manhattan Distance
Tree Distance
Jaro-Winkler Distance
Many Many More
Algorithms are Awesome
Where to go next?
I want to learn more about algorithms?
Wikipedia!
Rosetta code
Give an algorithm talk
Everyone suggests a new Algorithm!
Algorithms are a way of sharing knowledge
Expand your knowledge Explore Algorithms
Antepenultimate Slide
YAY!
Questions @schneems
Going The Distance