Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Speaker Deck
PRO
Sign in
Sign up for free
Going The Distance
Richard Schneeman
September 22, 2014
Programming
3
12k
Going The Distance
Levenshtein Distance and the beauty of algorithms
Richard Schneeman
September 22, 2014
Tweet
Share
More Decks by Richard Schneeman
See All by Richard Schneeman
[RubyConf] Beware the Dreaded Dead End
schneems
0
75
[Kaigi] Beware the Dead End
schneems
0
31
Threads Aren't Evil
schneems
0
270
Bayes is BAE
schneems
0
2.3k
Testing the Untestable
schneems
1
280
SLOMO
schneems
2
470
Saving Sprockets
schneems
8
15k
Memory Leaks, Tweaks, and Techniques
schneems
1
110
Speed Science
schneems
20
31k
Other Decks in Programming
See All in Programming
How GitHub Supports Vim License Detection, The Five Years Journey
othree
1
350
クラウド KMS の活用 / TOKYO BLOCKCHAIN TECH MEETUP 2022
odanado
PRO
0
190
WindowsコンテナDojo: 第4回 Red Hat OpenShift Localを使ってみよう
oniak3ibm
PRO
0
180
Regular expressions basics/正規表現の基本
kishikawakatsumi
6
260
2022年のモダンCSS改
tonkotsuboy_com
24
16k
Getting Started With Data Structures
adoranwodo
1
260
企業内スモールデータでのデータ解析
hamage9
0
890
YATA: collaborative documents and how to make them fast
horusiath
1
160
リーダブルテストコード / #vstat
jnchito
47
36k
Babylon.jsで作ったsceneをレイトレーシングで映えさせる
turamy
1
210
RustのWebフレームワーク周りの概観
hayao
0
180
プロダクトの成長とSREと
takuyatezuka
0
120
Featured
See All Featured
Agile that works and the tools we love
rasmusluckow
319
19k
Producing Creativity
orderedlist
PRO
334
37k
We Have a Design System, Now What?
morganepeng
35
3k
Creating an realtime collaboration tool: Agile Flush - .NET Oxford
marcduiker
6
570
How to name files
jennybc
40
63k
Become a Pro
speakerdeck
PRO
3
900
Bootstrapping a Software Product
garrettdimon
296
110k
GraphQLとの向き合い方2022年版
quramy
16
8.4k
Why You Should Never Use an ORM
jnunemaker
PRO
47
7.6k
The Invisible Customer
myddelton
110
11k
Building a Scalable Design System with Sketch
lauravandoore
448
30k
No one is an island. Learnings from fostering a developers community.
thoeni
9
1.3k
Transcript
Going The Distance @schneems
They Call me @Schneems
Ruby Schneems
Ruby Python
None
Code Triage .com
Challenge: Comment on an issue
Docs Doctor .org
None
Top 50 Rails Contrib
Cats
None
Can you keep Ruby weird?
%CP%CP
'XGT[QPG%CP%CP
Thank You!
Algorithms:
I went to Georgia Tech for…
Mechanical Engineering
Self taught Programmer ~8 years
CS is boring to me
Programming is interesting
Building programs that accomplish tasks
But…
Those CS students are on to something
Algorithms
Are
Beautiful
<3
Algorithms solve problems
Introducting A Problem…
Spelling
When you are tired, or distracted spelling becomes harder
$ git commmit -m first WARNING: You called a Git
command named 'commmit', which does not exist. Continuing under the assumption that you meant 'commit' in 0.1 seconds automatically...
How does Git know?
Introducing: Edge Distance
The “cost” to change one word to another
Less “cost” means less more likely match
Cost of? zat => bat
Cost of? zat => bat 1
Cost of? zzat => bat
Cost of? bat 2 zzat =>
How do we code it though?
My First Attempt
def distance(str1, str2) cost = 0 str1.each_char.with_index do |char, index|
cost += 1 if str2[index] != char end cost end zat bat =>
def distance(str1, str2) cost = 0 str1.each_char.with_index do |char, index|
cost += 1 if str2[index] != char end cost end zat bat => Cost = 1
Perfect?
saturday sunday Cost = ?
saturday sunday Cost = 7
Wat?
Turns out I almost recreated
Hamming Distance
Hamming AKA: Signal Distance
Measures: the errors introduced in a string Hamming
Only valid for strings of same length Hamming
Good for: Detecting and correcting errors in binary and telecommunications
Hamming
Bad for: mis- spelled words Hamming
Does not include: Insertion Deletion Hamming
Introducing: An Algorithm!
Introducing: Levenshtein Distance
How do we calculate deletion?
distance("schneems", "zschneems")
distance("schneems", "zschneems") Match! deletion
str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] deletion
How do we calculate insertion?
distance("schneems", "chneems")
distance("schneems", "chneems") Match! insertion
str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 insertion
substitution?
distance("zchneems", "schneems") Substitution
str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] Substitution
How do we calculate Distance?
Pretend we have a distance() method
Strings of different lengths
distance(“”, “foo”) # => 3 distance(“foo”, “”) # => 3
return str2.length if str1.empty? return str1.length if str2.empty? Different Lengths
Calculate distance between every substring
l1 = distance(str1, str2[1..-1]) # deletion l2 = distance(str1[1..-1], str2)
# insertion l3 = distance(str1[1..-1], str2[1..-1]) # substitution Calculate costs
l1 = distance(str1, str2[1..-1]) # deletion l2 = distance(str1[1..-1], str2)
# insertion l3 = distance(str1[1..-1], str2[1..-1]) # substitution cost = 1 + [l1,l2,l3].min Take minimum
l1 = distance(str1, str2[1..-1]) # deletion l2 = distance(str1[1..-1], str2)
# insertion l3 = distance(str1[1..-1], str2[1..-1]) # substitution cost = 1 + [l1,l2,l3].min Why did we add one?
Pick the cheapest operation, then add one
What about when characters match?
distance(str1[1..-1], str2[1..-1]) if str1[0] == str2[0] Match str1 = “saturday”
str2 = “sunday”
Our powers combined, we form!
None
`
Recursive Levenshtein Distance!
def distance(str1, str2) # Different lengths return str2.length if str1.empty?
return str1.length if str2.empty? ! return distance(str1[1..-1], str2[1..-1]) if str1[0] == str2[0] # match l1 = distance(str1, str2[1..-1]) # deletion l2 = distance(str1[1..-1], str2) # insertion l3 = distance(str1[1..-1], str2[1..-1]) # substitution return 1 + [l1,l2,l3].min # increment cost end Recursive
distance(“saturday”, “sunday”) # => 3 Recursive Levenshtein
Much better
What does that look like?
None
github.com/ schneems/ going_the_distance
Hmm…
“Dirty Distance” took 8 comparisons
“Recursive” took 1647 comparisons
No trophy, no flowers, no flashbulbs, no wine,
Ouch
Can we do better?
If you watch the recursive algorithm closely, you notice repeats
Maybe we can store substring distance and use to calculate
total distance
I want you to join the club
A members only club
Matrix: Levenshtein Distance
Matrix: Levenshtein Distance
“” => “saturday” Cost?
+---+---+ | | S | +---+---+ | | 1 |
+---+---+ Matrix
+---+---+---+ | | S | A | +---+---+---+ | |
1 | 2 | +---+---+---+ Matrix
+---+---+---+---+ | | S | A | T | +---+---+---+---+
| | 1 | 2 | 3 | +---+---+---+---+ Matrix
+---+---+---+---+---+ | | S | A | T | U
| +---+---+---+---+---+ | | 1 | 2 | 3 | 4 | +---+---+---+---+---+ Matrix
+---+---+---+---+---+---+ | | S | A | T | U
| R | +---+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | +---+---+---+---+---+---+ Matrix
+---+---+---+---+---+---+---+ | | S | A | T | U
| R | D | +---+---+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+ Matrix
+---+---+---+---+---+---+---+---+ | | S | A | T | U
| R | D | A | +---+---+---+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+ Matrix
+---+---+---+---+---+---+---+---+---+ | | S | A | T | U
| R | D | A | Y | +---+---+---+---+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+ Matrix
“sunday” => “” Cost?
+---+---+ | | | +---+---+ | | 0 | +---+---+
| S | 1 | +---+---+ Matrix
+---+---+ | | | +---+---+ | | 0 | +---+---+
| S | 1 | +---+---+ | U | 2 | +---+---+ Matrix
+---+---+ | | | +---+---+ | | 0 | +---+---+
| S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ Matrix
+---+---+ | | | +---+---+ | | 0 | +---+---+
| S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ | D | 4 | +---+---+ Matrix
+---+---+ | | | +---+---+ | | 0 | +---+---+
| S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ | D | 4 | +---+---+ | A | 5 | +---+---+ Matrix
+---+---+ | | | +---+---+ | | 0 | +---+---+
| S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ | D | 4 | +---+---+ | A | 5 | +---+---+ | Y | 6 | +---+---+ Matrix
Now, fill in the matrix
+---+---+---+---+---+---+---+---+---+---+ | | | S | A | T |
U | R | D | A | Y | +---+---+---+---+---+---+---+---+---+---+ | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+---+ | S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ | D | 4 | +---+---+ | A | 5 | +---+---+ | Y | 6 | +---+---+ Matrix
Break it down
How much does it cost to change “s” into “s”?
+---+---+---+ | | | S | +---+---+---+ | | 0
| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ Match! Cost = 0
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! Cost = ?
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ Insertion! Cost = 1
How do we calculate insertion programmatically?
str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 Insertion!
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1] Insertion!
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1]
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1]
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1]
+ cost of change (+1)
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1] + 1
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ Insertion! Cost = 1 str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1] + 1
Keep Going
+---+---+---+---+---+---+---+---+---+---+ | | | S | A | T |
U | R | D | A | Y | +---+---+---+---+---+---+---+---+---+---+ | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+---+ | S | 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+---+---+ Insertion(s)
Next Char “su” => “s”
+---+---+---+ | | | S | +---+---+---+ | | 0
| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Action?
Change “su” to “s” is a deletion. How do we
calculate?
Deletion
str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] deletion
+---+---+---+ | | | S | +---+---+---+ | | 0
| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Deletion str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index]
+---+---+---+ | | | S | +---+---+---+ | | 0
| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Deletion str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index]
+---+---+---+ | | | S | +---+---+---+ | | 0
| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Deletion str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index]
+---+---+---+ | | | S | +---+---+---+ | | 0
| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Deletion str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index] + 1
+---+---+---+ | | | S | +---+---+---+ | | 0
| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | 1 | +---+---+---+ Deletion Cost = 1 str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index] + 1
- Insertion - Deletion - Substitution
- Insertion - Deletion - Substitution
Substitution
str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] Substitution!
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index - 1][column_index- 1]
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index - 1][column_index- 1]
str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index
- 1][column_index- 1] +---+---+---+---+ | | | S | A | +---+---+---+---+ | | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution
str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index
- 1][column_index- 1] +---+---+---+---+ | | | S | A | +---+---+---+---+ | | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution
str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index
- 1][column_index- 1] +---+---+---+---+ | | | S | A | +---+---+---+---+ | | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | 1 | +---+---+---+---+ Substitution Cost = 1 str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index - 1][column_index- 1] + 1
- Insertion - Deletion - Substitution
What about match?
+---+---+---+ | | | S | +---+---+---+ | | 0
| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ Match! str1 = “schneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index - 1][column_index- 1]
Why?
If the current character matches, cost is to change previous
character to previous sub string
i.e. changing “” to “” +---+---+---+ | | | S
| +---+---+---+ | | 0 | 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+
Now What?
Algorithm str2.each_char.each_with_index do |char1,i| str1.each_char.each_with_index do |char2, j| if char1
== char2 puts [:skip, matrix[i][j]].inspect matrix[i + 1 ][j + 1 ] = matrix[i][j] else actions = { deletion: matrix[i][j + 1] + 1, insert: matrix[i + 1][j] + 1, substitution: matrix[i][j] + 1 } action = actions.sort {|(k,v), (k2, v2)| v <=> v2 }.first puts action.inspect matrix[i + 1 ][j + 1 ] = action.last end each_step.call(matrix) if each_step end end
Iterate!
None
Final Cost +---+---+---+---+---+---+---+---+---+---+ | | | S | A |
T | U | R | D | A | Y | +---+---+---+---+---+---+---+---+---+---+ | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+---+ | S | 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+---+---+ | U | 2 | 1 | 1 | 2 | 2 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+---+---+---+ | N | 3 | 2 | 2 | 2 | 3 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+---+---+---+ | D | 4 | 3 | 3 | 3 | 3 | 4 | 3 | 4 | 5 | +---+---+---+---+---+---+---+---+---+---+ | A | 5 | 4 | 3 | 4 | 4 | 4 | 4 | 3 | 4 | +---+---+---+---+---+---+---+---+---+---+ | Y | 6 | 5 | 4 | 4 | 5 | 5 | 5 | 4 | 3 | +---+---+---+---+---+---+---+---+---+---+ 3
We can also get cost of sub strings
“sun” => “sat”
Final Cost +---+---+---+---+---+---+---+---+---+---+ | | | S | A |
T | U | R | D | A | Y | +---+---+---+---+---+---+---+---+---+---+ | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+---+ | S | 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+---+---+ | U | 2 | 1 | 1 | 2 | 2 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+---+---+---+ | N | 3 | 2 | 2 | 2 | 3 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+---+---+---+ | D | 4 | 3 | 3 | 3 | 3 | 4 | 3 | 4 | 5 | +---+---+---+---+---+---+---+---+---+---+ | A | 5 | 4 | 3 | 4 | 4 | 4 | 4 | 3 | 4 | +---+---+---+---+---+---+---+---+---+---+ | Y | 6 | 5 | 4 | 4 | 5 | 5 | 5 | 4 | 3 | +---+---+---+---+---+---+---+---+---+---+ 2
Better than Recursive?
As they speed through the finish, the flags go down.
48 iterations
Wayyyyyy better than 1647
bit.ly/ going_the_distance
My Problem
I am human
I get tired
Machines don’t understand tired
One day I tried typing
$ rails generate migration
But accidentally typed
$ rails generate migratoon
ERROR
None
Stress is increased when we fail at simple tasks
Why? It’s not hard
Why can’t my software be more like git?
We know what you’re trying to accomplish. Let’s help you
out
None
When you have ERROR
Compare given command to possible commands
Recommend smallest distance.
Google:
Google:
Read A lot of words from real books
~1+ million words
Count Each word
Higher count, higher probability
Get edit distance between input and dictionary
Lower edit, higher probability
Show Suggestion
None
Cache correct spelling suggestions
did_you_mean gem
None
More distance measurements
Levenshtein Distance
Hamming Distance
longest common subsequence Distance
Manhattan Distance
Tree Distance
Jaro-Winkler Distance
Many Many More
Algorithms are Awesome
Where to go next?
I want to learn more about algorithms?
Wikipedia!
Rosetta code
Give an algorithm talk
Everyone suggests a new Algorithm!
Algorithms are a way of sharing knowledge
Expand your knowledge Explore Algorithms
Antepenultimate Slide
YAY!
Questions @schneems
Going The Distance