Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Going The Distance
Search
Richard Schneeman
September 22, 2014
Programming
3
13k
Going The Distance
Levenshtein Distance and the beauty of algorithms
Richard Schneeman
September 22, 2014
Tweet
Share
More Decks by Richard Schneeman
See All by Richard Schneeman
[RubyConf] Beware the Dreaded Dead End
schneems
0
300
[Kaigi] Beware the Dead End
schneems
0
94
Threads Aren't Evil
schneems
0
470
Bayes is BAE
schneems
0
3.2k
Testing the Untestable
schneems
1
620
SLOMO
schneems
2
880
Saving Sprockets
schneems
8
16k
Memory Leaks, Tweaks, and Techniques
schneems
1
170
Speed Science
schneems
20
36k
Other Decks in Programming
See All in Programming
Go の GC の不得意な部分を克服したい
taiyow
2
760
선언형 UI에서의 상태관리
l2hyunwoo
0
140
「Chatwork」Android版アプリを 支える単体テストの現在
okuzawats
0
180
HTTP compression in PHP and Symfony apps
dunglas
2
1.7k
useSyncExternalStoreを使いまくる
ssssota
6
1k
採用事例の少ないSvelteを選んだ理由と それを正解にするためにやっていること
oekazuma
2
1k
クリエイティブコーディングとRuby学習 / Creative Coding and Learning Ruby
chobishiba
0
3.9k
[JAWS-UG横浜 #76] イケてるアップデートを宇宙いち早く紹介するよ!
maroon1st
0
450
アクターシステムに頼らずEvent Sourcingする方法について
j5ik2o
4
170
なまけものオバケたち -PHP 8.4 に入った新機能の紹介-
tanakahisateru
1
120
Monixと常駐プログラムの勘どころ / Scalaわいわい勉強会 #4
stoneream
0
270
コンテナをたくさん詰め込んだシステムとランタイムの変化
makihiro
1
120
Featured
See All Featured
Save Time (by Creating Custom Rails Generators)
garrettdimon
PRO
28
900
The Cost Of JavaScript in 2023
addyosmani
45
7k
Practical Orchestrator
shlominoach
186
10k
StorybookのUI Testing Handbookを読んだ
zakiyama
27
5.3k
Faster Mobile Websites
deanohume
305
30k
Building Flexible Design Systems
yeseniaperezcruz
327
38k
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
29
2k
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
280
13k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
247
1.3M
Testing 201, or: Great Expectations
jmmastey
40
7.1k
[Rails World 2023 - Day 1 Closing Keynote] - The Magic of Rails
eileencodes
33
1.9k
The Success of Rails: Ensuring Growth for the Next 100 Years
eileencodes
44
6.9k
Transcript
Going The Distance @schneems
They Call me @Schneems
Ruby Schneems
Ruby Python
None
Code Triage .com
Challenge: Comment on an issue
Docs Doctor .org
None
Top 50 Rails Contrib
Cats
None
Can you keep Ruby weird?
%CP%CP
'XGT[QPG%CP%CP
Thank You!
Algorithms:
I went to Georgia Tech for…
Mechanical Engineering
Self taught Programmer ~8 years
CS is boring to me
Programming is interesting
Building programs that accomplish tasks
But…
Those CS students are on to something
Algorithms
Are
Beautiful
<3
Algorithms solve problems
Introducting A Problem…
Spelling
When you are tired, or distracted spelling becomes harder
$ git commmit -m first WARNING: You called a Git
command named 'commmit', which does not exist. Continuing under the assumption that you meant 'commit' in 0.1 seconds automatically...
How does Git know?
Introducing: Edge Distance
The “cost” to change one word to another
Less “cost” means less more likely match
Cost of? zat => bat
Cost of? zat => bat 1
Cost of? zzat => bat
Cost of? bat 2 zzat =>
How do we code it though?
My First Attempt
def distance(str1, str2) cost = 0 str1.each_char.with_index do |char, index|
cost += 1 if str2[index] != char end cost end zat bat =>
def distance(str1, str2) cost = 0 str1.each_char.with_index do |char, index|
cost += 1 if str2[index] != char end cost end zat bat => Cost = 1
Perfect?
saturday sunday Cost = ?
saturday sunday Cost = 7
Wat?
Turns out I almost recreated
Hamming Distance
Hamming AKA: Signal Distance
Measures: the errors introduced in a string Hamming
Only valid for strings of same length Hamming
Good for: Detecting and correcting errors in binary and telecommunications
Hamming
Bad for: mis- spelled words Hamming
Does not include: Insertion Deletion Hamming
Introducing: An Algorithm!
Introducing: Levenshtein Distance
How do we calculate deletion?
distance("schneems", "zschneems")
distance("schneems", "zschneems") Match! deletion
str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] deletion
How do we calculate insertion?
distance("schneems", "chneems")
distance("schneems", "chneems") Match! insertion
str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 insertion
substitution?
distance("zchneems", "schneems") Substitution
str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] Substitution
How do we calculate Distance?
Pretend we have a distance() method
Strings of different lengths
distance(“”, “foo”) # => 3 distance(“foo”, “”) # => 3
return str2.length if str1.empty? return str1.length if str2.empty? Different Lengths
Calculate distance between every substring
l1 = distance(str1, str2[1..-1]) # deletion l2 = distance(str1[1..-1], str2)
# insertion l3 = distance(str1[1..-1], str2[1..-1]) # substitution Calculate costs
l1 = distance(str1, str2[1..-1]) # deletion l2 = distance(str1[1..-1], str2)
# insertion l3 = distance(str1[1..-1], str2[1..-1]) # substitution cost = 1 + [l1,l2,l3].min Take minimum
l1 = distance(str1, str2[1..-1]) # deletion l2 = distance(str1[1..-1], str2)
# insertion l3 = distance(str1[1..-1], str2[1..-1]) # substitution cost = 1 + [l1,l2,l3].min Why did we add one?
Pick the cheapest operation, then add one
What about when characters match?
distance(str1[1..-1], str2[1..-1]) if str1[0] == str2[0] Match str1 = “saturday”
str2 = “sunday”
Our powers combined, we form!
None
`
Recursive Levenshtein Distance!
def distance(str1, str2) # Different lengths return str2.length if str1.empty?
return str1.length if str2.empty? ! return distance(str1[1..-1], str2[1..-1]) if str1[0] == str2[0] # match l1 = distance(str1, str2[1..-1]) # deletion l2 = distance(str1[1..-1], str2) # insertion l3 = distance(str1[1..-1], str2[1..-1]) # substitution return 1 + [l1,l2,l3].min # increment cost end Recursive
distance(“saturday”, “sunday”) # => 3 Recursive Levenshtein
Much better
What does that look like?
None
github.com/ schneems/ going_the_distance
Hmm…
“Dirty Distance” took 8 comparisons
“Recursive” took 1647 comparisons
No trophy, no flowers, no flashbulbs, no wine,
Ouch
Can we do better?
If you watch the recursive algorithm closely, you notice repeats
Maybe we can store substring distance and use to calculate
total distance
I want you to join the club
A members only club
Matrix: Levenshtein Distance
Matrix: Levenshtein Distance
“” => “saturday” Cost?
+---+---+ | | S | +---+---+ | | 1 |
+---+---+ Matrix
+---+---+---+ | | S | A | +---+---+---+ | |
1 | 2 | +---+---+---+ Matrix
+---+---+---+---+ | | S | A | T | +---+---+---+---+
| | 1 | 2 | 3 | +---+---+---+---+ Matrix
+---+---+---+---+---+ | | S | A | T | U
| +---+---+---+---+---+ | | 1 | 2 | 3 | 4 | +---+---+---+---+---+ Matrix
+---+---+---+---+---+---+ | | S | A | T | U
| R | +---+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | +---+---+---+---+---+---+ Matrix
+---+---+---+---+---+---+---+ | | S | A | T | U
| R | D | +---+---+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+ Matrix
+---+---+---+---+---+---+---+---+ | | S | A | T | U
| R | D | A | +---+---+---+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+ Matrix
+---+---+---+---+---+---+---+---+---+ | | S | A | T | U
| R | D | A | Y | +---+---+---+---+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+ Matrix
“sunday” => “” Cost?
+---+---+ | | | +---+---+ | | 0 | +---+---+
| S | 1 | +---+---+ Matrix
+---+---+ | | | +---+---+ | | 0 | +---+---+
| S | 1 | +---+---+ | U | 2 | +---+---+ Matrix
+---+---+ | | | +---+---+ | | 0 | +---+---+
| S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ Matrix
+---+---+ | | | +---+---+ | | 0 | +---+---+
| S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ | D | 4 | +---+---+ Matrix
+---+---+ | | | +---+---+ | | 0 | +---+---+
| S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ | D | 4 | +---+---+ | A | 5 | +---+---+ Matrix
+---+---+ | | | +---+---+ | | 0 | +---+---+
| S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ | D | 4 | +---+---+ | A | 5 | +---+---+ | Y | 6 | +---+---+ Matrix
Now, fill in the matrix
+---+---+---+---+---+---+---+---+---+---+ | | | S | A | T |
U | R | D | A | Y | +---+---+---+---+---+---+---+---+---+---+ | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+---+ | S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ | D | 4 | +---+---+ | A | 5 | +---+---+ | Y | 6 | +---+---+ Matrix
Break it down
How much does it cost to change “s” into “s”?
+---+---+---+ | | | S | +---+---+---+ | | 0
| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ Match! Cost = 0
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! Cost = ?
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ Insertion! Cost = 1
How do we calculate insertion programmatically?
str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 Insertion!
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1] Insertion!
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1]
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1]
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1]
+ cost of change (+1)
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1] + 1
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ Insertion! Cost = 1 str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1] + 1
Keep Going
+---+---+---+---+---+---+---+---+---+---+ | | | S | A | T |
U | R | D | A | Y | +---+---+---+---+---+---+---+---+---+---+ | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+---+ | S | 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+---+---+ Insertion(s)
Next Char “su” => “s”
+---+---+---+ | | | S | +---+---+---+ | | 0
| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Action?
Change “su” to “s” is a deletion. How do we
calculate?
Deletion
str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] deletion
+---+---+---+ | | | S | +---+---+---+ | | 0
| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Deletion str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index]
+---+---+---+ | | | S | +---+---+---+ | | 0
| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Deletion str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index]
+---+---+---+ | | | S | +---+---+---+ | | 0
| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Deletion str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index]
+---+---+---+ | | | S | +---+---+---+ | | 0
| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Deletion str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index] + 1
+---+---+---+ | | | S | +---+---+---+ | | 0
| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | 1 | +---+---+---+ Deletion Cost = 1 str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index] + 1
- Insertion - Deletion - Substitution
- Insertion - Deletion - Substitution
Substitution
str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] Substitution!
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index - 1][column_index- 1]
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index - 1][column_index- 1]
str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index
- 1][column_index- 1] +---+---+---+---+ | | | S | A | +---+---+---+---+ | | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution
str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index
- 1][column_index- 1] +---+---+---+---+ | | | S | A | +---+---+---+---+ | | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution
str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index
- 1][column_index- 1] +---+---+---+---+ | | | S | A | +---+---+---+---+ | | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution
+---+---+---+---+ | | | S | A | +---+---+---+---+ |
| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | 1 | +---+---+---+---+ Substitution Cost = 1 str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index - 1][column_index- 1] + 1
- Insertion - Deletion - Substitution
What about match?
+---+---+---+ | | | S | +---+---+---+ | | 0
| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ Match! str1 = “schneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index - 1][column_index- 1]
Why?
If the current character matches, cost is to change previous
character to previous sub string
i.e. changing “” to “” +---+---+---+ | | | S
| +---+---+---+ | | 0 | 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+
Now What?
Algorithm str2.each_char.each_with_index do |char1,i| str1.each_char.each_with_index do |char2, j| if char1
== char2 puts [:skip, matrix[i][j]].inspect matrix[i + 1 ][j + 1 ] = matrix[i][j] else actions = { deletion: matrix[i][j + 1] + 1, insert: matrix[i + 1][j] + 1, substitution: matrix[i][j] + 1 } action = actions.sort {|(k,v), (k2, v2)| v <=> v2 }.first puts action.inspect matrix[i + 1 ][j + 1 ] = action.last end each_step.call(matrix) if each_step end end
Iterate!
None
Final Cost +---+---+---+---+---+---+---+---+---+---+ | | | S | A |
T | U | R | D | A | Y | +---+---+---+---+---+---+---+---+---+---+ | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+---+ | S | 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+---+---+ | U | 2 | 1 | 1 | 2 | 2 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+---+---+---+ | N | 3 | 2 | 2 | 2 | 3 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+---+---+---+ | D | 4 | 3 | 3 | 3 | 3 | 4 | 3 | 4 | 5 | +---+---+---+---+---+---+---+---+---+---+ | A | 5 | 4 | 3 | 4 | 4 | 4 | 4 | 3 | 4 | +---+---+---+---+---+---+---+---+---+---+ | Y | 6 | 5 | 4 | 4 | 5 | 5 | 5 | 4 | 3 | +---+---+---+---+---+---+---+---+---+---+ 3
We can also get cost of sub strings
“sun” => “sat”
Final Cost +---+---+---+---+---+---+---+---+---+---+ | | | S | A |
T | U | R | D | A | Y | +---+---+---+---+---+---+---+---+---+---+ | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+---+ | S | 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+---+---+ | U | 2 | 1 | 1 | 2 | 2 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+---+---+---+ | N | 3 | 2 | 2 | 2 | 3 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+---+---+---+ | D | 4 | 3 | 3 | 3 | 3 | 4 | 3 | 4 | 5 | +---+---+---+---+---+---+---+---+---+---+ | A | 5 | 4 | 3 | 4 | 4 | 4 | 4 | 3 | 4 | +---+---+---+---+---+---+---+---+---+---+ | Y | 6 | 5 | 4 | 4 | 5 | 5 | 5 | 4 | 3 | +---+---+---+---+---+---+---+---+---+---+ 2
Better than Recursive?
As they speed through the finish, the flags go down.
48 iterations
Wayyyyyy better than 1647
bit.ly/ going_the_distance
My Problem
I am human
I get tired
Machines don’t understand tired
One day I tried typing
$ rails generate migration
But accidentally typed
$ rails generate migratoon
ERROR
None
Stress is increased when we fail at simple tasks
Why? It’s not hard
Why can’t my software be more like git?
We know what you’re trying to accomplish. Let’s help you
out
None
When you have ERROR
Compare given command to possible commands
Recommend smallest distance.
Google:
Google:
Read A lot of words from real books
~1+ million words
Count Each word
Higher count, higher probability
Get edit distance between input and dictionary
Lower edit, higher probability
Show Suggestion
None
Cache correct spelling suggestions
did_you_mean gem
None
More distance measurements
Levenshtein Distance
Hamming Distance
longest common subsequence Distance
Manhattan Distance
Tree Distance
Jaro-Winkler Distance
Many Many More
Algorithms are Awesome
Where to go next?
I want to learn more about algorithms?
Wikipedia!
Rosetta code
Give an algorithm talk
Everyone suggests a new Algorithm!
Algorithms are a way of sharing knowledge
Expand your knowledge Explore Algorithms
Antepenultimate Slide
YAY!
Questions @schneems
Going The Distance