Richard Schneeman
September 22, 2014
13k

# Going The Distance

Levenshtein Distance and the beauty of algorithms

## Richard Schneeman

September 22, 2014

## Transcript

31. ### \$ git commmit -m first WARNING: You called a Git

command named 'commmit', which does not exist. Continuing under the assumption that you meant 'commit' in 0.1 seconds automatically...

42. ### def distance(str1, str2) cost = 0 str1.each_char.with_index do |char, index|

cost += 1 if str2[index] != char end cost end zat bat =>
43. ### def distance(str1, str2) cost = 0 str1.each_char.with_index do |char, index|

cost += 1 if str2[index] != char end cost end zat bat => Cost = 1

Hamming

75. ### l1 = distance(str1, str2[1..-1]) # deletion l2 = distance(str1[1..-1], str2)

# insertion l3 = distance(str1[1..-1], str2[1..-1]) # substitution Calculate costs
76. ### l1 = distance(str1, str2[1..-1]) # deletion l2 = distance(str1[1..-1], str2)

# insertion l3 = distance(str1[1..-1], str2[1..-1]) # substitution cost = 1 + [l1,l2,l3].min Take minimum
77. ### l1 = distance(str1, str2[1..-1]) # deletion l2 = distance(str1[1..-1], str2)

# insertion l3 = distance(str1[1..-1], str2[1..-1]) # substitution cost = 1 + [l1,l2,l3].min Why did we add one?

80. ### distance(str1[1..-1], str2[1..-1]) if str1[0] == str2[0] Match str1 = “saturday”

str2 = “sunday”

84. ### def distance(str1, str2) # Different lengths return str2.length if str1.empty?

return str1.length if str2.empty? ! return distance(str1[1..-1], str2[1..-1]) if str1[0] == str2[0] # match l1 = distance(str1, str2[1..-1]) # deletion l2 = distance(str1[1..-1], str2) # insertion l3 = distance(str1[1..-1], str2[1..-1]) # substitution return 1 + [l1,l2,l3].min # increment cost end Recursive

96. ### Maybe we can store substring distance and use to calculate

total distance

102. ### +---+---+ | | S | +---+---+ | | 1 |

+---+---+ Matrix
103. ### +---+---+---+ | | S | A | +---+---+---+ | |

1 | 2 | +---+---+---+ Matrix
104. ### +---+---+---+---+ | | S | A | T | +---+---+---+---+

| | 1 | 2 | 3 | +---+---+---+---+ Matrix
105. ### +---+---+---+---+---+ | | S | A | T | U

| +---+---+---+---+---+ | | 1 | 2 | 3 | 4 | +---+---+---+---+---+ Matrix
106. ### +---+---+---+---+---+---+ | | S | A | T | U

| R | +---+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | +---+---+---+---+---+---+ Matrix
107. ### +---+---+---+---+---+---+---+ | | S | A | T | U

| R | D | +---+---+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+ Matrix
108. ### +---+---+---+---+---+---+---+---+ | | S | A | T | U

| R | D | A | +---+---+---+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+ Matrix
109. ### +---+---+---+---+---+---+---+---+---+ | | S | A | T | U

| R | D | A | Y | +---+---+---+---+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+ Matrix

111. ### +---+---+ | | | +---+---+ | | 0 | +---+---+

| S | 1 | +---+---+ Matrix
112. ### +---+---+ | | | +---+---+ | | 0 | +---+---+

| S | 1 | +---+---+ | U | 2 | +---+---+ Matrix
113. ### +---+---+ | | | +---+---+ | | 0 | +---+---+

| S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ Matrix
114. ### +---+---+ | | | +---+---+ | | 0 | +---+---+

| S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ | D | 4 | +---+---+ Matrix
115. ### +---+---+ | | | +---+---+ | | 0 | +---+---+

| S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ | D | 4 | +---+---+ | A | 5 | +---+---+ Matrix
116. ### +---+---+ | | | +---+---+ | | 0 | +---+---+

| S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ | D | 4 | +---+---+ | A | 5 | +---+---+ | Y | 6 | +---+---+ Matrix

118. ### +---+---+---+---+---+---+---+---+---+---+ | | | S | A | T |

U | R | D | A | Y | +---+---+---+---+---+---+---+---+---+---+ | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+---+ | S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ | D | 4 | +---+---+ | A | 5 | +---+---+ | Y | 6 | +---+---+ Matrix

121. ### +---+---+---+ | | | S | +---+---+---+ | | 0

| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ Match! Cost = 0
122. ### +---+---+---+---+ | | | S | A | +---+---+---+---+ |

| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! Cost = ?
123. ### +---+---+---+---+ | | | S | A | +---+---+---+---+ |

| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ Insertion! Cost = 1

126. ### +---+---+---+---+ | | | S | A | +---+---+---+---+ |

| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1] Insertion!
127. ### +---+---+---+---+ | | | S | A | +---+---+---+---+ |

| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1]
128. ### +---+---+---+---+ | | | S | A | +---+---+---+---+ |

| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1]
129. ### +---+---+---+---+ | | | S | A | +---+---+---+---+ |

| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1]

131. ### +---+---+---+---+ | | | S | A | +---+---+---+---+ |

| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1] + 1
132. ### +---+---+---+---+ | | | S | A | +---+---+---+---+ |

| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ Insertion! Cost = 1 str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1] + 1

134. ### +---+---+---+---+---+---+---+---+---+---+ | | | S | A | T |

U | R | D | A | Y | +---+---+---+---+---+---+---+---+---+---+ | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+---+ | S | 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+---+---+ Insertion(s)

136. ### +---+---+---+ | | | S | +---+---+---+ | | 0

| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Action?

calculate?

140. ### +---+---+---+ | | | S | +---+---+---+ | | 0

| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Deletion str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index]
141. ### +---+---+---+ | | | S | +---+---+---+ | | 0

| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Deletion str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index]
142. ### +---+---+---+ | | | S | +---+---+---+ | | 0

| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Deletion str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index]
143. ### +---+---+---+ | | | S | +---+---+---+ | | 0

| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Deletion str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index] + 1
144. ### +---+---+---+ | | | S | +---+---+---+ | | 0

| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | 1 | +---+---+---+ Deletion Cost = 1 str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index] + 1

149. ### +---+---+---+---+ | | | S | A | +---+---+---+---+ |

| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index - 1][column_index- 1]
150. ### +---+---+---+---+ | | | S | A | +---+---+---+---+ |

| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index - 1][column_index- 1]
151. ### str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index

- 1][column_index- 1] +---+---+---+---+ | | | S | A | +---+---+---+---+ | | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution
152. ### str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index

- 1][column_index- 1] +---+---+---+---+ | | | S | A | +---+---+---+---+ | | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution
153. ### str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index

- 1][column_index- 1] +---+---+---+---+ | | | S | A | +---+---+---+---+ | | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution
154. ### +---+---+---+---+ | | | S | A | +---+---+---+---+ |

| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | 1 | +---+---+---+---+ Substitution Cost = 1 str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index - 1][column_index- 1] + 1

157. ### +---+---+---+ | | | S | +---+---+---+ | | 0

| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ Match! str1 = “schneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index - 1][column_index- 1]

159. ### If the current character matches, cost is to change previous

character to previous sub string
160. ### i.e. changing “” to “” +---+---+---+ | | | S

| +---+---+---+ | | 0 | 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+

162. ### Algorithm str2.each_char.each_with_index do |char1,i| str1.each_char.each_with_index do |char2, j| if char1

== char2 puts [:skip, matrix[i][j]].inspect matrix[i + 1 ][j + 1 ] = matrix[i][j] else actions = { deletion: matrix[i][j + 1] + 1, insert: matrix[i + 1][j] + 1, substitution: matrix[i][j] + 1 } action = actions.sort {|(k,v), (k2, v2)| v <=> v2 }.first puts action.inspect matrix[i + 1 ][j + 1 ] = action.last end each_step.call(matrix) if each_step end end

164. ### Final Cost +---+---+---+---+---+---+---+---+---+---+ | | | S | A |

T | U | R | D | A | Y | +---+---+---+---+---+---+---+---+---+---+ | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+---+ | S | 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+---+---+ | U | 2 | 1 | 1 | 2 | 2 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+---+---+---+ | N | 3 | 2 | 2 | 2 | 3 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+---+---+---+ | D | 4 | 3 | 3 | 3 | 3 | 4 | 3 | 4 | 5 | +---+---+---+---+---+---+---+---+---+---+ | A | 5 | 4 | 3 | 4 | 4 | 4 | 4 | 3 | 4 | +---+---+---+---+---+---+---+---+---+---+ | Y | 6 | 5 | 4 | 4 | 5 | 5 | 5 | 4 | 3 | +---+---+---+---+---+---+---+---+---+---+ 3

167. ### Final Cost +---+---+---+---+---+---+---+---+---+---+ | | | S | A |

T | U | R | D | A | Y | +---+---+---+---+---+---+---+---+---+---+ | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+---+ | S | 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+---+---+ | U | 2 | 1 | 1 | 2 | 2 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+---+---+---+ | N | 3 | 2 | 2 | 2 | 3 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+---+---+---+ | D | 4 | 3 | 3 | 3 | 3 | 4 | 3 | 4 | 5 | +---+---+---+---+---+---+---+---+---+---+ | A | 5 | 4 | 3 | 4 | 4 | 4 | 4 | 3 | 4 | +---+---+---+---+---+---+---+---+---+---+ | Y | 6 | 5 | 4 | 4 | 5 | 5 | 5 | 4 | 3 | +---+---+---+---+---+---+---+---+---+---+ 2

out