12k

# Going The Distance

Levenshtein Distance and the beauty of algorithms ## Richard Schneeman

September 22, 2014

## Transcript

5. None

9. None

12. None

34. ### \$ git commmit -m first WARNING: You called a Git

command named 'commmit', which does not exist. Continuing under the assumption that you meant 'commit' in 0.1 seconds automatically...

45. ### def distance(str1, str2) cost = 0 str1.each_char.with_index do |char, index|

cost += 1 if str2[index] != char end cost end zat bat =>
46. ### def distance(str1, str2) cost = 0 str1.each_char.with_index do |char, index|

cost += 1 if str2[index] != char end cost end zat bat => Cost = 1

Hamming

78. ### l1 = distance(str1, str2[1..-1]) # deletion l2 = distance(str1[1..-1], str2)

# insertion l3 = distance(str1[1..-1], str2[1..-1]) # substitution Calculate costs
79. ### l1 = distance(str1, str2[1..-1]) # deletion l2 = distance(str1[1..-1], str2)

# insertion l3 = distance(str1[1..-1], str2[1..-1]) # substitution cost = 1 + [l1,l2,l3].min Take minimum
80. ### l1 = distance(str1, str2[1..-1]) # deletion l2 = distance(str1[1..-1], str2)

# insertion l3 = distance(str1[1..-1], str2[1..-1]) # substitution cost = 1 + [l1,l2,l3].min Why did we add one?

83. ### distance(str1[1..-1], str2[1..-1]) if str1 == str2 Match str1 = “saturday”

str2 = “sunday”

85. None

88. ### def distance(str1, str2) # Different lengths return str2.length if str1.empty?

return str1.length if str2.empty? ! return distance(str1[1..-1], str2[1..-1]) if str1 == str2 # match l1 = distance(str1, str2[1..-1]) # deletion l2 = distance(str1[1..-1], str2) # insertion l3 = distance(str1[1..-1], str2[1..-1]) # substitution return 1 + [l1,l2,l3].min # increment cost end Recursive

92. None

101. ### Maybe we can store substring distance and use to calculate

total distance

107. ### +---+---+ | | S | +---+---+ | | 1 |

+---+---+ Matrix
108. ### +---+---+---+ | | S | A | +---+---+---+ | |

1 | 2 | +---+---+---+ Matrix
109. ### +---+---+---+---+ | | S | A | T | +---+---+---+---+

| | 1 | 2 | 3 | +---+---+---+---+ Matrix
110. ### +---+---+---+---+---+ | | S | A | T | U

| +---+---+---+---+---+ | | 1 | 2 | 3 | 4 | +---+---+---+---+---+ Matrix
111. ### +---+---+---+---+---+---+ | | S | A | T | U

| R | +---+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | +---+---+---+---+---+---+ Matrix
112. ### +---+---+---+---+---+---+---+ | | S | A | T | U

| R | D | +---+---+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+ Matrix
113. ### +---+---+---+---+---+---+---+---+ | | S | A | T | U

| R | D | A | +---+---+---+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+ Matrix
114. ### +---+---+---+---+---+---+---+---+---+ | | S | A | T | U

| R | D | A | Y | +---+---+---+---+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+ Matrix

116. ### +---+---+ | | | +---+---+ | | 0 | +---+---+

| S | 1 | +---+---+ Matrix
117. ### +---+---+ | | | +---+---+ | | 0 | +---+---+

| S | 1 | +---+---+ | U | 2 | +---+---+ Matrix
118. ### +---+---+ | | | +---+---+ | | 0 | +---+---+

| S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ Matrix
119. ### +---+---+ | | | +---+---+ | | 0 | +---+---+

| S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ | D | 4 | +---+---+ Matrix
120. ### +---+---+ | | | +---+---+ | | 0 | +---+---+

| S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ | D | 4 | +---+---+ | A | 5 | +---+---+ Matrix
121. ### +---+---+ | | | +---+---+ | | 0 | +---+---+

| S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ | D | 4 | +---+---+ | A | 5 | +---+---+ | Y | 6 | +---+---+ Matrix

123. ### +---+---+---+---+---+---+---+---+---+---+ | | | S | A | T |

U | R | D | A | Y | +---+---+---+---+---+---+---+---+---+---+ | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+---+ | S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ | D | 4 | +---+---+ | A | 5 | +---+---+ | Y | 6 | +---+---+ Matrix

126. ### +---+---+---+ | | | S | +---+---+---+ | | 0

| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ Match! Cost = 0
127. ### +---+---+---+---+ | | | S | A | +---+---+---+---+ |

| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! Cost = ?
128. ### +---+---+---+---+ | | | S | A | +---+---+---+---+ |

| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ Insertion! Cost = 1

131. ### +---+---+---+---+ | | | S | A | +---+---+---+---+ |

| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1] Insertion!
132. ### +---+---+---+---+ | | | S | A | +---+---+---+---+ |

| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1]
133. ### +---+---+---+---+ | | | S | A | +---+---+---+---+ |

| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1]
134. ### +---+---+---+---+ | | | S | A | +---+---+---+---+ |

| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1]

136. ### +---+---+---+---+ | | | S | A | +---+---+---+---+ |

| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1] + 1
137. ### +---+---+---+---+ | | | S | A | +---+---+---+---+ |

| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ Insertion! Cost = 1 str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1] + 1

139. ### +---+---+---+---+---+---+---+---+---+---+ | | | S | A | T |

U | R | D | A | Y | +---+---+---+---+---+---+---+---+---+---+ | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+---+ | S | 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+---+---+ Insertion(s)

141. ### +---+---+---+ | | | S | +---+---+---+ | | 0

| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Action?

calculate?

145. ### +---+---+---+ | | | S | +---+---+---+ | | 0

| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Deletion str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index]
146. ### +---+---+---+ | | | S | +---+---+---+ | | 0

| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Deletion str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index]
147. ### +---+---+---+ | | | S | +---+---+---+ | | 0

| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Deletion str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index]
148. ### +---+---+---+ | | | S | +---+---+---+ | | 0

| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Deletion str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index] + 1
149. ### +---+---+---+ | | | S | +---+---+---+ | | 0

| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | 1 | +---+---+---+ Deletion Cost = 1 str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index] + 1

154. ### +---+---+---+---+ | | | S | A | +---+---+---+---+ |

| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index - 1][column_index- 1]
155. ### +---+---+---+---+ | | | S | A | +---+---+---+---+ |

| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index - 1][column_index- 1]
156. ### str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index

- 1][column_index- 1] +---+---+---+---+ | | | S | A | +---+---+---+---+ | | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution
157. ### str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index

- 1][column_index- 1] +---+---+---+---+ | | | S | A | +---+---+---+---+ | | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution
158. ### str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index

- 1][column_index- 1] +---+---+---+---+ | | | S | A | +---+---+---+---+ | | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution
159. ### +---+---+---+---+ | | | S | A | +---+---+---+---+ |

| 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | 1 | +---+---+---+---+ Substitution Cost = 1 str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index - 1][column_index- 1] + 1

162. ### +---+---+---+ | | | S | +---+---+---+ | | 0

| 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ Match! str1 = “schneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index - 1][column_index- 1]

164. ### If the current character matches, cost is to change previous

character to previous sub string
165. ### i.e. changing “” to “” +---+---+---+ | | | S

| +---+---+---+ | | 0 | 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+

167. ### Algorithm str2.each_char.each_with_index do |char1,i| str1.each_char.each_with_index do |char2, j| if char1

== char2 puts [:skip, matrix[i][j]].inspect matrix[i + 1 ][j + 1 ] = matrix[i][j] else actions = { deletion: matrix[i][j + 1] + 1, insert: matrix[i + 1][j] + 1, substitution: matrix[i][j] + 1 } action = actions.sort {|(k,v), (k2, v2)| v <=> v2 }.first puts action.inspect matrix[i + 1 ][j + 1 ] = action.last end each_step.call(matrix) if each_step end end

169. None
170. ### Final Cost +---+---+---+---+---+---+---+---+---+---+ | | | S | A |

T | U | R | D | A | Y | +---+---+---+---+---+---+---+---+---+---+ | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+---+ | S | 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+---+---+ | U | 2 | 1 | 1 | 2 | 2 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+---+---+---+ | N | 3 | 2 | 2 | 2 | 3 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+---+---+---+ | D | 4 | 3 | 3 | 3 | 3 | 4 | 3 | 4 | 5 | +---+---+---+---+---+---+---+---+---+---+ | A | 5 | 4 | 3 | 4 | 4 | 4 | 4 | 3 | 4 | +---+---+---+---+---+---+---+---+---+---+ | Y | 6 | 5 | 4 | 4 | 5 | 5 | 5 | 4 | 3 | +---+---+---+---+---+---+---+---+---+---+ 3

173. ### Final Cost +---+---+---+---+---+---+---+---+---+---+ | | | S | A |

T | U | R | D | A | Y | +---+---+---+---+---+---+---+---+---+---+ | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+---+ | S | 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+---+---+ | U | 2 | 1 | 1 | 2 | 2 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+---+---+---+ | N | 3 | 2 | 2 | 2 | 3 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+---+---+---+ | D | 4 | 3 | 3 | 3 | 3 | 4 | 3 | 4 | 5 | +---+---+---+---+---+---+---+---+---+---+ | A | 5 | 4 | 3 | 4 | 4 | 4 | 4 | 3 | 4 | +---+---+---+---+---+---+---+---+---+---+ | Y | 6 | 5 | 4 | 4 | 5 | 5 | 5 | 4 | 3 | +---+---+---+---+---+---+---+---+---+---+ 2

188. None

out
193. None

206. None

209. None