Pro Yearly is on sale from $80 to $50! »

Going The Distance

Going The Distance

Levenshtein Distance and the beauty of algorithms

Db953d125f5cc49756edb6149f1b813e?s=128

Richard Schneeman

September 22, 2014
Tweet

Transcript

  1. Going The Distance @schneems

  2. They Call me @Schneems

  3. Ruby Schneems

  4. Ruby Python

  5. None
  6. Code Triage .com

  7. Challenge: Comment on an issue

  8. Docs Doctor .org

  9. None
  10. Top 50 Rails Contrib

  11. Cats

  12. None
  13. Can you keep Ruby weird?

  14. %CP%CP

  15. 'XGT[QPG%CP%CP

  16. Thank You!

  17. Algorithms:

  18. I went to Georgia Tech for…

  19. Mechanical 
 Engineering

  20. Self taught Programmer ~8 years

  21. CS is boring to me

  22. Programming is interesting

  23. Building programs that accomplish tasks

  24. But…

  25. Those CS students are on to something

  26. Algorithms

  27. Are

  28. Beautiful

  29. <3

  30. Algorithms solve problems

  31. Introducting A Problem…

  32. Spelling

  33. When you are tired, or distracted spelling becomes harder

  34. $ git commmit -m first WARNING: You called a Git

    command named 'commmit', which does not exist. Continuing under the assumption that you meant 'commit' in 0.1 seconds automatically...
  35. How does Git know?

  36. Introducing:
 Edge Distance

  37. The “cost” to change one word to another

  38. Less “cost” means less more likely match

  39. Cost of? zat => bat

  40. Cost of? zat => bat 1

  41. Cost of? zzat => bat

  42. Cost of? bat 2 zzat =>

  43. How do we code it though?

  44. My First Attempt

  45. def distance(str1, str2) cost = 0 str1.each_char.with_index do |char, index|

    cost += 1 if str2[index] != char end cost end zat bat =>
  46. def distance(str1, str2) cost = 0 str1.each_char.with_index do |char, index|

    cost += 1 if str2[index] != char end cost end zat bat => Cost = 1
  47. Perfect?

  48. saturday sunday Cost = ?

  49. saturday sunday Cost = 7

  50. Wat?

  51. Turns out I almost recreated

  52. Hamming Distance

  53. Hamming AKA: Signal Distance

  54. Measures: the errors introduced in a string Hamming

  55. Only valid for strings of same length Hamming

  56. Good for: Detecting and correcting errors in binary and telecommunications

    Hamming
  57. Bad for: mis- spelled words Hamming

  58. Does not include: Insertion
 Deletion Hamming

  59. Introducing:
 An Algorithm!

  60. Introducing:
 Levenshtein Distance

  61. How do we calculate deletion?

  62. distance("schneems", "zschneems")

  63. distance("schneems", "zschneems") Match! deletion

  64. str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] deletion

  65. How do we calculate insertion?

  66. distance("schneems", "chneems")

  67. distance("schneems", "chneems") Match! insertion

  68. str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 insertion

  69. substitution?

  70. distance("zchneems", "schneems") Substitution

  71. str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] Substitution

  72. How do we calculate Distance?

  73. Pretend we have a distance() method

  74. Strings of different lengths

  75. distance(“”, “foo”) # => 3 distance(“foo”, “”) # => 3

  76. return str2.length if str1.empty? return str1.length if str2.empty? Different Lengths

  77. Calculate distance between every substring

  78. l1 = distance(str1, str2[1..-1]) # deletion l2 = distance(str1[1..-1], str2)

    # insertion l3 = distance(str1[1..-1], str2[1..-1]) # substitution Calculate costs
  79. l1 = distance(str1, str2[1..-1]) # deletion l2 = distance(str1[1..-1], str2)

    # insertion l3 = distance(str1[1..-1], str2[1..-1]) # substitution cost = 1 + [l1,l2,l3].min Take minimum
  80. l1 = distance(str1, str2[1..-1]) # deletion l2 = distance(str1[1..-1], str2)

    # insertion l3 = distance(str1[1..-1], str2[1..-1]) # substitution cost = 1 + [l1,l2,l3].min Why did we add one?
  81. Pick the cheapest operation, then add one

  82. What about when characters match?

  83. distance(str1[1..-1], str2[1..-1]) if str1[0] == str2[0] Match str1 = “saturday”

    str2 = “sunday”
  84. Our powers combined, we form!

  85. None
  86. `

  87. Recursive Levenshtein Distance!

  88. def distance(str1, str2) # Different lengths return str2.length if str1.empty?

    return str1.length if str2.empty? ! return distance(str1[1..-1], str2[1..-1]) if str1[0] == str2[0] # match l1 = distance(str1, str2[1..-1]) # deletion l2 = distance(str1[1..-1], str2) # insertion l3 = distance(str1[1..-1], str2[1..-1]) # substitution return 1 + [l1,l2,l3].min # increment cost end Recursive
  89. distance(“saturday”, “sunday”) # => 3 Recursive Levenshtein

  90. Much better

  91. What does that look like?

  92. None
  93. github.com/ schneems/ going_the_distance

  94. Hmm…

  95. “Dirty Distance” took 8 comparisons

  96. “Recursive” took 1647 comparisons

  97. No trophy, no flowers, no flashbulbs, no wine,

  98. Ouch

  99. Can we do better?

  100. If you watch the recursive algorithm closely, you notice repeats

  101. Maybe we can store substring distance and use to calculate

    total distance
  102. I want you to join the club

  103. A members only club

  104. Matrix:
 Levenshtein Distance

  105. Matrix:
 Levenshtein Distance

  106. “” => “saturday” Cost?

  107. +---+---+ | | S | +---+---+ | | 1 |

    +---+---+ Matrix
  108. +---+---+---+ | | S | A | +---+---+---+ | |

    1 | 2 | +---+---+---+ Matrix
  109. +---+---+---+---+ | | S | A | T | +---+---+---+---+

    | | 1 | 2 | 3 | +---+---+---+---+ Matrix
  110. +---+---+---+---+---+ | | S | A | T | U

    | +---+---+---+---+---+ | | 1 | 2 | 3 | 4 | +---+---+---+---+---+ Matrix
  111. +---+---+---+---+---+---+ | | S | A | T | U

    | R | +---+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | +---+---+---+---+---+---+ Matrix
  112. +---+---+---+---+---+---+---+ | | S | A | T | U

    | R | D | +---+---+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+ Matrix
  113. +---+---+---+---+---+---+---+---+ | | S | A | T | U

    | R | D | A | +---+---+---+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+ Matrix
  114. +---+---+---+---+---+---+---+---+---+ | | S | A | T | U

    | R | D | A | Y | +---+---+---+---+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+ Matrix
  115. “sunday” => “” Cost?

  116. +---+---+ | | | +---+---+ | | 0 | +---+---+

    | S | 1 | +---+---+ Matrix
  117. +---+---+ | | | +---+---+ | | 0 | +---+---+

    | S | 1 | +---+---+ | U | 2 | +---+---+ Matrix
  118. +---+---+ | | | +---+---+ | | 0 | +---+---+

    | S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ Matrix
  119. +---+---+ | | | +---+---+ | | 0 | +---+---+

    | S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ | D | 4 | +---+---+ Matrix
  120. +---+---+ | | | +---+---+ | | 0 | +---+---+

    | S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ | D | 4 | +---+---+ | A | 5 | +---+---+ Matrix
  121. +---+---+ | | | +---+---+ | | 0 | +---+---+

    | S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ | D | 4 | +---+---+ | A | 5 | +---+---+ | Y | 6 | +---+---+ Matrix
  122. Now, fill in the matrix

  123. +---+---+---+---+---+---+---+---+---+---+ | | | S | A | T |

    U | R | D | A | Y | +---+---+---+---+---+---+---+---+---+---+ | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+---+ | S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ | D | 4 | +---+---+ | A | 5 | +---+---+ | Y | 6 | +---+---+ Matrix
  124. Break it down

  125. How much does it cost to change “s” into “s”?

  126. +---+---+---+ | | | S | +---+---+---+ | | 0

    | 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ Match! Cost = 0
  127. +---+---+---+---+ | | | S | A | +---+---+---+---+ |

    | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! Cost = ?
  128. +---+---+---+---+ | | | S | A | +---+---+---+---+ |

    | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ Insertion! Cost = 1
  129. How do we calculate insertion programmatically?

  130. str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 Insertion!

  131. +---+---+---+---+ | | | S | A | +---+---+---+---+ |

    | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1] Insertion!
  132. +---+---+---+---+ | | | S | A | +---+---+---+---+ |

    | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1]
  133. +---+---+---+---+ | | | S | A | +---+---+---+---+ |

    | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1]
  134. +---+---+---+---+ | | | S | A | +---+---+---+---+ |

    | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1]
  135. + cost of change (+1)

  136. +---+---+---+---+ | | | S | A | +---+---+---+---+ |

    | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1] + 1
  137. +---+---+---+---+ | | | S | A | +---+---+---+---+ |

    | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ Insertion! Cost = 1 str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1] + 1
  138. Keep Going

  139. +---+---+---+---+---+---+---+---+---+---+ | | | S | A | T |

    U | R | D | A | Y | +---+---+---+---+---+---+---+---+---+---+ | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+---+ | S | 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+---+---+ Insertion(s)
  140. Next Char “su” => “s”

  141. +---+---+---+ | | | S | +---+---+---+ | | 0

    | 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Action?
  142. Change “su” to “s” is a deletion. How do we

    calculate?
  143. Deletion

  144. str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] deletion

  145. +---+---+---+ | | | S | +---+---+---+ | | 0

    | 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Deletion str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index]
  146. +---+---+---+ | | | S | +---+---+---+ | | 0

    | 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Deletion str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index]
  147. +---+---+---+ | | | S | +---+---+---+ | | 0

    | 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Deletion str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index]
  148. +---+---+---+ | | | S | +---+---+---+ | | 0

    | 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Deletion str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index] + 1
  149. +---+---+---+ | | | S | +---+---+---+ | | 0

    | 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | 1 | +---+---+---+ Deletion Cost = 1 str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index] + 1
  150. - Insertion - Deletion - Substitution

  151. - Insertion - Deletion - Substitution

  152. Substitution

  153. str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] Substitution!

  154. +---+---+---+---+ | | | S | A | +---+---+---+---+ |

    | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index - 1][column_index- 1]
  155. +---+---+---+---+ | | | S | A | +---+---+---+---+ |

    | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index - 1][column_index- 1]
  156. str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index

    - 1][column_index- 1] +---+---+---+---+ | | | S | A | +---+---+---+---+ | | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution
  157. str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index

    - 1][column_index- 1] +---+---+---+---+ | | | S | A | +---+---+---+---+ | | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution
  158. str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index

    - 1][column_index- 1] +---+---+---+---+ | | | S | A | +---+---+---+---+ | | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution
  159. +---+---+---+---+ | | | S | A | +---+---+---+---+ |

    | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | 1 | +---+---+---+---+ Substitution Cost = 1 str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index - 1][column_index- 1] + 1
  160. - Insertion - Deletion - Substitution

  161. What about match?

  162. +---+---+---+ | | | S | +---+---+---+ | | 0

    | 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ Match! str1 = “schneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index - 1][column_index- 1]
  163. Why?

  164. If the current character matches, cost is to change previous

    character to previous sub string
  165. i.e. changing “” to “” +---+---+---+ | | | S

    | +---+---+---+ | | 0 | 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+
  166. Now What?

  167. Algorithm str2.each_char.each_with_index do |char1,i| str1.each_char.each_with_index do |char2, j| if char1

    == char2 puts [:skip, matrix[i][j]].inspect matrix[i + 1 ][j + 1 ] = matrix[i][j] else actions = { deletion: matrix[i][j + 1] + 1, insert: matrix[i + 1][j] + 1, substitution: matrix[i][j] + 1 } action = actions.sort {|(k,v), (k2, v2)| v <=> v2 }.first puts action.inspect matrix[i + 1 ][j + 1 ] = action.last end each_step.call(matrix) if each_step end end
  168. Iterate!

  169. None
  170. Final Cost +---+---+---+---+---+---+---+---+---+---+ | | | S | A |

    T | U | R | D | A | Y | +---+---+---+---+---+---+---+---+---+---+ | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+---+ | S | 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+---+---+ | U | 2 | 1 | 1 | 2 | 2 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+---+---+---+ | N | 3 | 2 | 2 | 2 | 3 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+---+---+---+ | D | 4 | 3 | 3 | 3 | 3 | 4 | 3 | 4 | 5 | +---+---+---+---+---+---+---+---+---+---+ | A | 5 | 4 | 3 | 4 | 4 | 4 | 4 | 3 | 4 | +---+---+---+---+---+---+---+---+---+---+ | Y | 6 | 5 | 4 | 4 | 5 | 5 | 5 | 4 | 3 | +---+---+---+---+---+---+---+---+---+---+ 3
  171. We can also get cost of sub strings

  172. “sun” => “sat”

  173. Final Cost +---+---+---+---+---+---+---+---+---+---+ | | | S | A |

    T | U | R | D | A | Y | +---+---+---+---+---+---+---+---+---+---+ | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+---+ | S | 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+---+---+ | U | 2 | 1 | 1 | 2 | 2 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+---+---+---+ | N | 3 | 2 | 2 | 2 | 3 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+---+---+---+ | D | 4 | 3 | 3 | 3 | 3 | 4 | 3 | 4 | 5 | +---+---+---+---+---+---+---+---+---+---+ | A | 5 | 4 | 3 | 4 | 4 | 4 | 4 | 3 | 4 | +---+---+---+---+---+---+---+---+---+---+ | Y | 6 | 5 | 4 | 4 | 5 | 5 | 5 | 4 | 3 | +---+---+---+---+---+---+---+---+---+---+ 2
  174. Better than Recursive?

  175. As they speed through the finish, the flags go down.

  176. 48 iterations

  177. Wayyyyyy better than 1647

  178. bit.ly/ going_the_distance

  179. My Problem

  180. I am human

  181. I get tired

  182. Machines don’t understand tired

  183. One day I tried typing

  184. $ rails generate migration

  185. But accidentally typed

  186. $ rails generate migratoon

  187. ERROR

  188. None
  189. Stress is increased when we fail at simple tasks

  190. Why? It’s not hard

  191. Why can’t my software be more like git?

  192. We know what you’re trying to accomplish. Let’s help you

    out
  193. None
  194. When you have ERROR

  195. Compare given command to possible commands

  196. Recommend smallest distance.

  197. Google:

  198. Google:

  199. Read A lot of words from real books

  200. ~1+ million words

  201. Count Each word

  202. Higher count, higher probability

  203. Get edit distance between input and dictionary

  204. Lower edit, higher probability

  205. Show Suggestion

  206. None
  207. Cache correct spelling suggestions

  208. did_you_mean gem

  209. None
  210. More distance measurements

  211. Levenshtein Distance

  212. Hamming Distance

  213. longest common subsequence Distance

  214. Manhattan Distance

  215. Tree Distance

  216. Jaro-Winkler Distance

  217. Many Many More

  218. Algorithms are Awesome

  219. Where to go next?

  220. I want to learn more about algorithms?

  221. Wikipedia!

  222. Rosetta code

  223. Give an algorithm talk

  224. Everyone suggests a new Algorithm!

  225. Algorithms are a way of sharing knowledge

  226. Expand your knowledge Explore Algorithms

  227. Antepenultimate Slide

  228. YAY!

  229. Questions @schneems

  230. Going The Distance