Going The Distance

Going The Distance

Levenshtein Distance and the beauty of algorithms

Db953d125f5cc49756edb6149f1b813e?s=128

Richard Schneeman

September 22, 2014
Tweet

Transcript

  1. 5.
  2. 9.
  3. 11.
  4. 12.
  5. 14.
  6. 24.
  7. 27.

    Are

  8. 28.
  9. 29.

    <3

  10. 32.
  11. 34.

    $ git commmit -m first WARNING: You called a Git

    command named 'commmit', which does not exist. Continuing under the assumption that you meant 'commit' in 0.1 seconds automatically...
  12. 45.

    def distance(str1, str2) cost = 0 str1.each_char.with_index do |char, index|

    cost += 1 if str2[index] != char end cost end zat bat =>
  13. 46.

    def distance(str1, str2) cost = 0 str1.each_char.with_index do |char, index|

    cost += 1 if str2[index] != char end cost end zat bat => Cost = 1
  14. 47.
  15. 50.
  16. 78.

    l1 = distance(str1, str2[1..-1]) # deletion l2 = distance(str1[1..-1], str2)

    # insertion l3 = distance(str1[1..-1], str2[1..-1]) # substitution Calculate costs
  17. 79.

    l1 = distance(str1, str2[1..-1]) # deletion l2 = distance(str1[1..-1], str2)

    # insertion l3 = distance(str1[1..-1], str2[1..-1]) # substitution cost = 1 + [l1,l2,l3].min Take minimum
  18. 80.

    l1 = distance(str1, str2[1..-1]) # deletion l2 = distance(str1[1..-1], str2)

    # insertion l3 = distance(str1[1..-1], str2[1..-1]) # substitution cost = 1 + [l1,l2,l3].min Why did we add one?
  19. 85.
  20. 86.

    `

  21. 88.

    def distance(str1, str2) # Different lengths return str2.length if str1.empty?

    return str1.length if str2.empty? ! return distance(str1[1..-1], str2[1..-1]) if str1[0] == str2[0] # match l1 = distance(str1, str2[1..-1]) # deletion l2 = distance(str1[1..-1], str2) # insertion l3 = distance(str1[1..-1], str2[1..-1]) # substitution return 1 + [l1,l2,l3].min # increment cost end Recursive
  22. 92.
  23. 94.
  24. 98.
  25. 107.
  26. 108.

    +---+---+---+ | | S | A | +---+---+---+ | |

    1 | 2 | +---+---+---+ Matrix
  27. 109.

    +---+---+---+---+ | | S | A | T | +---+---+---+---+

    | | 1 | 2 | 3 | +---+---+---+---+ Matrix
  28. 110.

    +---+---+---+---+---+ | | S | A | T | U

    | +---+---+---+---+---+ | | 1 | 2 | 3 | 4 | +---+---+---+---+---+ Matrix
  29. 111.

    +---+---+---+---+---+---+ | | S | A | T | U

    | R | +---+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | +---+---+---+---+---+---+ Matrix
  30. 112.

    +---+---+---+---+---+---+---+ | | S | A | T | U

    | R | D | +---+---+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+ Matrix
  31. 113.

    +---+---+---+---+---+---+---+---+ | | S | A | T | U

    | R | D | A | +---+---+---+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+ Matrix
  32. 114.

    +---+---+---+---+---+---+---+---+---+ | | S | A | T | U

    | R | D | A | Y | +---+---+---+---+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+ Matrix
  33. 116.

    +---+---+ | | | +---+---+ | | 0 | +---+---+

    | S | 1 | +---+---+ Matrix
  34. 117.

    +---+---+ | | | +---+---+ | | 0 | +---+---+

    | S | 1 | +---+---+ | U | 2 | +---+---+ Matrix
  35. 118.

    +---+---+ | | | +---+---+ | | 0 | +---+---+

    | S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ Matrix
  36. 119.

    +---+---+ | | | +---+---+ | | 0 | +---+---+

    | S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ | D | 4 | +---+---+ Matrix
  37. 120.

    +---+---+ | | | +---+---+ | | 0 | +---+---+

    | S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ | D | 4 | +---+---+ | A | 5 | +---+---+ Matrix
  38. 121.

    +---+---+ | | | +---+---+ | | 0 | +---+---+

    | S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ | D | 4 | +---+---+ | A | 5 | +---+---+ | Y | 6 | +---+---+ Matrix
  39. 123.

    +---+---+---+---+---+---+---+---+---+---+ | | | S | A | T |

    U | R | D | A | Y | +---+---+---+---+---+---+---+---+---+---+ | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+---+ | S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ | D | 4 | +---+---+ | A | 5 | +---+---+ | Y | 6 | +---+---+ Matrix
  40. 126.

    +---+---+---+ | | | S | +---+---+---+ | | 0

    | 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ Match! Cost = 0
  41. 127.

    +---+---+---+---+ | | | S | A | +---+---+---+---+ |

    | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! Cost = ?
  42. 128.

    +---+---+---+---+ | | | S | A | +---+---+---+---+ |

    | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ Insertion! Cost = 1
  43. 131.

    +---+---+---+---+ | | | S | A | +---+---+---+---+ |

    | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1] Insertion!
  44. 132.

    +---+---+---+---+ | | | S | A | +---+---+---+---+ |

    | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1]
  45. 133.

    +---+---+---+---+ | | | S | A | +---+---+---+---+ |

    | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1]
  46. 134.

    +---+---+---+---+ | | | S | A | +---+---+---+---+ |

    | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1]
  47. 136.

    +---+---+---+---+ | | | S | A | +---+---+---+---+ |

    | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1] + 1
  48. 137.

    +---+---+---+---+ | | | S | A | +---+---+---+---+ |

    | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ Insertion! Cost = 1 str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1] + 1
  49. 138.
  50. 139.

    +---+---+---+---+---+---+---+---+---+---+ | | | S | A | T |

    U | R | D | A | Y | +---+---+---+---+---+---+---+---+---+---+ | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+---+ | S | 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+---+---+ Insertion(s)
  51. 141.

    +---+---+---+ | | | S | +---+---+---+ | | 0

    | 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Action?
  52. 143.
  53. 145.

    +---+---+---+ | | | S | +---+---+---+ | | 0

    | 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Deletion str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index]
  54. 146.

    +---+---+---+ | | | S | +---+---+---+ | | 0

    | 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Deletion str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index]
  55. 147.

    +---+---+---+ | | | S | +---+---+---+ | | 0

    | 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Deletion str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index]
  56. 148.

    +---+---+---+ | | | S | +---+---+---+ | | 0

    | 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Deletion str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index] + 1
  57. 149.

    +---+---+---+ | | | S | +---+---+---+ | | 0

    | 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | 1 | +---+---+---+ Deletion Cost = 1 str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index] + 1
  58. 154.

    +---+---+---+---+ | | | S | A | +---+---+---+---+ |

    | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index - 1][column_index- 1]
  59. 155.

    +---+---+---+---+ | | | S | A | +---+---+---+---+ |

    | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index - 1][column_index- 1]
  60. 156.

    str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index

    - 1][column_index- 1] +---+---+---+---+ | | | S | A | +---+---+---+---+ | | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution
  61. 157.

    str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index

    - 1][column_index- 1] +---+---+---+---+ | | | S | A | +---+---+---+---+ | | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution
  62. 158.

    str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index

    - 1][column_index- 1] +---+---+---+---+ | | | S | A | +---+---+---+---+ | | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution
  63. 159.

    +---+---+---+---+ | | | S | A | +---+---+---+---+ |

    | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | 1 | +---+---+---+---+ Substitution Cost = 1 str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index - 1][column_index- 1] + 1
  64. 162.

    +---+---+---+ | | | S | +---+---+---+ | | 0

    | 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ Match! str1 = “schneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index - 1][column_index- 1]
  65. 163.
  66. 164.
  67. 165.

    i.e. changing “” to “” +---+---+---+ | | | S

    | +---+---+---+ | | 0 | 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+
  68. 166.
  69. 167.

    Algorithm str2.each_char.each_with_index do |char1,i| str1.each_char.each_with_index do |char2, j| if char1

    == char2 puts [:skip, matrix[i][j]].inspect matrix[i + 1 ][j + 1 ] = matrix[i][j] else actions = { deletion: matrix[i][j + 1] + 1, insert: matrix[i + 1][j] + 1, substitution: matrix[i][j] + 1 } action = actions.sort {|(k,v), (k2, v2)| v <=> v2 }.first puts action.inspect matrix[i + 1 ][j + 1 ] = action.last end each_step.call(matrix) if each_step end end
  70. 168.
  71. 169.
  72. 170.

    Final Cost +---+---+---+---+---+---+---+---+---+---+ | | | S | A |

    T | U | R | D | A | Y | +---+---+---+---+---+---+---+---+---+---+ | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+---+ | S | 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+---+---+ | U | 2 | 1 | 1 | 2 | 2 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+---+---+---+ | N | 3 | 2 | 2 | 2 | 3 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+---+---+---+ | D | 4 | 3 | 3 | 3 | 3 | 4 | 3 | 4 | 5 | +---+---+---+---+---+---+---+---+---+---+ | A | 5 | 4 | 3 | 4 | 4 | 4 | 4 | 3 | 4 | +---+---+---+---+---+---+---+---+---+---+ | Y | 6 | 5 | 4 | 4 | 5 | 5 | 5 | 4 | 3 | +---+---+---+---+---+---+---+---+---+---+ 3
  73. 173.

    Final Cost +---+---+---+---+---+---+---+---+---+---+ | | | S | A |

    T | U | R | D | A | Y | +---+---+---+---+---+---+---+---+---+---+ | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+---+ | S | 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+---+---+ | U | 2 | 1 | 1 | 2 | 2 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+---+---+---+ | N | 3 | 2 | 2 | 2 | 3 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+---+---+---+ | D | 4 | 3 | 3 | 3 | 3 | 4 | 3 | 4 | 5 | +---+---+---+---+---+---+---+---+---+---+ | A | 5 | 4 | 3 | 4 | 4 | 4 | 4 | 3 | 4 | +---+---+---+---+---+---+---+---+---+---+ | Y | 6 | 5 | 4 | 4 | 5 | 5 | 5 | 4 | 3 | +---+---+---+---+---+---+---+---+---+---+ 2
  74. 179.
  75. 180.
  76. 187.
  77. 188.
  78. 193.
  79. 197.
  80. 198.
  81. 206.
  82. 209.
  83. 221.
  84. 228.