Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Going The Distance

Going The Distance

Levenshtein Distance and the beauty of algorithms

Richard Schneeman

September 22, 2014
Tweet

More Decks by Richard Schneeman

Other Decks in Programming

Transcript

  1. Are

  2. <3

  3. $ git commmit -m first WARNING: You called a Git

    command named 'commmit', which does not exist. Continuing under the assumption that you meant 'commit' in 0.1 seconds automatically...
  4. def distance(str1, str2) cost = 0 str1.each_char.with_index do |char, index|

    cost += 1 if str2[index] != char end cost end zat bat =>
  5. def distance(str1, str2) cost = 0 str1.each_char.with_index do |char, index|

    cost += 1 if str2[index] != char end cost end zat bat => Cost = 1
  6. l1 = distance(str1, str2[1..-1]) # deletion l2 = distance(str1[1..-1], str2)

    # insertion l3 = distance(str1[1..-1], str2[1..-1]) # substitution Calculate costs
  7. l1 = distance(str1, str2[1..-1]) # deletion l2 = distance(str1[1..-1], str2)

    # insertion l3 = distance(str1[1..-1], str2[1..-1]) # substitution cost = 1 + [l1,l2,l3].min Take minimum
  8. l1 = distance(str1, str2[1..-1]) # deletion l2 = distance(str1[1..-1], str2)

    # insertion l3 = distance(str1[1..-1], str2[1..-1]) # substitution cost = 1 + [l1,l2,l3].min Why did we add one?
  9. `

  10. def distance(str1, str2) # Different lengths return str2.length if str1.empty?

    return str1.length if str2.empty? ! return distance(str1[1..-1], str2[1..-1]) if str1[0] == str2[0] # match l1 = distance(str1, str2[1..-1]) # deletion l2 = distance(str1[1..-1], str2) # insertion l3 = distance(str1[1..-1], str2[1..-1]) # substitution return 1 + [l1,l2,l3].min # increment cost end Recursive
  11. +---+---+---+ | | S | A | +---+---+---+ | |

    1 | 2 | +---+---+---+ Matrix
  12. +---+---+---+---+ | | S | A | T | +---+---+---+---+

    | | 1 | 2 | 3 | +---+---+---+---+ Matrix
  13. +---+---+---+---+---+ | | S | A | T | U

    | +---+---+---+---+---+ | | 1 | 2 | 3 | 4 | +---+---+---+---+---+ Matrix
  14. +---+---+---+---+---+---+ | | S | A | T | U

    | R | +---+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | +---+---+---+---+---+---+ Matrix
  15. +---+---+---+---+---+---+---+ | | S | A | T | U

    | R | D | +---+---+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+ Matrix
  16. +---+---+---+---+---+---+---+---+ | | S | A | T | U

    | R | D | A | +---+---+---+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+ Matrix
  17. +---+---+---+---+---+---+---+---+---+ | | S | A | T | U

    | R | D | A | Y | +---+---+---+---+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+ Matrix
  18. +---+---+ | | | +---+---+ | | 0 | +---+---+

    | S | 1 | +---+---+ Matrix
  19. +---+---+ | | | +---+---+ | | 0 | +---+---+

    | S | 1 | +---+---+ | U | 2 | +---+---+ Matrix
  20. +---+---+ | | | +---+---+ | | 0 | +---+---+

    | S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ Matrix
  21. +---+---+ | | | +---+---+ | | 0 | +---+---+

    | S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ | D | 4 | +---+---+ Matrix
  22. +---+---+ | | | +---+---+ | | 0 | +---+---+

    | S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ | D | 4 | +---+---+ | A | 5 | +---+---+ Matrix
  23. +---+---+ | | | +---+---+ | | 0 | +---+---+

    | S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ | D | 4 | +---+---+ | A | 5 | +---+---+ | Y | 6 | +---+---+ Matrix
  24. +---+---+---+---+---+---+---+---+---+---+ | | | S | A | T |

    U | R | D | A | Y | +---+---+---+---+---+---+---+---+---+---+ | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+---+ | S | 1 | +---+---+ | U | 2 | +---+---+ | N | 3 | +---+---+ | D | 4 | +---+---+ | A | 5 | +---+---+ | Y | 6 | +---+---+ Matrix
  25. +---+---+---+ | | | S | +---+---+---+ | | 0

    | 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ Match! Cost = 0
  26. +---+---+---+---+ | | | S | A | +---+---+---+---+ |

    | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! Cost = ?
  27. +---+---+---+---+ | | | S | A | +---+---+---+---+ |

    | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ Insertion! Cost = 1
  28. +---+---+---+---+ | | | S | A | +---+---+---+---+ |

    | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1] Insertion!
  29. +---+---+---+---+ | | | S | A | +---+---+---+---+ |

    | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1]
  30. +---+---+---+---+ | | | S | A | +---+---+---+---+ |

    | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1]
  31. +---+---+---+---+ | | | S | A | +---+---+---+---+ |

    | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1]
  32. +---+---+---+---+ | | | S | A | +---+---+---+---+ |

    | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | | +---+---+---+---+ Insertion! str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1] + 1
  33. +---+---+---+---+ | | | S | A | +---+---+---+---+ |

    | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ Insertion! Cost = 1 str1 = “schneems” str2 = “chneems” str1[1..-1] == str2 matrix[row_index][column_index - 1] + 1
  34. +---+---+---+---+---+---+---+---+---+---+ | | | S | A | T |

    U | R | D | A | Y | +---+---+---+---+---+---+---+---+---+---+ | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+---+ | S | 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+---+---+ Insertion(s)
  35. +---+---+---+ | | | S | +---+---+---+ | | 0

    | 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Action?
  36. +---+---+---+ | | | S | +---+---+---+ | | 0

    | 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Deletion str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index]
  37. +---+---+---+ | | | S | +---+---+---+ | | 0

    | 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Deletion str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index]
  38. +---+---+---+ | | | S | +---+---+---+ | | 0

    | 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Deletion str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index]
  39. +---+---+---+ | | | S | +---+---+---+ | | 0

    | 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | | +---+---+---+ Deletion str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index] + 1
  40. +---+---+---+ | | | S | +---+---+---+ | | 0

    | 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ | U | 2 | 1 | +---+---+---+ Deletion Cost = 1 str1 = “schneems” str2 = “zschneems” str1 == str2[1..-1] matrix[row_index - 1][column_index] + 1
  41. +---+---+---+---+ | | | S | A | +---+---+---+---+ |

    | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index - 1][column_index- 1]
  42. +---+---+---+---+ | | | S | A | +---+---+---+---+ |

    | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index - 1][column_index- 1]
  43. str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index

    - 1][column_index- 1] +---+---+---+---+ | | | S | A | +---+---+---+---+ | | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution
  44. str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index

    - 1][column_index- 1] +---+---+---+---+ | | | S | A | +---+---+---+---+ | | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution
  45. str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index

    - 1][column_index- 1] +---+---+---+---+ | | | S | A | +---+---+---+---+ | | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | | +---+---+---+---+ Substitution
  46. +---+---+---+---+ | | | S | A | +---+---+---+---+ |

    | 0 | 1 | 2 | +---+---+---+---+ | S | 1 | 0 | 1 | +---+---+---+---+ | U | 2 | 1 | 1 | +---+---+---+---+ Substitution Cost = 1 str1 = “zchneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index - 1][column_index- 1] + 1
  47. +---+---+---+ | | | S | +---+---+---+ | | 0

    | 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+ Match! str1 = “schneems” str2 = “schneems” str1[1..-1] == str2[1..-1] matrix[row_index - 1][column_index- 1]
  48. i.e. changing “” to “” +---+---+---+ | | | S

    | +---+---+---+ | | 0 | 1 | +---+---+---+ | S | 1 | 0 | +---+---+---+
  49. Algorithm str2.each_char.each_with_index do |char1,i| str1.each_char.each_with_index do |char2, j| if char1

    == char2 puts [:skip, matrix[i][j]].inspect matrix[i + 1 ][j + 1 ] = matrix[i][j] else actions = { deletion: matrix[i][j + 1] + 1, insert: matrix[i + 1][j] + 1, substitution: matrix[i][j] + 1 } action = actions.sort {|(k,v), (k2, v2)| v <=> v2 }.first puts action.inspect matrix[i + 1 ][j + 1 ] = action.last end each_step.call(matrix) if each_step end end
  50. Final Cost +---+---+---+---+---+---+---+---+---+---+ | | | S | A |

    T | U | R | D | A | Y | +---+---+---+---+---+---+---+---+---+---+ | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+---+ | S | 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+---+---+ | U | 2 | 1 | 1 | 2 | 2 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+---+---+---+ | N | 3 | 2 | 2 | 2 | 3 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+---+---+---+ | D | 4 | 3 | 3 | 3 | 3 | 4 | 3 | 4 | 5 | +---+---+---+---+---+---+---+---+---+---+ | A | 5 | 4 | 3 | 4 | 4 | 4 | 4 | 3 | 4 | +---+---+---+---+---+---+---+---+---+---+ | Y | 6 | 5 | 4 | 4 | 5 | 5 | 5 | 4 | 3 | +---+---+---+---+---+---+---+---+---+---+ 3
  51. Final Cost +---+---+---+---+---+---+---+---+---+---+ | | | S | A |

    T | U | R | D | A | Y | +---+---+---+---+---+---+---+---+---+---+ | | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +---+---+---+---+---+---+---+---+---+---+ | S | 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | +---+---+---+---+---+---+---+---+---+---+ | U | 2 | 1 | 1 | 2 | 2 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+---+---+---+ | N | 3 | 2 | 2 | 2 | 3 | 3 | 4 | 5 | 6 | +---+---+---+---+---+---+---+---+---+---+ | D | 4 | 3 | 3 | 3 | 3 | 4 | 3 | 4 | 5 | +---+---+---+---+---+---+---+---+---+---+ | A | 5 | 4 | 3 | 4 | 4 | 4 | 4 | 3 | 4 | +---+---+---+---+---+---+---+---+---+---+ | Y | 6 | 5 | 4 | 4 | 5 | 5 | 5 | 4 | 3 | +---+---+---+---+---+---+---+---+---+---+ 2