Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Dual Decomposition for Parsing with Non-Projective Head Automata

Alexander Rush
October 16, 2012
46

Dual Decomposition for Parsing with Non-Projective Head Automata

Alexander Rush

October 16, 2012
Tweet

Transcript

  1. Dual Decomposition for Parsing with Non-Projective Head Automata Terry Koo,

    Alexander M. Rush, Michael Collins, David Sontag, and Tommi Jaakkola
  2. The Cost of Model Complexity We are always looking for

    better ways to model natural language. Tradeoff: Richer models ⇒ Harder decoding Added complexity is both computational and implementational.
  3. The Cost of Model Complexity We are always looking for

    better ways to model natural language. Tradeoff: Richer models ⇒ Harder decoding Added complexity is both computational and implementational. Tasks with challenging decoding problems: Speech Recognition Sequence Modeling (e.g. extensions to HMM/CRF) Parsing Machine Translation
  4. The Cost of Model Complexity We are always looking for

    better ways to model natural language. Tradeoff: Richer models ⇒ Harder decoding Added complexity is both computational and implementational. Tasks with challenging decoding problems: Speech Recognition Sequence Modeling (e.g. extensions to HMM/CRF) Parsing Machine Translation y∗ = arg max y f (y) Decoding
  5. Non-Projective Dependency Parsing *0 John1 saw2 a3 movie4 today5 that6

    he7 liked8 *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 Important problem in many languages. Problem is NP-Hard for all but the simplest models.
  6. Dual Decomposition A classical technique for constructing decoding algorithms. Solve

    complicated models y∗ = arg max y f (y) by decomposing into smaller problems. Upshot: Can utilize a toolbox of combinatorial algorithms. Dynamic programming Minimum spanning tree Shortest path Min-Cut ...
  7. A Dual Decomposition Algorithm for Non-Projective Dependency Parsing Simple -

    Uses basic combinatorial algorithms Efficient - Faster than previously proposed algorithms Strong Guarantees - Gives a certificate of optimality when exact Solves 98% of examples exactly, even though the problem is NP-Hard Widely Applicable - Similar techniques extend to other problems
  8. Non-Projective Dependency Parsing *0 John1 saw2 a3 movie4 today5 that6

    he7 liked8 *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 Starts at the root symbol * Each word has a exactly one parent word Produces a tree structure (no cycles) Dependencies can cross
  9. Algorithm Outline *0 John1 saw2 a3 movie4 today5 that6 he7

    liked8 *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 Arc-Factored Model Sibling Model
  10. Algorithm Outline *0 John1 saw2 a3 movie4 today5 that6 he7

    liked8 *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 + *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 Arc-Factored Model Dual Decomposition Sibling Model
  11. Arc-Factored *0 John1 saw2 a3 movie4 today5 that6 he7 liked8

    f (y) = score(head =∗0, mod =saw2) +score(saw2, John1)
  12. Arc-Factored *0 John1 saw2 a3 movie4 today5 that6 he7 liked8

    f (y) = score(head =∗0, mod =saw2) +score(saw2, John1) +score(saw2, movie4)
  13. Arc-Factored *0 John1 saw2 a3 movie4 today5 that6 he7 liked8

    f (y) = score(head =∗0, mod =saw2) +score(saw2, John1) +score(saw2, movie4) +score(saw2, today5)
  14. Arc-Factored *0 John1 saw2 a3 movie4 today5 that6 he7 liked8

    f (y) = score(head =∗0, mod =saw2) +score(saw2, John1) +score(saw2, movie4) +score(saw2, today5) +score(movie4, a3) + ...
  15. Arc-Factored *0 John1 saw2 a3 movie4 today5 that6 he7 liked8

    f (y) = score(head =∗0, mod =saw2) +score(saw2, John1) +score(saw2, movie4) +score(saw2, today5) +score(movie4, a3) + ... e.g. score(∗0, saw2) = log p(saw2|∗0) (generative model)
  16. Arc-Factored *0 John1 saw2 a3 movie4 today5 that6 he7 liked8

    f (y) = score(head =∗0, mod =saw2) +score(saw2, John1) +score(saw2, movie4) +score(saw2, today5) +score(movie4, a3) + ... e.g. score(∗0, saw2) = log p(saw2|∗0) (generative model) or score(∗0, saw2) = w · φ(saw2, ∗0) (CRF/perceptron model)
  17. Arc-Factored *0 John1 saw2 a3 movie4 today5 that6 he7 liked8

    f (y) = score(head =∗0, mod =saw2) +score(saw2, John1) +score(saw2, movie4) +score(saw2, today5) +score(movie4, a3) + ... e.g. score(∗0, saw2) = log p(saw2|∗0) (generative model) or score(∗0, saw2) = w · φ(saw2, ∗0) (CRF/perceptron model) y∗ = arg max y f (y) ⇐ Minimum Spanning Tree Algorithm
  18. Sibling Models *0 John1 saw2 a3 movie4 today5 that6 he7

    liked8 f (y) = score(head = ∗0, prev = NULL, mod = saw2)
  19. Sibling Models *0 John1 saw2 a3 movie4 today5 that6 he7

    liked8 f (y) = score(head = ∗0, prev = NULL, mod = saw2) +score(saw2, NULL, John1)
  20. Sibling Models *0 John1 saw2 a3 movie4 today5 that6 he7

    liked8 f (y) = score(head = ∗0, prev = NULL, mod = saw2) +score(saw2, NULL, John1) +score(saw2, NULL, movie4)
  21. Sibling Models *0 John1 saw2 a3 movie4 today5 that6 he7

    liked8 f (y) = score(head = ∗0, prev = NULL, mod = saw2) +score(saw2, NULL, John1) +score(saw2, NULL, movie4) +score(saw2,movie4, today5) + ...
  22. Sibling Models *0 John1 saw2 a3 movie4 today5 that6 he7

    liked8 f (y) = score(head = ∗0, prev = NULL, mod = saw2) +score(saw2, NULL, John1) +score(saw2, NULL, movie4) +score(saw2,movie4, today5) + ... e.g. score(saw2, movie4, today5) = log p(today5|saw2, movie4)
  23. Sibling Models *0 John1 saw2 a3 movie4 today5 that6 he7

    liked8 f (y) = score(head = ∗0, prev = NULL, mod = saw2) +score(saw2, NULL, John1) +score(saw2, NULL, movie4) +score(saw2,movie4, today5) + ... e.g. score(saw2, movie4, today5) = log p(today5|saw2, movie4) or score(saw2, movie4, today5) = w · φ(saw2, movie4, today5)
  24. Sibling Models *0 John1 saw2 a3 movie4 today5 that6 he7

    liked8 f (y) = score(head = ∗0, prev = NULL, mod = saw2) +score(saw2, NULL, John1) +score(saw2, NULL, movie4) +score(saw2,movie4, today5) + ... e.g. score(saw2, movie4, today5) = log p(today5|saw2, movie4) or score(saw2, movie4, today5) = w · φ(saw2, movie4, today5) y∗ = arg max y f (y) ⇐ NP-Hard
  25. Thought Experiment: Individual Decoding *0 John1 saw2 a3 movie4 today5

    that6 he7 liked8 score(saw2, NULL, John1) + score(saw2, NULL, movie4) +score(saw2, movie4, today5)
  26. Thought Experiment: Individual Decoding *0 John1 saw2 a3 movie4 today5

    that6 he7 liked8 score(saw2, NULL, John1) + score(saw2, NULL, movie4) +score(saw2, movie4, today5) score(saw2, NULL, John1) + score(saw2, NULL, that6)
  27. Thought Experiment: Individual Decoding *0 John1 saw2 a3 movie4 today5

    that6 he7 liked8 score(saw2, NULL, John1) + score(saw2, NULL, movie4) +score(saw2, movie4, today5) score(saw2, NULL, John1) + score(saw2, NULL, that6) score(saw2, NULL, a3) + score(saw2, a3, he7)
  28. Thought Experiment: Individual Decoding *0 John1 saw2 a3 movie4 today5

    that6 he7 liked8 score(saw2, NULL, John1) + score(saw2, NULL, movie4) +score(saw2, movie4, today5) score(saw2, NULL, John1) + score(saw2, NULL, that6) score(saw2, NULL, a3) + score(saw2, a3, he7) 2n−1 possibilities
  29. Thought Experiment: Individual Decoding *0 John1 saw2 a3 movie4 today5

    that6 he7 liked8 score(saw2, NULL, John1) + score(saw2, NULL, movie4) +score(saw2, movie4, today5) score(saw2, NULL, John1) + score(saw2, NULL, that6) score(saw2, NULL, a3) + score(saw2, a3, he7) 2n−1 possibilities Under Sibling Model, can solve for each word with Viterbi decoding.
  30. Thought Experiment Continued *0 John1 saw2 a3 movie4 today5 that6

    he7 liked8 Idea: Do individual decoding for each head word using dynamic programming. If we’re lucky, we’ll end up with a valid final tree.
  31. Thought Experiment Continued *0 John1 saw2 a3 movie4 today5 that6

    he7 liked8 Idea: Do individual decoding for each head word using dynamic programming. If we’re lucky, we’ll end up with a valid final tree.
  32. Thought Experiment Continued *0 John1 saw2 a3 movie4 today5 that6

    he7 liked8 Idea: Do individual decoding for each head word using dynamic programming. If we’re lucky, we’ll end up with a valid final tree.
  33. Thought Experiment Continued *0 John1 saw2 a3 movie4 today5 that6

    he7 liked8 Idea: Do individual decoding for each head word using dynamic programming. If we’re lucky, we’ll end up with a valid final tree.
  34. Thought Experiment Continued *0 John1 saw2 a3 movie4 today5 that6

    he7 liked8 Idea: Do individual decoding for each head word using dynamic programming. If we’re lucky, we’ll end up with a valid final tree.
  35. Thought Experiment Continued *0 John1 saw2 a3 movie4 today5 that6

    he7 liked8 Idea: Do individual decoding for each head word using dynamic programming. If we’re lucky, we’ll end up with a valid final tree.
  36. Thought Experiment Continued *0 John1 saw2 a3 movie4 today5 that6

    he7 liked8 Idea: Do individual decoding for each head word using dynamic programming. If we’re lucky, we’ll end up with a valid final tree.
  37. Thought Experiment Continued *0 John1 saw2 a3 movie4 today5 that6

    he7 liked8 Idea: Do individual decoding for each head word using dynamic programming. If we’re lucky, we’ll end up with a valid final tree.
  38. Thought Experiment Continued *0 John1 saw2 a3 movie4 today5 that6

    he7 liked8 Idea: Do individual decoding for each head word using dynamic programming. If we’re lucky, we’ll end up with a valid final tree. But we might violate some constraints.
  39. Dual Decomposition Idea No Constraints Tree Constraints Arc- Factored Minimum

    Spanning Tree Sibling Model Individual Decoding Dual Decomposition
  40. Dual Decomposition Structure Goal y∗ = arg max y∈Y f

    (y) Rewrite as argmax z∈ Z, y∈ Y f (z) + g(y) such that z = y
  41. Dual Decomposition Structure Goal y∗ = arg max y∈Y f

    (y) Rewrite as argmax z∈ Z, y∈ Y f (z) + g(y) such that z = y All Possible
  42. Dual Decomposition Structure Goal y∗ = arg max y∈Y f

    (y) Rewrite as argmax z∈ Z, y∈ Y f (z) + g(y) such that z = y Valid Trees All Possible
  43. Dual Decomposition Structure Goal y∗ = arg max y∈Y f

    (y) Rewrite as argmax z∈ Z, y∈ Y f (z) + g(y) such that z = y Valid Trees All Possible Sibling
  44. Dual Decomposition Structure Goal y∗ = arg max y∈Y f

    (y) Rewrite as argmax z∈ Z, y∈ Y f (z) + g(y) such that z = y Valid Trees All Possible Sibling Arc-Factored
  45. Dual Decomposition Structure Goal y∗ = arg max y∈Y f

    (y) Rewrite as argmax z∈ Z, y∈ Y f (z) + g(y) such that z = y Valid Trees All Possible Sibling Arc-Factored Constraint
  46. Algorithm Sketch Set penalty weights equal to 0 for all

    edges. For k = 1 to K z(k) ← Decode (f (z) + penalty) by Individual Decoding
  47. Algorithm Sketch Set penalty weights equal to 0 for all

    edges. For k = 1 to K z(k) ← Decode (f (z) + penalty) by Individual Decoding y(k) ← Decode (g(y) − penalty) by Minimum Spanning Tree
  48. Algorithm Sketch Set penalty weights equal to 0 for all

    edges. For k = 1 to K z(k) ← Decode (f (z) + penalty) by Individual Decoding y(k) ← Decode (g(y) − penalty) by Minimum Spanning Tree If y(k)(i, j) = z(k)(i, j) for all i, j Return (y(k), z(k))
  49. Algorithm Sketch Set penalty weights equal to 0 for all

    edges. For k = 1 to K z(k) ← Decode (f (z) + penalty) by Individual Decoding y(k) ← Decode (g(y) − penalty) by Minimum Spanning Tree If y(k)(i, j) = z(k)(i, j) for all i, j Return (y(k), z(k)) Else Update penalty weights based on y(k)(i, j) − z(k)(i, j)
  50. Individual Decoding *0 John1 saw2 a3 movie4 today5 that6 he7

    liked8 z∗ = arg max z∈Z (f (z) + i,j u(i, j)z(i, j)) Minimum Spanning Tree *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 y∗ = arg max y∈Y (g(y) − i,j u(i, j)y(i, j)) Penalties u(i, j) = 0 for all i,j Key f (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored Model Z ⇐ No Constraints Y ⇐ Tree Constraints y(i, j) = 1 if y contains dependency i, j
  51. Individual Decoding *0 John1 saw2 a3 movie4 today5 that6 he7

    liked8 z∗ = arg max z∈Z (f (z) + i,j u(i, j)z(i, j)) Minimum Spanning Tree *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 y∗ = arg max y∈Y (g(y) − i,j u(i, j)y(i, j)) Penalties u(i, j) = 0 for all i,j Key f (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored Model Z ⇐ No Constraints Y ⇐ Tree Constraints y(i, j) = 1 if y contains dependency i, j
  52. Individual Decoding *0 John1 saw2 a3 movie4 today5 that6 he7

    liked8 z∗ = arg max z∈Z (f (z) + i,j u(i, j)z(i, j)) Minimum Spanning Tree *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 y∗ = arg max y∈Y (g(y) − i,j u(i, j)y(i, j)) Penalties u(i, j) = 0 for all i,j Key f (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored Model Z ⇐ No Constraints Y ⇐ Tree Constraints y(i, j) = 1 if y contains dependency i, j
  53. Individual Decoding *0 John1 saw2 a3 movie4 today5 that6 he7

    liked8 z∗ = arg max z∈Z (f (z) + i,j u(i, j)z(i, j)) Minimum Spanning Tree *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 y∗ = arg max y∈Y (g(y) − i,j u(i, j)y(i, j)) Penalties u(i, j) = 0 for all i,j Key f (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored Model Z ⇐ No Constraints Y ⇐ Tree Constraints y(i, j) = 1 if y contains dependency i, j
  54. Individual Decoding *0 John1 saw2 a3 movie4 today5 that6 he7

    liked8 z∗ = arg max z∈Z (f (z) + i,j u(i, j)z(i, j)) Minimum Spanning Tree *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 y∗ = arg max y∈Y (g(y) − i,j u(i, j)y(i, j)) Penalties u(i, j) = 0 for all i,j Iteration 1 u(8, 1) -1 u(4, 6) -1 u(2, 6) 1 u(8, 7) 1 Key f (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored Model Z ⇐ No Constraints Y ⇐ Tree Constraints y(i, j) = 1 if y contains dependency i, j
  55. Individual Decoding *0 John1 saw2 a3 movie4 today5 that6 he7

    liked8 z∗ = arg max z∈Z (f (z) + i,j u(i, j)z(i, j)) Minimum Spanning Tree *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 y∗ = arg max y∈Y (g(y) − i,j u(i, j)y(i, j)) Penalties u(i, j) = 0 for all i,j Iteration 1 u(8, 1) -1 u(4, 6) -1 u(2, 6) 1 u(8, 7) 1 Key f (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored Model Z ⇐ No Constraints Y ⇐ Tree Constraints y(i, j) = 1 if y contains dependency i, j
  56. Individual Decoding *0 John1 saw2 a3 movie4 today5 that6 he7

    liked8 z∗ = arg max z∈Z (f (z) + i,j u(i, j)z(i, j)) Minimum Spanning Tree *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 y∗ = arg max y∈Y (g(y) − i,j u(i, j)y(i, j)) Penalties u(i, j) = 0 for all i,j Iteration 1 u(8, 1) -1 u(4, 6) -1 u(2, 6) 1 u(8, 7) 1 Key f (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored Model Z ⇐ No Constraints Y ⇐ Tree Constraints y(i, j) = 1 if y contains dependency i, j
  57. Individual Decoding *0 John1 saw2 a3 movie4 today5 that6 he7

    liked8 z∗ = arg max z∈Z (f (z) + i,j u(i, j)z(i, j)) Minimum Spanning Tree *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 y∗ = arg max y∈Y (g(y) − i,j u(i, j)y(i, j)) Penalties u(i, j) = 0 for all i,j Iteration 1 u(8, 1) -1 u(4, 6) -1 u(2, 6) 1 u(8, 7) 1 Iteration 2 u(8, 1) -1 u(4, 6) -2 u(2, 6) 2 u(8, 7) 1 Key f (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored Model Z ⇐ No Constraints Y ⇐ Tree Constraints y(i, j) = 1 if y contains dependency i, j
  58. Individual Decoding *0 John1 saw2 a3 movie4 today5 that6 he7

    liked8 z∗ = arg max z∈Z (f (z) + i,j u(i, j)z(i, j)) Minimum Spanning Tree *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 y∗ = arg max y∈Y (g(y) − i,j u(i, j)y(i, j)) Penalties u(i, j) = 0 for all i,j Iteration 1 u(8, 1) -1 u(4, 6) -1 u(2, 6) 1 u(8, 7) 1 Iteration 2 u(8, 1) -1 u(4, 6) -2 u(2, 6) 2 u(8, 7) 1 Key f (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored Model Z ⇐ No Constraints Y ⇐ Tree Constraints y(i, j) = 1 if y contains dependency i, j
  59. Individual Decoding *0 John1 saw2 a3 movie4 today5 that6 he7

    liked8 z∗ = arg max z∈Z (f (z) + i,j u(i, j)z(i, j)) Minimum Spanning Tree *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 y∗ = arg max y∈Y (g(y) − i,j u(i, j)y(i, j)) Penalties u(i, j) = 0 for all i,j Iteration 1 u(8, 1) -1 u(4, 6) -1 u(2, 6) 1 u(8, 7) 1 Iteration 2 u(8, 1) -1 u(4, 6) -2 u(2, 6) 2 u(8, 7) 1 Key f (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored Model Z ⇐ No Constraints Y ⇐ Tree Constraints y(i, j) = 1 if y contains dependency i, j
  60. Individual Decoding *0 John1 saw2 a3 movie4 today5 that6 he7

    liked8 z∗ = arg max z∈Z (f (z) + i,j u(i, j)z(i, j)) Minimum Spanning Tree *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 y∗ = arg max y∈Y (g(y) − i,j u(i, j)y(i, j)) Penalties u(i, j) = 0 for all i,j Iteration 1 u(8, 1) -1 u(4, 6) -1 u(2, 6) 1 u(8, 7) 1 Iteration 2 u(8, 1) -1 u(4, 6) -2 u(2, 6) 2 u(8, 7) 1 Key f (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored Model Z ⇐ No Constraints Y ⇐ Tree Constraints y(i, j) = 1 if y contains dependency i, j
  61. Individual Decoding *0 John1 saw2 a3 movie4 today5 that6 he7

    liked8 z∗ = arg max z∈Z (f (z) + i,j u(i, j)z(i, j)) Minimum Spanning Tree *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 y∗ = arg max y∈Y (g(y) − i,j u(i, j)y(i, j)) Penalties u(i, j) = 0 for all i,j Iteration 1 u(8, 1) -1 u(4, 6) -1 u(2, 6) 1 u(8, 7) 1 Iteration 2 u(8, 1) -1 u(4, 6) -2 u(2, 6) 2 u(8, 7) 1 Converged y∗ = arg max y∈Y f (y) + g(y) Key f (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored Model Z ⇐ No Constraints Y ⇐ Tree Constraints y(i, j) = 1 if y contains dependency i, j
  62. Guarantees Theorem If at any iteration y(k) = z(k), then

    (y(k), z(k)) is the global optimum. In experiments, we find the global optimum on 98% of examples.
  63. Guarantees Theorem If at any iteration y(k) = z(k), then

    (y(k), z(k)) is the global optimum. In experiments, we find the global optimum on 98% of examples. If we do not converge to a match, we can still return an approximate solution (more in the paper).
  64. Extensions Grandparent Models *0 John1 saw2 a3 movie4 today5 that6

    he7 liked8 *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 f (y) =...+ score(gp =∗0, head = saw2, prev =movie4, mod =today5) Head Automata (Eisner, 2000) Generalization of Sibling models Allow arbitrary automata as local scoring function.
  65. Experiments Properties: Exactness Parsing Speed Parsing Accuracy Comparison to Individual

    Decoding Comparison to LP/ILP Training: Averaged Perceptron (more details in paper) Experiments on: CoNLL Datasets English Penn Treebank Czech Dependency Treebank
  66. How often do we exactly solve the problem? 90 92

    94 96 98 100 C ze Eng D an D ut Por Slo Sw e Tur Percentage of examples where the dual decomposition finds an exact solution.
  67. Parsing Speed 0 10 20 30 40 50 C ze

    Eng D an D ut Por Slo Sw e Tur Sibling model 0 5 10 15 20 25 C ze Eng D an D ut Por Slo Sw e Tur Grandparent model Number of sentences parsed per second Comparable to dynamic programming for projective parsing
  68. Accuracy Arc-Factored Prev Best Grandparent Dan 89.7 91.5 91.8 Dut

    82.3 85.6 85.8 Por 90.7 92.1 93.0 Slo 82.4 85.6 86.2 Swe 88.9 90.6 91.4 Tur 75.7 76.4 77.6 Eng 90.1 — 92.5 Cze 84.4 — 87.3 Prev Best - Best reported results for CoNLL-X data set, includes Approximate search (McDonald and Pereira, 2006) Loop belief propagation (Smith and Eisner, 2008) (Integer) Linear Programming (Martins et al., 2009)
  69. Comparison to Subproblems 88 89 90 91 92 93 Eng

    Individual MST Dual F1 for dependency accuracy
  70. Comparison to LP/ILP Martins et al.(2009): Proposes two representations of

    non-projective dependency parsing as a linear programming relaxation as well as an exact ILP. LP (1) LP (2) ILP Use an LP/ILP Solver for decoding We compare: Accuracy Exactness Speed Both LP and dual decomposition methods use the same model, features, and weights w.
  71. Comparison to LP/ILP: Accuracy 80 85 90 95 100 LP(1)

    LP(2) ILP Dual Dependency Accuracy All decoding methods have comparable accuracy
  72. Comparison to LP/ILP: Exactness and Speed 80 85 90 95

    100 LP(1) LP(2) ILP Dual Percentage with exact solution 0 2 4 6 8 10 12 14 LP(1) LP(2) ILP Dual Sentences per second
  73. Deriving the Algorithm Goal: y∗ = arg max y∈Y f

    (y) Rewrite: arg max z∈Z,y∈Y f (z) + g(y) s.t. z(i, j) = y(i, j) for all i, j Lagrangian: L(u, y, z) = f (z) + g(y) + i,j u(i, j) (z(i, j) − y(i, j))
  74. Deriving the Algorithm Goal: y∗ = arg max y∈Y f

    (y) Rewrite: arg max z∈Z,y∈Y f (z) + g(y) s.t. z(i, j) = y(i, j) for all i, j Lagrangian: L(u, y, z) = f (z) + g(y) + i,j u(i, j) (z(i, j) − y(i, j)) The dual problem is to find min u L(u) where L(u) = max y∈Y,z∈Z L(u, y, z) = max z∈Z  f (z) + i,j u(i, j)z(i, j)   + max y∈Y  g(y) − i,j u(i, j)y(i, j)   Dual is an upper bound: L(u) ≥ f (z∗) + g(y∗) for any u
  75. A Subgradient Algorithm for Minimizing L(u) L(u) = max z∈Z

     f (z) + i,j u(i, j)z(i, j)   + max y∈Y  g(y) − i,j u(i, j)y(i, j)   L(u) is convex, but not differentiable. A subgradient of L(u) at u is a vector gu such that for all v, L(v) ≥ L(u) + gu · (v − u) Subgradient methods use updates u = u − αgu In fact, for our L(u), gu(i, j) = z∗(i, j) − y∗(i, j)
  76. Related Work Methods that use general purpose linear programming or

    integer linear programming solvers (Martins et al. 2009; Riedel and Clarke 2006; Roth and Yih 2005) Dual decomposition/Lagrangian relaxation in combinatorial optimization (Dantzig and Wolfe, 1960; Held and Karp, 1970; Fisher 1981) Dual decomposition for inference in MRFs (Komodakis et al., 2007; Wainwright et al., 2005) Methods that incorporate combinatorial solvers within loopy belief propagation (Duchi et al. 2007; Smith and Eisner 2008)
  77. Summary y∗ = arg max y f (y) ⇐ NP-Hard

    *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 Arc-Factored Model Sibling Model
  78. Summary y∗ = arg max y f (y) ⇐ NP-Hard

    *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 + *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 Arc-Factored Model Dual Decomposition Sibling Model
  79. Other Applications Dual decomposition can be applied to other decoding

    problems. Rush et al. (2010) focuses on integrated dynamic programming algorithms. Integrated Parsing and Tagging Integrated Constituency and Dependency Parsing
  80. Parsing and Tagging y∗ = arg max y f (y)

    ⇐ Slow *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 S N V D N A D N V *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 S NP N John VP V saw NP D a NP N movie ADVP ADV today ADVP D that VP N he V liked HMM Model CFG Model
  81. Parsing and Tagging y∗ = arg max y f (y)

    ⇐ Slow *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 S N V D N A D N V + *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 S NP N John VP V saw NP D a NP N movie ADVP ADV today ADVP D that VP N he V liked HMM Model Dual Decomposition CFG Model
  82. Dependency and Constituency y∗ = arg max y f (y)

    ⇐ Slow *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 S NP N John VP V saw NP D a NP N movie ADVP ADV today ADVP D that VP N he V liked Dependency Model Lexicalized CFG
  83. Dependency and Constituency y∗ = arg max y f (y)

    ⇐ Slow *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 + *0 John1 saw2 a3 movie4 today5 that6 he7 liked8 S NP N John VP V saw NP D a NP N movie ADVP ADV today ADVP D that VP N he V liked Dependency Model Dual Decomposition Lexicalized CFG
  84. Future Directions There is much more to explore around dual

    decomposition in NLP. Known Techniques Generalization to more than two models K-best decoding Approximate subgradient Heuristic for branch-and-bound type search Possible NLP Applications Machine Translation Speech Recognition “Loopy” Sequence Models Open Questions Can we speed up subalgorithms when running repeatedly? What are the trade-offs of different decompositions? Are there better methods for optimizing the dual?
  85. Training the Model *0 John1 saw2 a3 movie4 today5 that6

    he7 liked8 f (y) = ... + score(saw2,movie4, today5) + ... score(saw2, movie4, today5) = w · φ(saw2, movie4, today5) Weight vector w trained using Averaged perceptron. (More details in the paper.)
  86. Early Stopping 50 60 70 80 90 100 0 200

    400 600 800 1000 Percentage Maximum Number of Dual Decomposition Iterations % validation UAS % certificates % match K=5000 Early Stopping
  87. Caching 0 5 10 15 20 25 30 0 1000

    2000 3000 4000 5000 % of Head Automata Recomputed Iterations of Dual Decomposition % recomputed, g+s % recomputed, sib Caching speed