Upgrade to Pro — share decks privately, control downloads, hide ads and more …

On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Alexander Rush
October 16, 2012
65

On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Alexander Rush

October 16, 2012
Tweet

Transcript

  1. On Dual Decomposition and Linear Programming Relaxations for Natural Language

    Processing Alexander M. Rush, David Sontag, Michael Collins, and Tommi Jaakkola
  2. Dynamic Programming Dynamic programming is a dominant technique in NLP.

    Fast Exact Easy to implement Examples: Viterbi algorithm for hidden Markov models CKY algorithm for weighted context-free grammars y∗ = arg max y f (y) ← Decoding
  3. Model Complexity Unfortunately, dynamic programming algorithms do not scale well

    with model complexity. As our models become complex, these algorithms can explode in terms of computational or implementational complexity. Integration: f ← Easy g ← Easy f + g ← Hard
  4. Integration (1) S NP N Red VP V flies NP

    D some A large N jet Red1 flies2 some3 large4 jet5 N V D A N f (y) + g(z) Classical problem in NLP. The dynamic programming intersection is prohibitively slow and complicated to implement.
  5. Integration (2) S NP N Red VP V flies NP

    D some A large N jet *0 Red1 flies2 some3 large4 jet5 f (y) + g(z) Important for improving parsing accuracy. The dynamic programming intersection is slow and complicated to implement.
  6. Dual Decomposition A general technique for constructing decoding algorithms Solve

    complicated models y∗ = arg max y f (y) by decomposing into smaller problems. Upshot: Can utilize a toolbox of combinatorial algorithms. Dynamic programming Minimum spanning tree Shortest path Min-Cut ...
  7. Dual Decomposition Algorithms Simple - Uses basic dynamic programming algorithms

    Efficient - Faster than full dynamic programming intersections Strong Guarantees - Gives a certificate of optimality when exact In experiments, we find the global optimum on 99% of examples. Widely Applicable - Similar techniques extend to other problems
  8. Integrated Parsing and Tagging Red1 flies2 some3 large4 jet5 Red1

    flies2 some3 large4 jet5 N V D A N Red1 flies2 some3 large4 jet5 S NP N Red VP V flies NP D some A large N jet HMM CFG
  9. Integrated Parsing and Tagging Red1 flies2 some3 large4 jet5 Red1

    flies2 some3 large4 jet5 N V D A N Red1 flies2 some3 large4 jet5 S NP N Red VP V flies NP D some A large N jet HMM Dual Decomposition CFG
  10. HMM for Tagging Red1 flies2 some3 large4 jet5 N V

    D A N Let Z be the set of all valid taggings of a sentence and g(z) be a scoring function. e.g. g(z) = log p(Red1|N) + log p(V|N) + ...
  11. HMM for Tagging Red1 flies2 some3 large4 jet5 N V

    D A N Let Z be the set of all valid taggings of a sentence and g(z) be a scoring function. e.g. g(z) = log p(Red1|N) + log p(V|N) + ... z∗ = arg max z∈Z g(z) ← Viterbi decoding
  12. CFG for Parsing S NP N Red VP V flies

    NP D some A large N jet Let Y be the set of all valid parse trees for a sentence and f (y) be a scoring function. e.g. f (y) = log p(S → NP VP|S) + log p(NP → N|NP) + ...
  13. CFG for Parsing S NP N Red VP V flies

    NP D some A large N jet Let Y be the set of all valid parse trees for a sentence and f (y) be a scoring function. e.g. f (y) = log p(S → NP VP|S) + log p(NP → N|NP) + ... y∗ = arg max y∈Y f (y) ← CKY Algorithm
  14. Problem Definition S NP N Red VP V flies NP

    D some A large N jet Find parse tree that optimizes score(S → NP VP) + score(VP → V NP) + ... + score(Red1, N) + score(V, N) + ... Conventional Approach (Bar Hillel et al., 1961) Replace rules like S → NP VP with rules like SN,N → NPN,V VPV ,N Painful. O(t6) increase in complexity for trigram tagging.
  15. The Integrated Parsing and Tagging Problem Find argmax y∈ Y,

    z∈ Z f (y) + g(z) such that for all i, t, y(i, t) = z(i, t) Where y(i, t) = 1 if parse includes tag t at position i z(i, t) = 1 if tagging includes tag t at position i
  16. The Integrated Parsing and Tagging Problem Find argmax y∈ Y,

    z∈ Z f (y) + g(z) such that for all i, t, y(i, t) = z(i, t) Trees Where y(i, t) = 1 if parse includes tag t at position i z(i, t) = 1 if tagging includes tag t at position i
  17. The Integrated Parsing and Tagging Problem Find argmax y∈ Y,

    z∈ Z f (y) + g(z) such that for all i, t, y(i, t) = z(i, t) Trees Taggings Where y(i, t) = 1 if parse includes tag t at position i z(i, t) = 1 if tagging includes tag t at position i
  18. The Integrated Parsing and Tagging Problem Find argmax y∈ Y,

    z∈ Z f (y) + g(z) such that for all i, t, y(i, t) = z(i, t) Trees Taggings CFG Where y(i, t) = 1 if parse includes tag t at position i z(i, t) = 1 if tagging includes tag t at position i
  19. The Integrated Parsing and Tagging Problem Find argmax y∈ Y,

    z∈ Z f (y) + g(z) such that for all i, t, y(i, t) = z(i, t) Trees Taggings CFG HMM Where y(i, t) = 1 if parse includes tag t at position i z(i, t) = 1 if tagging includes tag t at position i
  20. The Integrated Parsing and Tagging Problem Find argmax y∈ Y,

    z∈ Z f (y) + g(z) such that for all i, t, y(i, t) = z(i, t) Trees Taggings CFG HMM Constraints Where y(i, t) = 1 if parse includes tag t at position i z(i, t) = 1 if tagging includes tag t at position i
  21. Algorithm Sketch Set penalty weights equal to 0 for the

    tag at each position. For k = 1 to K
  22. Algorithm Sketch Set penalty weights equal to 0 for the

    tag at each position. For k = 1 to K y(k) ← Decode (f (y) + penalty) by CKY Algorithm
  23. Algorithm Sketch Set penalty weights equal to 0 for the

    tag at each position. For k = 1 to K y(k) ← Decode (f (y) + penalty) by CKY Algorithm z(k) ← Decode (g(z) − penalty) by Viterbi Decoding
  24. Algorithm Sketch Set penalty weights equal to 0 for the

    tag at each position. For k = 1 to K y(k) ← Decode (f (y) + penalty) by CKY Algorithm z(k) ← Decode (g(z) − penalty) by Viterbi Decoding If y(k)(i, t) = z(k)(i, t) for all i, t Return (y(k), z(k))
  25. Algorithm Sketch Set penalty weights equal to 0 for the

    tag at each position. For k = 1 to K y(k) ← Decode (f (y) + penalty) by CKY Algorithm z(k) ← Decode (g(z) − penalty) by Viterbi Decoding If y(k)(i, t) = z(k)(i, t) for all i, t Return (y(k), z(k)) Else Update penalty weights based on y(k)(i, t) − z(k)(i, t)
  26. CKY Parsing y∗ = arg max y∈Y (f (y) +

    i,t u(i, t)y(i, t)) Viterbi Decoding Red1 flies2 some3 large4 jet5 z∗ = arg max z∈Z (g(z) − i,t u(i, t)z(i, t)) Penalties u(i, t) = 0 for all i,t Key f (y) ⇐ CFG g(z) ⇐ HMM Y ⇐ Parse Trees Z ⇐ Taggings y(i, t) = 1 if y contains tag t at position i
  27. CKY Parsing S NP A Red N flies D some

    A large VP V jet y∗ = arg max y∈Y (f (y) + i,t u(i, t)y(i, t)) Viterbi Decoding Red1 flies2 some3 large4 jet5 z∗ = arg max z∈Z (g(z) − i,t u(i, t)z(i, t)) Penalties u(i, t) = 0 for all i,t Key f (y) ⇐ CFG g(z) ⇐ HMM Y ⇐ Parse Trees Z ⇐ Taggings y(i, t) = 1 if y contains tag t at position i
  28. CKY Parsing S NP A Red N flies D some

    A large VP V jet y∗ = arg max y∈Y (f (y) + i,t u(i, t)y(i, t)) Viterbi Decoding Red1 flies2 some3 large4 jet5 N V D A N z∗ = arg max z∈Z (g(z) − i,t u(i, t)z(i, t)) Penalties u(i, t) = 0 for all i,t Key f (y) ⇐ CFG g(z) ⇐ HMM Y ⇐ Parse Trees Z ⇐ Taggings y(i, t) = 1 if y contains tag t at position i
  29. CKY Parsing S NP A Red N flies D some

    A large VP V jet y∗ = arg max y∈Y (f (y) + i,t u(i, t)y(i, t)) Viterbi Decoding Red1 flies2 some3 large4 jet5 N V D A N z∗ = arg max z∈Z (g(z) − i,t u(i, t)z(i, t)) Penalties u(i, t) = 0 for all i,t Key f (y) ⇐ CFG g(z) ⇐ HMM Y ⇐ Parse Trees Z ⇐ Taggings y(i, t) = 1 if y contains tag t at position i
  30. CKY Parsing S NP A Red N flies D some

    A large VP V jet y∗ = arg max y∈Y (f (y) + i,t u(i, t)y(i, t)) Viterbi Decoding Red1 flies2 some3 large4 jet5 N V D A N z∗ = arg max z∈Z (g(z) − i,t u(i, t)z(i, t)) Penalties u(i, t) = 0 for all i,t Iteration 1 u(1, A) -1 u(1, N) 1 u(2, N) -1 u(2, V ) 1 u(5, V ) -1 u(5, N) 1 Key f (y) ⇐ CFG g(z) ⇐ HMM Y ⇐ Parse Trees Z ⇐ Taggings y(i, t) = 1 if y contains tag t at position i
  31. CKY Parsing y∗ = arg max y∈Y (f (y) +

    i,t u(i, t)y(i, t)) Viterbi Decoding Red1 flies2 some3 large4 jet5 z∗ = arg max z∈Z (g(z) − i,t u(i, t)z(i, t)) Penalties u(i, t) = 0 for all i,t Iteration 1 u(1, A) -1 u(1, N) 1 u(2, N) -1 u(2, V ) 1 u(5, V ) -1 u(5, N) 1 Key f (y) ⇐ CFG g(z) ⇐ HMM Y ⇐ Parse Trees Z ⇐ Taggings y(i, t) = 1 if y contains tag t at position i
  32. CKY Parsing S NP N Red VP V flies NP

    D some A large N jet y∗ = arg max y∈Y (f (y) + i,t u(i, t)y(i, t)) Viterbi Decoding Red1 flies2 some3 large4 jet5 z∗ = arg max z∈Z (g(z) − i,t u(i, t)z(i, t)) Penalties u(i, t) = 0 for all i,t Iteration 1 u(1, A) -1 u(1, N) 1 u(2, N) -1 u(2, V ) 1 u(5, V ) -1 u(5, N) 1 Key f (y) ⇐ CFG g(z) ⇐ HMM Y ⇐ Parse Trees Z ⇐ Taggings y(i, t) = 1 if y contains tag t at position i
  33. CKY Parsing S NP N Red VP V flies NP

    D some A large N jet y∗ = arg max y∈Y (f (y) + i,t u(i, t)y(i, t)) Viterbi Decoding Red1 flies2 some3 large4 jet5 A N D A N z∗ = arg max z∈Z (g(z) − i,t u(i, t)z(i, t)) Penalties u(i, t) = 0 for all i,t Iteration 1 u(1, A) -1 u(1, N) 1 u(2, N) -1 u(2, V ) 1 u(5, V ) -1 u(5, N) 1 Key f (y) ⇐ CFG g(z) ⇐ HMM Y ⇐ Parse Trees Z ⇐ Taggings y(i, t) = 1 if y contains tag t at position i
  34. CKY Parsing S NP N Red VP V flies NP

    D some A large N jet y∗ = arg max y∈Y (f (y) + i,t u(i, t)y(i, t)) Viterbi Decoding Red1 flies2 some3 large4 jet5 A N D A N z∗ = arg max z∈Z (g(z) − i,t u(i, t)z(i, t)) Penalties u(i, t) = 0 for all i,t Iteration 1 u(1, A) -1 u(1, N) 1 u(2, N) -1 u(2, V ) 1 u(5, V ) -1 u(5, N) 1 Iteration 2 u(5, V ) -1 u(5, N) 1 Key f (y) ⇐ CFG g(z) ⇐ HMM Y ⇐ Parse Trees Z ⇐ Taggings y(i, t) = 1 if y contains tag t at position i
  35. CKY Parsing y∗ = arg max y∈Y (f (y) +

    i,t u(i, t)y(i, t)) Viterbi Decoding Red1 flies2 some3 large4 jet5 z∗ = arg max z∈Z (g(z) − i,t u(i, t)z(i, t)) Penalties u(i, t) = 0 for all i,t Iteration 1 u(1, A) -1 u(1, N) 1 u(2, N) -1 u(2, V ) 1 u(5, V ) -1 u(5, N) 1 Iteration 2 u(5, V ) -1 u(5, N) 1 Key f (y) ⇐ CFG g(z) ⇐ HMM Y ⇐ Parse Trees Z ⇐ Taggings y(i, t) = 1 if y contains tag t at position i
  36. CKY Parsing S NP N Red VP V flies NP

    D some A large N jet y∗ = arg max y∈Y (f (y) + i,t u(i, t)y(i, t)) Viterbi Decoding Red1 flies2 some3 large4 jet5 z∗ = arg max z∈Z (g(z) − i,t u(i, t)z(i, t)) Penalties u(i, t) = 0 for all i,t Iteration 1 u(1, A) -1 u(1, N) 1 u(2, N) -1 u(2, V ) 1 u(5, V ) -1 u(5, N) 1 Iteration 2 u(5, V ) -1 u(5, N) 1 Key f (y) ⇐ CFG g(z) ⇐ HMM Y ⇐ Parse Trees Z ⇐ Taggings y(i, t) = 1 if y contains tag t at position i
  37. CKY Parsing S NP N Red VP V flies NP

    D some A large N jet y∗ = arg max y∈Y (f (y) + i,t u(i, t)y(i, t)) Viterbi Decoding Red1 flies2 some3 large4 jet5 N V D A N z∗ = arg max z∈Z (g(z) − i,t u(i, t)z(i, t)) Penalties u(i, t) = 0 for all i,t Iteration 1 u(1, A) -1 u(1, N) 1 u(2, N) -1 u(2, V ) 1 u(5, V ) -1 u(5, N) 1 Iteration 2 u(5, V ) -1 u(5, N) 1 Key f (y) ⇐ CFG g(z) ⇐ HMM Y ⇐ Parse Trees Z ⇐ Taggings y(i, t) = 1 if y contains tag t at position i
  38. CKY Parsing S NP N Red VP V flies NP

    D some A large N jet y∗ = arg max y∈Y (f (y) + i,t u(i, t)y(i, t)) Viterbi Decoding Red1 flies2 some3 large4 jet5 N V D A N z∗ = arg max z∈Z (g(z) − i,t u(i, t)z(i, t)) Penalties u(i, t) = 0 for all i,t Iteration 1 u(1, A) -1 u(1, N) 1 u(2, N) -1 u(2, V ) 1 u(5, V ) -1 u(5, N) 1 Iteration 2 u(5, V ) -1 u(5, N) 1 Converged y∗ = arg max y∈Y f (y) + g(y) Key f (y) ⇐ CFG g(z) ⇐ HMM Y ⇐ Parse Trees Z ⇐ Taggings y(i, t) = 1 if y contains tag t at position i
  39. Guarantees Theorem If at any iteration y(k)(i, t) = z(k)(i,

    t) for all i, t, then (y(k), z(k)) is the global optimum. In experiments, we find the global optimum on 99% of examples.
  40. Guarantees Theorem If at any iteration y(k)(i, t) = z(k)(i,

    t) for all i, t, then (y(k), z(k)) is the global optimum. In experiments, we find the global optimum on 99% of examples. If we do not converge to a match, we can still get a result (more in paper).
  41. Integrated CFG and Dependency Parsing Red1 flies2 some3 large4 jet5

    *0 Red1 flies2 some3 large4 jet5 Red1 flies2 some3 large4 jet5 S(flies) NP N Red VP(flies) V flies NP(jet) D some A large N jet Dependency Model Lexicalized CFG
  42. Integrated CFG and Dependency Parsing Red1 flies2 some3 large4 jet5

    *0 Red1 flies2 some3 large4 jet5 Red1 flies2 some3 large4 jet5 S(flies) NP N Red VP(flies) V flies NP(jet) D some A large N jet Dependency Model Dual Decomposition Lexicalized CFG
  43. Dependency Parsing *0 Red1 flies2 some3 large4 jet5 Let Z

    be the set of all valid dependency parses of a sentence and g(z) be a scoring function. e.g. g(z) = log p(some3|jet5 , large4 ) + ...
  44. Dependency Parsing *0 Red1 flies2 some3 large4 jet5 Let Z

    be the set of all valid dependency parses of a sentence and g(z) be a scoring function. e.g. g(z) = log p(some3|jet5 , large4 ) + ... z∗ = arg max z∈Z g(z) ← Eisner (2000) algorithm
  45. Lexicalized PCFG S(flies) NP N Red VP(flies) V flies NP(jet)

    D some A large N jet Let Y be the set of all valid dependency parses of a sentence and f (y) be a scoring function. e.g. f (y) = log p(S(flies) → NP(Red) VP(flies)|S(flies)) + ...
  46. Lexicalized PCFG S(flies) NP N Red VP(flies) V flies NP(jet)

    D some A large N jet Let Y be the set of all valid dependency parses of a sentence and f (y) be a scoring function. e.g. f (y) = log p(S(flies) → NP(Red) VP(flies)|S(flies)) + ... y∗ = arg max y∈Y f (y) ← Modified CKY algorithm
  47. The Integrated Constituency and Dependency Parsing Problem Find argmax y∈

    Y, z∈ Z f (y) + g(z) such that for all i, j, y(i, j) = z(i, j) Where y(i, j) = 1 if parse includes dependency from word i to j z(i, j) = 1 if parse includes dependency from word i to j
  48. The Integrated Constituency and Dependency Parsing Problem Find argmax y∈

    Y, z∈ Z f (y) + g(z) such that for all i, j, y(i, j) = z(i, j) Trees Where y(i, j) = 1 if parse includes dependency from word i to j z(i, j) = 1 if parse includes dependency from word i to j
  49. The Integrated Constituency and Dependency Parsing Problem Find argmax y∈

    Y, z∈ Z f (y) + g(z) such that for all i, j, y(i, j) = z(i, j) Trees Dependency Trees Where y(i, j) = 1 if parse includes dependency from word i to j z(i, j) = 1 if parse includes dependency from word i to j
  50. The Integrated Constituency and Dependency Parsing Problem Find argmax y∈

    Y, z∈ Z f (y) + g(z) such that for all i, j, y(i, j) = z(i, j) Trees Dependency Trees CFG Where y(i, j) = 1 if parse includes dependency from word i to j z(i, j) = 1 if parse includes dependency from word i to j
  51. The Integrated Constituency and Dependency Parsing Problem Find argmax y∈

    Y, z∈ Z f (y) + g(z) such that for all i, j, y(i, j) = z(i, j) Trees Dependency Trees CFG Dependency Where y(i, j) = 1 if parse includes dependency from word i to j z(i, j) = 1 if parse includes dependency from word i to j
  52. The Integrated Constituency and Dependency Parsing Problem Find argmax y∈

    Y, z∈ Z f (y) + g(z) such that for all i, j, y(i, j) = z(i, j) Trees Dependency Trees CFG Dependency Constraints Where y(i, j) = 1 if parse includes dependency from word i to j z(i, j) = 1 if parse includes dependency from word i to j
  53. CKY Parsing y∗ = arg max y∈Y (f (y) +

    i,j u(i, j)y(i, j)) Dependency Parsing *0 Red1 flies2 some3 large4 jet5 z∗ = arg max z∈Z (g(z) − i,j u(i, j)z(i, j)) Penalties u(i, j) = 0 for all i,j Key f (y) ⇐ CFG g(z) ⇐ Dependency Model Y ⇐ Parse Trees Z ⇐ Dependency Trees y(i, j) = 1 if y contains dependency i, j
  54. CKY Parsing S(flies) NP N Red VP(flies) V flies D

    some NP(jet) A large N jet y∗ = arg max y∈Y (f (y) + i,j u(i, j)y(i, j)) Dependency Parsing *0 Red1 flies2 some3 large4 jet5 z∗ = arg max z∈Z (g(z) − i,j u(i, j)z(i, j)) Penalties u(i, j) = 0 for all i,j Key f (y) ⇐ CFG g(z) ⇐ Dependency Model Y ⇐ Parse Trees Z ⇐ Dependency Trees y(i, j) = 1 if y contains dependency i, j
  55. CKY Parsing S(flies) NP N Red VP(flies) V flies D

    some NP(jet) A large N jet y∗ = arg max y∈Y (f (y) + i,j u(i, j)y(i, j)) Dependency Parsing *0 Red1 flies2 some3 large4 jet5 z∗ = arg max z∈Z (g(z) − i,j u(i, j)z(i, j)) Penalties u(i, j) = 0 for all i,j Key f (y) ⇐ CFG g(z) ⇐ Dependency Model Y ⇐ Parse Trees Z ⇐ Dependency Trees y(i, j) = 1 if y contains dependency i, j
  56. CKY Parsing S(flies) NP N Red VP(flies) V flies D

    some NP(jet) A large N jet y∗ = arg max y∈Y (f (y) + i,j u(i, j)y(i, j)) Dependency Parsing *0 Red1 flies2 some3 large4 jet5 z∗ = arg max z∈Z (g(z) − i,j u(i, j)z(i, j)) Penalties u(i, j) = 0 for all i,j Key f (y) ⇐ CFG g(z) ⇐ Dependency Model Y ⇐ Parse Trees Z ⇐ Dependency Trees y(i, j) = 1 if y contains dependency i, j
  57. CKY Parsing S(flies) NP N Red VP(flies) V flies D

    some NP(jet) A large N jet y∗ = arg max y∈Y (f (y) + i,j u(i, j)y(i, j)) Dependency Parsing *0 Red1 flies2 some3 large4 jet5 z∗ = arg max z∈Z (g(z) − i,j u(i, j)z(i, j)) Penalties u(i, j) = 0 for all i,j Iteration 1 u(2, 3) -1 u(5, 3) 1 Key f (y) ⇐ CFG g(z) ⇐ Dependency Model Y ⇐ Parse Trees Z ⇐ Dependency Trees y(i, j) = 1 if y contains dependency i, j
  58. CKY Parsing y∗ = arg max y∈Y (f (y) +

    i,j u(i, j)y(i, j)) Dependency Parsing *0 Red1 flies2 some3 large4 jet5 z∗ = arg max z∈Z (g(z) − i,j u(i, j)z(i, j)) Penalties u(i, j) = 0 for all i,j Iteration 1 u(2, 3) -1 u(5, 3) 1 Key f (y) ⇐ CFG g(z) ⇐ Dependency Model Y ⇐ Parse Trees Z ⇐ Dependency Trees y(i, j) = 1 if y contains dependency i, j
  59. CKY Parsing S(flies) NP N Red VP(flies) V flies NP(jet)

    D some A large N jet y∗ = arg max y∈Y (f (y) + i,j u(i, j)y(i, j)) Dependency Parsing *0 Red1 flies2 some3 large4 jet5 z∗ = arg max z∈Z (g(z) − i,j u(i, j)z(i, j)) Penalties u(i, j) = 0 for all i,j Iteration 1 u(2, 3) -1 u(5, 3) 1 Key f (y) ⇐ CFG g(z) ⇐ Dependency Model Y ⇐ Parse Trees Z ⇐ Dependency Trees y(i, j) = 1 if y contains dependency i, j
  60. CKY Parsing S(flies) NP N Red VP(flies) V flies NP(jet)

    D some A large N jet y∗ = arg max y∈Y (f (y) + i,j u(i, j)y(i, j)) Dependency Parsing *0 Red1 flies2 some3 large4 jet5 z∗ = arg max z∈Z (g(z) − i,j u(i, j)z(i, j)) Penalties u(i, j) = 0 for all i,j Iteration 1 u(2, 3) -1 u(5, 3) 1 Key f (y) ⇐ CFG g(z) ⇐ Dependency Model Y ⇐ Parse Trees Z ⇐ Dependency Trees y(i, j) = 1 if y contains dependency i, j
  61. CKY Parsing S(flies) NP N Red VP(flies) V flies NP(jet)

    D some A large N jet y∗ = arg max y∈Y (f (y) + i,j u(i, j)y(i, j)) Dependency Parsing *0 Red1 flies2 some3 large4 jet5 z∗ = arg max z∈Z (g(z) − i,j u(i, j)z(i, j)) Penalties u(i, j) = 0 for all i,j Iteration 1 u(2, 3) -1 u(5, 3) 1 Converged y∗ = arg max y∈Y f (y) + g(y) Key f (y) ⇐ CFG g(z) ⇐ Dependency Model Y ⇐ Parse Trees Z ⇐ Dependency Trees y(i, j) = 1 if y contains dependency i, j
  62. Experiment Properties: Exactness Parsing Accuracy Experiments on: English Penn Treebank

    Models Collins (1997) Model 1 Semi-Supervised Dependency Parser (Koo, 2008) Trigram Tagger (Toutanova, 2000)
  63. How quickly do the models converge? 0 20 40 60

    80 100 <=1 <=2 <=3 <=4 <=10 <=20 <=50 % examples converged number of iterations Integrated Dependency Parsing 0 20 40 60 80 100 <=1 <=2 <=3 <=4 <=10 <=20 <=50 % examples converged number of iterations Integrated POS Tagging
  64. Integrated Constituency and Dependency Parsing: Accuracy 87 88 89 90

    91 92 Collins Dep Dual F1 Score Collins (1997) Model 1 Fixed, First-best Dependencies from Koo (2008) Dual Decomposition
  65. Integrated Parsing and Tagging: Accuracy 87 88 89 90 91

    92 Fixed Dual F1 Score Fixed, First-Best Tags From Toutanova (2000) Dual Decomposition
  66. Dual Decomposition and Linear Programming Relaxations Theorem If the dual

    decomposition algorithm converges, then (y(k), z(k)) is the global optimum. Questions What problem is dual decomposition solving? How come the algorithm doesn’t always converge? Dual decomposition searches over a linear programming relaxation of the original problem.
  67. Convex Hulls for CKY A parse tree can be represented

    as a binary vector y ∈ Y. y(A → B C, i, j, k) = 1 if rule A → B C is used at span i, j, k. Parsing Y If f is linear, arg max y∈conv(Y) f (y) is a linear program. The best point in an LP is a vertex. So CKY solves this LP.
  68. Convex Hulls for CKY A parse tree can be represented

    as a binary vector y ∈ Y. y(A → B C, i, j, k) = 1 if rule A → B C is used at span i, j, k. Parsing Y If f is linear, arg max y∈conv(Y) f (y) is a linear program. The best point in an LP is a vertex. So CKY solves this LP.
  69. Convex Hulls for CKY A parse tree can be represented

    as a binary vector y ∈ Y. y(A → B C, i, j, k) = 1 if rule A → B C is used at span i, j, k. Parsing conv(Y) If f is linear, arg max y∈conv(Y) f (y) is a linear program. The best point in an LP is a vertex. So CKY solves this LP.
  70. Convex Hulls for CKY A parse tree can be represented

    as a binary vector y ∈ Y. y(A → B C, i, j, k) = 1 if rule A → B C is used at span i, j, k. Parsing y∗ w conv(Y) If f is linear, arg max y∈conv(Y) f (y) is a linear program. The best point in an LP is a vertex. So CKY solves this LP.
  71. Combined Problem Q = {(y, z): y ∈ Y, z

    ∈ Z, y(i, t) = z(i, t) for all (i, t)} Q
  72. Combined Problem Q = {(y, z): y ∈ Y, z

    ∈ Z, y(i, t) = z(i, t) for all (i, t)} Q
  73. Combined Problem Q = {(y, z): y ∈ Y, z

    ∈ Z, y(i, t) = z(i, t) for all (i, t)} conv(Q)
  74. Combined Problem Q = {(y, z): y ∈ Y, z

    ∈ Z, y(i, t) = z(i, t) for all (i, t)} Q Q = {(µ, ν): µ ∈ conv(Y), ν ∈ conv(Z), µ(i, t) = ν(i, t) for all (i, t)} Dual decomposition searches over Q
  75. Combined Problem Q = {(y, z): y ∈ Y, z

    ∈ Z, y(i, t) = z(i, t) for all (i, t)} Q conv(Q) Q = {(µ, ν): µ ∈ conv(Y), ν ∈ conv(Z), µ(i, t) = ν(i, t) for all (i, t)} Dual decomposition searches over Q
  76. Combined Problem Q = {(y, z): y ∈ Y, z

    ∈ Z, y(i, t) = z(i, t) for all (i, t)} Q conv(Q) Possible (y∗, z∗) w Q = {(µ, ν): µ ∈ conv(Y), ν ∈ conv(Z), µ(i, t) = ν(i, t) for all (i, t)} Dual decomposition searches over Q Depending on the weight vector, (y∗, z∗) ∈ Q could be in Q or in the strict outer bound.
  77. Combined Problem Q = {(y, z): y ∈ Y, z

    ∈ Z, y(i, t) = z(i, t) for all (i, t)} Q conv(Q) Possible (y∗, z∗) w Q = {(µ, ν): µ ∈ conv(Y), ν ∈ conv(Z), µ(i, t) = ν(i, t) for all (i, t)} Dual decomposition searches over Q Depending on the weight vector, (y∗, z∗) ∈ Q could be in Q or in the strict outer bound.
  78. Are there points strictly in the outer bound? Q Possible

    (y∗, z∗)? Taggings 0.5x w1 w2 w3 A A A + 0.5x w1 w2 w3 A B B Parses 0.5 x X A w1 X A w2 B w3 + 0.5 x X A w1 X B w2 A w3 Best result can be a fractional solution. Convex combination of these structures.
  79. Summary A Dual Decomposition algorithm for integrated decoding Simple -

    Uses only simple, off-the-shelf dynamic programming algorithms to solve a harder problem. Efficient - Faster than classical methods for dynamic programming intersection. Strong Guarantees - Solves a linear programming relaxation which gives a certificate of optimality. Finds the exact solution on 99% of the examples. Widely Applicable - Similar techniques extend to other problems
  80. Iterative Progress 50 60 70 80 90 100 0 10

    20 30 40 50 Percentage Maximum Number of Dual Decomposition Iterations f score % certificates % match K=50
  81. Deriving the Algorithm Goal: y∗ = arg max y∈Y f

    (y) Rewrite: arg max z∈Z,y∈Y f (z) + g(y) s.t. z(i, j) = y(i, j) for all i, j Lagrangian: L(u, y, z) = f (z) + g(y) + i,j u(i, j) (y(i, j) − z(i, j))
  82. Deriving the Algorithm Goal: y∗ = arg max y∈Y f

    (y) Rewrite: arg max z∈Z,y∈Y f (z) + g(y) s.t. z(i, j) = y(i, j) for all i, j Lagrangian: L(u, y, z) = f (z) + g(y) + i,j u(i, j) (y(i, j) − z(i, j)) The dual problem is to find min u L(u) where L(u) = max y∈Y,z∈Z L(u, y, z) = max z∈Z  f (z) + i,j u(i, j)z(i, j)   + max y∈Y  g(y) − i,j u(i, j)y(i, j)   Dual is an upper bound: L(u) ≥ f (z∗) + g(y∗) for any u
  83. A Subgradient Algorithm for Minimizing L(u) L(u) = max z∈Z

     f (z) + i,j u(i, j)y(i, j)   + max y∈Y  g(y) − i,j u(i, j)z(i, j)   L(u) is convex, but not differentiable. A subgradient of L(u) at u is a vector gu such that for all v, L(v) ≥ L(u) + gu · (v − u) Subgradient methods use updates u = u − αgu In fact, for our L(u), gu(i, j) = z∗(i, j) − y∗(i, j)
  84. Related Work Methods that use general purpose linear programming or

    integer linear programming solvers (Martins et al. 2009; Riedel and Clarke 2006; Roth and Yih 2005) Dual decomposition/Lagrangian relaxation in combinatorial optimization (Dantzig and Wolfe, 1960; Held and Karp, 1970; Fisher 1981) Dual decomposition for inference in MRFs (Komodakis et al., 2007; Wainwright et al., 2005) Methods that incorporate combinatorial solvers within loopy belief propagation (Duchi et al. 2007; Smith and Eisner 2008)