Introduction to LP-MERT

Optimal Search for Minimum Error Rate Training Michel Galley and
Chris Quirk EMNLP 2011 Presenter: Tetsuo Kiso MT study group meeting November 22, 2012 Feel free to email me if these slide contains any mistakes! [email protected] Thursday, November 29, 12

Introduction Thursday, November 29, 12

SMT system needs tuning 3 decoder language model translation model
image url: http://wyoﬁle.com/wp-content/uploads/2010/11/chiefjoe-haultruck.jpg moses parallel corpus monolingual corpus Thursday, November 29, 12

SMT system needs tuning 3 decoder language model translation model
common practice: parameters are tuned using MERT on a small development set image url: http://wyoﬁle.com/wp-content/uploads/2010/11/chiefjoe-haultruck.jpg moses parallel corpus monolingual corpus Thursday, November 29, 12

MERT (Och, 2003) 4 Thursday, November 29, 12

MERT (Och, 2003) • popular in SMT and speech recognition
• directly optimizes the evaluation metric - e.g., BLEU (Papineni+, 2002), TER (Snover+, 2006) • originally from speech recognition (as always) 4 Thursday, November 29, 12

MERT (Och, 2003) • popular in SMT and speech recognition
• directly optimizes the evaluation metric - e.g., BLEU (Papineni+, 2002), TER (Snover+, 2006) • originally from speech recognition (as always) • Typically, search space is approximated by N- best lists - Extensions: lattice MERT (Macherey+, 2008), hypergraph MERT (Kumar+, 2009) 4 Thursday, November 29, 12

Challenges in MERT • BLEU is highly non-convex, difﬁcult to
optimize directly. • Och’s line minimization efﬁciently searches the error surface in tuning a single parameter • But, it remains inexact in tuning parameters simultaneously (multi-dimensional case) 5 Thursday, November 29, 12

Contributions • In theory: develop exact search algorithm over the
parameter spaces in the multi- dimensional case. (called LP-MERT) • In practice, LP-MERT is often computationally expensive over thousands of sentences; use approximations - search only promising regions - w/ beam search 6 Thursday, November 29, 12

Outline • Introduction • Objective function of MERT • LP-MERT
- exact case - approximations • Experiments 7 Thursday, November 29, 12

MERT: minimize error counts 8 ˆ w = arg min
w ( S X s=1 E(rs, ˆ es(fs; w)) ) = arg min w ( S X s=1 N X n=1 E(rs, es,n) (es,n, ˆ e(fs; w)) ) ˆ e ( fs; w ) = arg max n2{1,. . . ,N} wThs,n where reference candidate translation source sentence error (e.g., 1-BLEU) feature vector w, h 2 RD consider: Thursday, November 29, 12

LP-MERT Thursday, November 29, 12

10 Starts with the simple case: the tuning set contains
only one sentence. tuning set f1 candidate space e1 e2 eN ... Thursday, November 29, 12

10 Starts with the simple case: the tuning set contains
only one sentence. tuning set f1 candidate space e1 e2 eN ... Searching for minimizing task loss on a single N-best list Thursday, November 29, 12

hidden variable Think about two dimensional feature space 11 feature
1 feature 2 candidate space e1 e2 eN ... h(f,e,~) feature space Thursday, November 29, 12

1 feature 2 candidate space e1 e2 eN ... h(f,e,~) h2(e2) feature space each point corresponds to a feature vector for each translation Thursday, November 29, 12

1 feature 2 candidate space e1 e2 eN ... h(f,e,~) h2(e2) feature space NOTE: Henceforth, we simply use hi each point corresponds to a feature vector for each translation Thursday, November 29, 12

12 feature 1 feature 2 h1: 0.28 h2: 0.21 h3:
0.37 h4: 0.31 task loss (lower is better) By exhaustively enumerating the N translations, we can ﬁnd the one which yields the minimal task loss. Thursday, November 29, 12

0.37 h4: 0.31 task loss (lower is better) By exhaustively enumerating the N translations, we can ﬁnd the one which yields the minimal task loss. Assumption: Task loss can be computed at sentence-level. Thursday, November 29, 12

0.21 h3: 0.37 We have to answer the question: Which feature vectors can any linear model maximize? Thursday, November 29, 12

h4: 0.31 14 feature 1 feature 2 h1: 0.28 h2:
0.21 h3: 0.37 This can be done by checking whether the feature vector is inside the convex hull or not. Thursday, November 29, 12

0.21 h3: 0.37 This can be done by checking whether the feature vector is inside the convex hull or not. convex hull ತแ Thursday, November 29, 12

0.21 h3: 0.37 This can be done by checking whether the feature vector is inside the convex hull or not. convex hull ತแ interior points Thursday, November 29, 12

0.21 h3: 0.37 This can be done by checking whether the feature vector is inside the convex hull or not. convex hull ತแ interior points cannot be maximized! Thursday, November 29, 12

h2: 0.21 h4: 0.31 15 feature 1 feature 2 h1:
0.28 h3: 0.37 This can be done by checking whether the feature vector is inside the convex hull or not. convex hull ತแ extreme points Thursday, November 29, 12

0.28 h3: 0.37 This can be done by checking whether the feature vector is inside the convex hull or not. convex hull ತแ extreme points can be maximized! Thursday, November 29, 12

h4: 0.31 16 feature 1 feature 2 h3: 0.37 Our
task: ﬁnd extreme points with the minimal task loss. convex hull ತแ h1: 0.28 εϥΠυ͸ΠϝʔδͰ͢ɻ࣮ࡍͱ͸ҟͳΔ৔߹͕͋Γ·͢ɻ Thursday, November 29, 12

0.37 Our task: ﬁnd extreme points with the minimal task loss. convex hull ತแ εϥΠυ͸ΠϝʔδͰ͢ɻ࣮ࡍͱ͸ҟͳΔ৔߹͕͋Γ·͢ɻ Thursday, November 29, 12

Let’s compute convex hull? • We know QuickHull algorithm (Eddy,
1977; Barber+, 1996) • Once we construct the convex hull, we can ﬁnd the extreme points with the lowest loss! 17 Thursday, November 29, 12

1977; Barber+, 1996) • Once we construct the convex hull, we can ﬁnd the extreme points with the lowest loss! 17 This does not scale! Computing convex hull is very expensive! Thursday, November 29, 12

1977; Barber+, 1996) • Once we construct the convex hull, we can ﬁnd the extreme points with the lowest loss! 17 Time complexity of the best known convex hull algorithm (Barber+, 1996): O(NbD/2c+1) This does not scale! Computing convex hull is very expensive! Thursday, November 29, 12

We wanna ﬁnd extreme points without computing any convex hull
explicitly. 18 Thursday, November 29, 12

We wanna ﬁnd extreme points without computing any convex hull
explicitly. • We use linear programming (LP). • Because we know a polynomial time algorithm (Karmarkar, 1984) to achieve this requirement. 18 Time complexity: O(ND3.5) ≪ O(NbD/2c+1) Thursday, November 29, 12

19 Interior-Point Methods—The Breakthrough http://www.princeton.edu/~rvdb/542/lectures/lec14.pdf The Breakthrough Thursday, November 29,
12

20 This algorithm tells us whether the given point hi
is extreme. Karmarkar algorithm feature 1 feature 2 we don’t explicitly construct the convex hull 0 return w (≠ 0) if h is extreme return 0 otherwise is interior Thursday, November 29, 12

21 This algorithm tells us whether the given point hi
is extreme. Karmarkar algorithm feature 1 feature 2 we don’t explicitly construct the convex hull ŵ return w (≠ 0) if h is extreme return 0 otherwise is extreme the weight vector maximizes Thursday, November 29, 12

LP-MERT for a single sentence 22 feature 1 feature 2
h2: h4: h1: h3: Thursday, November 29, 12

h2: • compute task loss for N-best translations h4: h1: h3: 0.28 0.37 0.31 0.21 Thursday, November 29, 12

h2: • compute task loss for N-best translations • sort N-best by increasing losses h4: h1: h3: 0.28 0.37 0.31 0.21 Thursday, November 29, 12

h2: • compute task loss for N-best translations • sort N-best by increasing losses h4: h1: h3: 0.28 0.37 0.31 0.21 ① Thursday, November 29, 12

h2: • compute task loss for N-best translations • sort N-best by increasing losses h4: h1: h3: 0.28 0.37 0.31 0.21 ① ② Thursday, November 29, 12

h2: • compute task loss for N-best translations • sort N-best by increasing losses h4: h1: h3: 0.28 0.37 0.31 0.21 ① ② ③ Thursday, November 29, 12

h2: • compute task loss for N-best translations • sort N-best by increasing losses h4: h1: h3: 0.28 0.37 0.31 0.21 ① ② ③ ④ Thursday, November 29, 12

h2: • compute task loss for N-best translations • sort N-best by increasing losses • for each candidate - w ← run Karmarkar - If w ≠ 0 - return w • return 0 h4: h1: h3: 0.28 0.37 0.31 0.21 ① ② ③ ④ Karmarkar algorithm Thursday, November 29, 12

h2: • compute task loss for N-best translations • sort N-best by increasing losses • for each candidate - w ← run Karmarkar - If w ≠ 0 - return w • return 0 h4: h1: h3: 0.28 0.37 0.31 0.21 ① ② ③ ④ Karmarkar algorithm 0 Thursday, November 29, 12

h2: • compute task loss for N-best translations • sort N-best by increasing losses • for each candidate - w ← run Karmarkar - If w ≠ 0 - return w • return 0 h4: h1: h3: 0.28 0.37 0.31 0.21 ① ② ③ ④ Karmarkar algorithm Thursday, November 29, 12

h2: • compute task loss for N-best translations • sort N-best by increasing losses • for each candidate - w ← run Karmarkar - If w ≠ 0 - return w • return 0 h4: h1: h3: 0.28 0.37 0.31 0.21 ① ② ③ ④ Karmarkar algorithm w done! Thursday, November 29, 12

h2: • compute task loss for N-best translations • sort N-best by increasing losses • for each candidate - w ← run Karmarkar - If w ≠ 0 - return w • return 0 h4: h1: h3: 0.28 0.37 0.31 0.21 ① ② ③ ④ Karmarkar algorithm w done! Time complexity: O(N2D3.5) Thursday, November 29, 12

Empirical performance: Karmarkar (LP) vs. QuickHull 23 a linear program
(LP), nonical form c| w Aw  b hi in Eq. 3, by deﬁning with an,d = hj,d hi,d ent of hj), and by setting e vertex hi is extreme if nds a non-zero vector w stem. To ensure that w is or, we set c = hi hµ, to be inside the hull (e.g., list).6 In the remaining LP formulation in func- 1 . . . hN ) , which returns izing hi, or which returns h1 . . . hN ) . We also use ote whether hi is extreme 0.001 0.01 0.1 1 10 100 1000 10000 100000 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 QuickHull LP Dimensions Seconds Figure 2: Running times to exactly optimize N-best lists with an increasing number of dimensions. To determine which feature vectors were on the hull, we use either linear programming (Karmarkar, 1984) or one of the most efﬁ- cient convex hull computation tools (Barber et al., 1996). method of (Karmarkar, 1984), and since the main loop may run O ( N ) times in the worst case, time (Galley and Quirk, 2011) Thursday, November 29, 12

Empirical performance: Karmarkar (LP) vs. QuickHull 23 a linear program
(LP), nonical form c| w Aw  b hi in Eq. 3, by deﬁning with an,d = hj,d hi,d ent of hj), and by setting e vertex hi is extreme if nds a non-zero vector w stem. To ensure that w is or, we set c = hi hµ, to be inside the hull (e.g., list).6 In the remaining LP formulation in func- 1 . . . hN ) , which returns izing hi, or which returns h1 . . . hN ) . We also use ote whether hi is extreme 0.001 0.01 0.1 1 10 100 1000 10000 100000 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 QuickHull LP Dimensions Seconds Figure 2: Running times to exactly optimize N-best lists with an increasing number of dimensions. To determine which feature vectors were on the hull, we use either linear programming (Karmarkar, 1984) or one of the most efﬁ- cient convex hull computation tools (Barber et al., 1996). method of (Karmarkar, 1984), and since the main loop may run O ( N ) times in the worst case, time (Galley and Quirk, 2011) O(ND3.5) O(NbD/2c+1) Thursday, November 29, 12

Now, we will consider optimizing multiple sentences simultaneously. Coming back
to the real world (nightmares) Thursday, November 29, 12

Real case: need to deal with exponential comb. O(NS) 25
tuning set f1 f2 fN ... candidate space e1,1 e1,2 e1,N ... e2,1 e2,2 e2,N ... eS,1 eS,2 eS,N ... ... ... ... ... Typically, we are using thousands of sentences (i.e., 1,000 < S < 10,000). Thursday, November 29, 12

Again, think about two dimensional feature space (for convenience) 26
h(f,e,~) candidate space e1,1 e1,2 e1,N ... e2,1 e2,2 e2,N ... eS,1 eS,2 eS,N ... ... ... ... ... feature space feature 1 feature 2 O(NS) points Thursday, November 29, 12

Again, think about two dimensional feature space (for convenience) 26
h(f,e,~) candidate space e1,1 e1,2 e1,N ... e2,1 e2,2 e2,N ... eS,1 eS,2 eS,N ... ... ... ... ... feature space feature 1 feature 2 h2,2 O(NS) points Thursday, November 29, 12

GOAL: We want to select a combination of S feature
vectors 27 feature space feature 1 feature 2 O(NS) points so that the combination is extreme. Thursday, November 29, 12

Naïve approach: exhaustive search • enumerate all possible combinations •
for each combination, check whether it is extreme or not. 28 Time: O(NSD3.5) per LP Total: O(N2SD3.5) Time: O(NS) Thursday, November 29, 12

for each combination, check whether it is extreme or not. 28 Time: O(NSD3.5) per LP Total: O(N2SD3.5) Hereafter, we present several improvements Time: O(NS) Thursday, November 29, 12

for each combination, check whether it is extreme or not. 28 Time: O(NSD3.5) per LP Total: O(N2SD3.5) Hereafter, we present several improvements Time: O(NS) 㱺Sparse hypothesis combination: O(NSD3.5) Thursday, November 29, 12

for each combination, check whether it is extreme or not. 28 Time: O(NSD3.5) per LP Total: O(N2SD3.5) Hereafter, we present several improvements Time: O(NS) 㱺Sparse hypothesis combination: O(NSD3.5) 㱺Lazy enumeration, divide-and-conquer Thursday, November 29, 12

#1 Sparse hypothesis combination 29 mization binations e for each
nd return loss. It is it follows low since whether ST ) time h1,1 h2,2 h2,1 (a) (b) (Galley and Quirk, 2011) Think about S=2, Given two N-best lists We ask whether we can ﬁnd a weight vector that maximizes both h1,1 and h2,1. Thursday, November 29, 12

#1 Sparse hypothesis combination 29 mization binations e for each
nd return loss. It is it follows low since whether ST ) time h1,1 h2,2 h2,1 (a) (b) (Galley and Quirk, 2011) Think about S=2, Given two N-best lists We ask whether we can ﬁnd a weight vector that maximizes both h1,1 and h2,1. Geometrically, translate a N-best list so that h1,1 = h‘2,1. Thursday, November 29, 12

#1 Sparse hypothesis combination 30 (Galley and Quirk, 2011) Think
about S=2, Given two N-best lists Then, we use LP with the new set of 2N-1 points to determine h1,1 is on the convex hull (i.e., check whether h1,1 is extreme) s this optimization sible combinations determine for each extreme, and return west total loss. It is mal (since it follows ibitively slow since determine whether ires O ( NST ) time ) time in total. We nts to make this ap- ination ed above, each LP h [ m ]; H ) requires s NS vertices, but (c) h1,1 h’2,1 Thursday, November 29, 12

#1 Sparse hypothesis combination 30 (Galley and Quirk, 2011) Think
about S=2, Given two N-best lists Then, we use LP with the new set of 2N-1 points to determine h1,1 is on the convex hull (i.e., check whether h1,1 is extreme) s this optimization sible combinations determine for each extreme, and return west total loss. It is mal (since it follows ibitively slow since determine whether ires O ( NST ) time ) time in total. We nts to make this ap- ination ed above, each LP h [ m ]; H ) requires s NS vertices, but (c) h1,1 h’2,1 㱺yes Thursday, November 29, 12

How about this? 31 this optimization ble combinations determine for
each xtreme, and return est total loss. It is al (since it follows bitively slow since etermine whether res O ( NST ) time ) time in total. We s to make this ap- nation d above, each LP [ m ]; H ) requires NS vertices, but o O ( NST ) time. h1,1 h2,2 h2,1 (a) (b) Figure 3: Given two N-best lists, (a) and (b), we use linear programming to determine which hypothesis com- ation ions each eturn It is lows ince ether time . We s ap- h LP uires h1,1 h’2,2 (d) (Galley and Quirk, 2011) Is (h1,1, h2,2) extreme? Thursday, November 29, 12

How about this? 31 this optimization ble combinations determine for
each xtreme, and return est total loss. It is al (since it follows bitively slow since etermine whether res O ( NST ) time ) time in total. We s to make this ap- nation d above, each LP [ m ]; H ) requires NS vertices, but o O ( NST ) time. h1,1 h2,2 h2,1 (a) (b) Figure 3: Given two N-best lists, (a) and (b), we use linear programming to determine which hypothesis com- ation ions each eturn It is lows ince ether time . We s ap- h LP uires h1,1 h’2,2 (d) (Galley and Quirk, 2011) Is (h1,1, h2,2) extreme? 㱺No Thursday, November 29, 12

This trick generalizes to S ≧ 2 • We used
S(N-1) + 1 points instead of NS to determine whether a given point is extreme. • This trick does not sacriﬁce the optimality of LP-MERT. See appendix for details. 32 Thursday, November 29, 12

for each combination, check whether it is extreme or not. 33 Time: O(NSD3.5) per LP Total: O(N2SD3.5) Time: O(NS) 㱺Sparse hypothesis combination: O(NSD3.5) 㱺Lazy enumeration, divide-and-conquer O(NS+1SD3.5) Thursday, November 29, 12

Lazy enumeration 34 The case of S = 2 •
Process combinations in increasing order of task loss - For each N-best lists, compute task loss & sort all points Thursday, November 29, 12

0.42 0.35 0.45 0.3 0.5 0.53 Lazy enumeration 34 0.28
0.37 0.31 0.21 The case of S = 2 • Process combinations in increasing order of task loss - For each N-best lists, compute task loss & sort all points Thursday, November 29, 12

0.37 0.31 0.21 ① ② ③ ④ ⑥ ① ② ③ ④ ⑤ The case of S = 2 • Process combinations in increasing order of task loss - For each N-best lists, compute task loss & sort all points Thursday, November 29, 12

0.37 0.31 0.21 ① ② ③ ④ ⑥ ① ② ③ ④ ⑤ The case of S = 2 Sparse hypothesis combination • Process combinations in increasing order of task loss - For each N-best lists, compute task loss & sort all points Thursday, November 29, 12

0.37 0.31 0.21 ① ② ③ ④ ⑥ ① ② ③ ④ ⑤ The case of S = 2 Sparse hypothesis combination this ensures ﬁrst extreme combination is optimal! • Process combinations in increasing order of task loss - For each N-best lists, compute task loss & sort all points Thursday, November 29, 12

35 O(NS) because we assumed that task loss can be
computed at sentence-level. Lazy enumeration • Process combinations in increasing order of task loss - For each N-best lists, compute task loss & sort all points Thursday, November 29, 12

35 O(NS) because we assumed that task loss can be
computed at sentence-level. O(SNlogN) Lazy enumeration • Process combinations in increasing order of task loss - For each N-best lists, compute task loss & sort all points Thursday, November 29, 12

35 O(SNlogN) O(NS) because we assumed that task loss can
be computed at sentence-level. O(SNlogN) Lazy enumeration • Process combinations in increasing order of task loss - For each N-best lists, compute task loss & sort all points Thursday, November 29, 12

35 O(SNlogN) O(NS) because we assumed that task loss can
be computed at sentence-level. O(SNlogN) N.B.: for non-decomposable metric, O(NS) Lazy enumeration • Process combinations in increasing order of task loss - For each N-best lists, compute task loss & sort all points Thursday, November 29, 12

36 Binary lazy enumeration • Divide and conquer: split N-best
lists into smaller, compute extreme, and merge two N-best lists. • because S-ary lazy enumeration runs in O(NS) similar to Algorithm 3 (Huang & Chiang, 2005) Thursday, November 29, 12

37 Binary lazy enumeration The case of 4 sentences {f1,
f2, f3, f4} Thursday, November 29, 12

f2, f3, f4} {f1, f2} {f3, f4} Thursday, November 29, 12

f2, f3, f4} {f1, f2} {f3, f4} f1 f2 Thursday, November 29, 12

f2, f3, f4} {f1, f2} {f3, f4} f1 f2 f3 f4 Thursday, November 29, 12

f2, f3, f4} {f1, f2} {f3, f4} f1 f2 f3 f4 h11 Thursday, November 29, 12

f2, f3, f4} {f1, f2} {f3, f4} f1 f2 f3 f4 h11 h21 Thursday, November 29, 12

f2, f3, f4} {f1, f2} {f3, f4} f1 f2 f3 f4 h11 h31 h21 Thursday, November 29, 12

f2, f3, f4} {f1, f2} {f3, f4} f1 f2 f3 f4 h11 h31 h41 h21 Thursday, November 29, 12

f2, f3, f4} {f1, f2} {f3, f4} f1 f2 f3 f4 h11 NOTE: For each N-best lists, task loss is computed & all points are sorted in increasing order. h31 h41 h21 Thursday, November 29, 12

{f3, f4} {f1, f2, f3, f4} 38 Binary lazy enumeration
The case of 4 sentences {f1, f2} f1 f2 h11 table stores the losses of combinations h11 h21 69.1 E11 + E21 compute the sum of the losses h11 and h21 h21 h22 h12 h23 h24 Thursday, November 29, 12

The case of 4 sentences {f1, f2} f1 f2 h11 h21 h11 69.1 E11 + E21 combine & check the combination is extreme h12 h23 h24 h21 h11 h21 69.1 E11 + E21 h22 Thursday, November 29, 12

The case of 4 sentences {f1, f2} f1 f2 h11 h21 h11 69.1 E11 + E21 combine & check the combination is extreme In this case, interior h12 h23 h24 h21 h11 h21 69.1 E11 + E21 h22 Thursday, November 29, 12

The case of 4 sentences {f1, f2} f1 f2 h11 h21 h11 h21 69.1 enumerate the next best combination (ExpandFrontier) h22 h12 Thursday, November 29, 12

The case of 4 sentences {f1, f2} f1 f2 h11 h11 h21 69.1 enumerate the next best combination (ExpandFrontier) h22 h22 h12 Thursday, November 29, 12

The case of 4 sentences {f1, f2} f1 f2 h11 h11 h21 69.1 E11+E22 enumerate the next best combination (ExpandFrontier) h22 h22 69.2 h12 Thursday, November 29, 12

The case of 4 sentences {f1, f2} f1 f2 h21 h11 h21 69.1 E11+E22 enumerate the next best combination (ExpandFrontier) h22 69.2 69.3 h12 h12 E12+E21 Thursday, November 29, 12

The case of 4 sentences {f1, f2} f1 f2 h21 h11 h21 69.1 E11+E22 enumerate the next best combination (ExpandFrontier) h22 69.2 69.3 h12 h12 E12+E21 frontier nodes Thursday, November 29, 12

The case of 4 sentences {f1, f2} f1 f2 h11 h11 h21 69.1 choose the best frontier h22 69.2 h22 h12 69.3 Thursday, November 29, 12

The case of 4 sentences {f1, f2} f1 f2 h11 h11 h21 69.1 choose the best frontier h22 69.2 h22 h12 69.3 combine & check the combination is extreme Thursday, November 29, 12

The case of 4 sentences {f1, f2} f1 f2 h11 h11 h21 69.1 choose the best frontier h22 69.2 h22 h12 69.3 combine & check the combination is extreme extreme Thursday, November 29, 12

The case of 4 sentences {f1, f2} f1 f2 h11 h22 {h11, h22} f3 f4 h31 h41 Thursday, November 29, 12

The case of 4 sentences {f1, f2} f1 f2 h11 h22 {h11, h22} f3 f4 h31 h41 h31 h41 h42 h32 Thursday, November 29, 12

The case of 4 sentences {f1, f2} f1 f2 h11 h22 {h11, h22} f3 f4 h31 h41 h31 h41 h42 h32 {h31, h41} Thursday, November 29, 12

{f1, f2} f1 f2 h11 h22 {h11, h22} f3 f4 h31 h41 {h31, h41} {h11, h22} {h31, h41} 126 Thursday, November 29, 12

{f1, f2} f1 f2 h11 h22 {h11, h22} f3 f4 h31 h41 {h31, h41} if this combination is extreme, we’re done. {h11, h22} {h31, h41} 126 Thursday, November 29, 12

{f1, f2} f1 f2 h11 h22 {h11, h22} f3 f4 h31 h41 {h31, h41} if this combination is interior {h11, h22} {h31, h41} 126 Thursday, November 29, 12

{f1, f2} f1 f2 h11 h22 f3 f4 h31 h41 if this combination is interior {h11, h22} {h31, h41} 126 we have to search lower branches h11 h21 69.1 69.2 h22 h12 69.3 h31 h41 h42 h32 Thursday, November 29, 12

45 Binary lazy enumeration S=8 {f1, f2, f3, f4, f5,
f6, f7, f8} {f1, f2, f3, f4} {f5, f6, f7, f8} {f1, f2} {f3, f4} {f5, f6} {f7, f8} f1 f2 f3 f4 f5 f6 f7 f8 Split divide-and-conquer computations can be executed concurrently (not same as the parallelism! cf. http://vimeo.com/49718712) Thursday, November 29, 12

Approximations • Practically, LP-MERT is computationally expensive - there are
too many LPs to be solved. • Approximations #1: cos(w, w0) >= t - w0: reasonable approximation of ŵ by 1D-MERT • Approximations #2 - pruning the output of combined N-best lists by model score wrt wbest (best model on the entire tuning set) 46 Thursday, November 29, 12

Experiments on machine translation Thursday, November 29, 12

Objectives • compare LP-MERT with Och’s MERT (1D-MERT) • evaluate
exact LP-MERT - effect of number of features on runtime - scalability of the number of sentences. • evaluate LP-MERT w/ beam search - BLEU scores for dev set - end-to-end MERT comparison 48 Thursday, November 29, 12

Setup • system: dependency treelet system (Quirk+, 2005) - w/
cube pruning • task: WMT 2010 English-to-German - dev set: WMT 2009 test set - test set: WMT 2010 test set • features: 13 (use two language models) • N-best list size is 100 • evaluation: sentence BLEU (Lin & Och, 2004) 49 Thursday, November 29, 12

Setup (contd.) • Baseline: Och’s MERT (“1D-MERT”) - use random
restarts and random walks (Moore & Quirk, 2008) - the values of the parameters they used in experiments are not clear from this paper. 50 Thursday, November 29, 12

Additional notes 51 “To enable legitimate comparisons, LP- MERT and
1D-MERT are evaluated on the same combined N-best lists, even though running multiple iterations of MERT with either LP-MERT or 1D-MERT would normally produce different combined N-best lists.” -(Galley and Quirk, 2011) Thursday, November 29, 12

LP-MERT vs. 1D-MERT 52 -1 0 1 2 3 4
5 6 7 0 100 200 300 400 500 600 700 800 900 1000 S=8 S=4 S=2  BLEU[%] Figure 5: Line graph of sorted differences in BLEUn4r1 [%] scores between LP-MERT and 1D-MERT on 1000 tuning sets of size S = 2 , 4 , 8 . The highest differ- length 8 4 2 1 Table 1: N ments of F full comb number o (Galley and Quirk, 2011) 1000 overlapping subsets for S = 2, 4, and 8. experimented on 374-best lists after 5 iterations of MERT NOTE: x-coordinate is sorted by ΔBLEU LP-MERT is better 1D-MERT is better Thursday, November 29, 12

Lazy vs. exhaustive enumeration how many combinations are checked? 53
length tested comb. total comb. order 8 639,960 1 . 33 ⇥ 10 20 O ( N8 ) 4 134,454 2 . 31 ⇥ 10 10 O (2 N4 ) 2 49,969 430,336 O (4 N2 ) 1 1,059 2,624 O (8 N ) Table 1: Number of tested combinations for the experiments of Fig. 5. LP-MERT with S = 8 checks only 600K full combinations on average, much less than the total number of combinations (which is more than 10 20). (Galley and Quirk, 2011) N=374? full lazy Thursday, November 29, 12

Effect of # of dim. 54 nces in D-MERT est
differ- 13.1. approx- g with a ual data. get side y data), ata (pri- develop- Table 1: Number of tested combinations for the experiments of Fig. 5. LP-MERT with S = 8 checks only 600K full combinations on average, much less than the total number of combinations (which is more than 10 20). 1 10 100 1,000 10,000 2 3 4 5 6 7 8 9 seconds 1024 256 128 64 32 16 8 4 2 1 dimension (D) Figure 6: Effect of the number of features (runtime on 1 CPU of a modern computer). Each curve represents a different number of tuning sentences. (Galley and Quirk, 2011) runtime on 1 CPU Thursday, November 29, 12

Empirical performance of Approximation #1 larger value of cosine (aggressively
pruning), faster search 55 1 10 100 1,000 10,000 0.99 0.98 0.96 0.92 0.84 0.68 0.36 -0.28 -1 seconds 1024 512 256 128 64 32 16 8 4 2 1 cosine Figure 7: Effect of a constraint on w (runtime on 1 CPU). 32 64 128 256 512 1024 to hype challen the sea mate se Othe of grad simple et al., 2 search it still r ﬂect, e suffers feature cos( w0, w ) t (Galley and Quirk, 2011) runtime on 1 CPU Thursday, November 29, 12

Objectives • compare LP-MERT with Och’s MERT (1D-MERT) • evaluate
exact LP-MERT - effect of number of features on runtime - scalability of the number of sentences. • evaluate LP-MERT w/ beam search - BLEU scores for dev set - end-to-end MERT comparison 56 Thursday, November 29, 12

Table 2 57 Figure 7: Effect of a constraint on
w (runtime on 1 CPU). 32 64 128 256 512 1024 1D-MERT 22.93 20.70 18.57 16.07 15.00 15.44 our work 25.25 22.28 19.86 17.05 15.56 15.67 +2.32 +1.59 +1.29 +0.98 +0.56 +0.23 Table 2: BLEUn4r1[%] scores for English-German on WMT09 for tuning sets ranging from 32 to 1024 sentences. size. Results are displayed in Table 2. The gains are fairly substantial, with gains of 0.5 BLEU point or more in all cases where S  512 .8 Finally, we perform an end-to-end MERT comparison, where et al., search it still ﬂect, suffer featur difﬁc metho ramet We numb timiza in the weigh eter in ing th and st 2002) (Galley and Quirk, 2011) Thursday, November 29, 12

Table 2 57 Figure 7: Effect of a constraint on
w (runtime on 1 CPU). 32 64 128 256 512 1024 1D-MERT 22.93 20.70 18.57 16.07 15.00 15.44 our work 25.25 22.28 19.86 17.05 15.56 15.67 +2.32 +1.59 +1.29 +0.98 +0.56 +0.23 Table 2: BLEUn4r1[%] scores for English-German on WMT09 for tuning sets ranging from 32 to 1024 sentences. size. Results are displayed in Table 2. The gains are fairly substantial, with gains of 0.5 BLEU point or more in all cases where S  512 .8 Finally, we perform an end-to-end MERT comparison, where et al., search it still ﬂect, suffer featur difﬁc metho ramet We numb timiza in the weigh eter in ing th and st 2002) gains of over 0.5 BLEU point (Galley and Quirk, 2011) Thursday, November 29, 12

End-to-end comparison 58 Method iterations of outer loop WMT 2010
devset WMT 2010 test set 1D-MERT 9 15.97 16.91 LP-MERT 7 16.21 17.08 +0.24 +0.17 This table is created from the results of (Galley and Quirk, 2011) Thursday, November 29, 12

Summary • Och’s MERT remains inexact in the multi- dimensional
search • proposed LP-MERT, an exact search algorithm on N-best lists in the multi-dimensional case. - give us the optimal parameters as ground truth - In practice, two approximations are offered • Experimental results: slightly improves over Och’s MERT in terms of sentence BLEU. 59 Thursday, November 29, 12

Introduction to LP-MERT

Introduction to LP-MERT

Other Decks in Research

Featured

Transcript