20240820: Minimum Bayes Risk Decoding for High-Quality Text Generation Beyond High-Probability Text

Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

◼ ⚫ ⚫ ⚫ ◼ ◼ ⚫ ⚫ ▶ ▶ https://en.wikipedia.org/wiki/Transfer-based_machine_translation

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

◼ ⚫ ◼ ⚫ ⚫

Slide 5

Slide 5 text

◼ ⚫ 𝒚⋆ ∈ 𝒴 ▶ 𝒴 ≔ 𝒱𝑌 ∗ ⚫ 𝒙 ∈ 𝒳 ▶ ◼ ◼ 𝒳 ≔ 𝒱𝑋 ∗ ◼ 𝒴 ≔ 𝒱𝑌 ∗ ⚫ 𝒱𝑋 ∗, 𝒱𝑌 ∗ ⚫ 𝒚1 ⚫ 𝒚2 ⚫ 𝒚3 ⚫ 𝒚4 ⚫

Slide 6

Slide 6 text

◼ ⚫ 𝒚⋆ ∈ 𝒴 ▶ 𝒴 ≔ 𝒱𝑌 ∗ ⚫ 𝒙 ∈ 𝒳 ▶ ◼ ◼ 𝒳 ≔ 𝒱𝑋 ∗ ◼ 𝒴 ≔ 𝒱𝑌 ∗ ⚫ 𝒱𝑋 ∗, 𝒱𝑌 ∗

Slide 7

Slide 7 text

https://repositorio.ul.pt/bitstream/10451/10945/2/ulfl155512_tm_2.pdf

Slide 8

Slide 8 text

◼ 𝑝 𝒚|𝒙; 𝜃 ⚫ 𝒚 𝒙 𝒚 ◼ 𝒙 ∈ 𝒱𝑋 ∗ ◼ 𝒚 ∈ 𝒱𝑌 ∗ ⚫ 𝒱𝑋 ∗, 𝒱𝑌 ∗ ◼ 𝜃 ⚫ 𝑝 This book is interesting ; 𝜃) = 0.8434 𝑝 This book is delicious ; 𝜃) = 0.0013

Slide 9

Slide 9 text

Slide 10

Slide 10 text

◼ 𝑝 𝒚|𝒙; 𝜃 ⚫ 𝒚 𝒙 𝒚 ◼ 𝒚MAP𝜃 ∈ 𝒴 𝒚MAP𝜃 = argmax 𝒚∈𝒴 𝑝 𝒚|𝒙; 𝜃 ς 𝑡=1 𝒚 𝑝 𝑦𝑡|𝒚<𝑡,𝒙;𝜃 ⚫ ⚫ ◼ 𝒳 ≔ 𝒱𝑋 ∗ ◼ 𝒴 ≔ 𝒱𝑌 ∗ ⚫ 𝒚1 ⚫ 𝒚2 ⚫ 𝒚3 ⚫ 𝒚4 ⚫ ◼ 𝜃 ⚫

Slide 11

Slide 11 text

◼ 𝑝 𝒚|𝒙; 𝜃 ⚫ 𝒚 𝒙 𝒚 ◼ 𝒚MAP𝜃 ∈ 𝒴 𝒚MAP𝜃 = argmax 𝒚∈𝒴 𝑝 𝒚|𝒙; 𝜃 ς 𝑡=1 𝒚 𝑝 𝑦𝑡|𝒚<𝑡,𝒙;𝜃 ⚫ ⚫ ◼ 𝒳 ≔ 𝒱𝑋 ∗ ◼ 𝒴 ≔ 𝒱𝑌 ∗ ⚫ 𝒚1 ⚫ 𝒚2 ⚫ 𝒚3 ⚫ 𝒚4 ⚫ ◼ 𝜃 ⚫ ◼ ⚫

Slide 12

Slide 12 text

Slide 13

Slide 13 text

◼ ⚫ ▶ ◼ ⚫ ⚫ 𝑝 ""|𝒙; 𝜃 ⚫ ; 1 2 3 4 5 𝑦5 (Ott+, ICML2018; Stahlberg & Byrne, EMNLP2019) Ott+, ICML2018, “Analyzing Uncertainty in Neural Machine Translation”. Stahlberg & Byrne, EMNLP2019, “On NMT Search Errors and Model Errors: Cat Got Your Tongue?”

Slide 14

Slide 14 text

◼ Risk 𝒚 = 𝔼𝒚′~ Pr ⋅|𝒙 ℒ 𝒚, 𝒚′ ⚫ ◼ ⚫ argmin 𝒚∈𝒴 Risk 𝒚 Goel & Byrne, CS&L Vol14., 2000, “Minimum Bayes-risk automatic speech recognition”. Kumar & Byrne, NAACL2004, “Minimum Bayes-Risk Decoding for Statistical Machine Translation”. ◼ ℒ: 𝒴 × 𝒴 → ℝ ◼ Pr ⋅ |𝒙

Slide 15

Slide 15 text

◼ Risk 𝒚 = 𝔼𝒚′~ Pr ⋅|𝒙 ℒ 𝒚, 𝒚′ ⚫ ◼ ⚫ argmin 𝒚∈𝒴 Risk 𝒚 ◼ Goel & Byrne, CS&L Vol14., 2000, “Minimum Bayes-risk automatic speech recognition”. Kumar & Byrne, NAACL2004, “Minimum Bayes-Risk Decoding for Statistical Machine Translation”. ◼ ℒ: 𝒴 × 𝒴 → ℝ ◼ Pr ⋅ |𝒙

Slide 16

Slide 16 text

◼ (von Neumann & Morgenstern, 1944) ⚫ von Neumann & Morgenstern, 1944, “Theory of Games and Economic Behavior”. ⚫ ⚫ ▶ $1500 ∗ 0.75 + $3000 ∗ 0.25 = $1875 ⚫ ▶ $1500 ∗ 0.25 + $3000 ∗ 0.75 = $2625

Slide 17

Slide 17 text

◼ ⚫ ⚫ 𝑢: 𝒴 × 𝒴 → ℝ 𝒚 ≽ 𝒚′ ⇔ 𝑢 𝒚, 𝒓 ≥ 𝑢 𝒚′, 𝒓 ⚫ ◼ 𝑢: 𝒴 × 𝒴 → ℝ ◼ ≽ 𝒚 𝒚′ ◼ 𝒓 ∈ 𝒴

Slide 18

Slide 18 text

◼ ⚫ ⚫ 𝑢: 𝒴 × 𝒴 → ℝ 𝒚 ≽ 𝒚′ ⇔ 𝑢 𝒚, 𝒓 ≥ 𝑢 𝒚′, 𝒓 ⚫ ◼ 𝑢: 𝒴 × 𝒴 → ℝ ◼ ≽ 𝒚 𝒚′ ◼ 𝒓 ∈ 𝒴

Slide 19

Slide 19 text

◼ 𝒚MBRtrue = argmax 𝒚∈𝒴 𝔼𝒓~ Pr ⋅|𝒙 𝑢 𝒚, 𝒓 ⚫ ◼ argmin 𝒚∈𝒴 Risk 𝒚 = argmin 𝒚∈𝒴 𝔼𝒚′~ Pr ⋅|𝒙 ℒ 𝒚, 𝒚′ ⚫ ◼ 𝑢: 𝒴 × 𝒴 → ℝ ◼ Pr ⋅ |𝒙 ◼ ⚫

Slide 20

Slide 20 text

◼ 𝒚MBRtrue = argmax 𝒚∈𝒴 𝔼𝒓~ Pr ⋅|𝒙 𝑢 𝒚, 𝒓 ◼ ⚫ ▶ ⚫ ▶ ▶ Pr ⋅ |𝒙 ⚫ ▶

Slide 21

Slide 21 text

◼ 𝒚MBRtrue = argmax 𝒚∈𝒴 𝔼𝒓~ Pr ⋅|𝒙 𝑢 𝒚, 𝒓 ◼ ⚫ ▶ ℋ ⊆ 𝒴 ⚫ ▶ ▶ Pr ⋅ |𝒙 ⚫ ▶

Slide 22

Slide 22 text

◼ 𝒚MBRtrue = argmax 𝒚∈𝒴 𝔼𝒓~ Pr ⋅|𝒙 𝑢 𝒚, 𝒓 ◼ ⚫ ▶ ℋ ⊆ 𝒴 ⚫ ▶ ▶ Pr ⋅ |𝒙 ⚫ ▶

Slide 23

Slide 23 text

(Eikema & Aziz, COLING2020) ◼ ෠ ℛ ≔ 𝒓𝑖 ∈ 𝒴 𝒓𝑖 ~𝑝 𝒓|𝒙; 𝜃 𝑖=1 ෠ ℛ ◼ 𝑝MC 𝒓|𝒙; ෠ ℛ ≔ 𝑚 ෠ ℛ 𝒓 ෠ ℛ 𝜇MC 𝒉; ෠ ℛ ≔ ෍ 𝒓∈Supp ෠ ℛ 𝑝MC 𝒓|𝒙; ෠ ℛ 𝑢 𝒉, 𝒓 𝑦MBR𝜃 MC = argmax 𝒉∈ℋ 𝜇MC 𝒉; ෠ ℛ ◼ ℋ ⊆ 𝒴 ◼ ෠ ℛ ◼ Supp ෠ ℛ ⊆ 𝒴 ෠ ℛ ◼ 𝑚 ෡ ℛ : 𝒴 → ℤ+ Eikema & Aziz, COLING2020, “Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation”.

Slide 24

Slide 24 text

(Eikema & Aziz, COLING2020) ◼ ෠ ℛ ≔ 𝒓𝑖 ∈ 𝒴 𝒓𝑖 ~𝑝 𝒓|𝒙; 𝜃 𝑖=1 ෠ ℛ ◼ 𝑝MC 𝒓|𝒙; ෠ ℛ ≔ 𝑚 ෠ ℛ 𝒓 ෠ ℛ 𝜇MC 𝒉; ෠ ℛ ≔ ෍ 𝒓∈Supp ෠ ℛ 𝑝MC 𝒓|𝒙; ෠ ℛ 𝑢 𝒉, 𝒓 𝑦MBR𝜃 MC = argmax 𝒉∈ℋ 𝜇MC 𝒉; ෠ ℛ Eikema & Aziz, COLING2020, “Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation”.

Slide 25

Slide 25 text

(Eikema & Aziz, COLING2020) Eikema & Aziz, COLING2020, “Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation”. ◼ 𝑦MBR𝜃 MC = argmax 𝒉∈ℋ 𝜇MC 𝒉; ෠ ℛ ◼

Slide 26

Slide 26 text

◼ ⚫ ◼ 𝜃

Slide 27

Slide 27 text

◼ ⚫ ▶ ℋ = ෠ ℛ ⚫ 𝜖 = 0.02 ◼ ⚫ ⚫ ◼ ⚫ ◼

Slide 28

Slide 28 text

◼ ◼ ◼ ◼ ◼ ◼ ◼ ⚫ ⚫

Slide 29

Slide 29 text

◼

Slide 30

Slide 30 text

◼ ⚫ ⚫ ▶ ◼ ⚫ Deguchi+, arxiv, 2408.04167, “mbrs: A Library for Minimum Bayes Risk Decoding”.

Slide 31

Slide 31 text

No content

Slide 32

Slide 32 text

◼ ◼

Slide 33

Slide 33 text

◼ 𝒪 ℋ ෠ ℛ ⚫ 𝒪 𝑁2 𝑁 ≔ ℋ ⚫ ⚫ ▶ ⚫ ◼ ◼ ℋ ⊆ 𝒴 ◼ ෠ ℛ

Slide 34

Slide 34 text

◼ ⚫ (DeNero+, ACL2009; Vamvas&Sennrich, ACL2024) ⚫ (Deguchi+, ACLFindigns2024) ◼ ⚫ (Cheng&Vlachos, EMNLP2023) ◼ ⚫ (Trabelsi+, 2024) DeNero+, ACL2009, “Fast Consensus Decoding over Translation Forests”. Vamvas&Sennrich, ACL2024, “Linear-time Minimum Bayes Risk Decoding with Reference Aggregation”. Deguchi+, Findings of ACL2024, “Centroid-Based Efficient Minimum Bayes Risk Decoding”. Cheng&Vlachos, EMNLP2023, “Faster Minimum Bayes Risk Decoding with Confidence-based Pruning”. Trabelsi+, 2024, “Efficient Minimum Bayes Risk Decoding using Low-Rank Matrix Completion Algorithms”.

Slide 35

Slide 35 text

(Denero+, ACL2009; Vamvas&Sennrich, ACL2024) ◼ 𝜙 𝒚 ⚫ ⚫ ⚫ ◼ ത 𝜙 ෠ ℛ = ෍ 𝒓∈Supp ෠ ℛ 𝑝MC 𝒓|𝒙; ෠ ℛ 𝜙 𝒓 ◼ ത 𝜙 ෠ ℛ 𝒚RAMBR𝜃 MC = argmax 𝒉∈ℋ 𝑠 𝜙 𝒉 , ത 𝜙 ෠ ℛ ⚫ 𝒪 ℋ ෠ ℛ 𝒪 ℋ + ෠ ℛ ◼ ℋ ⊆ 𝒴 ◼ ෠ ℛ ◼ 𝜙 ◼ 𝑠 DeNero+, ACL2009, “Fast Consensus Decoding over Translation Forests”. Vamvas&Sennrich, ACL2024, “Linear-time Minimum Bayes Risk Decoding with Reference Aggregation”.

Slide 36

Slide 36 text

(Deguchi+, Findings of ACL2024) ◼ 𝐷 ⚫ 𝜙: 𝒴 → ℝ𝐷 ◼ 𝑘 ⚫ 𝑘 ◼ 𝒪 ℋ 𝑘 + ෠ ℛ 𝑘 ◼ Deguchi+, Findings of ACL2024, “Centroid-Based Efficient Minimum Bayes Risk Decoding”.

Slide 37

Slide 37 text

(Cheng&Vlachos, EMNLP2023) ◼ ⚫ ◼ ◼ Cheng&Vlachos, EMNLP2023, “Faster Minimum Bayes Risk Decoding with Confidence-based Pruning”.

Slide 38

Slide 38 text

(Trabelsi+, 2024) ◼ ℋ × ෠ ℛ ⚫ ◼ ⚫ ▶ ▶ ◼ ⚫ 𝐻 ∈ ℝ𝑟× ℋ , 𝑅 ∈ ℝ𝑟× ෠ ℛ ⚫ 𝑀 ≈ 𝐻⊤𝑅 ▶ Trabelsi+, 2024, “Efficient Minimum Bayes Risk Decoding using Low-Rank Matrix Completion Algorithms”.

Slide 39

Slide 39 text

◼ ◼

Slide 40

Slide 40 text

(Jinnai+, ICML2024) ◼ ◼ ◼ 𝑝MB 𝒓|𝒙; ℛ, 𝜃 ≔ 𝑝 𝒓|𝒙; 𝜃 σ 𝒓∈ℛ 𝑝 𝒓|𝒙; 𝜃 𝜇MB 𝒉; ℛ, 𝜃 ≔ ෍ 𝒓∈ℛ 𝑝MB 𝒓|𝒙; ℛ, 𝜃 𝑢 𝒉, 𝒓 𝑦MBR𝜃 MB = argmax 𝒉∈ℋ 𝜇MB 𝒉; ℛ, 𝜃 ◼ ℋ ⊆ 𝒴 ◼ ℛ Jinnai+, ICML2024, “Model-Based Minimum Bayes Risk Decoding for Text Generation”.

Slide 41

Slide 41 text

◼ 𝑝MB 𝒓|𝒙; ℛ, 𝜃 ≔ 𝑝 𝒓|𝒙; 𝜃 σ 𝒓∈ℛ 𝑝 𝒓|𝒙; 𝜃 𝜇MB 𝒉; ℛ, 𝜃 ≔ ෍ 𝒓∈ℛ 𝑝MB 𝒓|𝒙; ℛ, 𝜃 𝑢 𝒉, 𝒓 𝑦MBR𝜃 MB = argmax 𝒉∈ℋ 𝜇MB 𝒉; ℛ, 𝜃 ◼ ℋ ⊆ 𝒴 ◼ ෠ ℛ ◼ ℛ ◼ 𝑝MC 𝒓|𝒙; ෠ ℛ ≔ 𝑚 ෠ ℛ 𝒓 ෠ ℛ 𝜇MC 𝒉; ෠ ℛ ≔ ෍ 𝒓∈Supp ෠ ℛ 𝑝MC 𝒓|𝒙; ෠ ℛ 𝑢 𝒉, 𝒓 𝑦MBR𝜃 MC = argmax 𝒉∈ℋ 𝜇MC 𝒉; ෠ ℛ Jinnai+, ICML2024, “Model-Based Minimum Bayes Risk Decoding for Text Generation”.

Slide 42

Slide 42 text

◼ ⚫ ◼ ◼ Deguchi+, arxiv, 2408.04167, “mbrs: A Library for Minimum Bayes Risk Decoding”.

Slide 43

Slide 43 text

Deguchi+, arxiv, 2408.04167, “mbrs: A Library for Minimum Bayes Risk Decoding”. 𝑢 𝑢 ◼ ◼

Slide 44

Slide 44 text

Deguchi+, arxiv, 2408.04167, “mbrs: A Library for Minimum Bayes Risk Decoding”. ◼ ⚫ ⚫ ⚫ ◼ ⚫ ⚫ ⚫ ⚫

Slide 45

Slide 45 text

◼ ◼ ◼ ⚫ ◼ ◼ ⚫