Slide 1

Slide 1 text

Unbalanced Optimal Transport for Unbalanced Word Alignment Yuki Arase Associate Professor, Osaka University, Japan NLP

Slide 2

Slide 2 text

About Me • Career • 2010-2014 Associate Researcher, Microsoft Research Asia, China • 2014- Associate Professor, Osaka University, Japan • Research Interest • Paraphrase recognition and generation • NLP for language education and healthcare • Community Service • PC Chair @ IJCNLP-AACL2023 • MAL @ Asian Federation of NLP, 2

Slide 3

Slide 3 text

Monolingual Word Alignment • Identifies semantically corresponding words in a sentence pair • Is crucial for modelling semantic interactions between sentences: • Paraphrase & entailment recognition • Summarization & sentence fusion • Interpretability, provenance of LLM outputs 3 The agency described in a statement that the information was a pack of lies It said in a bulletin that reports about the incident are cheap lies and news rumors

Slide 4

Slide 4 text

Unbalanced Word Alignment • Null alignment is prevalent in semantically divergent sentences, which makes the alignment problem challenging • Null alignment ratio can be ~64% in entailment sentence pairs • Identification of null alignment is useful to declare semantic gaps and reason about semantic (dis)similarity The agency described in a statement that the information was a pack of lies It said in a bulletin that reports about the incident are cheap lies and rumors news

Slide 5

Slide 5 text

Unbalanced Word Alignment • Null alignment is prevalent in semantically divergent sentences, which makes the alignment problem challenging • Null alignment ratio can be ~64% in entailment sentence pairs • Identification of null alignment is useful to declare semantic gaps and reason about semantic (dis)similarity Two days later , a 28-year-old man died in a shark attack in Avon , North Carolina . A shark attacked a human being .

Slide 6

Slide 6 text

Related Work • Bilingual word alignment has been commonly studied for MT, e.g., (Garg et al. 2019, Zenkel et al. 2020) • Assume the availability of a large-scale parallel corpus • Monolingual word alignment commonly uses supervised learning, e.g., (Yao et al. 2013, Lan et al. 2021) • Modelled word alignment using the CRF regarding source words as observations and target words as hidden states • Null alignment has got less attention. • Critical to handle semantically divergent sentence pairs 6

Slide 7

Slide 7 text

Optimal Transport (OT) Problems ※ OT 7 Cost matrix 1.0 0.0

Slide 8

Slide 8 text

Partial and Unbalanced OT 8 Cost matrix 1.0 0.0 Null alignment

Slide 9

Slide 9 text

Unbalanced Word Alignment as OT 9 𝑤1 𝑠 𝑤2 𝑠 𝑤3 𝑠 𝑤4 𝑠 𝑤5 𝑠 𝑤6 𝑠 𝑤7 𝑠 𝑤8 𝑠 𝑤1 𝑡 𝑤2 𝑡 𝑤3 𝑡 𝑤4 𝑡 𝑤5 𝑡 𝑤6 𝑡 𝑤7 𝑡 Source Target 1.0 0.0 Distance matrix

Slide 10

Slide 10 text

Link to Statistic Word Alignment 10 𝑤1 𝑠 𝑤2 𝑠 𝑤3 𝑠 𝑤4 𝑠 𝑤5 𝑠 𝑤6 𝑠 𝑤7 𝑠 𝑤8 𝑠 𝑤1 𝑡 𝑤2 𝑡 𝑤3 𝑡 𝑤4 𝑡 𝑤5 𝑡 𝑤6 𝑡 𝑤7 𝑡 Source Target 1.0 0.0 Distance matrix Distortion (IBM Model2) Fertility (IBM Model3) Brown et al. 2003. The Mathematics of Statistical Machine Translation: Parameter Estimation. CL.

Slide 11

Slide 11 text

Optimal Transport Alignment: OTAlign • Leverage balanced, partial, and unbalanced OT for unbalanced word alignment • Obtain contextualized word embeddings using a pretrained LM, namely BERT • Cost: Cosine and Euclidean distances of embeddings • Fertility: L2-norms / Uniform • Sparcify alignments • Regularization on OT makes the alignment matrix dense • Prune alignment whose probability is smaller than a threshold 11

Slide 12

Slide 12 text

Experiment Settings • Datasets with human alignment • MSR-RTE, Edinburgh++ • MultiMWA: MTRef, Wiki, Newsela, and ArXiv • Evaluation metrics precision = ෡ 𝕐𝑎 ∩ 𝕐𝑎 + ෡ 𝕐∅ ∩ 𝕐∅ ෡ 𝕐𝑎 + ෡ 𝕐∅ , recall = ෡ 𝕐𝑎 ∩ 𝕐𝑎 + ෡ 𝕐∅ ∩ 𝕐∅ 𝕐𝑎 + 𝕐∅ 12

Slide 13

Slide 13 text

Unsupervised Alignment: Per Corpus [Observation 1] The best OT problem depends on null alignment ratios 13 Corpus (sparse ↔ dense) MSR-RTE Newsela EDB++ MTRef Arxiv Wiki Alignment links S S + P S S + P S S + P S S + P S S + P S Null rate (%) 63.8 59.0 33.3 23.5 27.4 19.0 18.7 11.2 12.8 12.2 8.3 fast-align 42.3 41.6 58.4 56.5 59.6 60.8 58.1 58.0 80.5 80.5 87.2 SimAlign 85.4 81.5 76.7 77.3 74.7 78.9 74.8 75.8 91.7 91.9 94.8 Type Reg. cost mass BOT -- cosine uniform 20.6 22.5 41.4 46.9 49.0 55.0 50.4 55.5 65.6 66.2 66.5 Sk cosine uniform 88.8 83.0 83.7 79.4 84.4 82.8 77.3 77.2 90.4 90.9 93.9 POT -- cosine uniform 89.0 84.0 77.1 76.2 78.4 78.7 75.6 76.2 84.3 89.9 94.5 Sk cosine uniform 92.2 86.4 84.6 79.8 83.8 82.3 77.0 76.6 91.5 90.3 93.9 UOT Sk cosine uniform 90.2 84.5 83.1 79.1 84.7 82.5 77.2 77.1 90.0 89.6 93.8

Slide 14

Slide 14 text

Unsupervised Alignment: Per Null Rate 0% 20% 40% 60% 80% 100% 0 20 40 60 80 100 Alignment F1 (%) Null ratio (%) fast-align SimAlign BOT: cos, uniform Regularised BOT: cos, uniform POT: cos, uniform Regularised POT: cos, uniform UOT: cos, uniform 14

Slide 15

Slide 15 text

Unsupervised Alignment: Per Null Rate 0% 20% 40% 60% 80% 100% 0 20 40 60 80 100 Alignment F1 (%) Null atio (%) fast-align SimAlign BOT: cos, uniform Regularised BOT: cos, uniform POT: cos, uniform Regularised POT: cos, uniform UOT: cos, uniform 15 [Observation 2] Thresholding on the alignment matrix makes it unbalanced.

Slide 16

Slide 16 text

Supervised Alignment • The entropy-regularized OT is differentiable and thus can be directly integrated into neural models. • Fine-tune the entire model by minimizing the binary cross- entropy loss: ℒ 𝑃𝑖,𝑗 , 𝑌𝑖,𝑗 = −𝑌𝑖,𝑗 log 𝑃𝑖,𝑗 − 1 − 𝑌𝑖,𝑗 log(1 − 𝑃𝑖,𝑗 ) 16

Slide 17

Slide 17 text

Supervised Alignment: Per Corpus [Observation 3] OT-based alignment is competitive against the SoTA methods on datasets with higher null alignment ratios. 17 Corpus (sparse ↔ dense) MSR-RTE Newsela EDB++ MTRef Arxiv Wiki Alignment links S S + P S S + P S S + P S S + P S S + P S Null rate (%) 63.8 59.0 33.3 23.5 27.4 19.0 18.7 11.2 12.8 12.2 8.3 (Lan et al. 2021) 95.1 89.2 86.7 85.3 88.3 87.8 83.4 86.1 95.2 95.0 96.6 (Nagata et al. 2020) 95.0 89.2 79.4 82.4 86.9 87.2 82.9 88.0 89.1 89.5 96.5 Type cost mass BOT cosine norm 94.6 88.4 86.5 84.4 85.7 85.4 82.9 87.3 91.7 93.0 96.5 POT cosine norm 94.6 88.4 84.0 81.4 85.5 83.7 82.0 85.2 93.0 92.2 95.5 UOT cosine norm 94.8 89.0 86.8 84.7 86.7 86.6 82.9 87.4 92.5 92.8 96.7

Slide 18

Slide 18 text

Supervised Alignment: Per Null Rate 70% 80% 90% 100% 0 20 40 60 80 100 Alignment F1 (%) Null ratio (%) (Lan et al., 2021) (Nagata et al., 2020) Regularised BOT: cos, norm Regularised POT: cos, norm UOT: cos, norm 18

Slide 19

Slide 19 text

OTAlign Example 19 State-of-the-art (Lan et al. 2021) OTAlign (Unbalanced OT)

Slide 20

Slide 20 text

Unsupervised Bilingual Word Alignment 20 • Applied OTAlign to bilingual word alignment • Multilingual pre-trained model: LaBSE Corpus de-en sv-en fr-en ro-en ja-en zh-en Awsome-align (Dou and Neubig 2021) 82.5 90.2 94.3 72.1 54.5 82.1 AccAlign (Wang et al. 2022) 84.0 92.6 95.5 79.2 56.7 83.8 Type cost mass BOT cosine norm 82.1 90.5 92.8 76.6 51.8 84.0 UOT cosine norm 85.3 93.6 96.3 79.9 59.5 84.8 * Hyper-parameters were tuned on the dev set (cs-en)

Slide 21

Slide 21 text

Summary This is the first study that connects the paradigms of unbalanced word alignment and the OT problems. We empirically showed 1. OTAlign is a natural and powerful tool to unbalanced word alignment without tailor-made techniques 2. a comprehensive picture that unveils the characteristics of the OT problems on unbalanced word alignment 21 OTAlign:

Slide 22

Slide 22 text

All Resources Are Available! • Yuki Arase, Han Bao, and Sho Yokoi. Unbalanced Optimal Transport for Unbalanced Word Alignment. In Proc. of ACL 2023. • OTAlign: • Yuki Arase and Jun’ichi Tsujii. 2020. Compositional Phrase Alignment and Beyond. In Proc. of EMNLP 2020. • Sora Kadotani and Yuki Arase. 2023. Monolingual Phrase Alignment as Parse Forest Mapping. In Proc. of *SEM 2023. • Phrase Aligner: