Slide 1

Slide 1 text

Deep Copycat Networks for Text-to-Text Generation Julia Ive, Pranava Madhyastha, Lucia Specia EMNLP 2019 紹介者:平澤 寅庄(TMU, 小町研, M1) 3 February, 2020 @小町研

Slide 2

Slide 2 text

0 Overview ● Proposed a flexible pointer network framework for text-to-text generation task ○ copycat network based on Transformer ○ Dual-source copycat network ○ Dual-source double-attention copycat network ● Two tasks ○ Summarisation ○ Automatic Post-Edit (APE) ● Competitive results for summarisation ● Better novel/repetitive n-gram rates for APE

Slide 3

Slide 3 text

1 Introduction ● Seq-to-seq models are “over creative” ● copycat framework ○ Based on transformer network ○ Do not use coverage penalty ● Better performance wrt ○ novel n-gram generation rate (higher is better) ○ repetition rate (lower is better) ● Dual-source setting: Automatic post-editing task ○ Conditioned on two inputs and copy from either of them ■ Source language ■ Original machine translation

Slide 4

Slide 4 text

2 Related work ● Abstractive Summarisation ○ Pointer network [See et al., 2017] ○ Reinforcement learning ■ ROUGE and maximum likelihood [Paulus et al., 2017] ■ Global quality estimator [Li et al., 2018] ○ Content selection [Gehrmann et al., 2018] ● Automatic Post-Editing (APE) ○ Machine translation [Simard and Foster, 2013] ○ Multi-source transformer architectures [Junczys-Dowmunt and Grundkiewicz, 2018] ○ Requires large-scale dataset: unrealistic scenarios ○ SOTA APE models modify ~30% of inputs [Chatterjee et al., 2018] ■ Only 50% of modification are positive changes

Slide 5

Slide 5 text

Pointer-Generator Network [See et al., 2017]

Slide 6

Slide 6 text

3 copycat Networks 1. Transformer Architecture 2. Pointer-generator Network 3. Dual-source extension

Slide 7

Slide 7 text

3.1 Transformer Architecture ● Based on Vaswani et al., 2017 ○ 6 layers for enc./dec. ○ 512 model dimension ● Encoder: ○ Sub-layers: multi-head self-attention, position-wise feed-forward neural network ○ Output: ● Decoder: ○ Sub-layers: multi-head self-attention, multi-head cross-attention, position-wise FFNN ○ Output: ● Other stuff ○ Vocabulary: Byte Pair Encoding (BPE) applied ○ Training: cross entropy loss

Slide 8

Slide 8 text

3.2 Pointer-generator Network ● Based on See et al., 2017 ● Attention over encoder ○ Dot product attention ○ States of last decoder layer, just before Norm. layer ● Generation probability ○ sigmoid function ○ States of last decoder layer, just before Norm. layer ○ Interpolation of decoder’s probability and copy probability

Slide 9

Slide 9 text

3.3 Dual-source extension 1/2 ● Standard dual-source copycat framework ○ Encoder for machine translated output: ○ stacked decoder sub-layer:

Slide 10

Slide 10 text

3.3 Dual-source extension 2/2 ● Dual-source with double attention ○ Encoder for machine translated output: ○ stacked decoder sub-layer: a ○ Attention over encoder a ○ Pointer-generation with multiple sources a ○ Switching function

Slide 11

Slide 11 text

4 Data and Settings: Data ● Abstractive summarisation: CNN/Daily Mail dataset [Hermann et al., 2015; Nallapati et al., 2016] ○ Online news articles (800 tokens on average) and their abstracts (60 tokens on average) ○ 300k for training, 13k for validation, 11k for test ○ BPE with 32k, truncate articles to 400 subwords ● APE: two variants ○ English-German (En-De): WMT18/WMT19 APE shared task (IT domain) ■ translated by a NMT system (45.8 BLEU) ■ 13k for training, 1k for validation, 1k for test ○ English-Latvian (En-Lv): life sciences domain [Specia et al., 2018] ■ Translated by a NMT system (38.4 BLEU) ■ 13k for training, 1k for validation, 1k for test

Slide 12

Slide 12 text

4 Data and Settings: Metrics and hyperparameters ● Evaluation metrics for APE ○ HTER [Snover et al., 2009] ■ The minimum number of edits (substitution, insertion, deletion and shift) ■ 0.15 HTER for En-De, 0.29 HTER for En-Lv on average ○ BLEU [Papineni et al., 2002] ■ n-gram precision between MT hypotheses and post-edits ● Hyperparameters ○ Shared embedding with enc./dec. ○ beam search of size 5 (Summarisation), 10 (APE) ○ Adam optimizer ■ initial learning rate of 0.044, 4k warmup steps (Summarisation) ■ initial learning rate of 0.050, 8k warmup steps (APE) ○ Batch size of 50 ○ Early stopping: patience of 10 epochs based on ROUGE-L (Summarisation) and BLEU (APE)

Slide 13

Slide 13 text

5 Results: Summarisation Pointer network Reinforcement Learning Content Selection

Slide 14

Slide 14 text

5 Results: repetitive/novel n-gram ● Less effective for higher order n-grams than See et al., 2017 (~0 for 4-gram) ● Generated more unique quadrigrams than See et al. 2017 (~10%)

Slide 15

Slide 15 text

5 Results: Example of Summarisation

Slide 16

Slide 16 text

5 Results: APE ● BASE: the raw MT output ● PNT-TRG: copycat with attention over the MT ● PNT-TRG+SRC: copycat with double attention ● Pre-training model is mandatory ○ Pre-train models on synthetic data by randomly deletion, insertion, shift, substitution target side sentence in parallel corpus wrt HTER score in APE dataset ○ eSCAPE corpus (En-De) ○ Europarl and EMEA corpora (En-Lv) ○ Fine-tuning with learning rate of 0.001 ● The performance of copycat systems depends on the difficulty of the task ○ Minor improvement for German ● Major post-edits ○ punctuation (e.g., deletion of a comma) ○ auxiliary words (e.g., insertion/deletion of prepositions or articles) eSCAPE: a Large-scale Synthetic Corpus for Automatic Post-Editing EMEA: European Medicines Agency

Slide 17

Slide 17 text

5 Results: Examples of APE

Slide 18

Slide 18 text

6. Conclusion