BottleSum

BottleSum Unsupervised and Self-supervised Sentence Summarization using the Information Bottleneck
Principle Peter West, Ari Holtzman, Jan Buys, Yejin Choi Paper Reading Masato Umakoshi, Kyoto University 1

Summary • Methods • Unsupervised summarization • Self-supervised summarization •
Using Information Bottleneck principle for summarization • Outcomes • Better result on automatic / human evaluation • Better result on domain where sentence-summary is not available 2

What is task? 3

What is task? • Sentence summarization (compression) • Given a
sentence, summarize it into shorter one • Suppose that good sentence summary contains information related to the broader context while discarding less significant details 4

How to solve it? 5

How to solve it? • Unsupervised methods • Why unsupervised?
• Sentence-summarize pair is not always available • Current unsupervised methods use autoencoder(AE) as core of methods • The source sentence should be accurately predicted from summary • This goes against the fundamental goal of summarization • Crucially needs to forget all but the “relevant” information • Therefore, use Information Bottleneck(IB) to discard irrelevant information 6

Example • Next sentence is about control, so summary should
only contain about control • However, summary by autoencoder necessary refer to population to restore information about it 7

Methods • 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚!" is based mainly on the principle of
the Information Bottleneck(IB) • 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚!" • Extractive unsupervised method • No need to train • Take consecutive two sentences as input • 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚#$%& • Abstractive self-supervised method • Use the result of 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚!" as training data • Take a sentence as input 8

What is Information Bottleneck? (1/2) • Given source 𝑆, external
variable 𝑌, summary * 𝑆 • Learning a conditional distribution 𝑝( * 𝑆|𝑆) minimizing: 𝐼 " 𝑆; 𝑆 − 𝛽𝐼( " 𝑆; 𝑌) • 𝛽: Coefficient to balance two terms • 𝐼 : Mutual information between two variables • 𝐼 𝑋; 𝑌 = 𝐻 𝑋 − 𝐻 𝑋 𝑌 = 𝐻 𝑌 − 𝐻 𝑌 𝑋 • Given information about Y, how much ambiguity of X decrease? • Refer appendix 9

What is Information Bottleneck? (2/2) • minimizing: 𝐼 " 𝑆;
𝑆 − 𝛽𝐼 " 𝑆; 𝑌 • Generate good summary * 𝑆 from source 𝑆 • First term: pruning term • If 0 𝑆 decrease ambiguity of 𝑆, then 𝐼 0 𝑆; 𝑆 increase • Ensures irrelevant information is discarded • Second term: relevant term • If 0 𝑆 decrease ambiguity of 𝑌, then −𝐼 0 𝑆; 𝑌 decrease • Ensures 0 𝑆 and 𝑌 share information. (1) 10

Why IB is better? • Suppose in 𝑆, we have
some information 𝑍 which is irrelevant to 𝑌 • In IB: • Not containing 𝑍 is better. • If contain, 𝐼 0 𝑆; 𝑆 increases, and 𝐼 0 𝑆; 𝑌 is not affected. • In AE: • Containing 𝑍 is better. • 𝑍 contains information about 𝑆, decrease reconstruct loss. • Reconstruct loss: Suppose reconstruct 𝑆′ from 0 𝑆. Difference between 𝑆′ and 𝑆 11

Extractive: 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚*+ • 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚!" • Extractive unsupervised method 12

Implement IB • Given sentence 𝑆, using next sentence as
relevance variable 𝑌 • Get deterministic function mapping sentence 𝑆 to summary ̃ 𝑠 • Therefore, 𝑝 ̃ 𝑠 𝑠 = 1 • In this setting, minimizing (1) is equal to minimizing: 𝐼 * 𝑆; 𝑆 − 𝛽𝐼 * 𝑆; 𝑌 = −log 𝑝( ̃ 𝑠) − 𝛽' 𝑝 𝑠($") | ̃ 𝑠 𝑝 ̃ 𝑠 𝑙𝑜𝑔𝑝(𝑠($") | ̃ 𝑠) • Use pretrained language model for estimating distributions • GPT-2 was used 13

Algorithm (1/8) • Implement extractive method • Iteratively deleting words
or phrases from candidates, starting with the original sentence • At each elimination step, only consider candidate deletions which decrease the value of pruning term • When expanding candidate, chose a few candidates with the highest relevance scores to optimize relevant term 14

Algorithm (2/8) • Algorithm: • Input: • 𝑠: sentence •
𝑠#$"% : context • Hyper parameter: • 𝑚: the number of words to delete • 𝑘: the number of candidates to search 15

Algorithm (3/8) • E.g • 𝑠 = Unsupervised methods use
autoencoder as core of methods • 𝑘 = 1 • 𝑚 = 3 16

Algorithm (4/8) • s* = Unsupervised methods use autoencoder as
core of methods • List up next candidates 𝑠** by removing up to 𝑚 words (l7~l9) methods use autoencoder as core of methods Unsupervised use autoencoder as core of methods Unsupervised methods autoencoder as core of methods … Unsupervised methods use autoencoder as core of 17

core of methods • List up next candidates 𝑠** by removing up to 𝑚 words (l7~l9) use autoencoder as core of methods Unsupervised autoencoder as core of methods Unsupervised methods as core of methods … Unsupervised methods use autoencoder as core 18

core of methods • List up next candidates 𝑠** by removing up to 𝑚 words (l7~l9) autoencoder as core of methods Unsupervised as core of methods Unsupervised methods core of methods … Unsupervised methods use autoencoder as 19

Algorithm (7/8) • Discard bad candidates (l10~l11) • For every
candidate, estimate 𝑝(𝑠**) • If p s* < 𝑝(𝑠**), then add 𝑠** as candidate. • This procedure corresponds to decreasing the value of pruning term. methods use autoencoder as core of methods Unsupervised use autoencoder as core of methods Unsupervised methods use autoencoder as core methods Unsupervised autoencoder as core of methods Unsupervised methods as core of methods … 20

Algorithm (8/8) • Chose next s* from candidates (l4~l5). •
Sort candidates by 𝑝(𝑠($") |𝑠*) on descending order. • Chose top 𝑘 candidates as next s*. • This procedure corresponds to decreasing the value of relevant term. Unsupervised methods use autoencoder as core methods Unsupervised use autoencoder as core of methods methods use autoencoder as core of methods 21

Note • It does not train anything at all. •
About 𝛽' • In this algorithms, ensure both pruning term and relevant term improves • Thus, the pruning term and relevant term are not compared directly • Therefore, choce of 𝛽& is less important 22

Abstractive: 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚,-./ (1/2) • Abstractive summarization method • Train GPT-2
model for summarization • Self-supervised learning • Use 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚!"ʼs output as training data • Aims • Remove the restriction of extractiveness • Learn an explicit compression function not requiring a next sentence 23

Abstractive: 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚,-./ (2/2) • Fine-tune GPT-2 model for summarization •
Train language model • Input:[ sentence + “TL;DR:” + summary ] • E.g. Hong Kong, a bustling metropolis with a population over 7 million, was once under British Rule. TL;DR: Hong Kong was once under British Rule. • When make summary • Input: [ sentence + “TL;DR:” ] • E.g. Hong Kong, a bustling metropolis with a population over 7 million, was once under British Rule. TL;DR: 24

Experiments 25

Unsupervised Models • 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚!", 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚#$%& • Use 𝑘 =1, 𝑚
=3 • 𝑅𝑒𝑐𝑜𝑛!" • Follows 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚!", but replace next sentence with the source sentence • For probing the role of next sentence • 𝑆𝐸𝑄' • Trained with an autoencoding objective paired with a topic loss and language model prior loss • Have the highest unsupervised result on DUC • PREFIX • First 75 bytes of the source sentence • INPUT • The full input sentence 26

Supervised Models • ABS • Supervised SOTA result on DUC-2003
dataset • Result of Li et al. • Supervised SOTA result on DUC-2004 dataset 27

Dataset & Evaluation method • Evaluate models on three dataset
• DUC-2003, DUC-2004 datasets • Automatic ROUGE metrics • Sentence-summary pairs • CNN corpus • Summary is not available • Human evaluation • Compare two summary from different models over 3 attributes • coherence, conciseness, and agreement with the input 28

DUC-2003, DUC-2004 • 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚!" achieves the highest R-1 and R-L
scores for unsupervised on both dataset • 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚#$%& achieves the second highest scores • R-2 scores for 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚!" are lower than baselines • Possibly due to a lack of fluency 29

CNN corpus • Use CNN corpus • Only sentence is
available • 𝑆𝐸𝑄+ is trained on CNN corpus • ABS is not trained • Use the model as originally trained on the Gigaword sentence dataset • Attribute scores are average of three peopleʼs score • Attribute scores are averaged over a scale of 1 (better), 0 (equal) and -1 (worse) 30

CNN corpus • 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚!" and 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚#$%& show stronger performance •
𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚#$%& score is better than 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚!" • A combination of abstractivness and learning a cohesive underlying of summarization enable more favorable summary for human 31

Model Comparison • ABS requires learning on a large supervised
training set • Poor out-of-domain performance • 𝑆𝐸𝑄+ is unsupervised, but still needs extensive training on a large corpus of in-domain text • 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚!" requires neither of them 32

Conclusion 33

Conclusion • Methods • Unsupervised summarization • Self-supervised summarization •
Using Information Bottleneck principle for summarization • Outcomes • Better result on automatic / human evaluation • Better result on domain where sentence-summary is not available 34

Appendix 35

Definition of mutual information • Event 𝐸, probability 𝑃(𝐸) •
Information: − log 𝑃 𝐸 (generally, base is 2) • Entropy: H P = − ∑!∈- 𝑃 𝐸 𝑙𝑜𝑔𝑃 𝐸 • E.g. 𝑃 𝑠𝑢𝑛𝑛𝑦 = 0.5, 𝑃 𝑟𝑎𝑖𝑛 = 0.5 • H P = − ∑!∈( 𝑃 𝐸 𝑙𝑜𝑔𝑃 𝐸 = −P sunny logP sunny − P rain logP rain = −0.5 ∗ −1 − 0.5 ∗ −1 = 1 • E.g. 𝑃 𝑠𝑢𝑛𝑛𝑦 = 0.9, 𝑃 𝑟𝑎𝑖𝑛 = 0.1 • H P = − ∑!∈( 𝑃 𝐸 𝑙𝑜𝑔𝑃 𝐸 = −P sunny logP sunny − P rain logP rain ≈ −0.9 ∗ −0.15 − 0.5 ∗ −3.3 = 2.0 36

Definition of mutual information • Conditional Entropy • 𝐻 𝑋
𝑦 = − ∑"∈) 𝑃 𝑥 𝑦 𝑙𝑜𝑔𝑃(𝑥|𝑦) • 𝐻 𝑋 𝑌 = ∑* 𝑃 𝑦 𝐻(𝑋| 𝑦) = ⋯ = − ∑"∈),*∈, 𝑃 𝑥, 𝑦 𝑙𝑜𝑔𝑃(𝑥|𝑦) • Mutual information • 𝐼 𝑋; 𝑌 = 𝐻 𝑋 − 𝐻 𝑋 𝑌 = 𝐻 𝑌 − 𝐻 𝑌 𝑋 • Given information about Y, how much ambiguity of X decrease? • E.g. 𝑃 𝑋 = 𝑃(𝑋|𝑌) • 𝐻 𝑋 = 𝐻 𝑋 𝑌 , so 𝐼 𝑋; 𝑌 = 0 37

Definition of mutual information • E.g. • 𝑃 𝑠𝑢𝑛𝑛𝑦 =
0.5, 𝑃 𝑟𝑎𝑖𝑛 = 0.5 • 𝑃 𝑐𝑜𝑛𝑡𝑟𝑎𝑖𝑙𝑠 = 0.2, 𝑃 𝑏𝑙𝑢𝑒𝑠𝑘𝑦 = 0.8 • 𝑃 𝑠𝑢𝑛𝑛𝑦 |contrails = 0.2, 𝑃 𝑟𝑎𝑖𝑛 | 𝑐𝑜𝑛𝑡𝑟𝑎𝑖𝑙𝑠 = 0.8 • 𝑃 𝑠𝑢𝑛𝑛𝑦 |bluesky = 0.8, 𝑃 𝑟𝑎𝑖𝑛 | 𝑏𝑙𝑢𝑒𝑠𝑘𝑦 = 0.2 → 𝐼 𝑋; 𝑌 = 𝐻 𝑋 − 𝐻 𝑋 𝑌 = − c "∈) 𝑃 𝑥 𝑙𝑜𝑔𝑃 𝑥 + c "∈),*∈, 𝑃 𝑥, 𝑦 𝑙𝑜𝑔𝑃(𝑥|𝑦) = −0.5 ∗ −1 − 0.5 ∗ −1 + 0.04 ∗ −2.3 + 0.16 ∗ −0.32 + 0.64 ∗ −0.32 +0.16 ∗ −2.3 = 1 − 0.716 = 0.184 38

BottleSum

BottleSum

More Decks by Masato Umakoshi

Other Decks in Programming

Featured

Transcript