Upgrade to Pro — share decks privately, control downloads, hide ads and more …

BottleSum

 BottleSum

Masato Umakoshi

October 07, 2021
Tweet

More Decks by Masato Umakoshi

Other Decks in Programming

Transcript

  1. BottleSum Unsupervised and Self-supervised Sentence Summarization using the Information Bottleneck

    Principle Peter West, Ari Holtzman, Jan Buys, Yejin Choi Paper Reading Masato Umakoshi, Kyoto University 1
  2. Summary • Methods • Unsupervised summarization • Self-supervised summarization •

    Using Information Bottleneck principle for summarization • Outcomes • Better result on automatic / human evaluation • Better result on domain where sentence-summary is not available 2
  3. What is task? • Sentence summarization (compression) • Given a

    sentence, summarize it into shorter one • Suppose that good sentence summary contains information related to the broader context while discarding less significant details 4
  4. How to solve it? • Unsupervised methods • Why unsupervised?

    • Sentence-summarize pair is not always available • Current unsupervised methods use autoencoder(AE) as core of methods • The source sentence should be accurately predicted from summary • This goes against the fundamental goal of summarization • Crucially needs to forget all but the “relevant” information • Therefore, use Information Bottleneck(IB) to discard irrelevant information 6
  5. Example • Next sentence is about control, so summary should

    only contain about control • However, summary by autoencoder necessary refer to population to restore information about it 7
  6. Methods • 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚!" is based mainly on the principle of

    the Information Bottleneck(IB) • 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚!" • Extractive unsupervised method • No need to train • Take consecutive two sentences as input • 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚#$%& • Abstractive self-supervised method • Use the result of 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚!" as training data • Take a sentence as input 8
  7. What is Information Bottleneck? (1/2) • Given source 𝑆, external

    variable 𝑌, summary * 𝑆 • Learning a conditional distribution 𝑝( * 𝑆|𝑆) minimizing: 𝐼 " 𝑆; 𝑆 − 𝛽𝐼( " 𝑆; 𝑌) • 𝛽: Coefficient to balance two terms • 𝐼 : Mutual information between two variables • 𝐼 𝑋; 𝑌 = 𝐻 𝑋 − 𝐻 𝑋 𝑌 = 𝐻 𝑌 − 𝐻 𝑌 𝑋 • Given information about Y, how much ambiguity of X decrease? • Refer appendix 9
  8. What is Information Bottleneck? (2/2) • minimizing: 𝐼 " 𝑆;

    𝑆 − 𝛽𝐼 " 𝑆; 𝑌 • Generate good summary * 𝑆 from source 𝑆 • First term: pruning term • If 0 𝑆 decrease ambiguity of 𝑆, then 𝐼 0 𝑆; 𝑆 increase • Ensures irrelevant information is discarded • Second term: relevant term • If 0 𝑆 decrease ambiguity of 𝑌, then −𝐼 0 𝑆; 𝑌 decrease • Ensures 0 𝑆 and 𝑌 share information. (1) 10
  9. Why IB is better? • Suppose in 𝑆, we have

    some information 𝑍 which is irrelevant to 𝑌 • In IB: • Not containing 𝑍 is better. • If contain, 𝐼 0 𝑆; 𝑆 increases, and 𝐼 0 𝑆; 𝑌 is not affected. • In AE: • Containing 𝑍 is better. • 𝑍 contains information about 𝑆, decrease reconstruct loss. • Reconstruct loss: Suppose reconstruct 𝑆′ from 0 𝑆. Difference between 𝑆′ and 𝑆 11
  10. Implement IB • Given sentence 𝑆, using next sentence as

    relevance variable 𝑌 • Get deterministic function mapping sentence 𝑆 to summary ̃ 𝑠 • Therefore, 𝑝 ̃ 𝑠 𝑠 = 1 • In this setting, minimizing (1) is equal to minimizing: 𝐼 * 𝑆; 𝑆 − 𝛽𝐼 * 𝑆; 𝑌 = −log 𝑝( ̃ 𝑠) − 𝛽' 𝑝 𝑠($") | ̃ 𝑠 𝑝 ̃ 𝑠 𝑙𝑜𝑔𝑝(𝑠($") | ̃ 𝑠) • Use pretrained language model for estimating distributions • GPT-2 was used 13
  11. Algorithm (1/8) • Implement extractive method • Iteratively deleting words

    or phrases from candidates, starting with the original sentence • At each elimination step, only consider candidate deletions which decrease the value of pruning term • When expanding candidate, chose a few candidates with the highest relevance scores to optimize relevant term 14
  12. Algorithm (2/8) • Algorithm: • Input: • 𝑠: sentence •

    𝑠#$"% : context • Hyper parameter: • 𝑚: the number of words to delete • 𝑘: the number of candidates to search 15
  13. Algorithm (3/8) • E.g • 𝑠 = Unsupervised methods use

    autoencoder as core of methods • 𝑘 = 1 • 𝑚 = 3 16
  14. Algorithm (4/8) • s* = Unsupervised methods use autoencoder as

    core of methods • List up next candidates 𝑠** by removing up to 𝑚 words (l7~l9) methods use autoencoder as core of methods Unsupervised use autoencoder as core of methods Unsupervised methods autoencoder as core of methods … Unsupervised methods use autoencoder as core of 17
  15. Algorithm (5/8) • s* = Unsupervised methods use autoencoder as

    core of methods • List up next candidates 𝑠** by removing up to 𝑚 words (l7~l9) use autoencoder as core of methods Unsupervised autoencoder as core of methods Unsupervised methods as core of methods … Unsupervised methods use autoencoder as core 18
  16. Algorithm (6/8) • s* = Unsupervised methods use autoencoder as

    core of methods • List up next candidates 𝑠** by removing up to 𝑚 words (l7~l9) autoencoder as core of methods Unsupervised as core of methods Unsupervised methods core of methods … Unsupervised methods use autoencoder as 19
  17. Algorithm (7/8) • Discard bad candidates (l10~l11) • For every

    candidate, estimate 𝑝(𝑠**) • If p s* < 𝑝(𝑠**), then add 𝑠** as candidate. • This procedure corresponds to decreasing the value of pruning term. methods use autoencoder as core of methods Unsupervised use autoencoder as core of methods Unsupervised methods use autoencoder as core methods Unsupervised autoencoder as core of methods Unsupervised methods as core of methods … 20
  18. Algorithm (8/8) • Chose next s* from candidates (l4~l5). •

    Sort candidates by 𝑝(𝑠($") |𝑠*) on descending order. • Chose top 𝑘 candidates as next s*. • This procedure corresponds to decreasing the value of relevant term. Unsupervised methods use autoencoder as core methods Unsupervised use autoencoder as core of methods methods use autoencoder as core of methods 21
  19. Note • It does not train anything at all. •

    About 𝛽' • In this algorithms, ensure both pruning term and relevant term improves • Thus, the pruning term and relevant term are not compared directly • Therefore, choce of 𝛽& is less important 22
  20. Abstractive: 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚,-./ (1/2) • Abstractive summarization method • Train GPT-2

    model for summarization • Self-supervised learning • Use 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚!"ʼs output as training data • Aims • Remove the restriction of extractiveness • Learn an explicit compression function not requiring a next sentence 23
  21. Abstractive: 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚,-./ (2/2) • Fine-tune GPT-2 model for summarization •

    Train language model • Input:[ sentence + “TL;DR:” + summary ] • E.g. Hong Kong, a bustling metropolis with a population over 7 million, was once under British Rule. TL;DR: Hong Kong was once under British Rule. • When make summary • Input: [ sentence + “TL;DR:” ] • E.g. Hong Kong, a bustling metropolis with a population over 7 million, was once under British Rule. TL;DR: 24
  22. Unsupervised Models • 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚!", 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚#$%& • Use 𝑘 =1, 𝑚

    =3 • 𝑅𝑒𝑐𝑜𝑛!" • Follows 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚!", but replace next sentence with the source sentence • For probing the role of next sentence • 𝑆𝐸𝑄' • Trained with an autoencoding objective paired with a topic loss and language model prior loss • Have the highest unsupervised result on DUC • PREFIX • First 75 bytes of the source sentence • INPUT • The full input sentence 26
  23. Supervised Models • ABS • Supervised SOTA result on DUC-2003

    dataset • Result of Li et al. • Supervised SOTA result on DUC-2004 dataset 27
  24. Dataset & Evaluation method • Evaluate models on three dataset

    • DUC-2003, DUC-2004 datasets • Automatic ROUGE metrics • Sentence-summary pairs • CNN corpus • Summary is not available • Human evaluation • Compare two summary from different models over 3 attributes • coherence, conciseness, and agreement with the input 28
  25. DUC-2003, DUC-2004 • 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚!" achieves the highest R-1 and R-L

    scores for unsupervised on both dataset • 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚#$%& achieves the second highest scores • R-2 scores for 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚!" are lower than baselines • Possibly due to a lack of fluency 29
  26. CNN corpus • Use CNN corpus • Only sentence is

    available • 𝑆𝐸𝑄+ is trained on CNN corpus • ABS is not trained • Use the model as originally trained on the Gigaword sentence dataset • Attribute scores are average of three peopleʼs score • Attribute scores are averaged over a scale of 1 (better), 0 (equal) and -1 (worse) 30
  27. CNN corpus • 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚!" and 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚#$%& show stronger performance •

    𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚#$%& score is better than 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚!" • A combination of abstractivness and learning a cohesive underlying of summarization enable more favorable summary for human 31
  28. Model Comparison • ABS requires learning on a large supervised

    training set • Poor out-of-domain performance • 𝑆𝐸𝑄+ is unsupervised, but still needs extensive training on a large corpus of in-domain text • 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚!" requires neither of them 32
  29. Conclusion • Methods • Unsupervised summarization • Self-supervised summarization •

    Using Information Bottleneck principle for summarization • Outcomes • Better result on automatic / human evaluation • Better result on domain where sentence-summary is not available 34
  30. Definition of mutual information • Event 𝐸, probability 𝑃(𝐸) •

    Information: − log 𝑃 𝐸 (generally, base is 2) • Entropy: H P = − ∑!∈- 𝑃 𝐸 𝑙𝑜𝑔𝑃 𝐸 • E.g. 𝑃 𝑠𝑢𝑛𝑛𝑦 = 0.5, 𝑃 𝑟𝑎𝑖𝑛 = 0.5 • H P = − ∑!∈( 𝑃 𝐸 𝑙𝑜𝑔𝑃 𝐸 = −P sunny logP sunny − P rain logP rain = −0.5 ∗ −1 − 0.5 ∗ −1 = 1 • E.g. 𝑃 𝑠𝑢𝑛𝑛𝑦 = 0.9, 𝑃 𝑟𝑎𝑖𝑛 = 0.1 • H P = − ∑!∈( 𝑃 𝐸 𝑙𝑜𝑔𝑃 𝐸 = −P sunny logP sunny − P rain logP rain ≈ −0.9 ∗ −0.15 − 0.5 ∗ −3.3 = 2.0 36
  31. Definition of mutual information • Conditional Entropy • 𝐻 𝑋

    𝑦 = − ∑"∈) 𝑃 𝑥 𝑦 𝑙𝑜𝑔𝑃(𝑥|𝑦) • 𝐻 𝑋 𝑌 = ∑* 𝑃 𝑦 𝐻(𝑋| 𝑦) = ⋯ = − ∑"∈),*∈, 𝑃 𝑥, 𝑦 𝑙𝑜𝑔𝑃(𝑥|𝑦) • Mutual information • 𝐼 𝑋; 𝑌 = 𝐻 𝑋 − 𝐻 𝑋 𝑌 = 𝐻 𝑌 − 𝐻 𝑌 𝑋 • Given information about Y, how much ambiguity of X decrease? • E.g. 𝑃 𝑋 = 𝑃(𝑋|𝑌) • 𝐻 𝑋 = 𝐻 𝑋 𝑌 , so 𝐼 𝑋; 𝑌 = 0 37
  32. Definition of mutual information • E.g. • 𝑃 𝑠𝑢𝑛𝑛𝑦 =

    0.5, 𝑃 𝑟𝑎𝑖𝑛 = 0.5 • 𝑃 𝑐𝑜𝑛𝑡𝑟𝑎𝑖𝑙𝑠 = 0.2, 𝑃 𝑏𝑙𝑢𝑒𝑠𝑘𝑦 = 0.8 • 𝑃 𝑠𝑢𝑛𝑛𝑦 |contrails = 0.2, 𝑃 𝑟𝑎𝑖𝑛 | 𝑐𝑜𝑛𝑡𝑟𝑎𝑖𝑙𝑠 = 0.8 • 𝑃 𝑠𝑢𝑛𝑛𝑦 |bluesky = 0.8, 𝑃 𝑟𝑎𝑖𝑛 | 𝑏𝑙𝑢𝑒𝑠𝑘𝑦 = 0.2 → 𝐼 𝑋; 𝑌 = 𝐻 𝑋 − 𝐻 𝑋 𝑌 = − c "∈) 𝑃 𝑥 𝑙𝑜𝑔𝑃 𝑥 + c "∈),*∈, 𝑃 𝑥, 𝑦 𝑙𝑜𝑔𝑃(𝑥|𝑦) = −0.5 ∗ −1 − 0.5 ∗ −1 + 0.04 ∗ −2.3 + 0.16 ∗ −0.32 + 0.64 ∗ −0.32 +0.16 ∗ −2.3 = 1 − 0.716 = 0.184 38