Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Beyond Sentences and Paragraphs: Towards Document-level and Multi-document Understanding

wing.nus
December 30, 2021

Beyond Sentences and Paragraphs: Towards Document-level and Multi-document Understanding

Recommender systems have become increasingly important in our daily lives since they play an important role in mitigating the information overload problem, especially in many user-oriented online services. Most recommender systems consider the recommendation procedure as a static process and make recommendations following a fixed greedy strategy, which may fail given the dynamic nature of the users’ preferences. Also, they are designed to maximize the immediate reward of recommendations, while completely overlooking their long-term influence on user experiments. To learn adaptive recommendation policy, we will consider the recommendation procedure as sequential interactions between users and recommender agents; and leverage Reinforcement Learning (RL) to automatically learn an optimal recommendation strategy (policy) that maximizes cumulative rewards from users without any specific instructions. Recommender systems based on reinforcement learning have two advantages. First, they can continuously update their strategies during the interactions. Second, the optimal strategy is made by maximizing the expected long-term cumulative reward from users. This talk will introduce the fundamentals and advances of deep reinforcement learning and its applications in recommender systems.
Seminar page: https://wing-nus.github.io/ir-seminar/speaker-arman
YouTube Video recording: https://www.youtube.com/watch?v=IMRgFS7GKB0

wing.nus

December 30, 2021
Tweet

More Decks by wing.nus

Other Decks in Education

Transcript

  1. Motivation • Significant progress on token and sentence-level tasks in

    NLP • Many practical NLP problems require full document understanding • Some require making connections between documents • Advances in full document processing opens up new approaches and application areas (inside, and outside NLP).
  2. Key challenges for document-level NLP • Annotating full documents or

    multiple documents can be difficult • End tasks often require combining information spread over long distances. Models need to ignore a lot of irrelevant text. • Many popular algorithms are designed to work in short sequence setting, and have limitations in long setting: • RNN/LSTM: process input sequentially → slow for long sequences • Transformers: self-attention is O(L2) → cannot process long input with current hardware. Many pre-trained LMs limited to 512 tokens (e.g. BERT). • Models for processing multi-document tasks are often task-specific and complex
  3. This talk • How to learn good document-level representations ➝

    SPECTER • How to extend transformers for long sequences ➝ Longformer • General language models for multi-document processing ➝ CDLM: encoder ➝ PRIMER: encoder-decoder • Benchmarks (SciFact, TLDR, QASPER)
  4. 5 SPECTER: Document-level Representation Learning using Citation-informed Transformers (ACL 2020)

    Arman Cohan*, Sergey Feldman*, Iz Beltagy, Doug Downey, Daniel S. Weld Outline Salton (1975) 1- Document-level representations learning
  5. Document representation learning • Good document representations transfer well to

    downstream tasks 6 Task-specific model SPECTER Recommendation Classification Search • Goal: Task-independent representation. Easy to use in downstream tasks.
  6. SPECTER Idea • SPECTER: Scientific Paper Embeddings using Citation-informed TransformERs

    • Documents with links are more similar than random documents ◦ Their vector representation should be closer • Contrastive learning using the citation graph 7 A B D E C F F
  7. Baseline model, How to represent a document? 8 Document (Q)

    f(Q) [CLS] The traffic light was red.
  8. 10 Implement using citation network A B E E C

    F D A B E E C F D A B E E C F D B C D B C F Triplets using easy negatives Triplets using hard negatives
  9. Hard negatives 13 84.2 88.4 91.5 36.9 80 82.4 85.8

    89.8 36.8 78.4 20 30 40 50 60 70 80 90 100 Classification User activity Citation Recomm Avg. Specter - No hard negatives
  10. SPECTER recap • Improved document representation trained using citation network

    • Improved zero-shot performance on a variety of downstream tasks • Currently in use at SemanticScholar and some other academic search engines 15
  11. 16 Longformer: The Long Document Transformer (2020) Iz Beltagy*, Matthew

    E. Peters*, Arman Cohan* - Language data consists of many long documents - Existing transformers (BERT, T5): process up to 512 tokens Outline 2- Long document understanding
  12. Limitation of Transformers 17 Enc Layer 1 Enc Layer 2

    Enc Layer 24 Input FF + softmax output Feed Forward Self attention
  13. Limitation of Transformers • Self-attention is expensive ◦ Quadratic complexity

    (in sequence length) ◦ Input length is limited O(n^2) 18 Fig from: https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/note books/hello_t2t.ipynb#scrollTo=OJKU36QAfqOC
  14. • Chunk, extract, combine • Compressing memory in Transformers Transformer-XL

    (2019), Adaptive span (2019), Compressive (2019) • Optimizing the self-attention computation Reformer (2020), SparseTransformer (2019), BP-Transformer (2020), Linformer (2020) Prior work (scaling to long documents) 19 Limitation 1: Only intrinsic evaluation on language modeling tasks Limitation 2: No pretrained general model that can be used for downstream tasks
  15. Proposed method: Longformer 20 • Efficient Transformer model for long

    documents • Allows processing documents up to 16K tokens long • General model that can be applied to various document-level NLP tasks
  16. Avoid computing the full attention matrix Tokens attend to each

    others following an “attention pattern” Large receptive field with stacked layers Longformer - sparse attention matrix 22
  17. - QA: [cls] [q1 q2 q3] [sep] [t1 t2 t3

    t4 t5 t6 t7 ...] - Classification [cls] [t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 ...] Longformer - local and global attention 23 Local attn window Local attn window Global attn Global attn - Indirect information flow is not always sufficient. - For some tasks direct attention is needed.
  18. - Global attention to a few location of interest -

    Global attention is user defined based on the task - Separate projection weights for global attention - Local attention - capture local context Longformer - local and global attention 24
  19. • Requires banded-multiplication ◦ Not supported in existing DL libraries

    • Implementation 1: sliding chunks Implementation 25 Split into chunks Mask out extra elements
  20. • Requires banded-multiplication ◦ Not supported in existing DL libraries

    • Implementation 2: custom cuda kernel ◦ Supports dilation ◦ Only computes the non-zero elements (memory efficient) ◦ Slightly harder to use (requires custom kernel) Implementation 26
  21. Intrinsic evaluation: Char-level LM (small models) • Longformer: Sequences 32K

    tokens long 28 1.06 1.08 1.1 1.12 1.14 1.16 1.18 1.2 T12 (2018) Adaptive (2019) BP (2019) Our Longformer Test BPC (Lower is better) 0.94 0.96 0.98 1 1.02 1.04 1.06 1.08 1.1 1.12 T12 (2018) Tr-XL (2019) Reformer (2020) Adaptive (2019) BP (2020) Longformer text8 enwik8
  22. Intrinsic evaluation: Char-level LM (large models) 29 88M Params 100M

    Params 277M Params 209M Params 277M Params 102M Params 0.93 0.95 0.97 0.99 1.01 1.03 1.05 Tr-XL (2019) Sparse (2019) Tr-XL 24 layers (2019) Adaptive (2019) Compressive (2020) Longformer BPC enwik8
  23. Goal: a BERT-like for long-doc NLP tasks Pretrain with self-supervision

    and fine-tune on end task Naïve pre-training is expensive Longformer in transfer learning setting 30 Pre-trained Longformer Longformer Unlabeled data Self-supervised learning Long document tasks QA Classification Summarization
  24. Longformer pre-training Careful initialization to avoid high cost of pretraining

    ◦ Start from RoBERTa weights (support 512 tokens) ◦ Increase size of position embedding matrix. Initialize it by copying the first 512 ◦ Continue Masked LM (MLM) pre-training on a corpus of long docs ◦ Copy Q, K, V linear projections to get separate projections for global attention 31
  25. Experiments on downstream tasks 33 - HotpotQA -- Multihop reasoning

    and evidence extraction - TriviaQA -- QA dataset from long Wikipedia articles - WikiHop -- QA requiring combining facts spread across multiple paragraphs - Coref -- Coreference chains across the document - IMDB -- Document classification - Hyperpartisan news -- Document classification
  26. Finetuning on downstream tasks - Global attention Classification: on CLS

    token QA: on all question tokens Coref: local attention only 34
  27. SOTA on WikiHop and TriviaQA Competitive results on HotpotQA -

    Longformer is simpler - Better models use GNN of entities Finetuning on downstream tasks - large model 36
  28. Encoder-decoder setting • Long document seq2seq tasks ◦ Long document

    summarization • LED: Longformer Encoder-Decoder 37 Enc Layer 1 Enc Layer 2 Enc Layer 3 Dec Layer 1 Dec Layer 2 Dec Layer 3 Document FF + softmax Summary Feed Forward Self attention Longformer Self attention
  29. Summarization results • ArXiv dataset: Generate abstract given full paper

    38 11.05 14.69 16.54 16.95 19.02 17.94 19.62 31.8 38.03 38.44 38.83 41.77 39.76 41.83 10 15 20 25 30 35 40 45 Discourse-aware (2018) Extr-Abst-TLM (2020) Dancer (2020) Pegasus (2019) BigBird-4K (2020) LED-large-4K (ours) LED-large-16K (ours) ROUGE R-2 R-L
  30. Summary - Longformer • An efficient Transformer model for processing

    very long documents • State-of-the-art results on document-level classification, summarization and QA tasks • Widely used by the community 39
  31. • Existing works often use complex task-specific architectures for multi-document

    tasks • Not leveraging pre-training / fine-tuning paradigm • CDLM: Cross-Document Language Modeling ◦ A pre-training method for using shared information between documents ◦ General easy-to-use model for multi-document tasks 40 3- Beyond single docs: multi-document tasks Outline Cross-Document Language Modeling (EMNLP 2021, findings) Avi Caciularu, Arman Cohan, Iz Beltagy, Matthew E. Peters, Arie Cattan, Ido Dagan
  32. CDLM: Cross document language modeling • A pre-training approach to

    implicitly encode cross document relationships • Using a cluster of related documents in pre-training Longformer 41 Doc 1:“...Harry Shearer is suing Vivendi’s Universal Music for $125 million for allegedly fraudulent ...” Doc 2:“...Harry Shearer alleges parent company of Universal Music and Studio Canal withheld millions ...” Doc 3: In October 2016, Shearer’s Century of Progress Productions sued UMG, StudioCanal and their parent company, Vivendi, over unpaid royalties.
  33. Longformer for multi-document tasks 42 Question Doc 1 Doc 2

    Doc 3 Doc 4 Longformer Answer Doc 1 Doc 2 Doc 3 Doc 4 Question Single long input
  34. CDLM: Cross document language modeling 43 Cluster of related documents

    Input Longformer Harry Shearer is suing Vivendi’s <doc-sep> Harry Shearer [MASK] parent company <doc-sep> Doc 2 Doc 1 local local local local local local local local global local local local Attention alleges
  35. Cross Document Language Modeling (CDLM) • Start from Longformer •

    Continue pre-training on the multi-news corpus (Fabbri et al 2019) ◦ 45K document clusters (each 2 to 10 docs) ◦ Train for 25K steps 44
  36. CDLM - Finetuning • Concatenate documents, process through CDLM ◦

    Task-specific finetuning similar to Longformer 45
  37. Intrinsic evaluation: Cross-document perplexity 46 3.94 3.84 3.81 3.41 3.39

    3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4 Longformer CDLM - Local attention CDLM - Random docs CDLM - Prefix Attn CDLM MLM PPL (lower is better)
  38. Results – Cross Doc Coreference Resolution 48 76 77 78

    79 80 81 82 83 84 85 86 87 Barhom et al (2019) Meged et al. (2020) Cattan et al. (2020) Zeng et al. (2020) Yu et al (2020) CDLM CoNLL F1 Entities 64 66 68 70 72 74 76 78 80 82 84 Barhom et al. (2019) Cattan et al. (2020) CDLM Events
  39. Summary - CDLM • Pre-training on multiple related documents improve

    multi-doc performance • General Transformer model for multi-document tasks ◦ Encoder-only ◦ Eliminates the need for task-specific multi-document modeling 52
  40. • How to extend Transformers for multi-document tasks that concern

    generation? • We focus on Multi-document summarization 53 4- Beyond single docs: multi-document tasks: generation Outline PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization (2021) Wen Xiao, Iz Beltagy, Giuseppe Carenini, Arman Cohan
  41. Previous Methods Dataset-specific Models Graph- based Hierarchical Require domain specific

    additional information Use dataset-specific, customized architectures, hard to leverage pre-trained LM Pre-trained Generation Models General-Purpose (BART, T5,...) Task-specific Require large amount of data to fine-tune Mainly focus on single- document input PEGASUS PRIMER Mainly for Multi-document Summarization
  42. Overview of PRIMER • Architecture: Longformer Encoder Decoder (LED) •

    Input Structure: ◦ documents separated with document separator(<doc-sep>) ◦ Global Attention on <doc-sep>
  43. Overview of PRIMER • Architecture: Longformer (with local and global

    attention) • Input Structure: ◦ documents separated with document separator(<doc-sep>) ◦ Global Attention on <doc-sep> • Pre-training: ◦ Goal: Teach the model to identify and aggregate salient information across a “cluster” of related documents ◦ Multi-doc corpus: Newshead (360k clusters, 3.5 doc/cluster on average) ◦ Objective: Gap Sentence Generation ◦ Strategy: Entity Pyramid
  44. Pre-training : objective • Motivation: close to Multi-Document Summarization task

    • Gap Sentence Generation [Zhang et al., 2019]: ◦ Select several SALIENT sentences from the input documents (as pseudo- summary) ◦ mask out the selected sentences ◦ Recover them in order in the decoder
  45. How to select SALIENT sentences? Previous work for single document

    input (PEGASUS): • Random • Lead-K • Principle ◦ For each sentence, compute the ROUGE score between the sentence, and the rest of the document, and select sentences greedily, i.e. Best
  46. How to select SALIENT sentences? However, multi-document input tends to

    be more redundant And such strategy would prefer exact match between sentences, resulting in selection of less representative information.
  47. New Masking Strategy Entity Pyramid • Select sentences that best

    represent the entire cluster of input documents • Based on the Pyramid evaluation framework (Nenkova and Passonneau 2004) • We want to encourage the model to generate missing information using other documents in the input
  48. Recall: Pyramid Evaluation w/ multiple refs (Nenkova and Passonneau, 2004)

    Given: • Multiple gold human written summaries (references) for each input Human annotator: • Annotate Summary Content Units (phrases or clauses) • Compute weight for SCU by how many references it appears in • For each candidate summary, label which SCUs can match the summary, • The score is computed by summing up the weights of matched SCUs
  49. Recall: Pyramid Evaluation (w/ multiple refs) REF #1 In 1998

    two Libyans indicted in 1991 for the Lockerbie bombing were still in Libya. REF #2 Two Libyans were indicted in 1991 for blowing up a Pan Am jumbo jet over Lockerbie, Scotland in 1988. REF #3 Two Libyans, accused by the United States and Britain of bombing a New York bound Pan Am jet over Lockerbie, Scotland in 1988, killing 270 people, for 10 years were harbored by Libya who claimed the suspects could not get a fair trail in America or Britain. REF #4 Two Libyan suspects were indicted in 1991.
  50. Recall: Pyramid Evaluation (w/ multiple refs) REF #1 In 1998

    two Libyans indicted in 1991 for the Lockerbie bombing were still in Libya. REF #2 Two Libyans were indicted in 1991 for blowing up a Pan Am jumbo jet over Lockerbie, Scotland in 1988. REF #3 Two Libyans, accused by the United States and Britain of bombing a New York bound Pan Am jet over Lockerbie, Scotland in 1988, killing 270 people, for 10 years were harbored by Libya who claimed the suspects could not get a fair trail in America or Britain. REF #4 Two Libyan suspects were indicted in 1991. SCU #1 (w=4): two Libyans were officially accused
  51. Recall: Pyramid Evaluation (w/ multiple refs) REF #1 In 1998

    two Libyans indicted in 1991 for the Lockerbie bombing were still in Libya. REF #2 Two Libyans were indicted in 1991 for blowing up a Pan Am jumbo jet over Lockerbie, Scotland in 1988. REF #3 Two Libyans, accused by the United States and Britain of bombing a New York bound Pan Am jet over Lockerbie, Scotland in 1988, killing 270 people, for 10 years were harbored by Libya who claimed the suspects could not get a fair trail in America or Britain. REF #4 Two Libyan suspects were indicted in 1991. SCU #2 (w=3): the indictment of the two Lockerbie suspects was in 1991
  52. Recall: Pyramid Evaluation w/ multiple refs (Nenkova and Passonneau, 2004)

    The relative importance of facts to be conveyed can be quantified by `the number of reference that it appears in` SCU #1 SCU #2 SCU #3 … w=4 w=3 …
  53. New Masking Strategy: Entity Pyramid Idea: The relative importance of

    facts in the input documents can be quantified in the same way, i.e. the more documents that the fact appears in, the more important it is in the cluster. We use Entity to identify the facts, which is a proxy for human labeled SCU
  54. New Masking Strategy: Entity Pyramid • Step 1: Entity Extraction

    • Step 2: Build Entity Pyramid • Step 3: Select the most representative sentence for each entity
  55. Ablation on Pre-training Strategies Compare with PEGASUS (2020) • Same

    architecture (LED-Base) • Same input structure • Same pre-training objective • Zero/Few-shot setting
  56. Ablation on Pre-training Strategies Compare with PEGASUS Principle strategy (2020)

    • Same architecture (LED-Base) • Same input structure • Same pre-training objective • Zero/Few-shot setting • The Entity Pyramid strategy works better than the Principle strategy used in PEGASUS.
  57. Experiments • Evaluation Datasets: ◦ Multi-Document Summarization: Multi-News, Multi-XScience, WCEP,

    Wikisum, DUC2004 ◦ Single Document Summarization: arXiv • Evaluation Metric: ◦ ROUGE score (R-1, R-2, and R-L)
  58. Experiments • Settings: ◦ Zero-shot ◦ Few-shot: 10/100 training examples,

    5 runs for each model ◦ Fully supervised • Compared Models: ◦ BART ◦ PEGASUS ◦ Longformer Encoder Decoder (LED) ◦ Prior SOTA Models (fully supervised only)
  59. Results on Zero-shot • PEGASUS is also pre-trained for summarization

    downstream task, thus it performs better than the other two models • Our model outperforms all the other pre-trained models on most of the datasets (up to 5 ROUGE points)
  60. Results on Few-shot • Our model outperform all the other

    pre-trained models on all the datasets.
  61. Results: Full fine-tuning • Our model achieves SOTA on several

    multi-document summarization datasets, as well a single-document summarization dataset.
  62. Takeaway • PRIMER, a pre-trained model for multi-document summarization •

    It is pre-trained with new strategy, Entity Pyramid • PRIMER reduces the need for dataset-specific architectures and large labeled data. • PRIMER achieves SOTA on multiple datasets under zero/few-shot and fully supervised
  63. New benchmarks • SciFact: Scientific Fact Verification • TLDR: Extreme

    summarization of scientific papers • QASPER: Question answering over scientific papers
  64. SciFact: Motivation The coronavirus cannot thrive in warmer climates Scientific

    claim Claim verification system Research corpus ...a 1°C increase in local temperature reduces transmission by 13%... SUPPORTS ...summer temperatures will not substantially limit pandemic growth... REFUTES Fact or Fiction: Verifying Scientific Claims. EMNLP 2020 David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, Hannaneh Hajishirzi
  65. Generating and verifying claims Citing paper ACE-2 receptors are involved

    in coronavirus infection The results of Weld et al. suggest that COVID-19 infection is mediated by ACE-2 receptors [45, 46]... Citation sentence Cited papers 46 45 SUPPORTS REFUTES ✓Scalable; rewrite existing claims ✓Important findings, as judged by other scientists. ✓ Evidence documents are specified in citations.
  66. Dataset statistics Domain Wikipedia Biomedicine Claim type Synthetic Natural Number

    of claims 185k 1.4k Annotator agreement 0.68 (5-way) 0.75 (2-way)
  67. 96 TLDR: Extreme summarization of scientific papers (EMNLP 2020, findings)

    Isabel Cachola, Kyle Lo, Arman Cohan, Daniel S. Weld Title: A Large-Scale, Automated Study of Language Surrounding Artificial Intelligence Authors: Autumn Toney Categories: cs.CL This work presents a large-scale analysis of artificial intelligence (AI) and machine learning (ML) references within news articles and scientific publications between 2011 and 2019. We implement word association measurements that automatically identify shifts in language co-occurring with AI/ML and quantify the strength of these word associations. Our results highlight the evolution of perceptions and definitions around AI/ML and detect emerging application areas, models, and systems (e.g., blockchain and cybersecurity). Recent small-scale, manual studies have explored AI/ML discourse within the general public, the policymaker community, and researcher community, but are limited in their scalability and longevity. Our methods provide new views into public perceptions and subject-area expert discussions of AI/ML and greatly exceed the explanative power of prior work. Title: LazyFormer: Self Attention with Lazy Update Authors: Chengxuan Ying, Guolin Ke, Di He, Tie-Yan Liu Categories: cs.CL cs.AI Improving the efficiency of Transformer-based language pre-training is an important task in NLP, especially for the self-attention module, which is computationally expensive. In this paper, we propose a simple but effective solution, called \emph{LazyFormer}, which computes the self-attention distribution infrequently. LazyFormer composes of multiple lazy blocks, each of which contains multiple Transformer layers. In each lazy block, the self-attention distribution is only computed once in the first layer and then is reused in all upper layers. In this way, the cost of computation could be largely saved. We also provide several training tricks for LazyFormer. Extensive experiments demonstrate the effectiveness of the proposed method. Title: A Large-Scale, Automated Study of Language Surrounding Artificial Intelligence Authors: Autumn Toney Categories: cs.CL tl;dr: Uses word association measurements for analyses of AI-related references in news and scientific articles. Title: LazyFormer: Self Attention with Lazy Update Authors: Chengxuan Ying, Guolin Ke, Di He, Tie-Yan Liu Categories: cs.CL cs.AI tl;dr: An efficient approach for Transformer self-attention which computes the self-attention once in the first layer and reused in upper layers TLDR Summarizing scientific papers in a single sentence
  68. 97

  69. Data collection • Peer reviews include a summary of the

    paper in the first paragraph • Rewrite the information in the first paragraph as TLDR 99
  70. QASPER: Question answering from scientific papers NLP Professionals / students

    A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers (NAACL 2021) Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, Matt Gardner 5,049 Questions NLP Professionals / students
  71. Takeaways • Many practical NLP problems require full document understanding

    • Extending existing models for document-level and multi-document tasks ◦ SPECTER: Document-level representation learning ◦ Longformer: Long document transformer ◦ CDLM: Cross document language model ◦ PRIMER: Multi-document summarization model • Document-level benchmarks