Beyond Sentences and Paragraphs: Towards Document-level and Multi-document Understanding

Beyond Sentences and Paragraphs: Towards Document-level and Multi-document Understanding Arman
Cohan Dec 29, 2021

Motivation • Significant progress on token and sentence-level tasks in
NLP • Many practical NLP problems require full document understanding • Some require making connections between documents • Advances in full document processing opens up new approaches and application areas (inside, and outside NLP).

Key challenges for document-level NLP • Annotating full documents or
multiple documents can be difficult • End tasks often require combining information spread over long distances. Models need to ignore a lot of irrelevant text. • Many popular algorithms are designed to work in short sequence setting, and have limitations in long setting: • RNN/LSTM: process input sequentially → slow for long sequences • Transformers: self-attention is O(L2) → cannot process long input with current hardware. Many pre-trained LMs limited to 512 tokens (e.g. BERT). • Models for processing multi-document tasks are often task-specific and complex

This talk • How to learn good document-level representations ➝
SPECTER • How to extend transformers for long sequences ➝ Longformer • General language models for multi-document processing ➝ CDLM: encoder ➝ PRIMER: encoder-decoder • Benchmarks (SciFact, TLDR, QASPER)

5 SPECTER: Document-level Representation Learning using Citation-informed Transformers (ACL 2020)
Arman Cohan*, Sergey Feldman*, Iz Beltagy, Doug Downey, Daniel S. Weld Outline Salton (1975) 1- Document-level representations learning

Document representation learning • Good document representations transfer well to
downstream tasks 6 Task-specific model SPECTER Recommendation Classification Search • Goal: Task-independent representation. Easy to use in downstream tasks.

SPECTER Idea • SPECTER: Scientific Paper Embeddings using Citation-informed TransformERs
• Documents with links are more similar than random documents ◦ Their vector representation should be closer • Contrastive learning using the citation graph 7 A B D E C F F

Baseline model, How to represent a document? 8 Document (Q)
f(Q) [CLS] The traffic light was red.

SPECTER 9 Document (Q) Positive (P) Negative (N) train Triplet
loss

10 Implement using citation network A B E E C
F D A B E E C F D A B E E C F D B C D B C F Triplets using easy negatives Triplets using hard negatives

Results 11

Results 12

Hard negatives 13 84.2 88.4 91.5 36.9 80 82.4 85.8
89.8 36.8 78.4 20 30 40 50 60 70 80 90 100 Classification User activity Citation Recomm Avg. Specter - No hard negatives

Learned fixed representations are powerful Zero-shot performance (no fine-tuning) 14

SPECTER recap • Improved document representation trained using citation network
• Improved zero-shot performance on a variety of downstream tasks • Currently in use at SemanticScholar and some other academic search engines 15

16 Longformer: The Long Document Transformer (2020) Iz Beltagy*, Matthew
E. Peters*, Arman Cohan* - Language data consists of many long documents - Existing transformers (BERT, T5): process up to 512 tokens Outline 2- Long document understanding

Limitation of Transformers 17 Enc Layer 1 Enc Layer 2
Enc Layer 24 Input FF + softmax output Feed Forward Self attention

Limitation of Transformers • Self-attention is expensive ◦ Quadratic complexity
(in sequence length) ◦ Input length is limited O(n^2) 18 Fig from: https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/note books/hello_t2t.ipynb#scrollTo=OJKU36QAfqOC

• Chunk, extract, combine • Compressing memory in Transformers Transformer-XL
(2019), Adaptive span (2019), Compressive (2019) • Optimizing the self-attention computation Reformer (2020), SparseTransformer (2019), BP-Transformer (2020), Linformer (2020) Prior work (scaling to long documents) 19 Limitation 1: Only intrinsic evaluation on language modeling tasks Limitation 2: No pretrained general model that can be used for downstream tasks

Proposed method: Longformer 20 • Efficient Transformer model for long
documents • Allows processing documents up to 16K tokens long • General model that can be applied to various document-level NLP tasks

Full self-attention 21

Avoid computing the full attention matrix Tokens attend to each
others following an “attention pattern” Large receptive field with stacked layers Longformer - sparse attention matrix 22

- QA: [cls] [q1 q2 q3] [sep] [t1 t2 t3
t4 t5 t6 t7 ...] - Classification [cls] [t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 ...] Longformer - local and global attention 23 Local attn window Local attn window Global attn Global attn - Indirect information flow is not always sufficient. - For some tasks direct attention is needed.

- Global attention to a few location of interest -
Global attention is user defined based on the task - Separate projection weights for global attention - Local attention - capture local context Longformer - local and global attention 24

• Requires banded-multiplication ◦ Not supported in existing DL libraries
• Implementation 1: sliding chunks Implementation 25 Split into chunks Mask out extra elements

• Requires banded-multiplication ◦ Not supported in existing DL libraries
• Implementation 2: custom cuda kernel ◦ Supports dilation ◦ Only computes the non-zero elements (memory efficient) ◦ Slightly harder to use (requires custom kernel) Implementation 26

Performance 27

Intrinsic evaluation: Char-level LM (small models) • Longformer: Sequences 32K
tokens long 28 1.06 1.08 1.1 1.12 1.14 1.16 1.18 1.2 T12 (2018) Adaptive (2019) BP (2019) Our Longformer Test BPC (Lower is better) 0.94 0.96 0.98 1 1.02 1.04 1.06 1.08 1.1 1.12 T12 (2018) Tr-XL (2019) Reformer (2020) Adaptive (2019) BP (2020) Longformer text8 enwik8

Intrinsic evaluation: Char-level LM (large models) 29 88M Params 100M
Params 277M Params 209M Params 277M Params 102M Params 0.93 0.95 0.97 0.99 1.01 1.03 1.05 Tr-XL (2019) Sparse (2019) Tr-XL 24 layers (2019) Adaptive (2019) Compressive (2020) Longformer BPC enwik8

Goal: a BERT-like for long-doc NLP tasks Pretrain with self-supervision
and fine-tune on end task Naïve pre-training is expensive Longformer in transfer learning setting 30 Pre-trained Longformer Longformer Unlabeled data Self-supervised learning Long document tasks QA Classification Summarization

Longformer pre-training Careful initialization to avoid high cost of pretraining
◦ Start from RoBERTa weights (support 512 tokens) ◦ Increase size of position embedding matrix. Initialize it by copying the first 512 ◦ Continue Masked LM (MLM) pre-training on a corpus of long docs ◦ Copy Q, K, V linear projections to get separate projections for global attention 31

Experiments on downstream tasks 33 - HotpotQA -- Multihop reasoning
and evidence extraction - TriviaQA -- QA dataset from long Wikipedia articles - WikiHop -- QA requiring combining facts spread across multiple paragraphs - Coref -- Coreference chains across the document - IMDB -- Document classification - Hyperpartisan news -- Document classification

Finetuning on downstream tasks - Global attention Classification: on CLS
token QA: on all question tokens Coref: local attention only 34

Finetuning on downstream tasks - Results 35

SOTA on WikiHop and TriviaQA Competitive results on HotpotQA -
Longformer is simpler - Better models use GNN of entities Finetuning on downstream tasks - large model 36

Encoder-decoder setting • Long document seq2seq tasks ◦ Long document
summarization • LED: Longformer Encoder-Decoder 37 Enc Layer 1 Enc Layer 2 Enc Layer 3 Dec Layer 1 Dec Layer 2 Dec Layer 3 Document FF + softmax Summary Feed Forward Self attention Longformer Self attention

Summarization results • ArXiv dataset: Generate abstract given full paper
38 11.05 14.69 16.54 16.95 19.02 17.94 19.62 31.8 38.03 38.44 38.83 41.77 39.76 41.83 10 15 20 25 30 35 40 45 Discourse-aware (2018) Extr-Abst-TLM (2020) Dancer (2020) Pegasus (2019) BigBird-4K (2020) LED-large-4K (ours) LED-large-16K (ours) ROUGE R-2 R-L

Summary - Longformer • An efficient Transformer model for processing
very long documents • State-of-the-art results on document-level classification, summarization and QA tasks • Widely used by the community 39

• Existing works often use complex task-specific architectures for multi-document
tasks • Not leveraging pre-training / fine-tuning paradigm • CDLM: Cross-Document Language Modeling ◦ A pre-training method for using shared information between documents ◦ General easy-to-use model for multi-document tasks 40 3- Beyond single docs: multi-document tasks Outline Cross-Document Language Modeling (EMNLP 2021, findings) Avi Caciularu, Arman Cohan, Iz Beltagy, Matthew E. Peters, Arie Cattan, Ido Dagan

CDLM: Cross document language modeling • A pre-training approach to
implicitly encode cross document relationships • Using a cluster of related documents in pre-training Longformer 41 Doc 1:“...Harry Shearer is suing Vivendi’s Universal Music for $125 million for allegedly fraudulent ...” Doc 2:“...Harry Shearer alleges parent company of Universal Music and Studio Canal withheld millions ...” Doc 3: In October 2016, Shearer’s Century of Progress Productions sued UMG, StudioCanal and their parent company, Vivendi, over unpaid royalties.

Longformer for multi-document tasks 42 Question Doc 1 Doc 2
Doc 3 Doc 4 Longformer Answer Doc 1 Doc 2 Doc 3 Doc 4 Question Single long input

CDLM: Cross document language modeling 43 Cluster of related documents
Input Longformer Harry Shearer is suing Vivendi’s <doc-sep> Harry Shearer [MASK] parent company <doc-sep> Doc 2 Doc 1 local local local local local local local local global local local local Attention alleges

Cross Document Language Modeling (CDLM) • Start from Longformer •
Continue pre-training on the multi-news corpus (Fabbri et al 2019) ◦ 45K document clusters (each 2 to 10 docs) ◦ Train for 25K steps 44

CDLM - Finetuning • Concatenate documents, process through CDLM ◦
Task-specific finetuning similar to Longformer 45

Intrinsic evaluation: Cross-document perplexity 46 3.94 3.84 3.81 3.41 3.39
3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4 Longformer CDLM - Local attention CDLM - Random docs CDLM - Prefix Attn CDLM MLM PPL (lower is better)

Evaluation tasks: Cross Doc Coref 47

Results – Cross Doc Coreference Resolution 48 76 77 78
79 80 81 82 83 84 85 86 87 Barhom et al (2019) Meged et al. (2020) Cattan et al. (2020) Zeng et al. (2020) Yu et al (2020) CDLM CoNLL F1 Entities 64 66 68 70 72 74 76 78 80 82 84 Barhom et al. (2019) Cattan et al. (2020) CDLM Events

Results – Semantic document matching Tasks: citation recommendation and plagiarism
detection 50

Results – Multihop QA • HotpotQA dataset 51

Summary - CDLM • Pre-training on multiple related documents improve
multi-doc performance • General Transformer model for multi-document tasks ◦ Encoder-only ◦ Eliminates the need for task-specific multi-document modeling 52

• How to extend Transformers for multi-document tasks that concern
generation? • We focus on Multi-document summarization 53 4- Beyond single docs: multi-document tasks: generation Outline PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization (2021) Wen Xiao, Iz Beltagy, Giuseppe Carenini, Arman Cohan

What is Multi-Document Summarization? Task: Generate a summary given a
cluster of relevant documents.

Previous Methods Dataset-specific Models Graph- based Hierarchical Require domain specific
additional information Use dataset-specific, customized architectures, hard to leverage pre-trained LM Pre-trained Generation Models General-Purpose (BART, T5,...) Task-specific Require large amount of data to fine-tune Mainly focus on single- document input PEGASUS PRIMER Mainly for Multi-document Summarization

Overview of PRIMER • Architecture: Longformer Encoder Decoder (LED)

Overview of PRIMER • Architecture: Longformer Encoder Decoder (LED) •
Input Structure: ◦ documents separated with document separator(<doc-sep>) ◦ Global Attention on <doc-sep>

Input Structure

Ablation on Input Structure

Ablation on Input Structure • Input structure helps converging

Overview of PRIMER • Architecture: Longformer (with local and global
attention) • Input Structure: ◦ documents separated with document separator(<doc-sep>) ◦ Global Attention on <doc-sep> • Pre-training: ◦ Goal: Teach the model to identify and aggregate salient information across a “cluster” of related documents ◦ Multi-doc corpus: Newshead (360k clusters, 3.5 doc/cluster on average) ◦ Objective: Gap Sentence Generation ◦ Strategy: Entity Pyramid

Pre-training : objective • Motivation: close to Multi-Document Summarization task
• Gap Sentence Generation [Zhang et al., 2019]: ◦ Select several SALIENT sentences from the input documents (as pseudo- summary) ◦ mask out the selected sentences ◦ Recover them in order in the decoder

How to select SALIENT sentences? Previous work for single document
input (PEGASUS): • Random • Lead-K • Principle ◦ For each sentence, compute the ROUGE score between the sentence, and the rest of the document, and select sentences greedily, i.e. Best

How to select SALIENT sentences? However, multi-document input tends to
be more redundant And such strategy would prefer exact match between sentences, resulting in selection of less representative information.

New Masking Strategy Entity Pyramid • Select sentences that best
represent the entire cluster of input documents • Based on the Pyramid evaluation framework (Nenkova and Passonneau 2004) • We want to encourage the model to generate missing information using other documents in the input

Recall: Pyramid Evaluation w/ multiple refs (Nenkova and Passonneau, 2004)
Given: • Multiple gold human written summaries (references) for each input Human annotator: • Annotate Summary Content Units (phrases or clauses) • Compute weight for SCU by how many references it appears in • For each candidate summary, label which SCUs can match the summary, • The score is computed by summing up the weights of matched SCUs

Recall: Pyramid Evaluation (w/ multiple refs) REF #1 In 1998
two Libyans indicted in 1991 for the Lockerbie bombing were still in Libya. REF #2 Two Libyans were indicted in 1991 for blowing up a Pan Am jumbo jet over Lockerbie, Scotland in 1988. REF #3 Two Libyans, accused by the United States and Britain of bombing a New York bound Pan Am jet over Lockerbie, Scotland in 1988, killing 270 people, for 10 years were harbored by Libya who claimed the suspects could not get a fair trail in America or Britain. REF #4 Two Libyan suspects were indicted in 1991.

two Libyans indicted in 1991 for the Lockerbie bombing were still in Libya. REF #2 Two Libyans were indicted in 1991 for blowing up a Pan Am jumbo jet over Lockerbie, Scotland in 1988. REF #3 Two Libyans, accused by the United States and Britain of bombing a New York bound Pan Am jet over Lockerbie, Scotland in 1988, killing 270 people, for 10 years were harbored by Libya who claimed the suspects could not get a fair trail in America or Britain. REF #4 Two Libyan suspects were indicted in 1991. SCU #1 (w=4): two Libyans were officially accused

two Libyans indicted in 1991 for the Lockerbie bombing were still in Libya. REF #2 Two Libyans were indicted in 1991 for blowing up a Pan Am jumbo jet over Lockerbie, Scotland in 1988. REF #3 Two Libyans, accused by the United States and Britain of bombing a New York bound Pan Am jet over Lockerbie, Scotland in 1988, killing 270 people, for 10 years were harbored by Libya who claimed the suspects could not get a fair trail in America or Britain. REF #4 Two Libyan suspects were indicted in 1991. SCU #2 (w=3): the indictment of the two Lockerbie suspects was in 1991

Recall: Pyramid Evaluation w/ multiple refs (Nenkova and Passonneau, 2004)
The relative importance of facts to be conveyed can be quantified by `the number of reference that it appears in` SCU #1 SCU #2 SCU #3 … w=4 w=3 …

New Masking Strategy: Entity Pyramid Idea: The relative importance of
facts in the input documents can be quantified in the same way, i.e. the more documents that the fact appears in, the more important it is in the cluster. We use Entity to identify the facts, which is a proxy for human labeled SCU

New Masking Strategy: Entity Pyramid • Step 1: Entity Extraction
• Step 2: Build Entity Pyramid • Step 3: Select the most representative sentence for each entity

New Masking Strategy: Entity Pyramid Cluster ROUGE: Candidate Pool

Ablation on Pre-training • Input structure helps converging • Pre-training
also helps improving the model

Ablation on Pre-training Strategies Compare with PEGASUS (2020) • Same
architecture (LED-Base) • Same input structure • Same pre-training objective • Zero/Few-shot setting

Ablation on Pre-training Strategies Compare with PEGASUS Principle strategy (2020)
• Same architecture (LED-Base) • Same input structure • Same pre-training objective • Zero/Few-shot setting • The Entity Pyramid strategy works better than the Principle strategy used in PEGASUS.

Experiments • Evaluation Datasets: ◦ Multi-Document Summarization: Multi-News, Multi-XScience, WCEP,
Wikisum, DUC2004 ◦ Single Document Summarization: arXiv • Evaluation Metric: ◦ ROUGE score (R-1, R-2, and R-L)

Experiments • Settings: ◦ Zero-shot ◦ Few-shot: 10/100 training examples,
5 runs for each model ◦ Fully supervised • Compared Models: ◦ BART ◦ PEGASUS ◦ Longformer Encoder Decoder (LED) ◦ Prior SOTA Models (fully supervised only)

Results on Zero-shot • PEGASUS is also pre-trained for summarization
downstream task, thus it performs better than the other two models • Our model outperforms all the other pre-trained models on most of the datasets (up to 5 ROUGE points)

Results on Few-shot • Our model outperform all the other
pre-trained models on all the datasets.

Results: Full fine-tuning • Our model achieves SOTA on several
multi-document summarization datasets, as well a single-document summarization dataset.

Results: Human Evaluation Content Quality: Pyramid evaluation Fluency

Takeaway • PRIMER, a pre-trained model for multi-document summarization •
It is pre-trained with new strategy, Entity Pyramid • PRIMER reduces the need for dataset-specific architectures and large labeled data. • PRIMER achieves SOTA on multiple datasets under zero/few-shot and fully supervised

New benchmarks • SciFact: Scientific Fact Verification • TLDR: Extreme
summarization of scientific papers • QASPER: Question answering over scientific papers

SciFact: Motivation The coronavirus cannot thrive in warmer climates Scientific
claim Claim verification system Research corpus ...a 1°C increase in local temperature reduces transmission by 13%... SUPPORTS ...summer temperatures will not substantially limit pandemic growth... REFUTES Fact or Fiction: Verifying Scientific Claims. EMNLP 2020 David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, Hannaneh Hajishirzi

Generating and verifying claims Citing paper ACE-2 receptors are involved
in coronavirus infection The results of Weld et al. suggest that COVID-19 infection is mediated by ACE-2 receptors [45, 46]... Citation sentence Cited papers 46 45 SUPPORTS REFUTES ✓Scalable; rewrite existing claims ✓Important findings, as judged by other scientists. ✓ Evidence documents are specified in citations.

Dataset statistics Domain Wikipedia Biomedicine Claim type Synthetic Natural Number
of claims 185k 1.4k Annotator agreement 0.68 (5-way) 0.75 (2-way)

96 TLDR: Extreme summarization of scientific papers (EMNLP 2020, findings)
Isabel Cachola, Kyle Lo, Arman Cohan, Daniel S. Weld Title: A Large-Scale, Automated Study of Language Surrounding Artificial Intelligence Authors: Autumn Toney Categories: cs.CL This work presents a large-scale analysis of artificial intelligence (AI) and machine learning (ML) references within news articles and scientific publications between 2011 and 2019. We implement word association measurements that automatically identify shifts in language co-occurring with AI/ML and quantify the strength of these word associations. Our results highlight the evolution of perceptions and definitions around AI/ML and detect emerging application areas, models, and systems (e.g., blockchain and cybersecurity). Recent small-scale, manual studies have explored AI/ML discourse within the general public, the policymaker community, and researcher community, but are limited in their scalability and longevity. Our methods provide new views into public perceptions and subject-area expert discussions of AI/ML and greatly exceed the explanative power of prior work. Title: LazyFormer: Self Attention with Lazy Update Authors: Chengxuan Ying, Guolin Ke, Di He, Tie-Yan Liu Categories: cs.CL cs.AI Improving the efficiency of Transformer-based language pre-training is an important task in NLP, especially for the self-attention module, which is computationally expensive. In this paper, we propose a simple but effective solution, called \emph{LazyFormer}, which computes the self-attention distribution infrequently. LazyFormer composes of multiple lazy blocks, each of which contains multiple Transformer layers. In each lazy block, the self-attention distribution is only computed once in the first layer and then is reused in all upper layers. In this way, the cost of computation could be largely saved. We also provide several training tricks for LazyFormer. Extensive experiments demonstrate the effectiveness of the proposed method. Title: A Large-Scale, Automated Study of Language Surrounding Artificial Intelligence Authors: Autumn Toney Categories: cs.CL tl;dr: Uses word association measurements for analyses of AI-related references in news and scientific articles. Title: LazyFormer: Self Attention with Lazy Update Authors: Chengxuan Ying, Guolin Ke, Di He, Tie-Yan Liu Categories: cs.CL cs.AI tl;dr: An efficient approach for Transformer self-attention which computes the self-attention once in the first layer and reused in upper layers TLDR Summarizing scientific papers in a single sentence

Data collection • First set: Author written tldrs from OpenReview
98

Data collection • Peer reviews include a summary of the
paper in the first paragraph • Rewrite the information in the first paragraph as TLDR 99

SciTLDR : New extreme summarization dataset 100

QASPER: Question answering from scientific papers NLP Professionals / students
A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers (NAACL 2021) Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, Matt Gardner 5,049 Questions NLP Professionals / students

QASPER: Results • Significant gaps between human and best model
performance

Takeaways • Many practical NLP problems require full document understanding
• Extending existing models for document-level and multi-document tasks ◦ SPECTER: Document-level representation learning ◦ Longformer: Long document transformer ◦ CDLM: Cross document language model ◦ PRIMER: Multi-document summarization model • Document-level benchmarks

Thanks! @armancohan 104 Collaborators

Beyond Sentences and Paragraphs: Towards Docum...

Beyond Sentences and Paragraphs: Towards Document-level and Multi-document Understanding

More Decks by wing.nus

Other Decks in Education

Featured

Transcript