Poly-encoders: Transformer Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring

1/18 1 Unofficial slide by @izuna385

2/18 Summary 2 • Combining both the merits of Bi-encoder
and Cross-encoder Bi-encoder Faster Not considering cross-attention Cross-encoder Better performance with cross-attention slow • (Additional:) Pretraining strategy with datasets similar to the downstream tasks.

3/18 3 RQ and Solution • Research Question How to
combining both the merits of Bi-encoder and cross-encoder ? • Solution Caching candidates for the attention to contexts.

4/18 Encoder Structure (A.) Bi-encoder [CLS] [CLS] [CLS] Mention

5/18 [CLS] [CLS] [CLS] Mention Entity Encoder Structure (A.) Bi-encoder

6/18 [CLS] [CLS] [CLS] Mention Entity Caching for fast search.
Encoder Structure (A.) Bi-encoder

7/18 [CLS] [CLS] [CLS] Caching for fast search. can’t consider
cross-attention. Entity Encoder Structure (A.) Bi-encoder

8/18 Encoder Structure (B.) Cross-Encoder • Example: Zero-shot Entity Linking
[Logeswaran et al., ACL’19] [CLS] mention context [ENT] input : [Devlin et al., ‘18] L : embedding for indicating mention location ENT candidate entity descriptions

9/18 Encoder Structure (B.) Cross-Encoder • For each generated candidate
entity per mention, [CLS] mention context [ENT] candidate entity descriptions input : [Devlin et al., ‘18] L : embedding for indicating mention location [Logeswaran et al., ACL’19] ENT [CLS] scoring

entity per mention, [CLS] mention context [ENT] input : [Devlin et al., ‘18] L : embedding for indicating mention location [Logeswaran et al., ACL’19] ENT [CLS] scoring Considering mention-entity cross attention. candidate entity descriptions

entity per mention, [CLS] mention context [ENT] input : [Devlin et al., ‘18] L : embedding for indicating mention location [Logeswaran et al., ACL’19] ENT [CLS] scoring Slow inference per each mention and its candidates. Considering mention-entity cross attention. candidate entity descriptions

12/18 Poly-Encoder • Both and can be cached. à Fast
inference. • Attention from candidates. à Extract pertinent parts of the context per candidate.

13/18 • In-batch negative sampling [Henderson et al., ‘17; Gillick
et al., ‘18] Context Each gold label Dot one batch … … … In-Batch Training

14/18 • In-batch negative sampling [Henderson et al., ‘17; Gillick
et al., ‘18] one batch gold negative negative … In-Batch Training Context Each gold label Dot …

15/18 Results (a.) Effect of negatives in a batch

16/18 Results (b.) Comparison with Bi- / Cross- / Poly-
• See Table 4 of the original paper. ・ They also checked the effect of changing pretraining data for BERT.

17/18 Results (c.) Inference Speed Comparison 17

18/18 Conclusions 18

Poly-encoders: Transformer Architectures and Pr...

Poly-encoders: Transformer Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring

izuna385

More Decks by izuna385

Other Decks in Technology

Featured

Transcript

1/18 1 Unofficial slide by @izuna385

2/18 Summary 2 • Combining both the merits of Bi-encoder

3/18 3 RQ and Solution • Research Question How to

4/18 Encoder Structure (A.) Bi-encoder [CLS] [CLS] [CLS] Mention

5/18 [CLS] [CLS] [CLS] Mention Entity Encoder Structure (A.) Bi-encoder

6/18 [CLS] [CLS] [CLS] Mention Entity Caching for fast search.

7/18 [CLS] [CLS] [CLS] Caching for fast search. can’t consider

8/18 Encoder Structure (B.) Cross-Encoder • Example: Zero-shot Entity Linking

9/18 Encoder Structure (B.) Cross-Encoder • For each generated candidate

10/18 Encoder Structure (B.) Cross-Encoder • For each generated candidate

11/18 Encoder Structure (B.) Cross-Encoder • For each generated candidate

12/18 Poly-Encoder • Both and can be cached. à Fast

13/18 • In-batch negative sampling [Henderson et al., ‘17; Gillick

14/18 • In-batch negative sampling [Henderson et al., ‘17; Gillick

15/18 Results (a.) Effect of negatives in a batch

16/18 Results (b.) Comparison with Bi- / Cross- / Poly-

17/18 Results (c.) Inference Speed Comparison 17

18/18 Conclusions 18