Poly-encoders: Transformer Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring

Slide 1

Slide 1 text

1/18 1 Unofficial slide by @izuna385

Slide 2

Slide 2 text

2/18 Summary 2 • Combining both the merits of Bi-encoder and Cross-encoder Bi-encoder Faster Not considering cross-attention Cross-encoder Better performance with cross-attention slow • (Additional:) Pretraining strategy with datasets similar to the downstream tasks.

Slide 3

Slide 3 text

3/18 3 RQ and Solution • Research Question How to combining both the merits of Bi-encoder and cross-encoder ? • Solution Caching candidates for the attention to contexts.

Slide 4

Slide 4 text

4/18 Encoder Structure (A.) Bi-encoder [CLS] [CLS] [CLS] Mention

Slide 5

Slide 5 text

5/18 [CLS] [CLS] [CLS] Mention Entity Encoder Structure (A.) Bi-encoder

Slide 6

Slide 6 text

6/18 [CLS] [CLS] [CLS] Mention Entity Caching for fast search. Encoder Structure (A.) Bi-encoder

Slide 7

Slide 7 text

7/18 [CLS] [CLS] [CLS] Caching for fast search. can’t consider cross-attention. Entity Encoder Structure (A.) Bi-encoder

Slide 8

Slide 8 text

8/18 Encoder Structure (B.) Cross-Encoder • Example: Zero-shot Entity Linking [Logeswaran et al., ACL’19] [CLS] mention context [ENT] input : [Devlin et al., ‘18] L : embedding for indicating mention location ENT candidate entity descriptions

Slide 9

Slide 9 text

9/18 Encoder Structure (B.) Cross-Encoder • For each generated candidate entity per mention, [CLS] mention context [ENT] candidate entity descriptions input : [Devlin et al., ‘18] L : embedding for indicating mention location [Logeswaran et al., ACL’19] ENT [CLS] scoring

Slide 10

Slide 10 text

10/18 Encoder Structure (B.) Cross-Encoder • For each generated candidate entity per mention, [CLS] mention context [ENT] input : [Devlin et al., ‘18] L : embedding for indicating mention location [Logeswaran et al., ACL’19] ENT [CLS] scoring Considering mention-entity cross attention. candidate entity descriptions

Slide 11

Slide 11 text

11/18 Encoder Structure (B.) Cross-Encoder • For each generated candidate entity per mention, [CLS] mention context [ENT] input : [Devlin et al., ‘18] L : embedding for indicating mention location [Logeswaran et al., ACL’19] ENT [CLS] scoring Slow inference per each mention and its candidates. Considering mention-entity cross attention. candidate entity descriptions

Slide 12

Slide 12 text

12/18 Poly-Encoder • Both and can be cached. à Fast inference. • Attention from candidates. à Extract pertinent parts of the context per candidate.

Slide 13

Slide 13 text

13/18 • In-batch negative sampling [Henderson et al., ‘17; Gillick et al., ‘18] Context Each gold label Dot one batch … … … In-Batch Training

Slide 14

Slide 14 text

14/18 • In-batch negative sampling [Henderson et al., ‘17; Gillick et al., ‘18] one batch gold negative negative … In-Batch Training Context Each gold label Dot …

Slide 15

Slide 15 text

15/18 Results (a.) Effect of negatives in a batch

Slide 16

Slide 16 text

16/18 Results (b.) Comparison with Bi- / Cross- / Poly- • See Table 4 of the original paper. ・ They also checked the effect of changing pretraining data for BERT.