Transformer Memory as a Differentiable Search Index / Yi Tay (Google Research)

Yi Tay Research Scientist, Google Transformer Memory as a Differentiable
Search Index

Premise • Search/IR have primarily been about learning s(d,q) and
ranking them. • Dual encoders are well-established SOTA.

Dual Encoders • Project Doc and Query into low dimensional
space and compute similarity. • Trained using contrastive learning. • MIPS is then used to find nearest neighbours. • MIPS is discrete. ◦ Cannot be trained E2E. • What about cross attention? ◦ Typically used in re-rankers. ◦ Expensive. ▪ Docs can’t be cached.

What If? We could parameterize a search system with just
a single Transformer? 🤯 Query Complex Retrieval System Result Set Transformer 😎

But how? • How do we formulate this problem? •
Why hasn’t this done before? ◦ Because it’s extremely hard! • How do we frame retrieval as generation? ◦ Can we decode documents? • How can we train this? ◦ How can we index documents? ◦ How can we retrieve them? • Are you sure we’re just going to do this with a single model?

Problems with Existing Retrieval paradigms • Ranking is a cumbersome
problem. ◦ We train with pointwise/pairwise loss but we have to score all q,d pairs during inference for eval. ◦ IR eval is painful. • Retrieval systems that rely on MIPS have a non-discrete component. ◦ You’ll always have to spin up a fast vector search server to compute fast nearest neighbours. ◦ Goes against the paradigm of unified models (T5, UnifiedQA etc).

Towards Next Generation IR Differentiable Search Index

DSI / Differentiable Search Index • Transformer Encoder-Decoder to parameterize
a search system. • All information about the corpus is stored in the model. • We introduce an “indexing” task to allow DSI to ingest documents. ◦ All aspects of retrieval are now mapped to well-known ML problems. • We then introduce a “retrieval” task to retrieve from Transformer memory (parameters). • Indexing and Retrieving are co-trained.

DSI Core Tasks • Indexing ◦ Document tokens => Docid
• Retrieval ◦ Query tokens => Docid • Distinguished with a prompt token (is this a retrieval or indexing task?) ◦ As per T5 style. • But what are docids?

Docids • Naive String DocIds ◦ We assign random docids
to documents. E.g., <doc_198384> ◦ This is treated as a string and tokenized with the default tokenizer. ◦ The model generates this with beam search. • Atomic Docids ◦ We construct an additional doc embedding matrix with a predefined budget (e.g., 300k docs). ◦ This is like subword vocab.

Beam Searching Document Clusters • Advanced method that leverages corpus
information to assign docids. They are like naive docids, but semantically informed.

Document Representation • How to represent document tokens. • Direct
indexing. ◦ Take the document “as it is” • Set indexing ◦ Permutation invariant. Removes duplicated words and stopwords. • Inverted indexing ◦ Sliding window of chunks. ◦ Take turns to associate docids to different parts of the document.

Indexing Task Options • Seq2seq tasks ◦ Doc tokens ->
docid ◦ Docid -> Doc tokens ◦ Bidirectional (both doc->docid and docid->doctokens) ◦ Span corruption ▪ Concat docids and do regular t5 style pretraining • RQ: Which is the best to learn doc->docid association?

Model, Training and Optimization • The main model is a
simple Encoder-Decoder Transformer initialized with T5 model. • We optimize indexing and retrieval together with multi-task learning. • Explored index first for K steps, then learn retrieval. ◦ Did worse but may be due to optimization issues. • Interplay is tricky and ratio between indexing and retrieval task is important. ◦ Setting the ratio can 2x or even 3x performance. • Optimization is tricky for DSI. Getting this right seems paramount for this new paradigm.

An exciting proof-of-concept Experiments

NaturalQuestions • 307k Question-Answer pairs. • Each QA pair comes
with corresponding Wikipedia article (document). • We construct 3 subsets from NQ (to understand how corpus size affects DSI) ◦ 10K docs / 100k docs and full 307k docs. • During training: ◦ The model sees all docs in the subset. ◦ Validation query-doc examples are not seen by the model.

Baselines • BM25 (implemented in gensim) • T5 Dual Encoders
◦ For fairest comparison. ◦ Trained with contrastive learning using T5X framework. ◦ Documents are retrieved using MIPS. • Zero-shot (Unsupervised) ◦ No query-doc pair labeled data. ◦ BM25 ◦ Raw T5 Embeddings for computing similarity. ◦ SentenceT5 (Ni et al. 2021) ▪ SOTA for unsupervised sentence similarity learning.

NaturalQuestions

Zero-shot Retrieval • How well can our model do without
seeing supervised data? • Here supervised data refers to: ◦ Query-doc pairs. ◦ In DSI, the model is not trained at all on retrieval.

Effect of Indexing Strategies • Doc => DocId works the
best ◦ Bidirectional (both doc=>docid and reverse), works okay but not better. • Docid => Doc gets 0% retrieval (does not work at all) • Span corruption + Docid does not work at all (0% accuracy). • Tldr: Design of indexing task is crucial but simple works best here.

Effect of Doc Representation • Direct Indexing works best. •
Going to high seq length hurts performance. 32/64 is sufficient for NQ. ◦ Caveat: might not hold for all datasets down the line.

Scaling Laws

Conclusion • We present a new paradigm for retrieval. •
DSI rethinks the traditional IR paradigm. • Has multiple advantages: ◦ All aspects of search are now mapped to well-known ML problems. ▪ No more discrete MIPS ▪ Easy to unify with natural language processing applications. ▪ May be convenient to train retrieval augmented models. • New avenues in prompt tuning

Thank you! Questions?

Transformer Memory as a Differentiable Search I...

Transformer Memory as a Differentiable Search Index / Yi Tay (Google Research)

wing.nus

More Decks by wing.nus

Other Decks in Education

Featured

Transcript

Yi Tay Research Scientist, Google Transformer Memory as a Differentiable

Premise • Search/IR have primarily been about learning s(d,q) and

Dual Encoders • Project Doc and Query into low dimensional

What If? We could parameterize a search system with just

But how? • How do we formulate this problem? •

Problems with Existing Retrieval paradigms • Ranking is a cumbersome

Towards Next Generation IR Differentiable Search Index

DSI / Differentiable Search Index • Transformer Encoder-Decoder to parameterize

DSI Core Tasks • Indexing ◦ Document tokens => Docid

Docids • Naive String DocIds ◦ We assign random docids

Beam Searching Document Clusters • Advanced method that leverages corpus

Document Representation • How to represent document tokens. • Direct

Indexing Task Options • Seq2seq tasks ◦ Doc tokens ->

Model, Training and Optimization • The main model is a

An exciting proof-of-concept Experiments

NaturalQuestions • 307k Question-Answer pairs. • Each QA pair comes

Baselines • BM25 (implemented in gensim) • T5 Dual Encoders

NaturalQuestions

Zero-shot Retrieval • How well can our model do without

Effect of Indexing Strategies • Doc => DocId works the

Effect of Doc Representation • Direct Indexing works best. •

Scaling Laws

Conclusion • We present a new paradigm for retrieval. •

Thank you! Questions?