$30 off During Our Annual Pro Sale. View Details »

Transformer Memory as a Differentiable Search Index / Yi Tay (Google Research)

March 01, 2022

Transformer Memory as a Differentiable Search Index / Yi Tay (Google Research)

In this talk, I will discuss our latest work from Google AI, the “differentiable search index” (DSI). DSI demonstrates that information retrieval can be accomplished with a single Transformer, in which all information about the corpus is encoded in the parameters of the model. DSI is a new paradigm that learns a text-to-text model that maps string queries directly to relevant docids; in other words, a DSI model answers queries directly using only its parameters, dramatically simplifying the whole retrieval process. We study variations in how documents and their identifiers are represented, variations in training procedures, and the interplay between models and corpus sizes. Experiments demonstrate that given appropriate design choices, DSI significantly outperforms strong baselines such as dual encoder models. Moreover, DSI demonstrates strong generalization capabilities, outperforming a BM25 baseline in a zero-shot setup.

Yi Tay is a Senior Research Scientist and Tech Lead at Google AI. Yi is mainly a ML/NLP researcher with a keen focus on Transformer models. Yi’s research work has earned him the ICLR 2021 best paper award, WSDM 2020 Best paper award (runner-up) and WSDM 2021 Best Paper Award (runner-up). He also sometimes serves as Area Chair or Senior PC for top tier conferences. Before joining Google, Yi earned his PhD from NTU Singapore where he also won the best thesis award. To this date, Yi has published quite a lot of papers but is now more interested in retweets than peer reviewed papers. Homepage: https://vanzytay.github.io/

Slides link (via Speakerdeck): https://speakerdeck.com/wingnus/transformer-memory-as-a-differentiable-search-index
YouTube Video recording: https://youtu.be/27rNqGrTdSI
Related Link: https://wing-nus.github.io/nlp-seminar/speaker-yi


March 01, 2022

More Decks by wing.nus

Other Decks in Education


  1. Yi Tay Research Scientist, Google Transformer Memory as a Differentiable

    Search Index
  2. Premise • Search/IR have primarily been about learning s(d,q) and

    ranking them. • Dual encoders are well-established SOTA.
  3. Dual Encoders • Project Doc and Query into low dimensional

    space and compute similarity. • Trained using contrastive learning. • MIPS is then used to find nearest neighbours. • MIPS is discrete. ◦ Cannot be trained E2E. • What about cross attention? ◦ Typically used in re-rankers. ◦ Expensive. ▪ Docs can’t be cached.
  4. What If? We could parameterize a search system with just

    a single Transformer? 🤯 Query Complex Retrieval System Result Set Transformer 😎
  5. But how? • How do we formulate this problem? •

    Why hasn’t this done before? ◦ Because it’s extremely hard! • How do we frame retrieval as generation? ◦ Can we decode documents? • How can we train this? ◦ How can we index documents? ◦ How can we retrieve them? • Are you sure we’re just going to do this with a single model?
  6. Problems with Existing Retrieval paradigms • Ranking is a cumbersome

    problem. ◦ We train with pointwise/pairwise loss but we have to score all q,d pairs during inference for eval. ◦ IR eval is painful. • Retrieval systems that rely on MIPS have a non-discrete component. ◦ You’ll always have to spin up a fast vector search server to compute fast nearest neighbours. ◦ Goes against the paradigm of unified models (T5, UnifiedQA etc).
  7. Towards Next Generation IR Differentiable Search Index

  8. DSI / Differentiable Search Index • Transformer Encoder-Decoder to parameterize

    a search system. • All information about the corpus is stored in the model. • We introduce an “indexing” task to allow DSI to ingest documents. ◦ All aspects of retrieval are now mapped to well-known ML problems. • We then introduce a “retrieval” task to retrieve from Transformer memory (parameters). • Indexing and Retrieving are co-trained.
  9. DSI Core Tasks • Indexing ◦ Document tokens => Docid

    • Retrieval ◦ Query tokens => Docid • Distinguished with a prompt token (is this a retrieval or indexing task?) ◦ As per T5 style. • But what are docids?
  10. Docids • Naive String DocIds ◦ We assign random docids

    to documents. E.g., <doc_198384> ◦ This is treated as a string and tokenized with the default tokenizer. ◦ The model generates this with beam search. • Atomic Docids ◦ We construct an additional doc embedding matrix with a predefined budget (e.g., 300k docs). ◦ This is like subword vocab.
  11. Beam Searching Document Clusters • Advanced method that leverages corpus

    information to assign docids. They are like naive docids, but semantically informed.
  12. Document Representation • How to represent document tokens. • Direct

    indexing. ◦ Take the document “as it is” • Set indexing ◦ Permutation invariant. Removes duplicated words and stopwords. • Inverted indexing ◦ Sliding window of chunks. ◦ Take turns to associate docids to different parts of the document.
  13. Indexing Task Options • Seq2seq tasks ◦ Doc tokens ->

    docid ◦ Docid -> Doc tokens ◦ Bidirectional (both doc->docid and docid->doctokens) ◦ Span corruption ▪ Concat docids and do regular t5 style pretraining • RQ: Which is the best to learn doc->docid association?
  14. Model, Training and Optimization • The main model is a

    simple Encoder-Decoder Transformer initialized with T5 model. • We optimize indexing and retrieval together with multi-task learning. • Explored index first for K steps, then learn retrieval. ◦ Did worse but may be due to optimization issues. • Interplay is tricky and ratio between indexing and retrieval task is important. ◦ Setting the ratio can 2x or even 3x performance. • Optimization is tricky for DSI. Getting this right seems paramount for this new paradigm.
  15. An exciting proof-of-concept Experiments

  16. NaturalQuestions • 307k Question-Answer pairs. • Each QA pair comes

    with corresponding Wikipedia article (document). • We construct 3 subsets from NQ (to understand how corpus size affects DSI) ◦ 10K docs / 100k docs and full 307k docs. • During training: ◦ The model sees all docs in the subset. ◦ Validation query-doc examples are not seen by the model.
  17. Baselines • BM25 (implemented in gensim) • T5 Dual Encoders

    ◦ For fairest comparison. ◦ Trained with contrastive learning using T5X framework. ◦ Documents are retrieved using MIPS. • Zero-shot (Unsupervised) ◦ No query-doc pair labeled data. ◦ BM25 ◦ Raw T5 Embeddings for computing similarity. ◦ SentenceT5 (Ni et al. 2021) ▪ SOTA for unsupervised sentence similarity learning.
  18. NaturalQuestions

  19. Zero-shot Retrieval • How well can our model do without

    seeing supervised data? • Here supervised data refers to: ◦ Query-doc pairs. ◦ In DSI, the model is not trained at all on retrieval.
  20. Effect of Indexing Strategies • Doc => DocId works the

    best ◦ Bidirectional (both doc=>docid and reverse), works okay but not better. • Docid => Doc gets 0% retrieval (does not work at all) • Span corruption + Docid does not work at all (0% accuracy). • Tldr: Design of indexing task is crucial but simple works best here.
  21. Effect of Doc Representation • Direct Indexing works best. •

    Going to high seq length hurts performance. 32/64 is sufficient for NQ. ◦ Caveat: might not hold for all datasets down the line.
  22. Scaling Laws

  23. Conclusion • We present a new paradigm for retrieval. •

    DSI rethinks the traditional IR paradigm. • Has multiple advantages: ◦ All aspects of search are now mapped to well-known ML problems. ▪ No more discrete MIPS ▪ Easy to unify with natural language processing applications. ▪ May be convenient to train retrieval augmented models. • New avenues in prompt tuning
  24. Thank you! Questions?