Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Transformer Memory as a Differentiable Search Index / Yi Tay (Google Research)

wing.nus
March 01, 2022

Transformer Memory as a Differentiable Search Index / Yi Tay (Google Research)

ABSTRACT:
In this talk, I will discuss our latest work from Google AI, the “differentiable search index” (DSI). DSI demonstrates that information retrieval can be accomplished with a single Transformer, in which all information about the corpus is encoded in the parameters of the model. DSI is a new paradigm that learns a text-to-text model that maps string queries directly to relevant docids; in other words, a DSI model answers queries directly using only its parameters, dramatically simplifying the whole retrieval process. We study variations in how documents and their identifiers are represented, variations in training procedures, and the interplay between models and corpus sizes. Experiments demonstrate that given appropriate design choices, DSI significantly outperforms strong baselines such as dual encoder models. Moreover, DSI demonstrates strong generalization capabilities, outperforming a BM25 baseline in a zero-shot setup.

BIO-DATA:
Yi Tay is a Senior Research Scientist and Tech Lead at Google AI. Yi is mainly a ML/NLP researcher with a keen focus on Transformer models. Yi’s research work has earned him the ICLR 2021 best paper award, WSDM 2020 Best paper award (runner-up) and WSDM 2021 Best Paper Award (runner-up). He also sometimes serves as Area Chair or Senior PC for top tier conferences. Before joining Google, Yi earned his PhD from NTU Singapore where he also won the best thesis award. To this date, Yi has published quite a lot of papers but is now more interested in retweets than peer reviewed papers. Homepage: https://vanzytay.github.io/

Slides link (via Speakerdeck): https://speakerdeck.com/wingnus/transformer-memory-as-a-differentiable-search-index
YouTube Video recording: https://youtu.be/27rNqGrTdSI
Related Link: https://wing-nus.github.io/nlp-seminar/speaker-yi

wing.nus

March 01, 2022
Tweet

More Decks by wing.nus

Other Decks in Education

Transcript

  1. Yi Tay
    Research Scientist, Google
    Transformer Memory as a
    Differentiable Search Index

    View full-size slide

  2. Premise
    ● Search/IR have primarily been about learning s(d,q) and ranking them.
    ● Dual encoders are well-established SOTA.

    View full-size slide

  3. Dual Encoders
    ● Project Doc and Query into low dimensional space and compute
    similarity.
    ● Trained using contrastive learning.
    ● MIPS is then used to find nearest neighbours.
    ● MIPS is discrete.
    ○ Cannot be trained E2E.
    ● What about cross attention?
    ○ Typically used in re-rankers.
    ○ Expensive.
    ■ Docs can’t be cached.

    View full-size slide

  4. What If?
    We could parameterize a search system with just a single Transformer? 🤯
    Query Complex Retrieval System Result Set
    Transformer
    😎

    View full-size slide

  5. But how?
    ● How do we formulate this problem?
    ● Why hasn’t this done before?
    ○ Because it’s extremely hard!
    ● How do we frame retrieval as generation?
    ○ Can we decode documents?
    ● How can we train this?
    ○ How can we index documents?
    ○ How can we retrieve them?
    ● Are you sure we’re just going to do this with a single model?

    View full-size slide

  6. Problems with Existing Retrieval paradigms
    ● Ranking is a cumbersome problem.
    ○ We train with pointwise/pairwise loss but we have to score all q,d
    pairs during inference for eval.
    ○ IR eval is painful.
    ● Retrieval systems that rely on MIPS have a non-discrete component.
    ○ You’ll always have to spin up a fast vector search server to compute
    fast nearest neighbours.
    ○ Goes against the paradigm of unified models (T5, UnifiedQA etc).

    View full-size slide

  7. Towards Next Generation IR
    Differentiable Search Index

    View full-size slide

  8. DSI / Differentiable Search Index
    ● Transformer Encoder-Decoder to parameterize a search system.
    ● All information about the corpus is stored in the model.
    ● We introduce an “indexing” task to allow DSI to ingest documents.
    ○ All aspects of retrieval are now mapped to well-known ML
    problems.
    ● We then introduce a “retrieval” task to retrieve from Transformer
    memory (parameters).
    ● Indexing and Retrieving are co-trained.

    View full-size slide

  9. DSI Core Tasks
    ● Indexing
    ○ Document tokens => Docid
    ● Retrieval
    ○ Query tokens => Docid
    ● Distinguished with a prompt token (is this a retrieval or indexing task?)
    ○ As per T5 style.
    ● But what are docids?

    View full-size slide

  10. Docids
    ● Naive String DocIds
    ○ We assign random docids to documents. E.g.,
    ○ This is treated as a string and tokenized with the default tokenizer.
    ○ The model generates this with beam search.
    ● Atomic Docids
    ○ We construct an additional doc embedding matrix with a
    predefined budget (e.g., 300k docs).
    ○ This is like subword vocab.

    View full-size slide

  11. Beam Searching Document Clusters
    ● Advanced method that leverages corpus information to assign docids.
    They are like naive docids, but semantically informed.

    View full-size slide

  12. Document Representation
    ● How to represent document tokens.
    ● Direct indexing.
    ○ Take the document “as it is”
    ● Set indexing
    ○ Permutation invariant. Removes duplicated words and stopwords.
    ● Inverted indexing
    ○ Sliding window of chunks.
    ○ Take turns to associate docids to different parts of the document.

    View full-size slide

  13. Indexing Task Options
    ● Seq2seq tasks
    ○ Doc tokens -> docid
    ○ Docid -> Doc tokens
    ○ Bidirectional (both doc->docid and docid->doctokens)
    ○ Span corruption
    ■ Concat docids and do regular t5 style pretraining
    ● RQ: Which is the best to learn doc->docid association?

    View full-size slide

  14. Model, Training and Optimization
    ● The main model is a simple Encoder-Decoder Transformer initialized
    with T5 model.
    ● We optimize indexing and retrieval together with multi-task learning.
    ● Explored index first for K steps, then learn retrieval.
    ○ Did worse but may be due to optimization issues.
    ● Interplay is tricky and ratio between indexing and retrieval task is
    important.
    ○ Setting the ratio can 2x or even 3x performance.
    ● Optimization is tricky for DSI. Getting this right seems paramount for
    this new paradigm.

    View full-size slide

  15. An exciting proof-of-concept
    Experiments

    View full-size slide

  16. NaturalQuestions
    ● 307k Question-Answer pairs.
    ● Each QA pair comes with corresponding Wikipedia article (document).
    ● We construct 3 subsets from NQ (to understand how corpus size affects
    DSI)
    ○ 10K docs / 100k docs and full 307k docs.
    ● During training:
    ○ The model sees all docs in the subset.
    ○ Validation query-doc examples are not seen by the model.

    View full-size slide

  17. Baselines
    ● BM25 (implemented in gensim)
    ● T5 Dual Encoders
    ○ For fairest comparison.
    ○ Trained with contrastive learning using T5X framework.
    ○ Documents are retrieved using MIPS.
    ● Zero-shot (Unsupervised)
    ○ No query-doc pair labeled data.
    ○ BM25
    ○ Raw T5 Embeddings for computing similarity.
    ○ SentenceT5 (Ni et al. 2021)
    ■ SOTA for unsupervised sentence similarity learning.

    View full-size slide

  18. NaturalQuestions

    View full-size slide

  19. Zero-shot Retrieval
    ● How well can our model do without seeing supervised data?
    ● Here supervised data refers to:
    ○ Query-doc pairs.
    ○ In DSI, the model is not trained at all on retrieval.

    View full-size slide

  20. Effect of Indexing Strategies
    ● Doc => DocId works the best
    ○ Bidirectional (both doc=>docid and reverse), works okay but not
    better.
    ● Docid => Doc gets 0% retrieval (does not work at all)
    ● Span corruption + Docid does not work at all (0% accuracy).
    ● Tldr: Design of indexing task is crucial but simple works best here.

    View full-size slide

  21. Effect of Doc Representation
    ● Direct Indexing works best.
    ● Going to high seq length hurts performance. 32/64 is sufficient for NQ.
    ○ Caveat: might not hold for all datasets down the line.

    View full-size slide

  22. Scaling Laws

    View full-size slide

  23. Conclusion
    ● We present a new paradigm for retrieval.
    ● DSI rethinks the traditional IR paradigm.
    ● Has multiple advantages:
    ○ All aspects of search are now mapped to well-known ML problems.
    ■ No more discrete MIPS
    ■ Easy to unify with natural language processing applications.
    ■ May be convenient to train retrieval augmented models.
    ● New avenues in prompt tuning

    View full-size slide

  24. Thank you!
    Questions?

    View full-size slide