Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them

14da6ebc2e909305afdb348e7970de81?s=47 wing.nus
June 10, 2021

PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them

Open-Domain Question Answering is the task of answering natural language questions with short factual answers. These questions are not accompanied by evidence, and can be from an open set of domains. Models must understand questions, search for and assemble evidence necessary to answer the question, and then generate an answer. Models which directly leverage question-answer (QA) pairs, such as closed-book QA (CBQA) models and QA-pair retrievers, show promise in terms of speed and memory compared to conventional models which retrieve and read from text corpora. QA-pair retrievers also offer interpretable answers, a high degree of control, and are trivial to update at test time with new knowledge. However, these models lack the accuracy of retrieve-and-read systems, as substantially less knowledge is covered by the available training QA-pairs relative to text corpora like Wikipedia. To facilitate improved QA-pair models, we introduce Probably Asked Questions (PAQ), a very large resource of 65M automatically-generated QA-pairs. We introduce a new QA-pair retriever, RePAQ, to complement PAQ. We find that PAQ preempts and caches test questions, enabling RePAQ to match the accuracy of recent retrieve-and-read models, whilst being significantly faster. RePAQ can be configured for size (under 500MB) or speed (over 1K questions per second) whilst retaining high accuracy. Lastly, we demonstrate RePAQ’s strength at selective QA, abstaining from answering when it is likely to be incorrect. This enables RePAQ to “back-off” to a more expensive state-of-the-art model, leading to a combined system which is both more accurate and 2x faster than the state-of-the-art model alone.

14da6ebc2e909305afdb348e7970de81?s=128

wing.nus

June 10, 2021
Tweet

Transcript

  1. PAQ: 65 Million Probably-Asked Questions and What You Can Do

    With Them Patrick Lewis, Yuxiang Wu, Linqing Liu, Pasquale Minervini, Heinrich Küttler, Aleksandra Piktus, Pontus Stenetorp, Sebastian Riedel 1
  2. Facebook AI “ODQA is emerging as a benchmark method of

    measuring systems’ abilities to read, represent, and retrieve knowledge expressed in all of the documents on the web.” EfficientQA Organizers, NeurIPS 2020 2 Open-Domain Question Answering (ODQA) Q: who has the right of way in international waters? A: Neither Vessel Q: when was puerto rico added to the usa? A: 1950 Q: who's hosting the super bowl in 2019? A: Atlanta, Georgia Q: how many seasons are there of grey's anatomy? A: 14 Q: who plays the voice of maui in moana? A: dwayne johnson
  3. Facebook AI 3 Q: last time la dodgers won the

    world series? Wikipedia Retriever e.g. DPR, TF-IDF “Retrieve-and-Read” Accurate Interpretable Slow Large Train time Test time Min -logP(a|q) Seq2Seq Training QA pairs e.g. T5, BART, GPT-{1,2,3} “Closed-book QA” Inaccurate Black-box Faster Smaller Reader 1988 e.g. BERT, RAG, FiD 1988 Q: last time la dodgers won the world series? Seq2Seq
  4. Facebook AI 4 who was the film chariots of fire

    about Eric Liddell Closed- book QA who was originally cast to play indiana jones RePAQ + FiD backoff FiD Harrison Ford Q: who was the main character in chariots of fire A: Eric Liddell who was the film chariots of fire about RePAQ Retrieval Tom Selleck Training data high confidence ? yes no Passage selector Question Generator Answer Extractor Wikipedia Global Filtering 1 2 3 4 Generation PAQ 65M Probably-Asked Questions Q: who played indiana jones in the original A: Harrison Ford RePAQ BART PAQ RePAQ PAQ+RePAQ enables • Highly accurate QA • Efficient QA • Fast QA • Interpretable QA • Well-calibrated QA 48% NQ 2x winner at EfficientQA 2020 Up to 1000s of questions per second returns single best-matched QA pair + 10% over SOTA @ 50% coverage
  5. Facebook company Why generate PAQ in the first place? 5

  6. 1. “Question Memorization” - Recall answers to questions from training

    time 6 Train Q: who's hosting the super bowl in 2019? A: Atlanta, Georgia Test Q: where will the super bowl be in 2019? A: Atlanta, Georgia Train Q: who's hosting the super bowl in 2019? A: Atlanta, Georgia Test Q: who hosted the 1996 Olympic games? A: Atlanta, Georgia Train Q: who's hosting the super bowl in 2019? A: Atlanta, Georgia Test Q : who plays the voice of maui in moana? A: dwayne johnson Question Answering Competencies 3. “QA Generalization” - answer novel test questions with novel answers 2. “Answer Classification” - answer novel questions at test time with answers seen at training time
  7. 7 Question Answering Competencies 60% of Test questions only need

    “Answer Classification” to answer correctly 30% of Test Questions only need “Question Memorization” to answer correctly WebQuestions 58% TriviaQA 72% NQ 64% Natural Questions TriviaQA WebQuestions No overlap Answer overlap Question overlap 33% 34% 28%
  8. 8 Question Answering Competencies Answer Test Question Train Question Jason

    Marsden Who plays Max’ voice in a goofy movie Who does max voice in a goofy movie Alan Shearer Who has scored more goals in the premiere league Most goals scored by a premier league player Francisco Pizarro Who led the conquest of the incas in south america Conquistador who defeated the incan empire in peru
  9. 9 0 10 20 30 40 50 NaturalQuestions Exact Match

    Score Question Memorization Answer Classification QA Generalization How well do models do on Question Competencies? RAG (retrieve and read) BART Closed- book QA 44% 27% 71% 25% 35% 1% 68% 10%
  10. Facebook AI 10 QA database (just training set) Q: last

    time la dodgers won the world series? Retriever Q: when is the last time the dodgers won a world series A: 1988 QA-pair retriever
  11. Facebook AI 11 QA database (just training set) Q: who

    sings i don't wanna miss a thing Retriever Q: who wrote i don't wanna miss a thing A: Diane Warren QA-pair retriever
  12. Facebook AI Q: who wrote i don't wanna miss a

    thing A: Diane Warren Q: who sang i don't wanna miss a thing first A: Aerosmith Q: movie with i don't want to miss a thing A: Armageddon 12 QA database (just training set) Retriever Reranker Q: who sings i don't wanna miss a thing Aerosmith QA-pair retriever - RePAQ
  13. 13 0 10 20 30 40 50 NaturalQuestions Exact Match

    Score Question Memorization Answer Classification QA Generalization How well do models do on Question Competencies? RAG BART CBQA 46% 27% QA-pair Retriever 31%
  14. 14 Aim: Generate QA pairs at scale to: • Pre-empt

    and Cache questions we may be asked at test time • Converting QA generalization and Answer Classification questions into Question Memorization questions • An alternative view: reduce open-domain QA -> community QA
  15. Facebook company Expanding the coverage of QA pair KB ->

    PAQ 15
  16. Facebook AI 16 PAQ: Probably-asked Questions Increase Coverage of QA

    pairs by generating probable QA pairs offline at SCALE Passage Selector (RoBERTa) The Tomb of Absalom also called Absalom's Pillar, is an ancient monumental rock-cut tomb[…] contains a burial chamber with three burial site Answer Extractor (RoBERTa) Question Generator (BART) three Consistency Filter (QA model) Q:how many burial sites are in the tomb of Absalom A:three PAQ Wikipedia
  17. Answer Extraction Question Generation Passage Ranking Filtering Wikipedia PAQ Book

    of a Thousand Days is a 2007 young adult fantasy novel by Shannon Hale. It is based on the Brothers Grimm fairy tale Maid Maleen. Dashti, a mucker from steppes of the Eight Realms, begins a diary as she looks for a job after her mother dies of illness. Eventually, she finds and accepts a position as the new maid of Lady Saren, the youngest child of the lord of Titor's Garden. Saren has defied her father's declaration that she will marry Lord Khasar of Thoughts of Under and revealed that she is engage… Shannon Hale Maid Maleen Lord Khasar 2007 Lady Saren maid of Lady Saren steppes of the Eight Realms Lord Khasar who wrote the book of a thousand days who is the book of a thousand days based on who does lady saren marry in book of a thousand days when was book of a thousand days written who is the maid in book of a thousand days who does dashti play in book of a thousand days where does book of a thousand days take place who does she marry in book of a thousand days? Book of a Thousand Days is a 2007 young adult fantasy novel by Shannon Hale. It is based on the Brothers Grimm fairy tale Maid Maleen. Dashti, a mucker from steppes of the Eight Realms, begins a diary as she looks for a job after her mother dies of illness. Eventually, she finds and accepts a position as the new maid of Lady Saren, the youngest child of the lord of Titor's Garden. Saren has defied her father's declaration that she will marry Lord Khasar of Thoughts of Under and revealed that she is engage…
  18. Facebook AI 18 PAQ – Probably-asked Questions Probably-asked Questions PAQ

    65 Million QA pairs (650x the size of NaturalQuestions) Generated from 1B words of Wikipedia (50% of Wikipedia)
  19. Facebook company Question Answering Results 19

  20. Facebook AI 20 25 30 35 40 45 50 DensePhrases

    RAG FiD-large RePAQ Exact Match Score Open-Natural Questions R ePA Q + FiD -large Backoff BA R T-large CBQ A T5-11B+SSM CBQ A
  21. Facebook AI 21 Global QA-pair Filtering matters • No Filter:

    keep all generated QA-pairs • Local filter: check generated questions are consistent using MRC model • Global filter: ensure generated questions are consistent using ODQA model 25 30 35 40 45 Exact Match score Global Filter Local Filter No Filter
  22. Facebook AI 22 More Globally-filtered questions ➞ better results •

    More questions per answer span is better scores • Combining QA-pairs from different generators is better • Empirically RePAQ always improves with more globally- filtered QA-pairs 43 44 45 46 47 Exact Match score 1 Q / A 4 Q / A + diverse models
  23. Facebook company A closer look at RePAQ 23

  24. Facebook AI 24 Selective QA: Refuse to answer when confidence

    low 40 50 60 70 80 90 100 0 20 40 60 80 100 Accuracy (%) Fraction of Questions Answered (%) RePAQ FiD
  25. Facebook AI 25 Selective QA: Refuse to answer when confidence

    low Eric Liddell Closed- book QA who was originally cast to play indiana jones RePAQ + FiD backoff FiD Harrison Ford the ter in fire dell l Tom Selleck high confidence ? yes no 65M Probably-Asked Questions Q: who played indiana jones in the original A: Harrison Ford PAQ RePAQ • Gives the best of both speed and accuracy
  26. Facebook AI 26 Inference Speed Model Retriever Reranker Exact Match

    Qs / sec FiD-large - - 51.4 0.5 RePAQ base - 40.9 1400 RePAQ xlarge - 41.5 800 RePAQ base base 45.7 55 RePAQ xlarge xxlarge 47.6 6 RePAQ + FiD-large Backoff 52.3 1
  27. Facebook AI 27 CBQA BART-Large struggles to memorize PAQ BART

    w/ NQ BART w/ NQ+PAQ + final NQ finetune RePAQ w/ NQ 0 10 20 30 40 50 Exact Match Score Question Memorization Answer Classification QA Generalization RePAQ w/ NQ+PAQ
  28. Facebook company Efficient QA 28

  29. Facebook AI 29 EfficientQA Competition • Develop a QA system

    that contains all of the knowledge required to answer open-domain questions • It could be in documents, databases, the parameters of a neural network, or any other form • Encourage systems that store and access knowledge using the smallest number of bytes, including code, corpora, and model parameters
  30. Facebook AI 30 EfficientQA Competition (Concretely) “Build a self-contained QA

    system docker image, submit it to our server, and we’ll ask it 1800 hidden, newly-annotated questions” 4 Tracks/prizes: 1. The smallest system that achieves >25% accuracy 2. The most accurate system < 500MB 3. The most accurate system < 6GB 4. Highest scoring system (no constraints)
  31. Facebook AI 31 EfficientQA Competition (Concretely) The size system is

    measured by finding the size of the docker image before evaluation begins. Disk space required during evaluation not recorded Systems run on a 16-core machine with 100GB RAM and 2 gpus Systems allowed 6 hours to evaluate (12 seconds per question) -> the real task here is how to compress “knowledge” and the mechanisms to access it, not a building the lightest, fastest or most efficient system.
  32. Facebook AI Implementing a tiny QA system Database: 140K QAs

    (as few as possible), build index on fly Retriever: BERT-base -> TF-IDF -220MB Reranker: BERT-base -> ALBERT-base -200MB Debian -> Alpine Linux -110MB PyTorch-CPU -> TFLite -99MB Python -> C++ -65MB Multi-stage builds, compression, optimization -30MB *All models stored as fp16. System accuracy dropped with Int8 quantization
  33. Facebook AI 34 Implementing a tiny QA system • Accuracy:

    26.8% • Final size: 28MB • Same size as : • 1 image in RAW • 7 Bibles • 20 Floppy disks • 90 seconds of youtube • 16x smaller than next smallest entry Reranker 21M bash 0.7M Tokenizer 0.2M TFLite 0.4M QA Database 1.6M
  34. Facebook AI 35 28Mb system visualized as an image:

  35. Facebook AI 36 Implementing 500MB system – Scale approach back

    up! • Still not enough space for GPU drivers or Pytorch • Limited by time limit not model size • Implement inference in NumPy • Database: 2.4M QAS • Retriever: replace TF-IDF with neural model (+22MB) • Reranker: Albert-base -> Albert-Large (+14MB) • Final size: 336 MB (2nd smallest model submitted) • Accuracy: 33.4% • Outperforms models with 100x more parameters
  36. Facebook AI 37 Ours-tiny Ours-500MB Ours-Unconstrained REALM T5XL T5-base 6GB

    6GB 500MB 500MB
  37. Related work MRC/Extractive QA: Hirschman et al. 1999, Rajpurkar et

    al. 2016, Joshi et al. 2017, Kwiatkowski et al. 2019, inter alia Open-domain QA: Vorhees and Tice 1999, Chen et al. 2017, inter alia Neural Memory Models: Graves et al. 2014, Weston et al. 2015, Sukhbaatar et al. 2015, Graves et al. 2016. inter alia Knowledge-grounded Dialogue: Weston et al. 2018, Dinan et al. 2019, inter alia Non-parametric Memory Models: Grave et al. 2017, Khandelwal et al. 2020, inter alia REALM: Guu et al. 2020 Recent Parametric Memory Literature: Petroni et al. 2019, Roberts et al. 2020, inter alia Recent ODQA literature: Lee et al. 2019, Karpukhin et al. 2020 , Izacard and Grave 2020, inter alia 38
  38. Collaborators: 39 + UCLNLP and FAIR Aleksandra Piktus Heinrich Küttler

    Pasquale Minervini Yuxiang Wu Sebastian Riedel Pontus Stenetorp Linqing Liu +