Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Efficient Passage Retrieval with Hashing for Open-domain Question Answering (ACL 2021)

Efficient Passage Retrieval with Hashing for Open-domain Question Answering (ACL 2021)

Ikuya Yamada

May 20, 2022
Tweet

More Decks by Ikuya Yamada

Other Decks in Research

Transcript

  1. Open-domain Question Answering The task of answering arbitrary factoid questions

    2 Open-domain QA model Question: Who wrote the novel "I Am a Cat"? Answer: Sōseki Natsume Knowledge base (e.g., Wikipedia)
  2. Retriever-Reader Architecture A pipelined approach to extract an answer from

    knowledge base 4 Question: Who wrote the novel "I Am a Cat"?
  3. Retriever-Reader Architecture A pipelined approach to extract an answer from

    knowledge base 5 Retriever Question: Who wrote the novel "I Am a Cat"? Top-k relevant passages Knowledge base (e.g., Wikipedia)
  4. \ Retriever-Reader Architecture A pipelined approach to extract an answer

    from knowledge base base 6 Retriever Reader Question: Who wrote the novel "I Am a Cat"? Answer: Sōseki Natsume Top-k relevant passages Knowledge base (e.g., Wikipedia) Open-domain QA model
  5. Dense Passage Retriever (DPR) (Karpukhin et al., 2020) 7 Passage

    Encoder Question Encoder Passage Question Relevance Score Inner product [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd State-of-the-art Retriever commonly used in open-domain QA models Two independent BERT models are used to encode passages and questions
  6. Extreme Memory Cost of DPR The size of each passage

    vector: 8 4 bytes * 768 dimensions ≒ 3072 bytes (float32) (BERT output dimensions)
  7. Extreme Memory Cost of DPR The size of each passage

    vector: 9 4 bytes * 768 dimensions ≒ 3072 bytes (float32) (BERT output dimensions) 3072 bytes * 21,000,000 ≒ 65GB The index of entire 21M English Wikipedia passages:
  8. Extreme Memory Cost of DPR The size of each passage

    vector: 10 4 bytes * 768 dimensions ≒ 3072 bytes (float32) (BERT output dimensions) 3072 bytes * 21,000,000 ≒ 65GB The index of entire 21M English Wikipedia passages: This index must be stored in memory!!
  9. Binary Passage Retriever (BPR) 11 BPR extends DPR using hashing

    to represent passages as compact binary codes rather than continuous vectors
  10. Binary Passage Retriever (BPR) 12 BPR extends DPR using hashing

    to represent passages as compact binary codes rather than continuous vectors Key Approaches: • Learning-to-hash that learns a hash function to convert continuous vectors to binary codes in an end-to-end manner
  11. Binary Passage Retriever (BPR) Key Approaches: • Learning-to-hash that learns

    a hash function to convert continuous vectors to binary codes in an end-to-end manner • Two-stage approach consisting of candidate generation and reranking 13 BPR extends DPR using hashing to represent passages as compact binary codes rather than continuous vectors
  12. DPR Architecture 14 Two independent BERT models are used to

    encode questions and passages Passage Encoder Question Encoder Passage Question Relevance Score Inner product [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd State-of-the-art Retriever based on two independent BERT encoders
  13. DPR Architecture 15 Two independent BERT models are used to

    encode questions and passages Relevance score of a passage given a question is computed using inner product Passage Encoder Question Encoder Passage Question Relevance Score Inner product [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd State-of-the-art Retriever based on two independent BERT encoders
  14. DPR Architecture 16 Two independent BERT models are used to

    encode questions and passages Relevance score of a passage given a question is computed using inner product Top-k passages are obtained based on nearest neighbor search for all passage vectors Passage Encoder Question Encoder Passage Question Relevance Score Inner product [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd State-of-the-art Retriever based on two independent BERT encoders
  15. Model: Hash Layer 17 Passage Encoder Question Encoder Passage Question

    Reranking Inner product [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd Hash layer Hash layer [1, -1, …, 1]∈{-1, 1}d [1, -1, …, 1]∈{-1,1}d Hamming distance Candidate generation Hash layer is placed on top of each encoder A continuous vector computed by each encoder is converted to a binary code Hash layer is implemented using sign function:
  16. Model: Approximating Sign Function Using Scaled Tanh Function 18 Problem:

    Sign function is incompatible with back propagation Solution: During training, we approximate sign function using differentiable scaled tanh function (Cao et al. 2017) where β is increased at every training step green: sign(x) blue: tanh(βx) training step: 0 (β=1, γ=0.1)
  17. Model: Approximating Sign Function Using Scaled Tanh Function 19 Problem:

    Sign function is incompatible with back propagation Solution: During training, we approximate sign function using differentiable scaled tanh function (Cao et al. 2017) where β is increased at every training step green: sign(x) blue: tanh(βx) training step: 240 (β=5, γ=0.1)
  18. Model: Approximating Sign Function Using Scaled Tanh Function 20 Problem:

    Sign function is incompatible with back propagation Solution: During training, we approximate sign function using differentiable scaled tanh function (Cao et al. 2017) where β is increased at every training step green: sign(x) blue: tanh(βx) training step: 990 (β=10, γ=0.1)
  19. Model: Approximating Sign Function Using Scaled Tanh Function 21 Problem:

    Sign function is incompatible with back propagation Solution: During training, we approximate sign function using differentiable scaled tanh function (Cao et al. 2017) where β is increased at every training step green: sign(x) blue: tanh(βx) training step: 8990 (β=30, γ=0.1)
  20. Model: Two-stage Approach of Candidate Generation and Reranking 22 Passage

    Encoder Question Encoder Passage Question Reranking Inner product [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd Hash layer Hash layer [1, -1, …, 1]∈{-1, 1}d [-1, 1, …, 1]∈{-1,1}d Hamming distance Candidate generation
  21. Model: Two-stage Approach of Candidate Generation and Reranking 23 Passage

    Encoder Question Encoder Passage Question Reranking Inner product [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd Hash layer Hash layer [1, -1, …, 1]∈{-1, 1}d [-1, 1, …, 1]∈{-1,1}d Hamming distance Candidate generation Candidate Generation: • A small number of candidates are obtained efficiently based on Hamming distance ◦ Question: binary code ◦ Passage: binary code
  22. Model: Two-stage Approach of Candidate Generation and Reranking 24 Passage

    Encoder Question Encoder Passage Question Reranking Inner product [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd Hash layer Hash layer [1, -1, …, 1]∈{-1, 1}d [-1, 1, …, 1]∈{-1,1}d Hamming distance Candidate generation Candidate Generation: • A small number of candidates are obtained efficiently based on Hamming distance ◦ Question: binary code ◦ Passage: binary code Reranking: • The candidates are re-ranked based on expressive inner product ◦ Question: continuous vector ◦ Passage: binary code
  23. Candidate Generation: • A small number of candidates are obtained

    efficiently based on Hamming distance ◦ Question: binary code ◦ Passage: binary code Reranking: • The candidates are re-ranked based on expressive inner product ◦ Question: continuous vector ◦ Passage: binary code Model: Two-stage Approach of Candidate Generation and Reranking 25 Passage Encoder Question Encoder Passage Question Reranking Inner product [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd Hash layer Hash layer [1, -1, …, 1]∈{-1, 1}d [-1, 1, …, 1]∈{-1,1}d Hamming distance Candidate generation more expressive
  24. Model: Multi-task Training Ranking loss for candidate generation: Cross-entropy loss

    for reranking: The final loss function: 26 BPR is trained by simultaneously optimizing two loss functions
  25. Experiments: Comparison with DPR 27 • BPR achieves similar or

    better performance than DPR when k ≥ 20 Retrieval recall rates on Natural Questions Retrieval recall rates on TriviaQA
  26. Experiments: Comparison with DPR 28 • BPR achieves similar or

    better performance than DPR when k ≥ 20 • BPR significantly reduces the computational cost of DPR → Index size: 65GB -> 2GB → Query time: 457ms -> 38ms Retrieval recall rates on Natural Questions Retrieval recall rates on TriviaQA
  27. Experiments: Comparison with DPR 29 • BPR achieves similar or

    better performance than DPR when k ≥ 20 • BPR significantly reduces the computational cost of DPR → Index size: 65GB -> 2GB → Query time: 457ms -> 38ms Retrieval recall rates on Natural Questions Retrieval recall rates on TriviaQA The recall in small k is less important: the reader usually takes k ≥ 20 passages
  28. Experiments: Comparison with Quantization Methods 30 • BPR achieves significantly

    better performance than DPR + post-hoc quantization methods: product quantization (PQ) and LSH Retrieval recall rates on Natural Questions Retrieval recall rates on TriviaQA
  29. Experiments: End-to-end QA 31 Model NQ TQA BPR+extractive reader 41.6

    56.8 DPR+extractive reader 41.5 56.8 • BPR achieves equivalent QA accuracy to DPR with substantially reduced computational cost Exact match QA accuracy on Natural Questions and TriviaQA Same BERT-based extractive reader is used for both models
  30. Summary 32 BPR significantly reduces the computational cost of state-of-the-art

    open-domain QA without a loss in accuracy [email protected] Paper: Code & Model: @ikuyamada https://arxiv.org/abs/2106.00882 https://github.com/studio-ousia/bpr Paper: Code: