May 20, 2022
300

# Efficient Passage Retrieval with Hashing for Open-domain Question Answering (ACL 2021)

May 20, 2022

## Transcript

1. ### Efﬁcient Passage Retrieval with Hashing for Open-domain Question Answering STUDIO

OUSIA Hanna Hajishirzi Akari Asai Ikuya Yamada

2 Open-domain QA model Question: Who wrote the novel "I Am a Cat"? Answer: Sōseki Natsume Knowledge base (e.g., Wikipedia)
3. ### Retriever-Reader Architecture A pipelined approach to extract an answer from

knowledge base 3
4. ### Retriever-Reader Architecture A pipelined approach to extract an answer from

knowledge base 4 Question: Who wrote the novel "I Am a Cat"?
5. ### Retriever-Reader Architecture A pipelined approach to extract an answer from

knowledge base 5 Retriever Question: Who wrote the novel "I Am a Cat"? Top-k relevant passages Knowledge base (e.g., Wikipedia)
6. ### \ Retriever-Reader Architecture A pipelined approach to extract an answer

from knowledge base base 6 Retriever Reader Question: Who wrote the novel "I Am a Cat"? Answer: Sōseki Natsume Top-k relevant passages Knowledge base (e.g., Wikipedia) Open-domain QA model
7. ### Dense Passage Retriever (DPR) (Karpukhin et al., 2020) 7 Passage

Encoder Question Encoder Passage Question Relevance Score Inner product [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd State-of-the-art Retriever commonly used in open-domain QA models Two independent BERT models are used to encode passages and questions
8. ### Extreme Memory Cost of DPR The size of each passage

vector: 8 4 bytes * 768 dimensions ≒ 3072 bytes (ﬂoat32) (BERT output dimensions)
9. ### Extreme Memory Cost of DPR The size of each passage

vector: 9 4 bytes * 768 dimensions ≒ 3072 bytes (ﬂoat32) (BERT output dimensions) 3072 bytes * 21,000,000 ≒ 65GB The index of entire 21M English Wikipedia passages:
10. ### Extreme Memory Cost of DPR The size of each passage

vector: 10 4 bytes * 768 dimensions ≒ 3072 bytes (ﬂoat32) (BERT output dimensions) 3072 bytes * 21,000,000 ≒ 65GB The index of entire 21M English Wikipedia passages: This index must be stored in memory!!
11. ### Binary Passage Retriever (BPR) 11 BPR extends DPR using hashing

to represent passages as compact binary codes rather than continuous vectors
12. ### Binary Passage Retriever (BPR) 12 BPR extends DPR using hashing

to represent passages as compact binary codes rather than continuous vectors Key Approaches: • Learning-to-hash that learns a hash function to convert continuous vectors to binary codes in an end-to-end manner
13. ### Binary Passage Retriever (BPR) Key Approaches: • Learning-to-hash that learns

a hash function to convert continuous vectors to binary codes in an end-to-end manner • Two-stage approach consisting of candidate generation and reranking 13 BPR extends DPR using hashing to represent passages as compact binary codes rather than continuous vectors
14. ### DPR Architecture 14 Two independent BERT models are used to

encode questions and passages Passage Encoder Question Encoder Passage Question Relevance Score Inner product [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd State-of-the-art Retriever based on two independent BERT encoders
15. ### DPR Architecture 15 Two independent BERT models are used to

encode questions and passages Relevance score of a passage given a question is computed using inner product Passage Encoder Question Encoder Passage Question Relevance Score Inner product [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd State-of-the-art Retriever based on two independent BERT encoders
16. ### DPR Architecture 16 Two independent BERT models are used to

encode questions and passages Relevance score of a passage given a question is computed using inner product Top-k passages are obtained based on nearest neighbor search for all passage vectors Passage Encoder Question Encoder Passage Question Relevance Score Inner product [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd State-of-the-art Retriever based on two independent BERT encoders
17. ### Model: Hash Layer 17 Passage Encoder Question Encoder Passage Question

Reranking Inner product [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd Hash layer Hash layer [1, -1, …, 1]∈{-1, 1}d [1, -1, …, 1]∈{-1,1}d Hamming distance Candidate generation Hash layer is placed on top of each encoder A continuous vector computed by each encoder is converted to a binary code Hash layer is implemented using sign function:
18. ### Model: Approximating Sign Function Using Scaled Tanh Function 18 Problem:

Sign function is incompatible with back propagation Solution: During training, we approximate sign function using differentiable scaled tanh function (Cao et al. 2017) where β is increased at every training step green: sign(x) blue: tanh(βx) training step: 0 (β=1, γ=0.1)
19. ### Model: Approximating Sign Function Using Scaled Tanh Function 19 Problem:

Sign function is incompatible with back propagation Solution: During training, we approximate sign function using differentiable scaled tanh function (Cao et al. 2017) where β is increased at every training step green: sign(x) blue: tanh(βx) training step: 240 (β=5, γ=0.1)
20. ### Model: Approximating Sign Function Using Scaled Tanh Function 20 Problem:

Sign function is incompatible with back propagation Solution: During training, we approximate sign function using differentiable scaled tanh function (Cao et al. 2017) where β is increased at every training step green: sign(x) blue: tanh(βx) training step: 990 (β=10, γ=0.1)
21. ### Model: Approximating Sign Function Using Scaled Tanh Function 21 Problem:

Sign function is incompatible with back propagation Solution: During training, we approximate sign function using differentiable scaled tanh function (Cao et al. 2017) where β is increased at every training step green: sign(x) blue: tanh(βx) training step: 8990 (β=30, γ=0.1)
22. ### Model: Two-stage Approach of Candidate Generation and Reranking 22 Passage

Encoder Question Encoder Passage Question Reranking Inner product [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd Hash layer Hash layer [1, -1, …, 1]∈{-1, 1}d [-1, 1, …, 1]∈{-1,1}d Hamming distance Candidate generation
23. ### Model: Two-stage Approach of Candidate Generation and Reranking 23 Passage

Encoder Question Encoder Passage Question Reranking Inner product [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd Hash layer Hash layer [1, -1, …, 1]∈{-1, 1}d [-1, 1, …, 1]∈{-1,1}d Hamming distance Candidate generation Candidate Generation: • A small number of candidates are obtained efﬁciently based on Hamming distance ◦ Question: binary code ◦ Passage: binary code
24. ### Model: Two-stage Approach of Candidate Generation and Reranking 24 Passage

Encoder Question Encoder Passage Question Reranking Inner product [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd Hash layer Hash layer [1, -1, …, 1]∈{-1, 1}d [-1, 1, …, 1]∈{-1,1}d Hamming distance Candidate generation Candidate Generation: • A small number of candidates are obtained efﬁciently based on Hamming distance ◦ Question: binary code ◦ Passage: binary code Reranking: • The candidates are re-ranked based on expressive inner product ◦ Question: continuous vector ◦ Passage: binary code
25. ### Candidate Generation: • A small number of candidates are obtained

efﬁciently based on Hamming distance ◦ Question: binary code ◦ Passage: binary code Reranking: • The candidates are re-ranked based on expressive inner product ◦ Question: continuous vector ◦ Passage: binary code Model: Two-stage Approach of Candidate Generation and Reranking 25 Passage Encoder Question Encoder Passage Question Reranking Inner product [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd Hash layer Hash layer [1, -1, …, 1]∈{-1, 1}d [-1, 1, …, 1]∈{-1,1}d Hamming distance Candidate generation more expressive
26. ### Model: Multi-task Training Ranking loss for candidate generation: Cross-entropy loss

for reranking: The ﬁnal loss function: 26 BPR is trained by simultaneously optimizing two loss functions
27. ### Experiments: Comparison with DPR 27 • BPR achieves similar or

better performance than DPR when k ≥ 20 Retrieval recall rates on Natural Questions Retrieval recall rates on TriviaQA
28. ### Experiments: Comparison with DPR 28 • BPR achieves similar or

better performance than DPR when k ≥ 20 • BPR signiﬁcantly reduces the computational cost of DPR → Index size: 65GB -> 2GB → Query time: 457ms -> 38ms Retrieval recall rates on Natural Questions Retrieval recall rates on TriviaQA
29. ### Experiments: Comparison with DPR 29 • BPR achieves similar or

better performance than DPR when k ≥ 20 • BPR signiﬁcantly reduces the computational cost of DPR → Index size: 65GB -> 2GB → Query time: 457ms -> 38ms Retrieval recall rates on Natural Questions Retrieval recall rates on TriviaQA The recall in small k is less important: the reader usually takes k ≥ 20 passages
30. ### Experiments: Comparison with Quantization Methods 30 • BPR achieves signiﬁcantly

better performance than DPR + post-hoc quantization methods: product quantization (PQ) and LSH Retrieval recall rates on Natural Questions Retrieval recall rates on TriviaQA
31. ### Experiments: End-to-end QA 31 Model NQ TQA BPR+extractive reader 41.6

56.8 DPR+extractive reader 41.5 56.8 • BPR achieves equivalent QA accuracy to DPR with substantially reduced computational cost Exact match QA accuracy on Natural Questions and TriviaQA Same BERT-based extractive reader is used for both models
32. ### Summary 32 BPR signiﬁcantly reduces the computational cost of state-of-the-art

open-domain QA without a loss in accuracy [email protected] Paper: Code & Model: @ikuyamada https://arxiv.org/abs/2106.00882 https://github.com/studio-ousia/bpr Paper: Code: