Slide 1

Slide 1 text

Efficient Passage Retrieval with Hashing for Open-domain Question Answering STUDIO OUSIA Hanna Hajishirzi Akari Asai Ikuya Yamada

Slide 2

Slide 2 text

Open-domain Question Answering The task of answering arbitrary factoid questions 2 Open-domain QA model Question: Who wrote the novel "I Am a Cat"? Answer: Sōseki Natsume Knowledge base (e.g., Wikipedia)

Slide 3

Slide 3 text

Retriever-Reader Architecture A pipelined approach to extract an answer from knowledge base 3

Slide 4

Slide 4 text

Retriever-Reader Architecture A pipelined approach to extract an answer from knowledge base 4 Question: Who wrote the novel "I Am a Cat"?

Slide 5

Slide 5 text

Retriever-Reader Architecture A pipelined approach to extract an answer from knowledge base 5 Retriever Question: Who wrote the novel "I Am a Cat"? Top-k relevant passages Knowledge base (e.g., Wikipedia)

Slide 6

Slide 6 text

\ Retriever-Reader Architecture A pipelined approach to extract an answer from knowledge base base 6 Retriever Reader Question: Who wrote the novel "I Am a Cat"? Answer: Sōseki Natsume Top-k relevant passages Knowledge base (e.g., Wikipedia) Open-domain QA model

Slide 7

Slide 7 text

Dense Passage Retriever (DPR) (Karpukhin et al., 2020) 7 Passage Encoder Question Encoder Passage Question Relevance Score Inner product [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd State-of-the-art Retriever commonly used in open-domain QA models Two independent BERT models are used to encode passages and questions

Slide 8

Slide 8 text

Extreme Memory Cost of DPR The size of each passage vector: 8 4 bytes * 768 dimensions ≒ 3072 bytes (float32) (BERT output dimensions)

Slide 9

Slide 9 text

Extreme Memory Cost of DPR The size of each passage vector: 9 4 bytes * 768 dimensions ≒ 3072 bytes (float32) (BERT output dimensions) 3072 bytes * 21,000,000 ≒ 65GB The index of entire 21M English Wikipedia passages:

Slide 10

Slide 10 text

Extreme Memory Cost of DPR The size of each passage vector: 10 4 bytes * 768 dimensions ≒ 3072 bytes (float32) (BERT output dimensions) 3072 bytes * 21,000,000 ≒ 65GB The index of entire 21M English Wikipedia passages: This index must be stored in memory!!

Slide 11

Slide 11 text

Binary Passage Retriever (BPR) 11 BPR extends DPR using hashing to represent passages as compact binary codes rather than continuous vectors

Slide 12

Slide 12 text

Binary Passage Retriever (BPR) 12 BPR extends DPR using hashing to represent passages as compact binary codes rather than continuous vectors Key Approaches: ● Learning-to-hash that learns a hash function to convert continuous vectors to binary codes in an end-to-end manner

Slide 13

Slide 13 text

Binary Passage Retriever (BPR) Key Approaches: ● Learning-to-hash that learns a hash function to convert continuous vectors to binary codes in an end-to-end manner ● Two-stage approach consisting of candidate generation and reranking 13 BPR extends DPR using hashing to represent passages as compact binary codes rather than continuous vectors

Slide 14

Slide 14 text

DPR Architecture 14 Two independent BERT models are used to encode questions and passages Passage Encoder Question Encoder Passage Question Relevance Score Inner product [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd State-of-the-art Retriever based on two independent BERT encoders

Slide 15

Slide 15 text

DPR Architecture 15 Two independent BERT models are used to encode questions and passages Relevance score of a passage given a question is computed using inner product Passage Encoder Question Encoder Passage Question Relevance Score Inner product [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd State-of-the-art Retriever based on two independent BERT encoders

Slide 16

Slide 16 text

DPR Architecture 16 Two independent BERT models are used to encode questions and passages Relevance score of a passage given a question is computed using inner product Top-k passages are obtained based on nearest neighbor search for all passage vectors Passage Encoder Question Encoder Passage Question Relevance Score Inner product [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd State-of-the-art Retriever based on two independent BERT encoders

Slide 17

Slide 17 text

Model: Hash Layer 17 Passage Encoder Question Encoder Passage Question Reranking Inner product [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd Hash layer Hash layer [1, -1, …, 1]∈{-1, 1}d [1, -1, …, 1]∈{-1,1}d Hamming distance Candidate generation Hash layer is placed on top of each encoder A continuous vector computed by each encoder is converted to a binary code Hash layer is implemented using sign function:

Slide 18

Slide 18 text

Model: Approximating Sign Function Using Scaled Tanh Function 18 Problem: Sign function is incompatible with back propagation Solution: During training, we approximate sign function using differentiable scaled tanh function (Cao et al. 2017) where β is increased at every training step green: sign(x) blue: tanh(βx) training step: 0 (β=1, γ=0.1)

Slide 19

Slide 19 text

Model: Approximating Sign Function Using Scaled Tanh Function 19 Problem: Sign function is incompatible with back propagation Solution: During training, we approximate sign function using differentiable scaled tanh function (Cao et al. 2017) where β is increased at every training step green: sign(x) blue: tanh(βx) training step: 240 (β=5, γ=0.1)

Slide 20

Slide 20 text

Model: Approximating Sign Function Using Scaled Tanh Function 20 Problem: Sign function is incompatible with back propagation Solution: During training, we approximate sign function using differentiable scaled tanh function (Cao et al. 2017) where β is increased at every training step green: sign(x) blue: tanh(βx) training step: 990 (β=10, γ=0.1)

Slide 21

Slide 21 text

Model: Approximating Sign Function Using Scaled Tanh Function 21 Problem: Sign function is incompatible with back propagation Solution: During training, we approximate sign function using differentiable scaled tanh function (Cao et al. 2017) where β is increased at every training step green: sign(x) blue: tanh(βx) training step: 8990 (β=30, γ=0.1)

Slide 22

Slide 22 text

Model: Two-stage Approach of Candidate Generation and Reranking 22 Passage Encoder Question Encoder Passage Question Reranking Inner product [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd Hash layer Hash layer [1, -1, …, 1]∈{-1, 1}d [-1, 1, …, 1]∈{-1,1}d Hamming distance Candidate generation

Slide 23

Slide 23 text

Model: Two-stage Approach of Candidate Generation and Reranking 23 Passage Encoder Question Encoder Passage Question Reranking Inner product [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd Hash layer Hash layer [1, -1, …, 1]∈{-1, 1}d [-1, 1, …, 1]∈{-1,1}d Hamming distance Candidate generation Candidate Generation: ● A small number of candidates are obtained efficiently based on Hamming distance ○ Question: binary code ○ Passage: binary code

Slide 24

Slide 24 text

Model: Two-stage Approach of Candidate Generation and Reranking 24 Passage Encoder Question Encoder Passage Question Reranking Inner product [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd Hash layer Hash layer [1, -1, …, 1]∈{-1, 1}d [-1, 1, …, 1]∈{-1,1}d Hamming distance Candidate generation Candidate Generation: ● A small number of candidates are obtained efficiently based on Hamming distance ○ Question: binary code ○ Passage: binary code Reranking: ● The candidates are re-ranked based on expressive inner product ○ Question: continuous vector ○ Passage: binary code

Slide 25

Slide 25 text

Candidate Generation: ● A small number of candidates are obtained efficiently based on Hamming distance ○ Question: binary code ○ Passage: binary code Reranking: ● The candidates are re-ranked based on expressive inner product ○ Question: continuous vector ○ Passage: binary code Model: Two-stage Approach of Candidate Generation and Reranking 25 Passage Encoder Question Encoder Passage Question Reranking Inner product [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd Hash layer Hash layer [1, -1, …, 1]∈{-1, 1}d [-1, 1, …, 1]∈{-1,1}d Hamming distance Candidate generation more expressive

Slide 26

Slide 26 text

Model: Multi-task Training Ranking loss for candidate generation: Cross-entropy loss for reranking: The final loss function: 26 BPR is trained by simultaneously optimizing two loss functions

Slide 27

Slide 27 text

Experiments: Comparison with DPR 27 ● BPR achieves similar or better performance than DPR when k ≥ 20 Retrieval recall rates on Natural Questions Retrieval recall rates on TriviaQA

Slide 28

Slide 28 text

Experiments: Comparison with DPR 28 ● BPR achieves similar or better performance than DPR when k ≥ 20 ● BPR significantly reduces the computational cost of DPR → Index size: 65GB -> 2GB → Query time: 457ms -> 38ms Retrieval recall rates on Natural Questions Retrieval recall rates on TriviaQA

Slide 29

Slide 29 text

Experiments: Comparison with DPR 29 ● BPR achieves similar or better performance than DPR when k ≥ 20 ● BPR significantly reduces the computational cost of DPR → Index size: 65GB -> 2GB → Query time: 457ms -> 38ms Retrieval recall rates on Natural Questions Retrieval recall rates on TriviaQA The recall in small k is less important: the reader usually takes k ≥ 20 passages

Slide 30

Slide 30 text

Experiments: Comparison with Quantization Methods 30 ● BPR achieves significantly better performance than DPR + post-hoc quantization methods: product quantization (PQ) and LSH Retrieval recall rates on Natural Questions Retrieval recall rates on TriviaQA

Slide 31

Slide 31 text

Experiments: End-to-end QA 31 Model NQ TQA BPR+extractive reader 41.6 56.8 DPR+extractive reader 41.5 56.8 ● BPR achieves equivalent QA accuracy to DPR with substantially reduced computational cost Exact match QA accuracy on Natural Questions and TriviaQA Same BERT-based extractive reader is used for both models

Slide 32

Slide 32 text

Summary 32 BPR significantly reduces the computational cost of state-of-the-art open-domain QA without a loss in accuracy [email protected] Paper: Code & Model: @ikuyamada https://arxiv.org/abs/2106.00882 https://github.com/studio-ousia/bpr Paper: Code: