Efficient Passage Retrieval with Hashing for Open-domain Question Answering (ACL 2021)

Efﬁcient Passage Retrieval with Hashing for Open-domain Question Answering STUDIO
OUSIA Hanna Hajishirzi Akari Asai Ikuya Yamada

Open-domain Question Answering The task of answering arbitrary factoid questions
2 Open-domain QA model Question: Who wrote the novel "I Am a Cat"? Answer: Sōseki Natsume Knowledge base (e.g., Wikipedia)

Retriever-Reader Architecture A pipelined approach to extract an answer from
knowledge base 3

knowledge base 4 Question: Who wrote the novel "I Am a Cat"?

knowledge base 5 Retriever Question: Who wrote the novel "I Am a Cat"? Top-k relevant passages Knowledge base (e.g., Wikipedia)

\ Retriever-Reader Architecture A pipelined approach to extract an answer
from knowledge base base 6 Retriever Reader Question: Who wrote the novel "I Am a Cat"? Answer: Sōseki Natsume Top-k relevant passages Knowledge base (e.g., Wikipedia) Open-domain QA model

Dense Passage Retriever (DPR) (Karpukhin et al., 2020) 7 Passage
Encoder Question Encoder Passage Question Relevance Score Inner product [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd State-of-the-art Retriever commonly used in open-domain QA models Two independent BERT models are used to encode passages and questions

Extreme Memory Cost of DPR The size of each passage
vector: 8 4 bytes * 768 dimensions ≒ 3072 bytes (ﬂoat32) (BERT output dimensions)

vector: 9 4 bytes * 768 dimensions ≒ 3072 bytes (ﬂoat32) (BERT output dimensions) 3072 bytes * 21,000,000 ≒ 65GB The index of entire 21M English Wikipedia passages:

vector: 10 4 bytes * 768 dimensions ≒ 3072 bytes (ﬂoat32) (BERT output dimensions) 3072 bytes * 21,000,000 ≒ 65GB The index of entire 21M English Wikipedia passages: This index must be stored in memory!!

Binary Passage Retriever (BPR) 11 BPR extends DPR using hashing
to represent passages as compact binary codes rather than continuous vectors

Binary Passage Retriever (BPR) 12 BPR extends DPR using hashing
to represent passages as compact binary codes rather than continuous vectors Key Approaches: • Learning-to-hash that learns a hash function to convert continuous vectors to binary codes in an end-to-end manner

Binary Passage Retriever (BPR) Key Approaches: • Learning-to-hash that learns
a hash function to convert continuous vectors to binary codes in an end-to-end manner • Two-stage approach consisting of candidate generation and reranking 13 BPR extends DPR using hashing to represent passages as compact binary codes rather than continuous vectors

DPR Architecture 14 Two independent BERT models are used to
encode questions and passages Passage Encoder Question Encoder Passage Question Relevance Score Inner product [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd State-of-the-art Retriever based on two independent BERT encoders

encode questions and passages Relevance score of a passage given a question is computed using inner product Passage Encoder Question Encoder Passage Question Relevance Score Inner product [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd State-of-the-art Retriever based on two independent BERT encoders

encode questions and passages Relevance score of a passage given a question is computed using inner product Top-k passages are obtained based on nearest neighbor search for all passage vectors Passage Encoder Question Encoder Passage Question Relevance Score Inner product [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd State-of-the-art Retriever based on two independent BERT encoders

Model: Hash Layer 17 Passage Encoder Question Encoder Passage Question
Reranking Inner product [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd Hash layer Hash layer [1, -1, …, 1]∈{-1, 1}d [1, -1, …, 1]∈{-1,1}d Hamming distance Candidate generation Hash layer is placed on top of each encoder A continuous vector computed by each encoder is converted to a binary code Hash layer is implemented using sign function:

Model: Approximating Sign Function Using Scaled Tanh Function 18 Problem:
Sign function is incompatible with back propagation Solution: During training, we approximate sign function using differentiable scaled tanh function (Cao et al. 2017) where β is increased at every training step green: sign(x) blue: tanh(βx) training step: 0 (β=1, γ=0.1)

Model: Two-stage Approach of Candidate Generation and Reranking 22 Passage
Encoder Question Encoder Passage Question Reranking Inner product [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd Hash layer Hash layer [1, -1, …, 1]∈{-1, 1}d [-1, 1, …, 1]∈{-1,1}d Hamming distance Candidate generation

Encoder Question Encoder Passage Question Reranking Inner product [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd Hash layer Hash layer [1, -1, …, 1]∈{-1, 1}d [-1, 1, …, 1]∈{-1,1}d Hamming distance Candidate generation Candidate Generation: • A small number of candidates are obtained efﬁciently based on Hamming distance ◦ Question: binary code ◦ Passage: binary code

Encoder Question Encoder Passage Question Reranking Inner product [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd Hash layer Hash layer [1, -1, …, 1]∈{-1, 1}d [-1, 1, …, 1]∈{-1,1}d Hamming distance Candidate generation Candidate Generation: • A small number of candidates are obtained efﬁciently based on Hamming distance ◦ Question: binary code ◦ Passage: binary code Reranking: • The candidates are re-ranked based on expressive inner product ◦ Question: continuous vector ◦ Passage: binary code

Candidate Generation: • A small number of candidates are obtained
efﬁciently based on Hamming distance ◦ Question: binary code ◦ Passage: binary code Reranking: • The candidates are re-ranked based on expressive inner product ◦ Question: continuous vector ◦ Passage: binary code Model: Two-stage Approach of Candidate Generation and Reranking 25 Passage Encoder Question Encoder Passage Question Reranking Inner product [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd Hash layer Hash layer [1, -1, …, 1]∈{-1, 1}d [-1, 1, …, 1]∈{-1,1}d Hamming distance Candidate generation more expressive

Model: Multi-task Training Ranking loss for candidate generation: Cross-entropy loss
for reranking: The ﬁnal loss function: 26 BPR is trained by simultaneously optimizing two loss functions

Experiments: Comparison with DPR 27 • BPR achieves similar or
better performance than DPR when k ≥ 20 Retrieval recall rates on Natural Questions Retrieval recall rates on TriviaQA

better performance than DPR when k ≥ 20 • BPR signiﬁcantly reduces the computational cost of DPR → Index size: 65GB -> 2GB → Query time: 457ms -> 38ms Retrieval recall rates on Natural Questions Retrieval recall rates on TriviaQA

better performance than DPR when k ≥ 20 • BPR signiﬁcantly reduces the computational cost of DPR → Index size: 65GB -> 2GB → Query time: 457ms -> 38ms Retrieval recall rates on Natural Questions Retrieval recall rates on TriviaQA The recall in small k is less important: the reader usually takes k ≥ 20 passages

Experiments: Comparison with Quantization Methods 30 • BPR achieves signiﬁcantly
better performance than DPR + post-hoc quantization methods: product quantization (PQ) and LSH Retrieval recall rates on Natural Questions Retrieval recall rates on TriviaQA

Experiments: End-to-end QA 31 Model NQ TQA BPR+extractive reader 41.6
56.8 DPR+extractive reader 41.5 56.8 • BPR achieves equivalent QA accuracy to DPR with substantially reduced computational cost Exact match QA accuracy on Natural Questions and TriviaQA Same BERT-based extractive reader is used for both models

Summary 32 BPR signiﬁcantly reduces the computational cost of state-of-the-art
open-domain QA without a loss in accuracy [email protected] Paper: Code & Model: @ikuyamada https://arxiv.org/abs/2106.00882 https://github.com/studio-ousia/bpr Paper: Code:

Efficient Passage Retrieval with Hashing for Op...

Efficient Passage Retrieval with Hashing for Open-domain Question Answering (ACL 2021)

Ikuya Yamada

More Decks by Ikuya Yamada

Other Decks in Research

Featured

Transcript

Efﬁcient Passage Retrieval with Hashing for Open-domain Question Answering STUDIO

Open-domain Question Answering The task of answering arbitrary factoid questions

Retriever-Reader Architecture A pipelined approach to extract an answer from

Retriever-Reader Architecture A pipelined approach to extract an answer from

Retriever-Reader Architecture A pipelined approach to extract an answer from

\ Retriever-Reader Architecture A pipelined approach to extract an answer

Dense Passage Retriever (DPR) (Karpukhin et al., 2020) 7 Passage

Extreme Memory Cost of DPR The size of each passage

Extreme Memory Cost of DPR The size of each passage

Extreme Memory Cost of DPR The size of each passage

Binary Passage Retriever (BPR) 11 BPR extends DPR using hashing

Binary Passage Retriever (BPR) 12 BPR extends DPR using hashing

Binary Passage Retriever (BPR) Key Approaches: • Learning-to-hash that learns

DPR Architecture 14 Two independent BERT models are used to

DPR Architecture 15 Two independent BERT models are used to

DPR Architecture 16 Two independent BERT models are used to

Model: Hash Layer 17 Passage Encoder Question Encoder Passage Question

Model: Approximating Sign Function Using Scaled Tanh Function 18 Problem:

Model: Approximating Sign Function Using Scaled Tanh Function 19 Problem:

Model: Approximating Sign Function Using Scaled Tanh Function 20 Problem:

Model: Approximating Sign Function Using Scaled Tanh Function 21 Problem:

Model: Two-stage Approach of Candidate Generation and Reranking 22 Passage

Model: Two-stage Approach of Candidate Generation and Reranking 23 Passage

Model: Two-stage Approach of Candidate Generation and Reranking 24 Passage

Candidate Generation: • A small number of candidates are obtained

Model: Multi-task Training Ranking loss for candidate generation: Cross-entropy loss

Experiments: Comparison with DPR 27 • BPR achieves similar or

Experiments: Comparison with DPR 28 • BPR achieves similar or

Experiments: Comparison with DPR 29 • BPR achieves similar or

Experiments: Comparison with Quantization Methods 30 • BPR achieves signiﬁcantly

Experiments: End-to-end QA 31 Model NQ TQA BPR+extractive reader 41.6

Summary 32 BPR signiﬁcantly reduces the computational cost of state-of-the-art