from knowledge base base 6 Retriever Reader Question: Who wrote the novel "I Am a Cat"? Answer: Sōseki Natsume Top-k relevant passages Knowledge base (e.g., Wikipedia) Open-domain QA model
vector: 10 4 bytes * 768 dimensions ≒ 3072 bytes (float32) (BERT output dimensions) 3072 bytes * 21,000,000 ≒ 65GB The index of entire 21M English Wikipedia passages: This index must be stored in memory!!
to represent passages as compact binary codes rather than continuous vectors Key Approaches: • Learning-to-hash that learns a hash function to convert continuous vectors to binary codes in an end-to-end manner
a hash function to convert continuous vectors to binary codes in an end-to-end manner • Two-stage approach consisting of candidate generation and reranking 13 BPR extends DPR using hashing to represent passages as compact binary codes rather than continuous vectors
encode questions and passages Relevance score of a passage given a question is computed using inner product Passage Encoder Question Encoder Passage Question Relevance Score Inner product [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd State-of-the-art Retriever based on two independent BERT encoders
encode questions and passages Relevance score of a passage given a question is computed using inner product Top-k passages are obtained based on nearest neighbor search for all passage vectors Passage Encoder Question Encoder Passage Question Relevance Score Inner product [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd State-of-the-art Retriever based on two independent BERT encoders
Reranking Inner product [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd Hash layer Hash layer [1, -1, …, 1]∈{-1, 1}d [1, -1, …, 1]∈{-1,1}d Hamming distance Candidate generation Hash layer is placed on top of each encoder A continuous vector computed by each encoder is converted to a binary code Hash layer is implemented using sign function:
Sign function is incompatible with back propagation Solution: During training, we approximate sign function using differentiable scaled tanh function (Cao et al. 2017) where β is increased at every training step green: sign(x) blue: tanh(βx) training step: 0 (β=1, γ=0.1)
Sign function is incompatible with back propagation Solution: During training, we approximate sign function using differentiable scaled tanh function (Cao et al. 2017) where β is increased at every training step green: sign(x) blue: tanh(βx) training step: 240 (β=5, γ=0.1)
Sign function is incompatible with back propagation Solution: During training, we approximate sign function using differentiable scaled tanh function (Cao et al. 2017) where β is increased at every training step green: sign(x) blue: tanh(βx) training step: 990 (β=10, γ=0.1)
Sign function is incompatible with back propagation Solution: During training, we approximate sign function using differentiable scaled tanh function (Cao et al. 2017) where β is increased at every training step green: sign(x) blue: tanh(βx) training step: 8990 (β=30, γ=0.1)
better performance than DPR when k ≥ 20 • BPR significantly reduces the computational cost of DPR → Index size: 65GB -> 2GB → Query time: 457ms -> 38ms Retrieval recall rates on Natural Questions Retrieval recall rates on TriviaQA
better performance than DPR when k ≥ 20 • BPR significantly reduces the computational cost of DPR → Index size: 65GB -> 2GB → Query time: 457ms -> 38ms Retrieval recall rates on Natural Questions Retrieval recall rates on TriviaQA The recall in small k is less important: the reader usually takes k ≥ 20 passages
56.8 DPR+extractive reader 41.5 56.8 • BPR achieves equivalent QA accuracy to DPR with substantially reduced computational cost Exact match QA accuracy on Natural Questions and TriviaQA Same BERT-based extractive reader is used for both models
open-domain QA without a loss in accuracy ikuya@ousia.jp Paper: Code & Model: @ikuyamada https://arxiv.org/abs/2106.00882 https://github.com/studio-ousia/bpr Paper: Code: