Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Efficient Passage Retrieval with Hashing for Open-domain Question Answering (ACL 2021)

Efficient Passage Retrieval with Hashing for Open-domain Question Answering (ACL 2021)

Ikuya Yamada

May 20, 2022
Tweet

More Decks by Ikuya Yamada

Other Decks in Research

Transcript

  1. Efficient Passage Retrieval with Hashing
    for Open-domain Question Answering
    STUDIO OUSIA
    Hanna Hajishirzi
    Akari Asai
    Ikuya Yamada

    View full-size slide

  2. Open-domain Question Answering
    The task of answering arbitrary factoid questions
    2
    Open-domain
    QA model
    Question:
    Who wrote the
    novel "I Am a Cat"?
    Answer:
    Sōseki Natsume
    Knowledge base
    (e.g., Wikipedia)

    View full-size slide

  3. Retriever-Reader Architecture
    A pipelined approach to extract an answer from knowledge base
    3

    View full-size slide

  4. Retriever-Reader Architecture
    A pipelined approach to extract an answer from knowledge base
    4
    Question:
    Who wrote the
    novel "I Am a Cat"?

    View full-size slide

  5. Retriever-Reader Architecture
    A pipelined approach to extract an answer from knowledge base
    5
    Retriever
    Question:
    Who wrote the
    novel "I Am a Cat"?
    Top-k relevant
    passages
    Knowledge base
    (e.g., Wikipedia)

    View full-size slide

  6. \
    Retriever-Reader Architecture
    A pipelined approach to extract an answer from knowledge base
    base
    6
    Retriever Reader
    Question:
    Who wrote the
    novel "I Am a Cat"?
    Answer:
    Sōseki Natsume
    Top-k relevant
    passages
    Knowledge base
    (e.g., Wikipedia)
    Open-domain QA model

    View full-size slide

  7. Dense Passage Retriever (DPR) (Karpukhin et al., 2020)
    7
    Passage
    Encoder
    Question
    Encoder
    Passage Question
    Relevance Score
    Inner product
    [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd
    State-of-the-art Retriever commonly used in open-domain QA models
    Two independent BERT models are
    used to encode passages and
    questions

    View full-size slide

  8. Extreme Memory Cost of DPR
    The size of each passage vector:
    8
    4 bytes * 768 dimensions ≒ 3072 bytes
    (float32) (BERT output dimensions)

    View full-size slide

  9. Extreme Memory Cost of DPR
    The size of each passage vector:
    9
    4 bytes * 768 dimensions ≒ 3072 bytes
    (float32) (BERT output dimensions)
    3072 bytes * 21,000,000 ≒ 65GB
    The index of entire 21M English Wikipedia passages:

    View full-size slide

  10. Extreme Memory Cost of DPR
    The size of each passage vector:
    10
    4 bytes * 768 dimensions ≒ 3072 bytes
    (float32) (BERT output dimensions)
    3072 bytes * 21,000,000 ≒ 65GB
    The index of entire 21M English Wikipedia passages:
    This index must be stored in memory!!

    View full-size slide

  11. Binary Passage Retriever (BPR)
    11
    BPR extends DPR using hashing to represent passages as
    compact binary codes rather than continuous vectors

    View full-size slide

  12. Binary Passage Retriever (BPR)
    12
    BPR extends DPR using hashing to represent passages as
    compact binary codes rather than continuous vectors
    Key Approaches:
    ● Learning-to-hash that learns a hash function to convert
    continuous vectors to binary codes in an end-to-end manner

    View full-size slide

  13. Binary Passage Retriever (BPR)
    Key Approaches:
    ● Learning-to-hash that learns a hash function to convert
    continuous vectors to binary codes in an end-to-end manner
    ● Two-stage approach consisting of candidate generation and
    reranking
    13
    BPR extends DPR using hashing to represent passages as
    compact binary codes rather than continuous vectors

    View full-size slide

  14. DPR Architecture
    14
    Two independent BERT models are used to
    encode questions and passages
    Passage
    Encoder
    Question
    Encoder
    Passage Question
    Relevance Score
    Inner product
    [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd
    State-of-the-art Retriever based on two independent BERT encoders

    View full-size slide

  15. DPR Architecture
    15
    Two independent BERT models are used to
    encode questions and passages
    Relevance score of a passage given a
    question is computed using inner product
    Passage
    Encoder
    Question
    Encoder
    Passage Question
    Relevance Score
    Inner product
    [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd
    State-of-the-art Retriever based on two independent BERT encoders

    View full-size slide

  16. DPR Architecture
    16
    Two independent BERT models are used to
    encode questions and passages
    Relevance score of a passage given a
    question is computed using inner product
    Top-k passages are obtained based on
    nearest neighbor search for
    all passage vectors
    Passage
    Encoder
    Question
    Encoder
    Passage Question
    Relevance Score
    Inner product
    [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd
    State-of-the-art Retriever based on two independent BERT encoders

    View full-size slide

  17. Model: Hash Layer
    17
    Passage
    Encoder
    Question
    Encoder
    Passage Question
    Reranking
    Inner product
    [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd
    Hash layer Hash layer
    [1, -1, …, 1]∈{-1, 1}d [1, -1, …, 1]∈{-1,1}d
    Hamming distance
    Candidate generation
    Hash layer is placed on top of each encoder
    A continuous vector computed by each
    encoder is converted to a binary code
    Hash layer is implemented using
    sign function:

    View full-size slide

  18. Model: Approximating Sign Function Using Scaled Tanh Function
    18
    Problem:
    Sign function is incompatible with
    back propagation
    Solution:
    During training, we approximate
    sign function using differentiable
    scaled tanh function (Cao et al. 2017)
    where β is increased at every training
    step
    green: sign(x) blue: tanh(βx)
    training step: 0 (β=1, γ=0.1)

    View full-size slide

  19. Model: Approximating Sign Function Using Scaled Tanh Function
    19
    Problem:
    Sign function is incompatible with
    back propagation
    Solution:
    During training, we approximate
    sign function using differentiable
    scaled tanh function (Cao et al. 2017)
    where β is increased at every training
    step
    green: sign(x) blue: tanh(βx)
    training step: 240 (β=5, γ=0.1)

    View full-size slide

  20. Model: Approximating Sign Function Using Scaled Tanh Function
    20
    Problem:
    Sign function is incompatible with
    back propagation
    Solution:
    During training, we approximate
    sign function using differentiable
    scaled tanh function (Cao et al. 2017)
    where β is increased at every training
    step
    green: sign(x) blue: tanh(βx)
    training step: 990 (β=10, γ=0.1)

    View full-size slide

  21. Model: Approximating Sign Function Using Scaled Tanh Function
    21
    Problem:
    Sign function is incompatible with
    back propagation
    Solution:
    During training, we approximate
    sign function using differentiable
    scaled tanh function (Cao et al. 2017)
    where β is increased at every training
    step
    green: sign(x) blue: tanh(βx)
    training step: 8990 (β=30, γ=0.1)

    View full-size slide

  22. Model: Two-stage Approach of Candidate Generation and Reranking
    22
    Passage
    Encoder
    Question
    Encoder
    Passage Question
    Reranking
    Inner product
    [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd
    Hash layer Hash layer
    [1, -1, …, 1]∈{-1, 1}d [-1, 1, …, 1]∈{-1,1}d
    Hamming distance
    Candidate generation

    View full-size slide

  23. Model: Two-stage Approach of Candidate Generation and Reranking
    23
    Passage
    Encoder
    Question
    Encoder
    Passage Question
    Reranking
    Inner product
    [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd
    Hash layer Hash layer
    [1, -1, …, 1]∈{-1, 1}d [-1, 1, …, 1]∈{-1,1}d
    Hamming distance
    Candidate generation
    Candidate Generation:
    ● A small number of candidates are obtained
    efficiently based on Hamming distance
    ○ Question: binary code
    ○ Passage: binary code

    View full-size slide

  24. Model: Two-stage Approach of Candidate Generation and Reranking
    24
    Passage
    Encoder
    Question
    Encoder
    Passage Question
    Reranking
    Inner product
    [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd
    Hash layer Hash layer
    [1, -1, …, 1]∈{-1, 1}d [-1, 1, …, 1]∈{-1,1}d
    Hamming distance
    Candidate generation
    Candidate Generation:
    ● A small number of candidates are obtained
    efficiently based on Hamming distance
    ○ Question: binary code
    ○ Passage: binary code
    Reranking:
    ● The candidates are re-ranked based on
    expressive inner product
    ○ Question: continuous vector
    ○ Passage: binary code

    View full-size slide

  25. Candidate Generation:
    ● A small number of candidates are obtained
    efficiently based on Hamming distance
    ○ Question: binary code
    ○ Passage: binary code
    Reranking:
    ● The candidates are re-ranked based on
    expressive inner product
    ○ Question: continuous vector
    ○ Passage: binary code
    Model: Two-stage Approach of Candidate Generation and Reranking
    25
    Passage
    Encoder
    Question
    Encoder
    Passage Question
    Reranking
    Inner product
    [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd
    Hash layer Hash layer
    [1, -1, …, 1]∈{-1, 1}d [-1, 1, …, 1]∈{-1,1}d
    Hamming distance
    Candidate generation
    more
    expressive

    View full-size slide

  26. Model: Multi-task Training
    Ranking loss for candidate generation:
    Cross-entropy loss for reranking:
    The final loss function:
    26
    BPR is trained by simultaneously optimizing two loss functions

    View full-size slide

  27. Experiments: Comparison with DPR
    27
    ● BPR achieves similar or better performance than DPR when k ≥ 20
    Retrieval recall rates on
    Natural Questions
    Retrieval recall rates on
    TriviaQA

    View full-size slide

  28. Experiments: Comparison with DPR
    28
    ● BPR achieves similar or better performance than DPR when k ≥ 20
    ● BPR significantly reduces the computational cost of DPR
    → Index size: 65GB -> 2GB
    → Query time: 457ms -> 38ms
    Retrieval recall rates on
    Natural Questions
    Retrieval recall rates on
    TriviaQA

    View full-size slide

  29. Experiments: Comparison with DPR
    29
    ● BPR achieves similar or better performance than DPR when k ≥ 20
    ● BPR significantly reduces the computational cost of DPR
    → Index size: 65GB -> 2GB
    → Query time: 457ms -> 38ms
    Retrieval recall rates on
    Natural Questions
    Retrieval recall rates on
    TriviaQA
    The recall in small k is less important:
    the reader usually takes k ≥ 20 passages

    View full-size slide

  30. Experiments: Comparison with Quantization Methods
    30
    ● BPR achieves significantly better performance than
    DPR + post-hoc quantization methods: product quantization (PQ) and LSH
    Retrieval recall rates on
    Natural Questions
    Retrieval recall rates on
    TriviaQA

    View full-size slide

  31. Experiments: End-to-end QA
    31
    Model NQ TQA
    BPR+extractive reader 41.6 56.8
    DPR+extractive reader 41.5 56.8
    ● BPR achieves equivalent QA accuracy to DPR with
    substantially reduced computational cost
    Exact match QA accuracy on Natural Questions and TriviaQA
    Same BERT-based extractive
    reader is used for both models

    View full-size slide

  32. Summary
    32
    BPR significantly reduces the computational cost of
    state-of-the-art open-domain QA without a loss in accuracy
    [email protected]
    Paper:
    Code & Model:
    @ikuyamada
    https://arxiv.org/abs/2106.00882
    https://github.com/studio-ousia/bpr
    Paper: Code:

    View full-size slide