Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Efficient Passage Retrieval with Hashing for Open-domain Question Answering (ACL 2021)

Efficient Passage Retrieval with Hashing for Open-domain Question Answering (ACL 2021)

Ikuya Yamada

May 20, 2022
Tweet

More Decks by Ikuya Yamada

Other Decks in Research

Transcript

  1. Efficient Passage Retrieval with Hashing
    for Open-domain Question Answering
    STUDIO OUSIA
    Hanna Hajishirzi
    Akari Asai
    Ikuya Yamada

    View Slide

  2. Open-domain Question Answering
    The task of answering arbitrary factoid questions
    2
    Open-domain
    QA model
    Question:
    Who wrote the
    novel "I Am a Cat"?
    Answer:
    Sōseki Natsume
    Knowledge base
    (e.g., Wikipedia)

    View Slide

  3. Retriever-Reader Architecture
    A pipelined approach to extract an answer from knowledge base
    3

    View Slide

  4. Retriever-Reader Architecture
    A pipelined approach to extract an answer from knowledge base
    4
    Question:
    Who wrote the
    novel "I Am a Cat"?

    View Slide

  5. Retriever-Reader Architecture
    A pipelined approach to extract an answer from knowledge base
    5
    Retriever
    Question:
    Who wrote the
    novel "I Am a Cat"?
    Top-k relevant
    passages
    Knowledge base
    (e.g., Wikipedia)

    View Slide

  6. \
    Retriever-Reader Architecture
    A pipelined approach to extract an answer from knowledge base
    base
    6
    Retriever Reader
    Question:
    Who wrote the
    novel "I Am a Cat"?
    Answer:
    Sōseki Natsume
    Top-k relevant
    passages
    Knowledge base
    (e.g., Wikipedia)
    Open-domain QA model

    View Slide

  7. Dense Passage Retriever (DPR) (Karpukhin et al., 2020)
    7
    Passage
    Encoder
    Question
    Encoder
    Passage Question
    Relevance Score
    Inner product
    [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd
    State-of-the-art Retriever commonly used in open-domain QA models
    Two independent BERT models are
    used to encode passages and
    questions

    View Slide

  8. Extreme Memory Cost of DPR
    The size of each passage vector:
    8
    4 bytes * 768 dimensions ≒ 3072 bytes
    (float32) (BERT output dimensions)

    View Slide

  9. Extreme Memory Cost of DPR
    The size of each passage vector:
    9
    4 bytes * 768 dimensions ≒ 3072 bytes
    (float32) (BERT output dimensions)
    3072 bytes * 21,000,000 ≒ 65GB
    The index of entire 21M English Wikipedia passages:

    View Slide

  10. Extreme Memory Cost of DPR
    The size of each passage vector:
    10
    4 bytes * 768 dimensions ≒ 3072 bytes
    (float32) (BERT output dimensions)
    3072 bytes * 21,000,000 ≒ 65GB
    The index of entire 21M English Wikipedia passages:
    This index must be stored in memory!!

    View Slide

  11. Binary Passage Retriever (BPR)
    11
    BPR extends DPR using hashing to represent passages as
    compact binary codes rather than continuous vectors

    View Slide

  12. Binary Passage Retriever (BPR)
    12
    BPR extends DPR using hashing to represent passages as
    compact binary codes rather than continuous vectors
    Key Approaches:
    ● Learning-to-hash that learns a hash function to convert
    continuous vectors to binary codes in an end-to-end manner

    View Slide

  13. Binary Passage Retriever (BPR)
    Key Approaches:
    ● Learning-to-hash that learns a hash function to convert
    continuous vectors to binary codes in an end-to-end manner
    ● Two-stage approach consisting of candidate generation and
    reranking
    13
    BPR extends DPR using hashing to represent passages as
    compact binary codes rather than continuous vectors

    View Slide

  14. DPR Architecture
    14
    Two independent BERT models are used to
    encode questions and passages
    Passage
    Encoder
    Question
    Encoder
    Passage Question
    Relevance Score
    Inner product
    [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd
    State-of-the-art Retriever based on two independent BERT encoders

    View Slide

  15. DPR Architecture
    15
    Two independent BERT models are used to
    encode questions and passages
    Relevance score of a passage given a
    question is computed using inner product
    Passage
    Encoder
    Question
    Encoder
    Passage Question
    Relevance Score
    Inner product
    [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd
    State-of-the-art Retriever based on two independent BERT encoders

    View Slide

  16. DPR Architecture
    16
    Two independent BERT models are used to
    encode questions and passages
    Relevance score of a passage given a
    question is computed using inner product
    Top-k passages are obtained based on
    nearest neighbor search for
    all passage vectors
    Passage
    Encoder
    Question
    Encoder
    Passage Question
    Relevance Score
    Inner product
    [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd
    State-of-the-art Retriever based on two independent BERT encoders

    View Slide

  17. Model: Hash Layer
    17
    Passage
    Encoder
    Question
    Encoder
    Passage Question
    Reranking
    Inner product
    [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd
    Hash layer Hash layer
    [1, -1, …, 1]∈{-1, 1}d [1, -1, …, 1]∈{-1,1}d
    Hamming distance
    Candidate generation
    Hash layer is placed on top of each encoder
    A continuous vector computed by each
    encoder is converted to a binary code
    Hash layer is implemented using
    sign function:

    View Slide

  18. Model: Approximating Sign Function Using Scaled Tanh Function
    18
    Problem:
    Sign function is incompatible with
    back propagation
    Solution:
    During training, we approximate
    sign function using differentiable
    scaled tanh function (Cao et al. 2017)
    where β is increased at every training
    step
    green: sign(x) blue: tanh(βx)
    training step: 0 (β=1, γ=0.1)

    View Slide

  19. Model: Approximating Sign Function Using Scaled Tanh Function
    19
    Problem:
    Sign function is incompatible with
    back propagation
    Solution:
    During training, we approximate
    sign function using differentiable
    scaled tanh function (Cao et al. 2017)
    where β is increased at every training
    step
    green: sign(x) blue: tanh(βx)
    training step: 240 (β=5, γ=0.1)

    View Slide

  20. Model: Approximating Sign Function Using Scaled Tanh Function
    20
    Problem:
    Sign function is incompatible with
    back propagation
    Solution:
    During training, we approximate
    sign function using differentiable
    scaled tanh function (Cao et al. 2017)
    where β is increased at every training
    step
    green: sign(x) blue: tanh(βx)
    training step: 990 (β=10, γ=0.1)

    View Slide

  21. Model: Approximating Sign Function Using Scaled Tanh Function
    21
    Problem:
    Sign function is incompatible with
    back propagation
    Solution:
    During training, we approximate
    sign function using differentiable
    scaled tanh function (Cao et al. 2017)
    where β is increased at every training
    step
    green: sign(x) blue: tanh(βx)
    training step: 8990 (β=30, γ=0.1)

    View Slide

  22. Model: Two-stage Approach of Candidate Generation and Reranking
    22
    Passage
    Encoder
    Question
    Encoder
    Passage Question
    Reranking
    Inner product
    [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd
    Hash layer Hash layer
    [1, -1, …, 1]∈{-1, 1}d [-1, 1, …, 1]∈{-1,1}d
    Hamming distance
    Candidate generation

    View Slide

  23. Model: Two-stage Approach of Candidate Generation and Reranking
    23
    Passage
    Encoder
    Question
    Encoder
    Passage Question
    Reranking
    Inner product
    [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd
    Hash layer Hash layer
    [1, -1, …, 1]∈{-1, 1}d [-1, 1, …, 1]∈{-1,1}d
    Hamming distance
    Candidate generation
    Candidate Generation:
    ● A small number of candidates are obtained
    efficiently based on Hamming distance
    ○ Question: binary code
    ○ Passage: binary code

    View Slide

  24. Model: Two-stage Approach of Candidate Generation and Reranking
    24
    Passage
    Encoder
    Question
    Encoder
    Passage Question
    Reranking
    Inner product
    [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd
    Hash layer Hash layer
    [1, -1, …, 1]∈{-1, 1}d [-1, 1, …, 1]∈{-1,1}d
    Hamming distance
    Candidate generation
    Candidate Generation:
    ● A small number of candidates are obtained
    efficiently based on Hamming distance
    ○ Question: binary code
    ○ Passage: binary code
    Reranking:
    ● The candidates are re-ranked based on
    expressive inner product
    ○ Question: continuous vector
    ○ Passage: binary code

    View Slide

  25. Candidate Generation:
    ● A small number of candidates are obtained
    efficiently based on Hamming distance
    ○ Question: binary code
    ○ Passage: binary code
    Reranking:
    ● The candidates are re-ranked based on
    expressive inner product
    ○ Question: continuous vector
    ○ Passage: binary code
    Model: Two-stage Approach of Candidate Generation and Reranking
    25
    Passage
    Encoder
    Question
    Encoder
    Passage Question
    Reranking
    Inner product
    [1.1, -0.3, …, 0.1]∈ℝd [0.2, -0.7, …, 0.3]∈ℝd
    Hash layer Hash layer
    [1, -1, …, 1]∈{-1, 1}d [-1, 1, …, 1]∈{-1,1}d
    Hamming distance
    Candidate generation
    more
    expressive

    View Slide

  26. Model: Multi-task Training
    Ranking loss for candidate generation:
    Cross-entropy loss for reranking:
    The final loss function:
    26
    BPR is trained by simultaneously optimizing two loss functions

    View Slide

  27. Experiments: Comparison with DPR
    27
    ● BPR achieves similar or better performance than DPR when k ≥ 20
    Retrieval recall rates on
    Natural Questions
    Retrieval recall rates on
    TriviaQA

    View Slide

  28. Experiments: Comparison with DPR
    28
    ● BPR achieves similar or better performance than DPR when k ≥ 20
    ● BPR significantly reduces the computational cost of DPR
    → Index size: 65GB -> 2GB
    → Query time: 457ms -> 38ms
    Retrieval recall rates on
    Natural Questions
    Retrieval recall rates on
    TriviaQA

    View Slide

  29. Experiments: Comparison with DPR
    29
    ● BPR achieves similar or better performance than DPR when k ≥ 20
    ● BPR significantly reduces the computational cost of DPR
    → Index size: 65GB -> 2GB
    → Query time: 457ms -> 38ms
    Retrieval recall rates on
    Natural Questions
    Retrieval recall rates on
    TriviaQA
    The recall in small k is less important:
    the reader usually takes k ≥ 20 passages

    View Slide

  30. Experiments: Comparison with Quantization Methods
    30
    ● BPR achieves significantly better performance than
    DPR + post-hoc quantization methods: product quantization (PQ) and LSH
    Retrieval recall rates on
    Natural Questions
    Retrieval recall rates on
    TriviaQA

    View Slide

  31. Experiments: End-to-end QA
    31
    Model NQ TQA
    BPR+extractive reader 41.6 56.8
    DPR+extractive reader 41.5 56.8
    ● BPR achieves equivalent QA accuracy to DPR with
    substantially reduced computational cost
    Exact match QA accuracy on Natural Questions and TriviaQA
    Same BERT-based extractive
    reader is used for both models

    View Slide

  32. Summary
    32
    BPR significantly reduces the computational cost of
    state-of-the-art open-domain QA without a loss in accuracy
    [email protected]
    Paper:
    Code & Model:
    @ikuyamada
    https://arxiv.org/abs/2106.00882
    https://github.com/studio-ousia/bpr
    Paper: Code:

    View Slide